date:20240202

On Wed, Jan 31, 2024 at 11:25 AM Bertrand Drouvot
 wrote:
>
> They make sense to me so applied both in v2 attached.
>

This patch looks good to me but I think it is better to commit this
after we push some more work on failover slots, especially the core
sync slot functionality because we are changing the existing wording
of the related work.

-- 
With Regards,
Amit Kapila.

Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix

On Thu, Feb 1, 2024 at 5:58 AM Peter Smith  wrote:
>
> OK. Now using your suggested 2nd sentence:
>
> +# The subscription's running status and failover option should be preserved
> +# in the upgraded instance. So regress_sub1 should still have
> subenabled,subfailover
> +# set to true, while regress_sub2 should have both set to false.
>
> ~
>
> I also tweaked some other nearby comments/messages which did not
> mention the 'failover' preservation.
>

Looks mostly good to me. One minor nitpick:
*
along with retaining the replication origin's remote lsn
-# and subscription's running status.
+# and subscription's running status and failover option.

I don't think we need to use 'and' twice in the above sentence. We
should use ',' between different properties. I can change this on
Monday and push it unless you think otherwise.

-- 
With Regards,
Amit Kapila.

on_error table, saving error info to a table

Hi.
I previously did some work in COPY FROM save error information to a table.
still based on this suggestion:
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us
Now I refactored it.

the syntax:
ON_ERROR 'table', TABLE 'error_saving_tbl'

if ON_ERROR is not specified with 'table', TABLE is specified, then error.
if ON_ERROR is specified with 'table', TABLE is not specified or
error_saving_tbl does not exist, then error.

In BeginCopyFrom, we check the data definition of error_saving_table,
we also check if the user has INSERT privilege to error_saving_table
(all the columns).
We also did a preliminary check of the lock condition of error_saving_table.

if it does not meet these conditions, we quickly error out.
error_saving_table will be the same schema as the copy from table.

Because "table" is a keyword, I have to add the following changes to gram.y.
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3420,6 +3420,10 @@ copy_opt_item:
  {
  $$ = makeDefElem("null", (Node *) makeString($3), @1);
  }
+ | TABLE opt_as Sconst
+ {
+ $$ = makeDefElem("table", (Node *) makeString($3), @1);
+ }

since "table" is already a keyword, so there is no influence on the
parsing speed?

demo:

create table err_tbl(
userid oid, -- the user oid while copy generated this entry
copy_tbl oid, --copy table
filename text,
lineno  int8,
linetext,
colname text,
raw_field_value text,
err_message text,
err_detail text,
errorcode text
);
create table t_copy_tbl(a int, b int, c int);
COPY t_copy_tbl FROM STDIN WITH (delimiter ',', on_error 'table',
table err_tbl);
1,2,a
\.

table err_tbl \gx
-[ RECORD 1 ]---+---
userid  | 10
copy_tbl| 17920
filename| STDIN
lineno  | 1
line| 1,2,a
colname | c
raw_field_value | a
err_message | invalid input syntax for type integer: "a"
err_detail  |
errorcode   | 22P02
From bb6f263ef4d37c2871086db44dca217fa91f5080 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sat, 3 Feb 2024 11:13:14 +0800
Subject: [PATCH v1 1/1] on_error table, saving error info to a table

introduce on_error table option for COPY FROM.
the syntax is {on_error table, table 'error_saving_tbl'}.

we first check table error_saving_tbl's existence and data definition.
if it does not meet our criteria, then we quickly abort the COPY operation.
we also did preliminary check the lock of error saving table
so the COPY can insert tuples to it.

once there is a error happened, we save the error metedata
and insert it to the error_saving_table.
---
 contrib/file_fdw/file_fdw.c  |   4 +-
 doc/src/sgml/ref/copy.sgml   | 106 +-
 src/backend/commands/copy.c  |  26 ++
 src/backend/commands/copyfrom.c  | 129 ++-
 src/backend/commands/copyfromparse.c |  52 ++-
 src/backend/parser/gram.y|   4 +
 src/include/commands/copy.h  |   4 +-
 src/test/regress/expected/copy2.out  |  93 +++
 src/test/regress/sql/copy2.sql   |  80 +
 9 files changed, 492 insertions(+), 6 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 249d82d3..1d536e9e 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -751,7 +751,7 @@ fileIterateForeignScan(ForeignScanState *node)
 	 */
 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
 	found = NextCopyFrom(festate->cstate, econtext,
-		 slot->tts_values, slot->tts_isnull);
+		 slot->tts_values, slot->tts_isnull, NULL);
 	if (found)
 		ExecStoreVirtualTuple(slot);
 
@@ -1183,7 +1183,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 		MemoryContextReset(tupcontext);
 		MemoryContextSwitchTo(tupcontext);
 
-		found = NextCopyFrom(cstate, NULL, values, nulls);
+		found = NextCopyFrom(cstate, NULL, values, nulls, NULL);
 
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a0..ae13c3b6 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
 ON_ERROR 'error_action'
+TABLE 'error_saving_tbl'
 ENCODING 'encoding_name'
 
  
@@ -380,12 +381,14 @@ COPY { table_name [ ( 
   Specifies which 
   error_action to perform when there is malformed data in the input.
-  Currently, only stop (default) and ignore
+  Currently, only stop (default) , ignore, table
   values are supported.
   If the stop value is specified,
   COPY stops operation at the first error.
   If the ignore value is specified,
   COPY skips malformed data and continues copying data.
+  If the table value is specified,
+  COPY skips malformed data and continues copying data, it aslo insert error related

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')

Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the  keyword NULL should be single quoted.

demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.

\pset null NULL

SELECT * FROM check_ign_err;
  n   |  m  |  k
--+-+--
1 | {1} | NULL
2 | {2} |1
3 | {3} |2
4 | {4} | NULL
 NULL | {5} | NULL
From 19afa942af22fd3d2ed2436c6bc7ce02f00bb570 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sat, 3 Feb 2024 14:04:08 +0800
Subject: [PATCH v1 1/1] introduce copy on_error 'null' option

on_error 'null', null needs single quoted.
any data type conversion error will treat that column value to be NULL.
it will not work with column have not null constraint,
we check this at BeginCopyFrom.

discussion: https://www.postgresql.org/message-id/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=bp3d1_asfe...@mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml   |  1 +
 src/backend/commands/copy.c  |  2 ++
 src/backend/commands/copyfrom.c  | 28 
 src/backend/commands/copyfromparse.c | 27 +--
 src/include/commands/copy.h  |  1 +
 src/test/regress/expected/copy2.out  | 26 ++
 src/test/regress/sql/copy2.sql   | 28 
 7 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 55764fc1..d3c4ebdc 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -390,6 +390,7 @@ COPY { table_name [ ( error_action value of
   stop means fail the command, while
   ignore means discard the input row and continue with the next one.
+  null means the input value will be null and continue with the next one.
   The default is stop.
  
  
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cc0786c6..01ce47b0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -422,6 +422,8 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
 		return COPY_ON_ERROR_STOP;
 	if (pg_strcasecmp(sval, "ignore") == 0)
 		return COPY_ON_ERROR_IGNORE;
+	if (pg_strcasecmp(sval, "null") == 0)
+		return COPY_ON_ERROR_NULL;
 
 	ereport(ERROR,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1fe70b91..f4c5704e 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1005,6 +1005,7 @@ CopyFrom(CopyFromState cstate)
 			 * information according to ON_ERROR.
 			 */
 			if (cstate->opts.on_error == COPY_ON_ERROR_IGNORE)
+			{
 
 /*
  * Just make ErrorSaveContext ready for the next NextCopyFrom.
@@ -1013,11 +1014,18 @@ CopyFrom(CopyFromState cstate)
  */
 cstate->escontext->error_occurred = false;
 
-			/* Report that this tuple was skipped by the ON_ERROR clause */
-			pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
-		 ++skipped);
+/* Report that this tuple was skipped by the ON_ERROR clause */
+pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
+			++skipped);
 
-			continue;
+continue;
+			}
+			/*
+			 * Just make ErrorSaveContext ready for the next NextCopyFrom.
+			 *
+			*/
+			if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
+cstate->escontext->error_occurred = false;
 		}
 
 		ExecStoreVirtualTuple(myslot);
@@ -1313,6 +1321,7 @@ CopyFrom(CopyFromState cstate)
 	error_context_stack = errcallback.previous;
 
 	if (cstate->opts.on_error != COPY_ON_ERROR_STOP &&
+		cstate->opts.on_error != COPY_ON_ERROR_NULL &&
 		cstate->num_errors > 0)
 		ereport(NOTICE,
 errmsg_plural("%llu row was skipped due to data type incompatibility",
@@ -1468,6 +1477,8 @@ BeginCopyFrom(ParseState *pstate,
 		 */
 		if (cstate->opts.on_error == COPY_ON_ERROR_IGNORE)
 			cstate->escontext->details_wanted = false;
+		else if (cstate->opts.on_error == COPY_ON_ERROR_NULL)
+			cstate->escontext->details_wanted = false;
 	}
 	else
 		cstate->escontext = NULL;
@@ -1621,6 +1632,15 @@ BeginCopyFrom(ParseState *pstate,
 		if (att->attisdropped)
 			continue;
 
+		/*
+		 * we can specify on_error 'null', but it can only apply to columns
+		 * don't have not null constraint.
+		*/
+		if (att->attnotnull && cstate->opts.on_error == COPY_ON_ERROR_NULL)
+			ereport(ERROR,
+	(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+	 errmsg("copy on_error 'null' cannot be used with not null constraint column")));
+
 		/* Fetch the input function and typioparam info */
 		if (cstate->opts.binary)
 			getTypeBinaryInputInfo(att->atttypid,
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 7cacd0b7..9475a9dc 100644
---

Re: Is this a problem in GenericXLogFinish()?

On Fri, Feb 2, 2024 at 12:40 PM Michael Paquier  wrote:
>
> Amit, this has been applied as of 861f86beea1c, and I got pinged about
> the fact this triggers inconsistencies because we always set the LSN
> of the write buffer (wbuf in _hash_freeovflpage) but
> XLogRegisterBuffer() would *not* be called when the two following
> conditions happen:
> - When xlrec.ntups <= 0.
> - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt
>
> And it seems to me that there is still a bug here: there should be no
> point in setting the LSN on the write buffer if we don't register it
> in WAL at all, no?
>

Right, I see the problem. I'll look into it further.

-- 
With Regards,
Amit Kapila.

Re: Draft release notes for minor releases are up

On Fri, Feb 02, 2024 at 08:18:50PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > Shall the top of the notes advise to reindex GIN indexes?
> 
> I thought about that, but it's a pretty low-probability failure
> I think, so I didn't write that advice.  Maybe I misjudged it.

I can see there being failures so low-probability to omit that text, e.g. a
failure identified theoretically and requiring a process to lose the CPU for
hours.  For this one, the reporter seems to have arrived at it without a
deliberate search.  This one just needs a recovery at the right WAL record,
then two processes reaching the incomplete split concurrently.

Re: Why is subscription/t/031_column_list.pl failing so much?

Noah Misch  writes:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
>> If you look at the buildfarm's failures page and filter down to
>> just subscriptionCheck failures, what you find is that all of the
>> last 6 such failures are in 031_column_list.pl:
>> ...
>> I don't see anything that 031_column_list.pl is doing that is much
>> different from other subscription tests, so why is it the only one
>> failing?  And more to the point, what's going wrong exactly?

> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.  That hasn't
> reproduced it in ~10 runs on my machine, though.

Ah, thanks for that link.  I like the theory proposed in that thread
that the walsender is starting up at an LSN somewhere before where
the publication is created.  I'm tempted to add some more queries to
the test script to see if that can be proven.

I'm still wondering how come the failure seems to have suddenly gotten
way more common.  The only changes that are in vaguely-related places
and fit the time frame are Amit's 732924043 and 776621a5e, but I sure
don't see a connection.

regards, tom lane

Re: Draft release notes for minor releases are up

Noah Misch  writes:
> Shall the top of the notes advise to reindex GIN indexes?

I thought about that, but it's a pretty low-probability failure
I think, so I didn't write that advice.  Maybe I misjudged it.

> Things I'd like to capture for this one:
> - Commit 0b6517a3b deals with crash recovery, not base backups.
> - Connection establishment will fail if one of these bugs corrupted the
>   database, so there's no need to worry about silent corruption.  (My commit
>   messages didn't make that clear.)
> Perhaps like this:

Thanks, I'll work on that one some more.

regards, tom lane

Re: Why is subscription/t/031_column_list.pl failing so much?

On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > If you look at the buildfarm's failures page and filter down to
> > just subscriptionCheck failures, what you find is that all of the
> > last 6 such failures are in 031_column_list.pl:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> > 
> > There are some further back too:
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> > 
> > but this definitely got way more common in the last few days.
> 
> > I don't see anything that 031_column_list.pl is doing that is much
> > different from other subscription tests, so why is it the only one
> > failing?  And more to the point, what's going wrong exactly?
> 
> I don't know, but
> https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> reported a replacement BgWriterDelay value reproducing it.

Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
BgWriterDelay.

> That hasn't reproduced it in ~10 runs on my machine, though.

After 207 successes, it did fail once for me.

> > I am suspicious that this somehow represents a failure of the
> > historical catalog decoding logic, but I don't see how that theory
> > explains this only breaking in one test script.

Re: Small fix on COPY ON_ERROR document

2024-02-02 Thread Alexander Korotkov

On Fri, Feb 2, 2024 at 10:07 PM David G. Johnston
 wrote:
> On Thu, Feb 1, 2024 at 11:59 PM Yugo NAGATA  wrote:
>>
>>
>> I attached a updated patch including fixes you pointed out above.
>>
>
> Removed "which"; changed "occupying" to "occupy"
> Removed on of the two "amounts"
> Changed "unacceptable to the input function" to just "converting" as that is 
> what the average reader is more likely to be thinking.
> The rest was good.

Thanks to everybody in the thread.
Pushed.

--
Regards,
Alexander Korotkov

Re: Draft release notes for minor releases are up

On Fri, Feb 02, 2024 at 12:54:48PM -0500, Tom Lane wrote:
> First-draft release notes for 16.2 are available at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=87dcc5e45fad3021514f01360d3a2abd4e6480ee

> +
> +
> + 
> +  Fix insufficient locking when cleaning up an incomplete split of
> +  a GIN index's internal page (Fei Changhong, Heikki Linnakangas)
> + 
> +
> + 
> +  The code tried to do this with shared rather than exclusive lock on
> +  the buffer.  This could lead to index corruption if two processes
> +  attempted the cleanup concurrently.
> + 
> +

Shall the top of the notes advise to reindex GIN indexes?

> +
> +
> + 
> +  Add more interlocks between CREATE DATABASE and
> +  base backup (Noah Misch)
> + 
> +
> + 
> +  This fixes some cases where a base backup taken concurrently
> +  with CREATE DATABASE could produce a corrupt
> +  image of the new database.
> + 
> +

Things I'd like to capture for this one:

- Commit 0b6517a3b deals with crash recovery, not base backups.
- Connection establishment will fail if one of these bugs corrupted the
  database, so there's no need to worry about silent corruption.  (My commit
  messages didn't make that clear.)

Perhaps like this:

diff --git a/doc/src/sgml/release-16.sgml b/doc/src/sgml/release-16.sgml
index 21387e3..8997279 100644
--- a/doc/src/sgml/release-16.sgml
+++ b/doc/src/sgml/release-16.sgml
@@ -750,15 +750,15 @@ Branch: REL_16_STABLE [48a6bf5c4] 2024-02-01 13:44:22 
-0800
 Branch: REL_15_STABLE [8fa4a1ac6] 2024-02-01 13:44:23 -0800
 -->
  
-  Add more interlocks between CREATE DATABASE and
-  base backup (Noah Misch)
+  Fix durability of CREATE DATABASE (Noah Misch)
  
 
  
-  This fixes some cases where a base backup taken concurrently
-  with CREATE DATABASE could produce a corrupt
-  image of the new database.
+  Recovery failed, or establishing connections to the new database failed.
+  Effects required an operating system crash or base backup, concurrently
+  with or shortly after the CREATE DATABASE.
  
+

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-02 Thread Jelte Fennema-Nio

On Fri, 2 Feb 2024 at 16:06, Alvaro Herrera  wrote:
> Now, looking at this list, I think it's surprising that the nonblocking
> request for a cancellation is called PQcancelPoll.  PQcancelSend() is at
> odds with the asynchronous query API, which uses the verb "send" for the
> asynchronous variants. This would suggest that PQcancelPoll should
> actually be called PQcancelSend or maybe PQcancelStart (mimicking
> PQconnectStart).  I'm not sure what's a good alternative name for the
> blocking one, which you have called PQcancelSend.

I agree that Send is an unfortunate suffix. I'd love to use PQcancel
for this, but obviously that one is already taken. Some other options
that I can think of are (from favorite to less favorite):
- PQcancelBlocking
- PQcancelAndWait
- PQcancelGo
- PQcancelNow

Finally, another option would be to renome PQcancelConn to
PQgetCancelConn and then rename PQcancelSend to PQcancelConn.

Regarding PQcancelPoll, I think it's a good name for the polling
function, but I agree it's a bit confusing to use it to also start
sending the connection. Even the code of PQcancelPoll basically admits
that this is  confusing behaviour:

/*
 * Before we can call PQconnectPoll we first need to start the connection
 * using pqConnectDBStart. Non-cancel connections already do this whenever
 * the connection is initialized. But cancel connections wait until the
 * caller starts polling, because there might be a large delay between
 * creating a cancel connection and actually wanting to use it.
 */
if (conn->status == CONNECTION_STARTING)
{
if (!pqConnectDBStart(>conn))
{
cancelConn->conn.status = CONNECTION_STARTED;
return PGRES_POLLING_WRITING;
}
}

The only reasonable thing I can think of to make that situation better
is to move that part of the function outside of PQcancelPoll and
create a dedicated PQcancelStart function for it. It introduces an
extra function, but it does seem more in line with how we do the
regular connection establishment. Basically you would have code like
this then, which looks quite nice honestly:

cancelConn = PQcancelConn(conn);
if (!PQcancelStart(cancelConn))
pg_fatal("bad cancel connection: %s", PQcancelErrorMessage(cancelConn));
while (true)
{
 // polling using PQcancelPoll here
}

Re: Flushing large data immediately in pqcomm

2024-02-02 Thread Andres Freund

Hi,

On 2024-02-01 15:02:57 -0500, Robert Haas wrote:
> On Thu, Feb 1, 2024 at 10:52 AM Robert Haas  wrote:
> There was probably a better way to phrase this email ... the sentiment
> is sincere, but there was almost certainly a way of writing it that
> didn't sound like I'm super-annoyed.

NP - I could have phrased mine better as well...

> > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund  wrote:
> > > While not perfect - e.g. because networks might use jumbo packets / large 
> > > MTUs
> > > and we don't know how many outstanding bytes there are locally, I think a
> > > decent heuristic could be to always try to send at least one packet worth 
> > > of
> > > data at once (something like ~1400 bytes), even if that requires copying 
> > > some
> > > of the input data. It might not be sent on its own, but it should make it
> > > reasonably unlikely to end up with tiny tiny packets.
> >
> > I think that COULD be a decent heuristic but I think it should be
> > TESTED, including against the ~3 or so other heuristics proposed on
> > this thread, before we make a decision.
> >
> > I literally mentioned the Ethernet frame size as one of the things
> > that we should test whether it's relevant in the exact email to which
> > you're replying, and you replied by proposing that as a heuristic, but
> > also criticizing me for wanting more research before we settle on
> > something.

I mentioned the frame size thing because afaict nobody in the thread had
mentioned our use of TCP_NODELAY (which basically forces the kernel to send
out data immediately instead of waiting for further data to be sent). Without
that it'd be a lot less problematic to occasionally send data in small
increments inbetween larger sends. Nor would packet sizes be as relevant.

> > Are we just supposed to assume that your heuristic is better than the
> > others proposed here without testing anything, or, like, what? I don't
> > think this needs to be a completely exhaustive or exhausting process, but
> > I think trying a few different things out and seeing what happens is
> > smart.

I wasn't trying to say that my heuristic necessarily is better. What I was
trying to get at - and expressed badly - was that I doubt that testing can get
us all that far here. It's not too hard to test the effects of our buffering
with regards to syscall overhead, but once you actually take network effects
into account it gets quite hard. Bandwidth, latency, the specific network
hardware and operating systems involved all play a significant role. Given
how, uh, naive our current approach is, I think analyzing the situation from
first principles and then doing some basic validation of the results makes
more sense.

Separately, I think we shouldn't aim for perfect here. It's obviously
extremely inefficient to send a larger amount of data by memcpy()ing and
send()ing it in 8kB chunks. As mentioned by several folks upthread, we can
improve upon that without having worse behaviour than today.  Medium-long term
I suspect we're going to want to use asynchronous network interfaces, in
combination with zero-copy sending, which requires larger changes. Not that
relevant for things like query results, quite relevant for base backups etc.

It's perhaps also worth mentioning that the small send buffer isn't great for
SSL performance, the encryption overhead increases when sending in small
chunks.

I hacked up Melih's patch to send the pending data together with the first bit
of the large "to be sent" data and also added a patch to increased
SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for
  pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null

   time via
test   unix tcp tcp+ssl
master 6.305s   9.436s  15.596s
master-larger-buffer   6.535s   9.453s  15.208s
patch  5.900s   7.465s  13.634s
patch-larger-buffer5.233s   5.439s  11.730s

The increase when using tcp is pretty darn impressive.  If I had remembered in
time to disable manifests checksums, the win would have been even bigger.

The bottleneck for SSL is that it still ends up with ~16kB sends, not sure
why.

Greetings,

Andres Freund

Re: Why is subscription/t/031_column_list.pl failing so much?

On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> If you look at the buildfarm's failures page and filter down to
> just subscriptionCheck failures, what you find is that all of the
> last 6 such failures are in 031_column_list.pl:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23
> 
> There are some further back too:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24
> 
> but this definitely got way more common in the last few days.

> I don't see anything that 031_column_list.pl is doing that is much
> different from other subscription tests, so why is it the only one
> failing?  And more to the point, what's going wrong exactly?

I don't know, but
https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
reported a replacement BgWriterDelay value reproducing it.  That hasn't
reproduced it in ~10 runs on my machine, though.

> I am suspicious that this somehow represents a failure of the
> historical catalog decoding logic, but I don't see how that theory
> explains this only breaking in one test script.

Why is subscription/t/031_column_list.pl failing so much?

If you look at the buildfarm's failures page and filter down to
just subscriptionCheck failures, what you find is that all of the
last 6 such failures are in 031_column_list.pl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-02%2019%3A33%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-02-02%2011%3A21%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2020%3A34%3A29
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-02-01%2016%3A57%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-01-31%2022%3A18%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-01-30%2011%3A29%3A23

There are some further back too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-17%2018%3A28%3A24

but this definitely got way more common in the last few days.

Digging down into the logs, these all look pretty similar.  Somehow
things get into a state where replication connections fail with
the publisher reporting "publication does not exist":

2024-02-02 19:42:23.187 UTC [1631708][not initialized][:0] LOG:  connection 
received: host=[local]
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  connection 
authenticated: user="bf" method=trust 
(/home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/031_column_list/data/t_031_column_list_publisher_data/pgdata/pg_hba.conf:117)
2024-02-02 19:42:23.189 UTC [1631708][walsender][4/287:0] LOG:  replication 
connection authorized: user=bf application_name=sub1
2024-02-02 19:42:23.214 UTC [1631708][walsender][4/288:0] LOG:  statement: 
SELECT pg_catalog.set_config('search_path', '', false);
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received 
replication command: IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
IDENTIFY_SYSTEM
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  received 
replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 
(proto_version '4', origin 'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] LOG:  acquired logical 
replication slot "sub1"
2024-02-02 19:42:23.226 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] LOG:  starting logical 
decoding for slot "sub1"
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] DETAIL:  Streaming 
transactions committing after 0/15BCDD0, reading WAL from 0/15BCDD0.
2024-02-02 19:42:23.242 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] LOG:  logical decoding 
found consistent point at 0/15BCDD0
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] DETAIL:  There are no 
running transactions.
2024-02-02 19:42:23.243 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] ERROR:  publication 
"pub7" does not exist
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] CONTEXT:  slot "sub1", 
output plugin "pgoutput", in the change callback, associated LSN 0/15C7698
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/15BCDD0 (proto_version '4', origin 
'any', publication_names '"pub7"')
2024-02-02 19:42:23.244 UTC [1631708][walsender][4/0:0] LOG:  released logical 
replication slot "sub1"
2024-02-02 19:42:23.834 UTC [1631708][walsender][:0] LOG:  disconnection: 
session time: 0:00:00.647 user=bf database=postgres host=[local]

and then we just repeat that until the test times out.  It fails at
different points in the test script (hence, different publication
names), but the pattern looks about the same.

I don't see anything that 031_column_list.pl is doing that is much
different from other subscription tests, so why is it the only one
failing?  And more to the point, what's going wrong exactly?

I am suspicious that this somehow represents a failure of the
historical catalog decoding logic, but I don't see how that theory
explains this only breaking in one test script.

regards, tom lane

Re: Postgres and --config-file option

On Fri, Feb 2, 2024 at 2:23 PM David G. Johnston 
wrote:

> On Mon, Jan 15, 2024 at 4:35 AM Aleksander Alekseev <
> aleksan...@timescale.com> wrote:
>
>> PFA the patch. It's short but I think it mitigates the problem.
>>
>>
> I took a look at where these options are discussed in the documentation
> and now feel that we should make these options clear more broadly (config
> and libpq, plus pointing to --name from -c in a couple of places).  It
> doesn't add much verbosity and, frankly, if I was to pick one
> "--name=value" would win and so I'd rather document it, leaving -c alone
> for historical reasons.
>
> I've attached a replacement patch with the additional changes.
>
>
And I just saw one more apparently undocumented requirement (or a typo)

You must specify the --config-file

The actual parameter is "config_file", so apparently we are supposed to
either convert underscores to hyphens or we have a typo.

David J.

Re: Postgres and --config-file option

On Mon, Jan 15, 2024 at 4:35 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> PFA the patch. It's short but I think it mitigates the problem.
>
>
I took a look at where these options are discussed in the documentation and
now feel that we should make these options clear more broadly (config and
libpq, plus pointing to --name from -c in a couple of places).  It doesn't
add much verbosity and, frankly, if I was to pick one "--name=value" would
win and so I'd rather document it, leaving -c alone for historical reasons.

I've attached a replacement patch with the additional changes.

David J.
From adc6c807d2bc0d73271b47e4d1908e5c069e5b24 Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Fri, 2 Feb 2024 14:16:24 -0700
Subject: [PATCH] v4 configs review

---
 doc/src/sgml/config.sgml   | 11 ++-
 doc/src/sgml/libpq.sgml|  7 ---
 doc/src/sgml/ref/postgres-ref.sgml |  8 +---
 src/backend/main/main.c|  4 ++--
 src/backend/utils/misc/guc.c   |  2 +-
 5 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..1eb58c0793 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -333,9 +333,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
   
During server startup, parameter settings can be
passed to the postgres command via the
-   -c command-line parameter.  For example,
+   -c name=value command-line parameter, or its equivalent
+   --name=value variation.  For example,
 
-postgres -c log_connections=yes -c log_destination='syslog'
+postgres -c log_connections=yes --log_destination='syslog'
 
Settings provided in this way override those set via
postgresql.conf or ALTER SYSTEM,
@@ -352,10 +353,10 @@ postgres -c log_connections=yes -c log_destination='syslog'
   of the session, but do not affect other sessions.
   For historical reasons, the format of PGOPTIONS is
   similar to that used when launching the postgres
-  command; specifically, the -c flag must be specified.
-  For example,
+  command; specifically, the -c , or prepended
+  --, before the name must be specified. For example,
 
-env PGOPTIONS="-c geqo=off -c statement_timeout=5min" psql
+env PGOPTIONS="-c geqo=off --statement_timeout=5min" psql
 
  
 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d0d5aefadc..89f7aa590d 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1374,9 +1374,10 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   

 Specifies command-line options to send to the server at connection
-start.  For example, setting this to -c geqo=off sets the
-session's value of the geqo parameter to
-off.  Spaces within this string are considered to
+start.  For example, setting this to -c geqo=off
+or --geoq=off sets the session's value of the
+geqo parameter to off.
+Spaces within this string are considered to
 separate command-line arguments, unless escaped with a backslash
 (\); write \\ to represent a literal
 backslash.  For a detailed discussion of the available
diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index b13a16a117..cbf06d86cb 100644
--- a/doc/src/sgml/ref/postgres-ref.sgml
+++ b/doc/src/sgml/ref/postgres-ref.sgml
@@ -123,7 +123,8 @@ PostgreSQL documentation
 described in . Most of the
 other command line options are in fact short forms of such a
 parameter assignment.  -c can appear multiple times
-to set multiple parameters.
+to set multiple parameters.  See the --name
+option below for an alternate syntax.

   
  
@@ -342,8 +343,9 @@ PostgreSQL documentation
   --name=value
   

-Sets a named run-time parameter; a shorter form of
--c.
+Sets the named run-time parameter; a shorter form of
+-c.  See 
+for a listing of parameters.

   
  
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 51ffb8e773..f1b88faa5c 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -324,7 +324,7 @@ help(const char *progname)
 	printf(_("Usage:\n  %s [OPTION]...\n\n"), progname);
 	printf(_("Options:\n"));
 	printf(_("  -B NBUFFERSnumber of shared buffers\n"));
-	printf(_("  -c NAME=VALUE  set run-time parameter\n"));
+	printf(_("  -c NAME=VALUE  set run-time parameter (see also --NAME)\n"));
 	printf(_("  -C NAMEprint value of run-time parameter, then exit\n"));
 	printf(_("  -d 1-5 debugging level\n"));
 	printf(_("  -D DATADIR database directory\n"));
@@ -341,7 +341,7 @@ help(const char *progname)
 	printf(_("  -s show statistics after each query\n"));
 	printf(_("  -S

Re: pgbench - adding pl/pgsql versions of tests

On Tue, Aug 15, 2023 at 11:41 AM Nathan Bossart
 wrote:
> Why's that?  I'm not aware of any project policy that prohibits such
> enhancements to pgbench.  It might take some effort to gather consensus on
> a proposal like this, but IMHO that doesn't mean we shouldn't try.  If the
> prevailing wisdom is that we shouldn't add more built-in scripts because
> there is an existing way to provide custom ones, then it's not clear that
> we should proceed with $SUBJECT, anyway.

I don't think there's a policy against adding more built-in scripts to
pgbench, but I'm skeptical of such efforts because I don't see how to
decide which ones are worthy of inclusion and which are not. Adding
everyone's favorite thing will be too cluttered, and adding nothing
forecloses nothing because people can always provide their own. If we
could establish that certain custom scripts are widely used across
many people, then those might be worth adding.

I have a vague recollection of someone proposing something similar to
this in the past, possibly Jeff Davis. If there is in fact a paper
trail showing that the same thing has been proposed more than once by
unrelated people, that would be a point in favor of adding that
particular thing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: to_regtype() Raises Error

2024-02-02 Thread David E. Wheeler

On Feb 2, 2024, at 3:23 PM, David G. Johnston  
wrote:

> Seems like most just want to leave well enough alone and deal with the rare 
> question for oddball input on the mailing list.  If you are interested enough 
> to come back after 4 months I'd suggest you write up and submit a patch.  I'm 
> happy to review it and see if that is enough to get a committer to respond.

LOL, “interested enough” is less the right term than “triaging email backlog 
and following up on a surprisingly controversial question.” I also just like to 
see decisions made and issues closed one way or another.

Anyway, I’m happy to submit a documentation patch along the lines you suggested.

D

Re: to_regtype() Raises Error

On Mon, Jan 29, 2024 at 8:45 AM David E. Wheeler 
wrote:

> Hey there, coming back to this. I poked at the logs in the master branch
> and saw no mention of to_regtype; did I miss it?
>

With no feedback regarding my final suggestion I lost interest in it and
never produced a patch.


> On Sep 17, 2023, at 10:58 PM, David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
> > Parses a string of text, extracts a potential type name from it, and
> translates that name into an OID.  Failure to extract a valid potential
> type name results in an error while a failure to determine that the
> extracted name is known to the system results in a null output.
> >
> > I take specific exception to describing your example as a “textual type
> name”.
>
> More docs seem like a reasonable compromise. Perhaps it’d be useful to
> also describe when an error is likely and when it’s not.
>
>
Seems like most just want to leave well enough alone and deal with the rare
question for oddball input on the mailing list.  If you are interested
enough to come back after 4 months I'd suggest you write up and submit a
patch.  I'm happy to review it and see if that is enough to get a committer
to respond.

David J.

Re: Correct SQLSTATE for ENOMEM in file access

Alexander Kuzmenkov  writes:
> On Fri, Feb 2, 2024 at 8:12 PM Tom Lane  wrote:
>> Hmm, do you think this is actually reachable?  AFAIK we should only be
>> calling errcode_for_file_access() after functions that are unlikely to
>> report ENOMEM.

> It's reachable, that's how I noticed. I'm seeing logs like "XX000:
> could not load library \"/usr/lib/postgresql/15/lib/plpgsql.so\": out
> of memory" from internal_load_library and so on. Not sure what is the
> exact configuration required to reproduce this, probably at least the
> overcommit should be disabled.

OK, can't argue with experimental evidence ;-)

regards, tom lane

Re: Small fix on COPY ON_ERROR document

On Thu, Feb 1, 2024 at 11:59 PM Yugo NAGATA  wrote:

>
> I attached a updated patch including fixes you pointed out above.
>
>
Removed "which"; changed "occupying" to "occupy"
Removed on of the two "amounts"
Changed "unacceptable to the input function" to just "converting" as that
is what the average reader is more likely to be thinking.
The rest was good.

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 3c2feaa11a..55764fc1f2 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -385,8 +385,8 @@ COPY { table_name [ ( ON_ERROR
 
  
-  Specifies which how to behave when encountering an error due to
column values
-  unacceptable to the input function of each attribute's data type.
+  Specifies how to behave when encountering an error converting a
column's
+  input value into its data type.
   An error_action value of
   stop means fail the command, while
   ignore means discard the input row and continue
with the next one.
@@ -589,7 +589,7 @@ COPY count
 The COPY FROM command physically inserts input rows
 into the table as it progresses.  If the command fails, these rows are
 left in a deleted state; these rows will not be visible, but still
-occupying disk space. This might amount to a considerable amount of
+occupy disk space. This might amount to considerable
 wasted disk space if the failure happened well into a large copy
 operation. VACUUM should be used to recover the
 wasted space.

David J.

Re: ResourceOwner refactoring

2024-02-02 Thread Heikki Linnakangas


On 02/02/2024 11:00, Alexander Lakhin wrote:

Please try the following script:
mkdir /tmp/50m
sudo mount -t tmpfs -o size=50M tmpfs /tmp/50m
export PGDATA=/tmp/50m/tmpdb

initdb
pg_ctl -l server.log start

cat << 'EOF' | psql
CREATE TEMP TABLE t (a name, b name, c name, d name);
INSERT INTO t SELECT 'a', 'b', 'c', 'd' FROM generate_series(1, 1000) g;

COPY t TO '/tmp/t.data';
SELECT 'COPY t FROM ''/tmp/t.data''' FROM generate_series(1, 100)
\gexec
EOF

which produces an unexpected error, a warning, and an assertion failure,
starting from b8bff07da:


Fixed, thanks for the report!

Comparing ExtendBufferedRelLocal() and ExtendBufferedRelShared(), it's 
easy to see that ExtendBufferedRelLocal() was missing a 
ResourceOwnerEnlarge() call in the loop. But it's actually a bit more 
subtle: it was correct without the ResourceOwnerEnlarge() call until 
commit b8bff07da, because ExtendBufferedRelLocal() unpins the old buffer 
pinning the new one, while ExtendBufferedRelShared() does it the other 
way 'round. The implicit assumption was that unpinning the old buffer 
ensures that you can pin a new one. That no longer holds with commit 
b8bff07da. Remembering a new resource expects there to be a free slot in 
the fixed-size array, but if the forgotten resource was in the hash, 
rather than the array, forgetting it doesn't make space in the array.


We also make that assumption here in BufferAlloc:


/*
 * Got a collision. Someone has already done what we were about 
to do.
 * We'll just handle this as if it were found in the buffer 
pool in
 * the first place.  First, give up the buffer we were planning 
to
 * use.
 *
 * We could do this after releasing the partition lock, but 
then we'd
 * have to call ResourceOwnerEnlarge() & 
ReservePrivateRefCountEntry()
 * before acquiring the lock, for the rare case of such a 
collision.
 */
UnpinBuffer(victim_buf_hdr);


It turns out to be OK in that case, because it unpins the buffer that 
was the last one pinned. That does ensure that you have one free slot in 
the array, but forgetting anything other than the most recently 
remembered resource does not.


I've added a note to that in ResourceOwnerForget. I read through the 
other callers of ResourceOwnerRemember and PinBuffer, but didn't find 
any other unsafe uses. I'm not too happy with this subtlety, but at 
least it's documented now.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Correct SQLSTATE for ENOMEM in file access

2024-02-02 Thread Alexander Kuzmenkov

On Fri, Feb 2, 2024 at 8:12 PM Tom Lane  wrote:
> Hmm, do you think this is actually reachable?  AFAIK we should only be
> calling errcode_for_file_access() after functions that are unlikely to
> report ENOMEM.

It's reachable, that's how I noticed. I'm seeing logs like "XX000:
could not load library \"/usr/lib/postgresql/15/lib/plpgsql.so\": out
of memory" from internal_load_library and so on. Not sure what is the
exact configuration required to reproduce this, probably at least the
overcommit should be disabled.

Re: Correct SQLSTATE for ENOMEM in file access

Alexander Kuzmenkov  writes:
> Looking through the logs of some server that were experiencing out of
> memory errors, I noticed that errcode_for_file_access() reports
> ERRCODE_INTERNAL_ERROR for ENOMEM, while the correct SQLSTATE for this
> should probably be ERRCODE_OUT_OF_MEMORY. Attached is a small patch to
> fix this.

Hmm, do you think this is actually reachable?  AFAIK we should only be
calling errcode_for_file_access() after functions that are unlikely to
report ENOMEM.

regards, tom lane

Re: POC, WIP: OR-clause support for indexes

2024-02-02 Thread Alena Rybakina



On 01.02.2024 08:00, jian he wrote:

On Wed, Jan 31, 2024 at 7:10 PM Alena Rybakina
  wrote:

Hi, thank you for your review and interest in this subject.

On 31.01.2024 13:15, jian he wrote:

On Wed, Jan 31, 2024 at 10:55 AM jian he  wrote:

based on my understanding of
https://www.postgresql.org/docs/current/xoper-optimization.html#XOPER-COMMUTATOR
I think you need move commutator check right after the `if
(get_op_rettype(opno) != BOOLOID)` branch

I was wrong about this part. sorry for the noise.


I have made some changes (attachment).
* if the operator expression left or right side type category is
{array | domain | composite}, then don't do the transformation.
(i am not 10% sure with composite)

To be honest, I'm not sure about this check, because we check the type of 
variable there:

if (!IsA(orqual, OpExpr))
 {
 or_list = lappend(or_list, orqual);
 continue;
 }
And below:
if (IsA(leftop, Const))
 {
 opno = get_commutator(opno);

 if (!OidIsValid(opno))
 {
 /* Commuter doesn't exist, we can't reverse the order */
 or_list = lappend(or_list, orqual);
 continue;
 }

 nconst_expr = get_rightop(orqual);
 const_expr = get_leftop(orqual);
 }
 else if (IsA(rightop, Const))
 {
 const_expr = get_rightop(orqual);
 nconst_expr = get_leftop(orqual);
 }
 else
 {
 or_list = lappend(or_list, orqual);
 continue;
 }

Isn't that enough?

alter table tenk1 add column arr int[];
set enable_or_transformation to on;
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1
WHERE arr = '{1,2,3}' or arr = '{1,2}';

the above query will not do the OR transformation. because array type
doesn't have array type.
`
scalar_type = entry->key.exprtype;
if (scalar_type != RECORDOID && OidIsValid(scalar_type))
array_type = get_array_type(scalar_type);
else
array_type = InvalidOid;
`

If either side of the operator expression is array or array related type,
we can be sure it cannot do the transformation
(get_array_type will return InvalidOid for anyarray type).
we can check it earlier, so hash related code will not be invoked for
array related types.

Agree.

Besides, some of examples (with ARRAY) works fine:

postgres=# CREATE TABLE sal_emp (
 pay_by_quarter  integer[],
 pay_by_quater1 integer[]
);
CREATE TABLE
postgres=# INSERT INTO sal_emp
 VALUES (
 '{1, 1, 1, 1}',
 '{1,2,3,4}');
INSERT 0 1
postgres=# select * from sal_emp where pay_by_quarter[1] = 1 or 
pay_by_quarter[1]=2;
   pay_by_quarter   | pay_by_quater1
---+
  {1,1,1,1} | {1,2,3,4}
(1 row)

postgres=# explain select * from sal_emp where pay_by_quarter[1] = 1 or 
pay_by_quarter[1]=2;
   QUERY PLAN
--
  Seq Scan on sal_emp  (cost=0.00..21.00 rows=9 width=64)
Filter: (pay_by_quarter[1] = ANY ('{1,2}'::integer[]))
(2 rows)

* if the left side of the operator expression node contains volatile
functions, then don't do the transformation.

I'm also not sure about the volatility check function, because we perform such 
a conversion at the parsing stage, and at this stage we don't have a RelOptInfo 
variable and especially a RestictInfo such as PathTarget.


  see the example in here:
https://www.postgresql.org/message-id/CACJufxGXhJ823cdAdp2Ho7qC-HZ3_-dtdj-myaAi_u9RQLn45g%40mail.gmail.com

set enable_or_transformation to on;
create or replace function retint(int) returns int as
$func$
begin raise notice 'hello';
 return $1 + round(10 * random()); end
$func$ LANGUAGE plpgsql;

SELECT count(*) FROM tenk1 WHERE thousand = 42;
will return 10 rows.

SELECT count(*) FROM tenk1 WHERE thousand = 42 AND (retint(1) = 4 OR
retint(1) = 3);
this query I should return 20 notices 'hello', but now only 10.

EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1
WHERE thousand = 42 AND (retint(1) = 4 OR  retint(1) = 3);
   QUERY PLAN
--
  Aggregate
->  Seq Scan on tenk1
  Filter: ((thousand = 42) AND (retint(1) = ANY ('{4,3}'::integer[])))
(3 rows)


Agree.

I added your code to the patch.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
From 0fcad72153c9eaaf436e5b417c803a70fbeaf8ac Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Fri, 2 Feb 2024 22:01:09 +0300
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the
preliminary stage of optimization when we are still working with the tree
expression.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an

Correct SQLSTATE for ENOMEM in file access

2024-02-02 Thread Alexander Kuzmenkov

Looking through the logs of some server that were experiencing out of
memory errors, I noticed that errcode_for_file_access() reports
ERRCODE_INTERNAL_ERROR for ENOMEM, while the correct SQLSTATE for this
should probably be ERRCODE_OUT_OF_MEMORY. Attached is a small patch to
fix this.

---
Alexander Kuzmenkov
Timescale
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 2c7a20e3d3..9860bf079a 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -922,6 +922,10 @@ errcode_for_file_access(void)
 			edata->sqlerrcode = ERRCODE_DISK_FULL;
 			break;
 
+		case ENOMEM:	/* Out of memory */
+			edata->sqlerrcode = ERRCODE_OUT_OF_MEMORY;
+			break;
+
 		case ENFILE:			/* File table overflow */
 		case EMFILE:			/* Too many open files */
 			edata->sqlerrcode = ERRCODE_INSUFFICIENT_RESOURCES;

Re: pg_language(langispl) column apparently not needed

Antonin Houska  writes:
> I couldn't find a reference to the 'langispl' attribute, so I removed it (see
> the diff attached) and the master branch compiled cleanly. Is there yet a
> reason to keep it?

You didn't test pg_dump, I take it.

regards, tom lane

pg_language(langispl) column apparently not needed

2024-02-02 Thread Antonin Houska

I couldn't find a reference to the 'langispl' attribute, so I removed it (see
the diff attached) and the master branch compiled cleanly. Is there yet a
reason to keep it?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index c849d65e62..d353b4e3f7 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -112,7 +112,6 @@ CreateProceduralLanguage(CreatePLangStmt *stmt)
 	namestrcpy(, languageName);
 	values[Anum_pg_language_lanname - 1] = NameGetDatum();
 	values[Anum_pg_language_lanowner - 1] = ObjectIdGetDatum(languageOwner);
-	values[Anum_pg_language_lanispl - 1] = BoolGetDatum(true);
 	values[Anum_pg_language_lanpltrusted - 1] = BoolGetDatum(stmt->pltrusted);
 	values[Anum_pg_language_lanplcallfoid - 1] = ObjectIdGetDatum(handlerOid);
 	values[Anum_pg_language_laninline - 1] = ObjectIdGetDatum(inlineOid);
diff --git a/src/include/catalog/pg_language.h b/src/include/catalog/pg_language.h
index 581372addd..7ff3fb0d08 100644
--- a/src/include/catalog/pg_language.h
+++ b/src/include/catalog/pg_language.h
@@ -36,9 +36,6 @@ CATALOG(pg_language,2612,LanguageRelationId)
 	/* Language's owner */
 	Oid			lanowner BKI_DEFAULT(POSTGRES) BKI_LOOKUP(pg_authid);
 
-	/* Is a procedural language */
-	bool		lanispl BKI_DEFAULT(f);
-
 	/* PL is trusted */
 	bool		lanpltrusted BKI_DEFAULT(f);

Draft release notes for minor releases are up

First-draft release notes for 16.2 are available at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=87dcc5e45fad3021514f01360d3a2abd4e6480ee

Please send comments/corrections before Sunday.

regards, tom lane

Re: pgbench - adding pl/pgsql versions of tests

2024-02-02 Thread Hannu Krosing

Thanks for the update.

I will give it another go over the weekend

Cheers,
Hannu

On Thu, Feb 1, 2024 at 7:33 PM vignesh C  wrote:

> On Fri, 18 Aug 2023 at 23:04, Hannu Krosing  wrote:
> >
> > I will address the comments here over this coming weekend.
>
> The patch which you submitted has been awaiting your attention for
> quite some time now.  As such, we have moved it to "Returned with
> Feedback" and removed it from the reviewing queue. Depending on
> timing, this may be reversible.  Kindly address the feedback you have
> received, and resubmit the patch to the next CommitFest.
>
> Regards,
> Vignesh
>

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

On Thu, Feb 1, 2024 at 10:51 AM Alvaro Herrera  wrote:
> I think this works similarly but not identically to tablespace defaults,
> and the difference could be confusing.  You seem to have made it so that
> the partitioned table _always_ have a table AM, so the partitions can
> always inherit from it.  I think it would be more sensible to _allow_
> partitioned tables to have one, but not mandatory; if they don't have
> it, then a partition created from it would use default_table_access_method.

I agree that we don't want this feature to invent any new behavior. If
it's clearly and fully parallel to what we do for tablespaces, then I
think it's probably OK, but anything less than that would be a cause
for concern for me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-02-02 Thread Jan Katins

Hi!

On Fri, 2 Feb 2024 at 13:41, Heikki Linnakangas  wrote:

> PS. Does any other DBMS implement this? Any lessons to be learned from
> them?
>

Mysql 8.3 has open telemetrie component, so very fresh:

https://dev.mysql.com/doc/refman/8.3/en/telemetry-trace-install.html
https://blogs.oracle.com/mysql/post/mysql-telemetry-tracing-with-oci-apm

Mysql ppl are already waiting for PG to catch up (in German):
https://chaos.social/@isotopp/111421160719915296 :-)

Best regards,

Jan

Re: Commitfest 2024-01 is now closed

On Fri, Feb 2, 2024 at 12:28 AM Tom Lane  wrote:
> Thanks for all the work you did running it!  CFM is typically a
> thankless exercise in being a nag, but I thought you put in more
> than the usual amount of effort to keep things moving.

+1. The extra effort was really noticeable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cataloguing NOT NULL constraints

2024-02-02 Thread Alexander Lakhin


Hello Alvaro,

Please look at an anomaly introduced with b0e96f311.
The following script:
CREATE TABLE a ();
CREATE TABLE b (i int) INHERITS (a);
CREATE TABLE c () INHERITS (a, b);

ALTER TABLE a ADD COLUMN i int NOT NULL;

results in:
NOTICE:  merging definition of column "i" for child "b"
NOTICE:  merging definition of column "i" for child "c"
ERROR:  tuple already updated by self

(This is similar to bug #18297, but ATExecAddColumn() isn't guilty in this
case.)

Best regards,
Alexander

Re: XLog size reductions: smaller XLRec block header for PG17

On Fri, Feb 2, 2024 at 8:52 AM Heikki Linnakangas  wrote:
> To shrink OIDs fields, you could refer to earlier WAL records. A special
> value for "same relation as in previous record", or something like that.
> Now we're just re-inventing LZ-style compression though. Might as well
> use LZ4 or Snappy or something to compress the whole WAL stream. It's a
> bit tricky to get the crash-safety right, but shouldn't be impossible.
>
> Has anyone seriously considered implementing wholesale compression of WAL?

I thought about the idea of referring to earlier WAL and/or undo
records when working on zheap. It seems tricky, because what if replay
starts after those WAL records and you can't refer back to them? It's
OK if you can make sure that you never depend on anything prior to the
latest checkpoint, but the natural way to make that work is to add
more looping like what we already do for FPIs, and then you have to
worry about whether that extra logic is going to be more expensive
than what you save. FPIs are so expensive that we can afford to go to
a lot of trouble to avoid them and still come out ahead, but storing
the full OID instead of an OID reference isn't nearly in the same
category.

I also thought about trying to refer to earlier items on the same
page, thinking that the locality would make things easier. But it
doesn't, because we don't know which page will ultimately contain the
WAL record until quite late, so we can't reason about what's on the
same page when constructing it.

Wholesale compression of WAL might run into some of the same issues,
e.g. if you don't want to compress each record individually, that
means you can't compress until you know the insert position. And even
then, if you want the previous data to be present in the compressor as
context, you almost need all the WAL compression to be done by a
single process. But if the LSNs are relative to the compressed stream,
you have to wait for that compression to finish before you can
determine the LSN of the next record, which seems super-painful, and
if they're relative to the uncompressed stream, then mapping them onto
fixed-size files gets tricky.

My hunch is that we can squeeze more out of the existing architecture
with a lot less work than it would take to do major rearchitecture
like compressing everything. I don't know how we can agree on a way of
doing that because everybody's got slightly different ideas about the
right way to do this. But if agreeing on how to evolve the system
we've got seems harder then rewriting it, we need to stop worrying
about WAL overhead and learn how to work together better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-02 Thread Alvaro Herrera

Hello,

The patched docs claim that PQrequestCancel is insecure, but neither the
code nor docs explain why.  The docs for PQcancel on the other hand do
mention that encryption is not used; does that apply to PQrequestCancel
as well and is that the reason?  If so, I think we should copy the
warning and perhaps include a code comment about that.  Also, maybe that
final phrase in PQcancel should be a  box: remove from "So, for
example" and add Because gssencmode and sslencmode are
not preserved from the original connection, the cancel request is not
encrypted. or something like that.


I wonder if Section 33.7 Canceling Queries in Progress should be split
in three subsections, and I propose the following order:

33.7.1 PGcancelConn-based Cancellation API
  PQcancelConn  -- we first document the basics
  PQcancelSend
  PQcancelFinish
  PQcancelPoll  -- the nonblocking interface is documented next
  PQcancelReset -- reuse a cancelconn, later in docs because it's more 
advanced
  PQcancelStatus-- accessors go last
  PQcancelSocket
  PQcancelErrorMessage

33.7.2 Obsolete interface
  PQgetCancel
  PQfreeCancel
  PQcancel

33.7.3 Deprecated and Insecure Methods
  PQrequestCancel

I have a hard time coming up with good subsection titles though.

Now, looking at this list, I think it's surprising that the nonblocking
request for a cancellation is called PQcancelPoll.  PQcancelSend() is at
odds with the asynchronous query API, which uses the verb "send" for the
asynchronous variants.  This would suggest that PQcancelPoll should
actually be called PQcancelSend or maybe PQcancelStart (mimicking
PQconnectStart).  I'm not sure what's a good alternative name for the
blocking one, which you have called PQcancelSend.

I see upthread that the names of these functions were already quite
heavily debated.  Sorry to beat that dead horse some more ... I'm just
not sure it's decided matter.

Lastly -- the doc blurbs that say simply "a version of XYZ that can be
used for cancellation connections" are a bit underwhelming.  Shouldn't
we document these more fully instead of making users go read the docs
for the other functions and wonder what the differences might be, if
any?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are."  -- Charles J. Sykes' advice to teenagers

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-02 Thread Jelte Fennema-Nio

On Fri, 2 Feb 2024 at 13:19, Alvaro Herrera  wrote:
> Thank you, looks good.
>
> I propose the following minor/trivial fixes over your initial 3 patches.

All of those seem good like fixes. Attached is an updated patchset
where they are all applied. As well as adding a missing word ("been")
in a comment that I noticed while reading your fixes.
From 7736e940567878c32355c2143cddba3b13bfa71e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Fri, 26 Jan 2024 16:47:51 +0100
Subject: [PATCH v30 3/5] libpq: Change some static functions to extern

This is in preparation of a follow up commit that starts using these
functions from fe-cancel.c.
---
 src/interfaces/libpq/fe-connect.c | 87 +++
 src/interfaces/libpq/libpq-int.h  |  6 +++
 2 files changed, 47 insertions(+), 46 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 079abfca9e..7d8616eb6d 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -387,15 +387,10 @@ static const char uri_designator[] = "postgresql://";
 static const char short_uri_designator[] = "postgres://";
 
 static bool connectOptions1(PGconn *conn, const char *conninfo);
-static bool connectOptions2(PGconn *conn);
-static int	connectDBStart(PGconn *conn);
-static int	connectDBComplete(PGconn *conn);
 static PGPing internal_ping(PGconn *conn);
-static PGconn *makeEmptyPGconn(void);
 static void pqFreeCommandQueue(PGcmdQueueEntry *queue);
 static bool fillPGconn(PGconn *conn, PQconninfoOption *connOptions);
 static void freePGconn(PGconn *conn);
-static void closePGconn(PGconn *conn);
 static void release_conn_addrinfo(PGconn *conn);
 static int	store_conn_addrinfo(PGconn *conn, struct addrinfo *addrlist);
 static void sendTerminateConn(PGconn *conn);
@@ -644,8 +639,8 @@ pqDropServerData(PGconn *conn)
  * PQconnectStart or PQconnectStartParams (which differ in the same way as
  * PQconnectdb and PQconnectdbParams) and PQconnectPoll.
  *
- * Internally, the static functions connectDBStart, connectDBComplete
- * are part of the connection procedure.
+ * The non-exported functions pqConnectDBStart, pqConnectDBComplete are
+ * part of the connection procedure implementation.
  */
 
 /*
@@ -678,7 +673,7 @@ PQconnectdbParams(const char *const *keywords,
 	PGconn	   *conn = PQconnectStartParams(keywords, values, expand_dbname);
 
 	if (conn && conn->status != CONNECTION_BAD)
-		(void) connectDBComplete(conn);
+		(void) pqConnectDBComplete(conn);
 
 	return conn;
 }
@@ -731,7 +726,7 @@ PQconnectdb(const char *conninfo)
 	PGconn	   *conn = PQconnectStart(conninfo);
 
 	if (conn && conn->status != CONNECTION_BAD)
-		(void) connectDBComplete(conn);
+		(void) pqConnectDBComplete(conn);
 
 	return conn;
 }
@@ -785,7 +780,7 @@ PQconnectStartParams(const char *const *keywords,
 	 * to initialize conn->errorMessage to empty.  All subsequent steps during
 	 * connection initialization will only append to that buffer.
 	 */
-	conn = makeEmptyPGconn();
+	conn = pqMakeEmptyPGconn();
 	if (conn == NULL)
 		return NULL;
 
@@ -819,15 +814,15 @@ PQconnectStartParams(const char *const *keywords,
 	/*
 	 * Compute derived options
 	 */
-	if (!connectOptions2(conn))
+	if (!pqConnectOptions2(conn))
 		return conn;
 
 	/*
 	 * Connect to the database
 	 */
-	if (!connectDBStart(conn))
+	if (!pqConnectDBStart(conn))
 	{
-		/* Just in case we failed to set it in connectDBStart */
+		/* Just in case we failed to set it in pqConnectDBStart */
 		conn->status = CONNECTION_BAD;
 	}
 
@@ -863,7 +858,7 @@ PQconnectStart(const char *conninfo)
 	 * to initialize conn->errorMessage to empty.  All subsequent steps during
 	 * connection initialization will only append to that buffer.
 	 */
-	conn = makeEmptyPGconn();
+	conn = pqMakeEmptyPGconn();
 	if (conn == NULL)
 		return NULL;
 
@@ -876,15 +871,15 @@ PQconnectStart(const char *conninfo)
 	/*
 	 * Compute derived options
 	 */
-	if (!connectOptions2(conn))
+	if (!pqConnectOptions2(conn))
 		return conn;
 
 	/*
 	 * Connect to the database
 	 */
-	if (!connectDBStart(conn))
+	if (!pqConnectDBStart(conn))
 	{
-		/* Just in case we failed to set it in connectDBStart */
+		/* Just in case we failed to set it in pqConnectDBStart */
 		conn->status = CONNECTION_BAD;
 	}
 
@@ -895,7 +890,7 @@ PQconnectStart(const char *conninfo)
  * Move option values into conn structure
  *
  * Don't put anything cute here --- intelligence should be in
- * connectOptions2 ...
+ * pqConnectOptions2 ...
  *
  * Returns true on success. On failure, returns false and sets error message.
  */
@@ -933,7 +928,7 @@ fillPGconn(PGconn *conn, PQconninfoOption *connOptions)
  *
  * Internal subroutine to set up connection parameters given an already-
  * created PGconn and a conninfo string.  Derived settings should be
- * processed by calling connectOptions2 next.  (We split them because
+ * processed by calling pqConnectOptions2 next.  (We split them because
  * PQsetdbLogin overrides

Re: XLog size reductions: smaller XLRec block header for PG17

2024-02-02 Thread Heikki Linnakangas

On 22/01/2024 19:23, Robert Haas wrote:

In the case of this particular patch, I think the problem is that
there's no consensus on the design. There's not a ton of debate on
this thread, but thread [1] linked in the original post contains a lot
of vigorous debate about what the right thing to do is here and I
don't believe we reached any meeting of the minds.

Yeah, so it seems.

It looks like I never replied to
https://www.postgresql.org/message-id/20221019192130.ebjbycpw6bzjry4v%40awork3.anarazel.de
but, FWIW, I agree with Andres that applying the same technique to
multiple fields that are stored together (DB OID, TS OID, rel #, block
#) is unlikely in practice to produce many cases that regress. But the
question for this thread is really more about whether we're OK with
using ad-hoc bit swizzling to reduce the size of xlog records or
whether we want to insist on the use of a uniform varint encoding.
Heikki and Andres both seem to favor the latter. IIRC, I was initially
more optimistic about ad-hoc bit swizzling being a potentially
acceptable technique, but I'm not convinced enough about it to argue
against two very smart committers both of whom know more about
micro-optimizing performance than I do, and nobody else seems to
making this argument on this thread either, so I just don't really see
how this patch is ever going to go anywhere in its current form.

I don't have a clear idea of how to proceed with this either. Some
thoughts I have:

Using varint encoding makes sense for length fields. The common values
are small, and if a length of anything is large, then the size of the
length field itself is insignificant compared to the actual data.

I don't like using varint encoding for OID. They might be small in
common cases, but it feels wrong to rely on that. They're just arbitrary
numbers. We could pick them randomly, it's just an implementation detail
that we use a counter to choose the next one. I really dislike the idea
that someone would do a pg_dump + restore, just to get smaller OIDs and
smaller WAL as a result.

It does make sense to have a fast-path (small-path?) for 0 OIDs though.

To shrink OIDs fields, you could refer to earlier WAL records. A special
value for "same relation as in previous record", or something like that.
Now we're just re-inventing LZ-style compression though. Might as well
use LZ4 or Snappy or something to compress the whole WAL stream. It's a
bit tricky to get the crash-safety right, but shouldn't be impossible.

Has anyone seriously considered implementing wholesale compression of WAL?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: InstallXLogFileSegment() vs concurrent WAL flush

2024-02-02 Thread Thomas Munro

On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA  wrote:
> On Fri, 2 Feb 2024 11:18:18 +0100
> Thomas Munro  wrote:
> > One simple way to address that would be to make XLogFileInitInternal()
> > wait for InstallXLogFileSegment() to finish.  It's a little
>
> Or, can we make sure the rename is durable by calling fsync before
> returning the fd, as a patch attached here?

Right, yeah, that works too.  I'm not sure which way is better.

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-02-02 Thread Heikki Linnakangas


Hi Anthonin,

I'm only now catching up on this thread. Very exciting feature!

My first observation is that you were able to implement this purely as 
an extension, without any core changes. That's very cool, because:


- You don't need buy-in or blessing from anyone else. You can just put 
this on github and people can immediately start using it.
- As an extension, it's not tied to the PostgreSQL release cycle. Users 
can immediately start using it with existing PostgreSQL versions.


There are benefits to being in contrib/ vs. an extension outside the 
PostgreSQL source tree, but there are significant downsides too. 
Out-of-tree, you get more flexibility, and you can develop much faster. 
I think this should live out-of-tree, at least until it's relatively 
stable. By stable, I mean "not changing much", not that it's bug-free.


# Core changes

That said, there are a lot of things that would make sense to integrate 
in PostgreSQL itself. For example:


- You're relying on various existing hooks, but it might be better to 
have the 'begin_span'/'end_span' calls directly in PostgreSQL source 
code in the right places. There are more places where spans might be 
nice, like when performing a large sort in tuplesort.c to name one 
example. Or even across every I/O; not sure how much overhead the spans 
incur and how much we'd be willing to accept. 'begin_span'/'end_span' 
could be new hooks that normally do nothing, but your extension would 
implement them.


- Other extensions could include begin_span/end_span calls too. (It's 
complicated for an extension to call functions in another extension.)


- Extensions like postgres_fdw could get the tracing context and 
propagate it to the remote system too.


- How to pass the tracing context from the client? You went with 
SQLCommenter and others proposed a GUC. A protocol extension would make 
sense too. I can see merits to all of those. It probably would be good 
to support multiple mechanisms, but some might need core changes. It 
might be good to implement the mechanism for accepting trace context in 
core. Without the extension, we could still include the trace ID in the 
logs, for example.


# Further ideas

Some ideas on cool extra features on top of this:

- SQL functions to begin/end a span. Could be handy if you have 
complicated PL/pgSQL functions, for example.

- Have spans over subtransactions.
- Include elog() messages in the traces. You might want to have a lower 
elevel for what's included in traces; something like the 
client_min_messages and log_min_messages settings.

- Include EXPLAIN plans in the traces.

# Comments on the implementation

There was discussion already on push vs pull model. Currently, you 
collect the traces in memory / files, and require a client to poll them. 
A push model is more common in applications that support tracing. If 
Postgres could connect directly to the OTEL collector, you'd need one 
fewer running component. You could keep the traces in backend-private 
memory, no need to deal with shared memory and spilling to files.


Admittedly supporting the OTEL wire protocol is complicated unless you 
use an existing library. Nevertheless, it might be a better tradeoff. 
OTEL/HTTP with JSON format seems just about feasible to implement from 
scratch. Or bite the bullet and use some external library. If this lives 
as an extension, you have more freedom to rely on 3rd party libraries.


(If you publish this as an out-of-tree extension, then this is up to 
you, of course, and doesn't need consensus here on pgsql-hackers)


# Suggested next steps with this

Here's how I'd suggest to proceed with this:

1. Publish this as an extension on github, as it is. I think you'll get 
a lot of immediate adoption.


2. Start a new pgsql-hackers thread on in-core changes that would 
benefit the extension. Write patches for them. I'm thinking of the 
things I listed in the Core changes section above, but maybe there are 
others.



PS. Does any other DBMS implement this? Any lessons to be learned from them?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-02-02 Thread Alvaro Herrera

On 2024-Jan-29, Jelte Fennema-Nio wrote:

> On Mon, 29 Jan 2024 at 12:44, Alvaro Herrera  wrote:
> > Thanks!  I committed 0001 now.  I also renamed the new
> > pq_parse_int_param to pqParseIntParam, for consistency with other
> > routines there.  Please rebase the other patches.
> 
> Awesome! Rebased, and renamed pq_release_conn_hosts to
> pqReleaseConnHosts for the same consistency reasons.

Thank you, looks good.

I propose the following minor/trivial fixes over your initial 3 patches.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)
>From 92ca8dc2739a777ff5a0df990d6e9818c5729ac5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 2 Feb 2024 10:57:02 +0100
Subject: [PATCH 1/4] pgindent

---
 src/interfaces/libpq/fe-cancel.c | 14 +++---
 src/interfaces/libpq/libpq-fe.h  | 17 +
 src/tools/pgindent/typedefs.list |  1 +
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/interfaces/libpq/fe-cancel.c b/src/interfaces/libpq/fe-cancel.c
index 7416791d9f..d75c9628e7 100644
--- a/src/interfaces/libpq/fe-cancel.c
+++ b/src/interfaces/libpq/fe-cancel.c
@@ -137,7 +137,7 @@ oom_error:
  * Returns 1 if successful 0 if not.
  */
 int
-PQcancelSend(PGcancelConn * cancelConn)
+PQcancelSend(PGcancelConn *cancelConn)
 {
if (!cancelConn || cancelConn->conn.status == CONNECTION_BAD)
return 1;
@@ -157,7 +157,7 @@ PQcancelSend(PGcancelConn * cancelConn)
  * Poll a cancel connection. For usage details see PQconnectPoll.
  */
 PostgresPollingStatusType
-PQcancelPoll(PGcancelConn * cancelConn)
+PQcancelPoll(PGcancelConn *cancelConn)
 {
PGconn *conn = (PGconn *) cancelConn;
int n;
@@ -249,7 +249,7 @@ PQcancelPoll(PGcancelConn * cancelConn)
  * Get the status of a cancel connection.
  */
 ConnStatusType
-PQcancelStatus(const PGcancelConn * cancelConn)
+PQcancelStatus(const PGcancelConn *cancelConn)
 {
return PQstatus((const PGconn *) cancelConn);
 }
@@ -260,7 +260,7 @@ PQcancelStatus(const PGcancelConn * cancelConn)
  * Get the socket of the cancel connection.
  */
 int
-PQcancelSocket(const PGcancelConn * cancelConn)
+PQcancelSocket(const PGcancelConn *cancelConn)
 {
return PQsocket((const PGconn *) cancelConn);
 }
@@ -271,7 +271,7 @@ PQcancelSocket(const PGcancelConn * cancelConn)
  * Get the socket of the cancel connection.
  */
 char *
-PQcancelErrorMessage(const PGcancelConn * cancelConn)
+PQcancelErrorMessage(const PGcancelConn *cancelConn)
 {
return PQerrorMessage((const PGconn *) cancelConn);
 }
@@ -283,7 +283,7 @@ PQcancelErrorMessage(const PGcancelConn * cancelConn)
  * request.
  */
 void
-PQcancelReset(PGcancelConn * cancelConn)
+PQcancelReset(PGcancelConn *cancelConn)
 {
pqClosePGconn((PGconn *) cancelConn);
cancelConn->conn.status = CONNECTION_STARTING;
@@ -299,7 +299,7 @@ PQcancelReset(PGcancelConn * cancelConn)
  * Closes and frees the cancel connection.
  */
 void
-PQcancelFinish(PGcancelConn * cancelConn)
+PQcancelFinish(PGcancelConn *cancelConn)
 {
PQfinish((PGconn *) cancelConn);
 }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 857ba54d94..851e549355 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -329,17 +329,18 @@ extern PostgresPollingStatusType PQresetPoll(PGconn 
*conn);
 extern void PQreset(PGconn *conn);
 
 /* Create a PGcancelConn that's used to cancel a query on the given PGconn */
-extern PGcancelConn * PQcancelConn(PGconn *conn);
+extern PGcancelConn *PQcancelConn(PGconn *conn);
+
 /* issue a blocking cancel request */
-extern int PQcancelSend(PGcancelConn * conn);
+extern int PQcancelSend(PGcancelConn *conn);
 
 /* issue or poll a non-blocking cancel request */
-extern PostgresPollingStatusType PQcancelPoll(PGcancelConn * cancelConn);
-extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn);
-extern int PQcancelSocket(const PGcancelConn * cancelConn);
-extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn);
-extern void PQcancelReset(PGcancelConn * cancelConn);
-extern void PQcancelFinish(PGcancelConn * cancelConn);
+extern PostgresPollingStatusType PQcancelPoll(PGcancelConn *cancelConn);
+extern ConnStatusType PQcancelStatus(const PGcancelConn *cancelConn);
+extern int PQcancelSocket(const PGcancelConn *cancelConn);
+extern char *PQcancelErrorMessage(const PGcancelConn *cancelConn);
+extern void PQcancelReset(PGcancelConn *cancelConn);
+extern void PQcancelFinish(PGcancelConn *cancelConn);
 
 
 /* request a cancel structure */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..9ffb169e9d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@

Re: Synchronizing slots from primary to standby

2024-02-02 Thread shveta malik

On Fri, Feb 2, 2024 at 12:25 PM Peter Smith  wrote:
>
> Here are some review comments for v750002.

Thanks for the feedback Peter. Addressed all in v76 except one.

> (this is a WIP but this is what I found so far...)

> I wonder if it is better to log all the problems in one go instead of
> making users stumble onto them one at a time after fixing one and then
> hitting the next problem. e.g. just set some variable "all_ok =
> false;" each time instead of all the "return false;"
>
> Then at the end of the function just "return all_ok;"

If we do this way, then we need to find a way to combine the msgs as
well, otherwise the same msg will be repeated multiple times. For the
concerned functionality (which needs one time config effort by user),
I feel the existing way looks okay. We may consider optimizing it if
we get more comments here.

thanks
Shveta

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 2 Feb 2024 at 12:11, Artur Zakirov  wrote:
>
> On Fri, 2 Feb 2024 at 09:41, Nazir Bilal Yavuz  wrote:
> > You seem right, nice catch. Also, this change makes the check in
> >
> > snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
> >  basedir,
> >  PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
> >  "pg_xlog" : "pg_wal");
> >
> > redundant. PQserverVersion(conn) will always be higher than
> > MINIMUM_VERSION_FOR_PG_WAL.
>
> Thank you both for the comments. Indeed, that part now looks redundant.
> I've attached a patch to remove checking MINIMUM_VERSION_FOR_PG_WAL.

Thanks for the update. The patch looks good to me.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: InstallXLogFileSegment() vs concurrent WAL flush

2024-02-02 Thread Yugo NAGATA

On Fri, 2 Feb 2024 11:18:18 +0100
Thomas Munro  wrote:

> Hi,
> 
> New WAL space is created by renaming a file into place.  Either a
> newly created file with a temporary name or, ideally, a recyclable old
> file with a name derived from an old LSN.  I think there is a data
> loss window between rename() and fsync(parent_directory).  A
> concurrent backend might open(new_name), write(), fdatasync(), and
> then we might lose power before the rename hits the disk.  The data
> itself would survive the crash, but recovery wouldn't be able to find
> and replay it.  That might break the log-before-data rule or forget a
> transaction that has been reported as committed to a client.
> 
> Actual breakage would presumably require really bad luck, and I
> haven't seen this happen or anything, it just occurred to me while
> reading code, and I can't see any existing defences.
> 
> One simple way to address that would be to make XLogFileInitInternal()
> wait for InstallXLogFileSegment() to finish.  It's a little

Or, can we make sure the rename is durable by calling fsync before
returning the fd, as a patch attached here?

Regards,
Yugo Nagata

> pessimistic to do that unconditionally, though, as then you have to
> wait even for rename operations for segment files later than the one
> you're opening, so I thought about how to skip waiting in that case --
> see 0002.  I'm not sure if it's worth worrying about or not.


-- 
Yugo NAGATA 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..862109ca3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3062,7 +3062,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 errmsg("could not open file \"%s\": %m", path)));
 	}
 	else
+	{
+		fsync_fname_ext(path, false, false, ERROR);
+		fsync_parent_path(path, ERROR);
 		return fd;
+	}
 
 	/*
 	 * Initialize an empty (all zeroes) segment.  NOTE: it is possible that

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-02 Thread John Naylor

On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada  wrote:
> I've attached the new patch set (v56). I've squashed previous updates
> and addressed review comments on v55 in separate patches. Here are the
> update summary:
>
> 0004: fix compiler warning caught by ci test.
> 0005-0008: address review comments on radix tree codes.
> 0009: cleanup #define and #undef
> 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> undefined after including radixtree.h so we should not use it in test
> code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

v56-0007:

+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.

Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

v56-0008

Thanks, I like how the tests look now.

-NOTICE:  testing node   4 with height 0 and  ascending keys
...
+NOTICE:  testing node   1 with height 0 and  ascending keys

Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

v56-0012:

The test module for tidstore could use a few more comments.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+

-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
  TidStore *ts;
+ dsa_pointer rt_dp = handle;

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

---

Now, for v57:

> Looking at overall changes, there are still XXX and TODO comments in
> radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

> ---
>  * XXX There are 4 node kinds, and this should never be increased,
>  * for several reasons:
>  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
>  *statements.
>  * 2. The 4 kinds can be represented with 2 bits, so we have the option
>  *in the future to tag the node pointer with the kind, even on
>  *platforms with 32-bit pointers. This might speed up node traversal
>  *in trees with highly random node kinds.
>  * 3. We can have multiple size classes per node kind.
>
> Can we just remove "XXX"?

How about "NOTE"?

> ---
>  * WIP: notes about traditional radix tree trading off span vs height...
>
> Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

> ---
> #ifdef RT_SHMEM
> /*  WIP: do we really need this? */
> typedef dsa_pointer RT_HANDLE;
> #endif
>
> I think it's worth having it.

Okay, removed WIP in v57-0004.

> ---
>  * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
>  * inside a single bitmapword on most platforms, so it's a good starting
>  * point. We can make it higher if we need to.
>  */
> #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
>
> Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

> ---
> /* WIP: We could go first to the higher node16 size class */
> newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
>
> Does it mean to go to RT_CLASS_16_HI and then further go to
> RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make

Re: An improvement on parallel DISTINCT

2024-02-02 Thread David Rowley

On Fri, 2 Feb 2024 at 23:39, David Rowley  wrote:
> I'll push the patch shortly.

I've pushed the partial path sort part.

Now for the other stuff you had.   I didn't really like this part:

+ /*
+ * Set target for partial_distinct_rel as generate_useful_gather_paths
+ * requires that the input rel has a valid reltarget.
+ */
+ partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;

I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.

I've done it that way in the attached.

David
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 84c4de488a..d404fbf262 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3053,10 +3053,10 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo 
*rel, RangeTblEntry *rte)
  *
  * If we're generating paths for a scan or join relation, override_rows will
  * be false, and we'll just use the relation's size estimate.  When we're
- * being called for a partially-grouped path, though, we need to override
- * the rowcount estimate.  (It's not clear that the particular value we're
- * using here is actually best, but the underlying rel has no estimate so
- * we must do something.)
+ * being called for a partially-grouped or partially-distinct path, though, we
+ * need to override the rowcount estimate.  (It's not clear that the
+ * particular value we're using here is actually best, but the underlying rel
+ * has no estimate so we must do something.)
  */
 void
 generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index acc324122f..be4e182869 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -190,10 +190,12 @@ static void create_one_window_path(PlannerInfo *root,
   
WindowFuncLists *wflists,
   List 
*activeWindows);
 static RelOptInfo *create_distinct_paths(PlannerInfo *root,
-   
 RelOptInfo *input_rel);
+   
 RelOptInfo *input_rel,
+   
 PathTarget *target);
 static void create_partial_distinct_paths(PlannerInfo *root,

  RelOptInfo *input_rel,
-   
  RelOptInfo *final_distinct_rel);
+   
  RelOptInfo *final_distinct_rel,
+   
  PathTarget *target);
 static RelOptInfo *create_final_distinct_paths(PlannerInfo *root,

   RelOptInfo *input_rel,

   RelOptInfo *distinct_rel);
@@ -1644,8 +1646,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 */
root->upper_targets[UPPERREL_FINAL] = final_target;
root->upper_targets[UPPERREL_ORDERED] = final_target;
-   root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = 
sort_input_target;
root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
+   root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = 
sort_input_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
 
@@ -1695,7 +1697,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
if (parse->distinctClause)
{
current_rel = create_distinct_paths(root,
-   
current_rel);
+   
current_rel,
+   
sort_input_target);
}
}   /* end of if 
(setOperations) */
 
@@ -4568,12 +4571,14 @@ create_one_window_path(PlannerInfo *root,
  * Build a new upperrel containing Paths for SELECT DISTINCT evaluation.
  *
  * input_rel: contains the source-data Paths
+ * target: the pathtarget for the result Paths to compute
  *
  * Note: input paths should already compute the desired pathtarget, since
  * Sort/Unique won't project anything.
  */
 static

Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)

2024-02-02 Thread Ranier Vilela

Em sex., 2 de fev. de 2024 às 03:48, Michael Paquier 
escreveu:

> On Fri, Feb 02, 2024 at 12:04:39AM +0530, vignesh C wrote:
> > The patch which you submitted has been awaiting your attention for
> > quite some time now.  As such, we have moved it to "Returned with
> > Feedback" and removed it from the reviewing queue. Depending on
> > timing, this may be reversible.  Kindly address the feedback you have
> > received, and resubmit the patch to the next CommitFest.
>
> Even with that, it seems to me that this is not required now that
> 21d9c3ee4ef7 outlines better how long SMgrRelation pointers should
> live, no?
>
Correct Micheal, the best thing would be to remove the patch now.
Since it has completely lost its meaning.

Best regards,
Ranier Vilela

Re: An improvement on parallel DISTINCT

2024-02-02 Thread David Rowley

On Fri, 2 Feb 2024 at 20:47, Richard Guo  wrote:
>
>
> On Fri, Feb 2, 2024 at 11:26 AM David Rowley  wrote:
>>
>> In light of this, do you still think it's worthwhile making this change?
>>
>> For me, I think all it's going to result in is extra planner work
>> without any performance gains.
>
>
> Hmm, with the query below, I can see that the new plan is cheaper than
> the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.
>
> create table t (a int, b int);
> insert into t select i%10, i from generate_series(1,1000)i;
> analyze t;
>
> explain (costs on) select distinct a from t order by a limit 1;

OK, a LIMIT clause... I didn't think of that.  Given the test results
below, I'm pretty convinced we should make the change.

Performance testing on an AMD 3990x with work_mem=4MB and hash_mem_multiplier=2.

$ cat bench.sql
select distinct a from t order by a limit 1;
$ pgbench -n -T 60 -f bench.sql postgres

-- Master

max_parallel_workers_per_gather=2;
latency average = 470.310 ms
latency average = 468.673 ms
latency average = 469.463 ms

max_parallel_workers_per_gather=4;
latency average = 346.012 ms
latency average = 346.662 ms
latency average = 347.591 ms

max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 300.298 ms
latency average = 300.029 ms
latency average = 300.314 ms

-- Patched

max_parallel_workers_per_gather=2;
latency average = 424.176 ms
latency average = 431.870 ms
latency average = 431.870 ms (9.36% faster than master)

max_parallel_workers_per_gather=4;
latency average = 279.837 ms
latency average = 280.893 ms
latency average = 281.518 ms (23.51% faster than master)

max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 178.585 ms
latency average = 178.780 ms
latency average = 179.768 ms (67.68% faster than master)

So the gains increase with more parallel workers due to pushing more
work to the worker. Amdahl's law approves of this.

I'll push the patch shortly.

David

InstallXLogFileSegment() vs concurrent WAL flush

2024-02-02 Thread Thomas Munro

Hi,

New WAL space is created by renaming a file into place.  Either a
newly created file with a temporary name or, ideally, a recyclable old
file with a name derived from an old LSN.  I think there is a data
loss window between rename() and fsync(parent_directory).  A
concurrent backend might open(new_name), write(), fdatasync(), and
then we might lose power before the rename hits the disk.  The data
itself would survive the crash, but recovery wouldn't be able to find
and replay it.  That might break the log-before-data rule or forget a
transaction that has been reported as committed to a client.

Actual breakage would presumably require really bad luck, and I
haven't seen this happen or anything, it just occurred to me while
reading code, and I can't see any existing defences.

One simple way to address that would be to make XLogFileInitInternal()
wait for InstallXLogFileSegment() to finish.  It's a little
pessimistic to do that unconditionally, though, as then you have to
wait even for rename operations for segment files later than the one
you're opening, so I thought about how to skip waiting in that case --
see 0002.  I'm not sure if it's worth worrying about or not.


0001-Fix-InstallXLogFileSegment-concurrency-bug.patch
Description: Binary data


0002-Track-end-of-installed-WAL-space-in-shared-memory.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-02 Thread Bertrand Drouvot

Hi,

On Fri, Feb 02, 2024 at 12:25:30PM +0530, Amit Kapila wrote:
> On Thu, Feb 1, 2024 at 5:29 PM shveta malik  wrote:
> >
> > On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila  wrote:
> > >
> > > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested
> > > by you though I have a few minor comments on 0002 and 0004. I was
> > > thinking about what will be a logical way to split the slot sync
> > > worker patch (combined result of 0001, 0002, and 0004), and one idea
> > > occurred to me is that we can have the first patch as
> > > synchronize_solts() API and the functionality required to implement
> > > that API then the second patch would be a slot sync worker which uses
> > > that API to synchronize slots and does all the required validations.
> > > Any thoughts?
> >
> > If we shift 'synchronize_slots()' to the first patch but there is no
> > caller of it, we may have a compiler warning for the same. The only
> > way it can be done is if we temporarily add SQL function on standby
> > which uses 'synchronize_slots()'. This SQL function can then be
> > removed in later patches where we actually have a caller for
> > 'synchronize_slots'.
> >
> 
> Can such a SQL function say pg_synchronize_slots() which can sync all
> slots that have a failover flag set be useful in general apart from
> just writing tests for this new API? I am thinking maybe users want
> more control over when to sync the slots and write their bgworker or
> simply do it just before shutdown once (sort of planned switchover) or
> at some other pre-defined times.

Big +1 for having this kind of function in user's hands (as the standby's slots
may be lagging behind during a switchover for example). As far the name, I think
it would make sense to add "replication" or "repl" something like
pg_sync_replication_slots()? (that would be aligned with
 pg_create_logical_replication_slot() and friends).

> BTW, we also have
> pg_log_standby_snapshot() which otherwise would be done periodically
> by background processes.
> 
> >
> > 1) Re-arranged the patches:
> > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are
> > separated out in v75-001 as those are independent changes.
> 
> Bertrand, Sawada-San, and others, do you see a problem with such a
> split? Can we go ahead with v75_0001 separately after fixing the open
> comments?

I think that makes sense, specially if we're also creating a user callable
function to sync the slot(s) at wish.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

On Fri, Feb 2, 2024 at 5:48 PM Alvaro Herrera  wrote:
>
> If you want the server to send this message when the JSON word is not in
> quotes, I'm afraid that's not possible, due to the funny nature of the
> FORMAT keyword when the JSON keyword appears after it.  But why do you
> care?  If you use the patch, then you no longer need to have the "not
> recognized" error messages anymore, because the JSON format is indeed
> a recognized one.
>

"JSON word is not in quotes" is my intention.

Now it seems when people implement any custom format for COPY,
if the format_name is a keyword then we need single quotes.

Thanks for clarifying!

Re: Emitting JSON to file using COPY TO

2024-02-02 Thread Alvaro Herrera

On 2024-Feb-02, jian he wrote:

> copy (select 1) to stdout with  (format json);
> ERROR:  syntax error at or near "format"
> LINE 1: copy (select 1) to stdout with  (format json);
>  ^
> 
> json is a keyword. Is it possible to escape it?
> make `copy (select 1) to stdout with  (format json)` error message the same as
> `copy (select 1) to stdout with  (format json1)`

Sure, you can use 
 copy (select 1) to stdout with  (format "json");
and then you get
ERROR:  COPY format "json" not recognized

is that what you meant?

If you want the server to send this message when the JSON word is not in
quotes, I'm afraid that's not possible, due to the funny nature of the
FORMAT keyword when the JSON keyword appears after it.  But why do you
care?  If you use the patch, then you no longer need to have the "not
recognized" error messages anymore, because the JSON format is indeed
a recognized one.

Maybe I didn't understand your question.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Change GUC hashtable to use simplehash?

2024-02-02 Thread John Naylor

On Tue, Jan 30, 2024 at 7:51 PM Ants Aasma  wrote:
>
> It didn't calculate the same result because the if (mask) condition
> was incorrect. Changed it to if (chunk & 0xFF) and removed the right
> shift from the mask.

Yes, you're quite right.

> It seems to be half a nanosecond faster, but as I
> don't have a machine set up for microbenchmarking it's quite close to
> measurement noise.

With my "throughput-ush" test, they look good:

pgbench -n -T 20 -f bench_cstr_aligned.sql -M prepared | grep latency

master:
latency average = 490.722 ms

(Ants Aantsma) v-17 0001:
latency average = 385.263 ms

v17 0001+0002:
latency average = 339.506 ms

> I didn't post the harness as it's currently so messy to be near
> useless to others. But if you'd like to play around,  I can tidy it up
> a bit and post it.

I'd be curious, thanks.
From 77d848a83930abe2badd8c0f1ade79c88c27b455 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 30 Jan 2024 16:14:57 +0700
Subject: [PATCH v17 2/2] Shorten dependency chain for computing hash mask

---
 src/include/common/hashfn_unstable.h | 30 
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 8ee1b99a20..3a74e6e33b 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -176,19 +176,6 @@ fasthash_accum(fasthash_state *hs, const char *k, int len)
 #define haszero64(v) \
 	(((v) - 0x0101010101010101) & ~(v) & 0x8080808080808080)
 
-/*
- * Returns non-zero when first byte in memory order is not NUL
- */
-static inline int
-first_byte_nonzero(uint64 v)
-{
-#ifdef WORDS_BIGENDIAN
-		return v >> 56;
-#else
-		return v & 0xFF;
-#endif
-}
-
 /*
  * all-purpose workhorse for fasthash_accum_cstring
  */
@@ -225,6 +212,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	int			remainder;
 	uint64		zero_bytes_le;
 	uint64		chunk;
+	uint64		mask;
 
 	Assert(PointerIsAligned(start, uint64));
 	for (;;)
@@ -257,14 +245,22 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 * byte within the input word by counting the number of trailing (because
 	 * little-endian) zeros and dividing the result by 8.
 	 */
-	if (first_byte_nonzero(chunk))
+	/* If the first byte in the input is the NUL terminator, we have nothing to do */
+	if ((zero_bytes_le & 0xFF) == 0)
 	{
 		remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
+		/*
+		 * Create a mask for the remaining bytes and
+		 * combine them into the hash. The mask also covers the NUL
+		 * terminator, but that's harmless. The mask could contain 0x80
+		 * in higher bytes where the input is non-zero, but only if the
+		 * input byte is 0x01, so also harmless.
+		 */
+		mask = zero_bytes_le | (zero_bytes_le - 1);
 #ifdef WORDS_BIGENDIAN
-		hs->accum = chunk & ((~0ULL) << (64 - BITS_PER_BYTE*remainder));
-#else
-		hs->accum = chunk & ((~0ULL) >> (64 - BITS_PER_BYTE*remainder));
+		mask = pg_bswap64(mask);
 #endif
+		hs->accum = chunk & mask;
 		fasthash_combine(hs);
 
 		str += remainder;
-- 
2.43.0

From 5617ee1450316ad4f494c9378ae1c53dbb095b89 Mon Sep 17 00:00:00 2001
From: Ants Aasma 
Date: Mon, 29 Jan 2024 15:16:02 +0200
Subject: [PATCH v17 1/2] Speed up last iteration of aligned fasthash

We know the length of the string so we can mask out end of the
string with a shift. Without this the aligned version was slower
than unaligned on small strings.
---
 src/include/common/hashfn_unstable.h | 31 
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 3d927e1fb1..8ee1b99a20 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -176,6 +176,19 @@ fasthash_accum(fasthash_state *hs, const char *k, int len)
 #define haszero64(v) \
 	(((v) - 0x0101010101010101) & ~(v) & 0x8080808080808080)
 
+/*
+ * Returns non-zero when first byte in memory order is not NUL
+ */
+static inline int
+first_byte_nonzero(uint64 v)
+{
+#ifdef WORDS_BIGENDIAN
+		return v >> 56;
+#else
+		return v & 0xFF;
+#endif
+}
+
 /*
  * all-purpose workhorse for fasthash_accum_cstring
  */
@@ -211,11 +224,12 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	const char *const start = str;
 	int			remainder;
 	uint64		zero_bytes_le;
+	uint64		chunk;
 
 	Assert(PointerIsAligned(start, uint64));
 	for (;;)
 	{
-		uint64		chunk = *(uint64 *) str;
+		chunk = *(uint64 *) str;
 
 		/*
 		 * With little-endian representation, we can use this calculation,
@@ -243,9 +257,18 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 * byte within the input word by counting the number of trailing (because
 	 * little-endian) zeros and dividing the result by 8.
 	 */
-	remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
-	str += remainder;
+

Re: Check lateral references within PHVs for memoize cache keys

2024-02-02 Thread Richard Guo

On Mon, Dec 25, 2023 at 3:01 PM Richard Guo  wrote:

> On Thu, Jul 13, 2023 at 3:12 PM Richard Guo 
> wrote:
>
>> So I'm wondering if it'd be better that we move all this logic of
>> computing additional lateral references within PHVs to get_memoize_path,
>> where we can examine only PHVs that are evaluated at innerrel.  And
>> considering that these lateral refs are only used by Memoize, it seems
>> more sensible to compute them there.  But I'm a little worried that
>> doing this would make get_memoize_path too expensive.
>>
>> Please see v4 patch for this change.
>>
>
> I'd like to add that not checking PHVs for lateral references can lead
> to performance regressions with Memoize node.
>

The v4 patch does not apply any more.  I've rebased it on master.
Nothing else has changed.

Thanks
Richard


v5-0001-Check-lateral-refs-within-PHVs-for-memoize-cache-keys.patch
Description: Binary data

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Artur Zakirov

On Fri, 2 Feb 2024 at 09:41, Nazir Bilal Yavuz  wrote:
> You seem right, nice catch. Also, this change makes the check in
>
> snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
>  basedir,
>  PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
>  "pg_xlog" : "pg_wal");
>
> redundant. PQserverVersion(conn) will always be higher than
> MINIMUM_VERSION_FOR_PG_WAL.

Thank you both for the comments. Indeed, that part now looks redundant.
I've attached a patch to remove checking MINIMUM_VERSION_FOR_PG_WAL.

-- 
Artur
From cc8e636ad47b9dcc8779934e58f351ca43067b05 Mon Sep 17 00:00:00 2001
From: Artur Zakirov 
Date: Fri, 2 Feb 2024 10:06:42 +0100
Subject: [PATCH v2] Fix checking MINIMUM_VERSION_FOR_WAL_SUMMARIES for
 creating pg_wal/summaries directory

---
 src/bin/pg_basebackup/pg_basebackup.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 77489af518..3a9940097c 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -700,14 +700,12 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier,
 		/*
 		 * For newer server versions, likewise create pg_wal/summaries
 		 */
-		if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
+		if (PQserverVersion(conn) >= MINIMUM_VERSION_FOR_WAL_SUMMARIES)
 		{
 			char		summarydir[MAXPGPATH];
 
 			snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
-	 basedir,
-	 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-	 "pg_xlog" : "pg_wal");
+	 basedir, "pg_wal");
 
 			if (pg_mkdir_p(summarydir, pg_dir_create_mode) != 0 &&
 errno != EEXIST)
-- 
2.40.1

Re: ResourceOwner refactoring

2024-02-02 Thread Alexander Lakhin


Hello Heikki,

08.11.2023 14:37, Heikki Linnakangas wrote:


Fixed these, and pushed. Thanks everyone for reviewing!



Please try the following script:
mkdir /tmp/50m
sudo mount -t tmpfs -o size=50M tmpfs /tmp/50m
export PGDATA=/tmp/50m/tmpdb

initdb
pg_ctl -l server.log start

cat << 'EOF' | psql
CREATE TEMP TABLE t (a name, b name, c name, d name);
INSERT INTO t SELECT 'a', 'b', 'c', 'd' FROM generate_series(1, 1000) g;

COPY t TO '/tmp/t.data';
SELECT 'COPY t FROM ''/tmp/t.data''' FROM generate_series(1, 100)
\gexec
EOF

which produces an unexpected error, a warning, and an assertion failure,
starting from b8bff07da:
..
COPY 1000
ERROR:  could not extend file "base/5/t2_16386" with FileFallocate(): No space 
left on device
HINT:  Check free disk space.
CONTEXT:  COPY t, line 1000
ERROR:  ResourceOwnerRemember called but array was full
CONTEXT:  COPY t, line 1000
WARNING:  local buffer refcount leak: [-499] (rel=base/5/t2_16386, 
blockNum=1483, flags=0x200, refcount=0 1)
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
connection to server was lost
2024-02-02 08:29:24.434 UTC [2052831] LOG:  server process (PID 2052839) was 
terminated by signal 6: Aborted

server.log contains:
TRAP: failed Assert("RefCountErrors == 0"), File: "localbuf.c", Line: 804, PID: 
2052839

Best regards,
Alexander

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-02 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 2 Feb 2024 17:04:28 +0900,
  Michael Paquier  wrote:

> One idea I was considering is whether we should use a special value in
> the "format" DefElem, say "custom:$my_custom_format" where it would be
> possible to bypass the formay check when processing options and find
> the routines after processing all the options.  I'm not wedded to
> that, but attaching the routines to the state data is IMO the correct
> thing, because this has nothing to do with CopyFormatOptions.

Thanks for sharing your idea.
Let's discuss how to support custom options after we
complete the current performance changes.

>> I'm OK with the approach. But how about adding the extra
>> callbacks to Copy{From,To}StateData not
>> Copy{From,To}Routines like CopyToStateData::data_dest_cb and
>> CopyFromStateData::data_source_cb? They are only needed for
>> "text" and "csv". So we don't need to add them to
>> Copy{From,To}Routines to keep required callback minimum.
> 
> And set them in cstate while we are in the Start routine, right?

I imagined that it's done around the following part:

@@ -1418,6 +1579,9 @@ BeginCopyFrom(ParseState *pstate,
/* Extract options from the statement node tree */
ProcessCopyOptions(pstate, >opts, true /* is_from */ , options);
 
+   /* Set format routine */
+   cstate->routine = CopyFromGetRoutine(cstate->opts);
+
/* Process the target relation */
cstate->rel = rel;
 

Example1:

/* Set format routine */
cstate->routine = CopyFromGetRoutine(cstate->opts);
if (!cstate->opts.binary)
if (cstate->opts.csv_mode)
cstate->copy_read_attributes = CopyReadAttributesCSV;
else
cstate->copy_read_attributes = CopyReadAttributesText;

Example2:

static void
CopyFromSetRoutine(CopyFromState cstate)
{
if (cstate->opts.csv_mode)
{
cstate->routine = 
cstate->copy_read_attributes = CopyReadAttributesCSV;
}
else if (cstate.binary)
cstate->routine = 
else
{
cstate->routine = 
cstate->copy_read_attributes = CopyReadAttributesText;
}
}

BeginCopyFrom()
{
/* Set format routine */
CopyFromSetRoutine(cstate);
}


But I don't object your original approach. If we have the
extra callbacks in Copy{From,To}Routines, I just don't use
them for my custom format extension.

>> What is the better next action for us? Do you want to
>> complete the WIP v11 patch set by yourself (and commit it)?
>> Or should I take over it?
> 
> I was planning to work on that, but wanted to be sure how you felt
> about the problem with text and csv first.

OK.
My opinion is the above. I have an idea how to implement it
but it's not a strong idea. You can choose whichever you like.


Thanks,
-- 
kou

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Yugo NAGATA

On Fri, 2 Feb 2024 01:11:27 +0100
Artur Zakirov  wrote:

> Hi hackers,
> 
> during reading the source code of new incremental backup functionality
> I noticed that the following condition can by unintentional:
> 
> /*
>  * For newer server versions, likewise create pg_wal/summaries
>  */
> if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
> {
> ...
> 
> if (pg_mkdir_p(summarydir, pg_dir_create_mode) != 0 &&
> errno != EEXIST)
> pg_fatal("could not create directory \"%s\": %m", summarydir);
> }
> 
> This is from src/bin/pg_basebackup/pg_basebackup.c.
> 
> Is the condition correct? Shouldn't it be ">=". Otherwise the function
> will create "/summaries" only for older PostgreSQL versions.
> 
> I've attached a patch to fix it in case this is a typo.

I  also think it should be ">="
Also, if so, I don't think the check of MINIMUM_VERSION_FOR_PG_WAL
in the block is required, because the directory name is always pg_wal
in the new versions.

Regards,
Yugo Nagata

> 
> -- 
> Artur


-- 
Yugo NAGATA

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 2 Feb 2024 at 03:11, Artur Zakirov  wrote:
>
> Hi hackers,
>
> during reading the source code of new incremental backup functionality
> I noticed that the following condition can by unintentional:
>
> /*
>  * For newer server versions, likewise create pg_wal/summaries
>  */
> if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
> {
> ...
>
> if (pg_mkdir_p(summarydir, pg_dir_create_mode) != 0 &&
> errno != EEXIST)
> pg_fatal("could not create directory \"%s\": %m", summarydir);
> }
>
> This is from src/bin/pg_basebackup/pg_basebackup.c.
>
> Is the condition correct? Shouldn't it be ">=". Otherwise the function
> will create "/summaries" only for older PostgreSQL versions.

You seem right, nice catch. Also, this change makes the check in

snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
 basedir,
 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
 "pg_xlog" : "pg_wal");

redundant. PQserverVersion(conn) will always be higher than
MINIMUM_VERSION_FOR_PG_WAL.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Statistics Import and Export

2024-02-02 Thread Corey Huinker

On Mon, Jan 22, 2024 at 1:09 AM Peter Smith  wrote:

> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ==
> [1] https://commitfest.postgresql.org/46/4538/
> [2]
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538
>
> Kind Regards,
> Peter Smith.
>

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received
previously.

The system views are gone. These were mostly there to serve as a baseline
for what an export query would look like. That role is temporarily
reassigned to pg_export_stats.c, but hopefully they will be integrated into
pg_dump in the next version. The regression test also contains the version
of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and
pg_statistic_ext_data, respectively. This format involves exporting oid
values for types, collations, operators, and attributes - values which are
specific to the server they were created on. To make sense of those values,
a subset of the columns of pg_type, pg_attribute, pg_collation, and
pg_operator are exported as well, which allows pg_import_rel_stats() and
pg_import_ext_stats() to reconstitute the data structure as it existed on
the old server, and adapt it to the modern structure and local schema
objects.

pg_import_rel_stats matches up local columns with the exported stats by
column name, not attnum. This allows for stats to be imported when columns
have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently
would get confused by changes in expressions that maintain the same result
data type. I'm not yet brave enough to handle importing nodetrees, nor do I
think it's wise to try. I think we'd be better off validating that the
destination extended stats object is identical in structure, and to fail
the import of that one object if it isn't perfect.

Export formats go back to v10.

Re: Emitting JSON to file using COPY TO

On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-23, jian he wrote:
>
> > > +   | FORMAT_LA copy_generic_opt_arg
> > > +   {
> > > +   $$ = makeDefElem("format", $2, @1);
> > > +   }
> > > ;
> > >
> > > I think it's not necessary. "format" option is already handled in
> > > copy_generic_opt_elem.
> >
> > test it, I found out this part is necessary.
> > because a query with WITH like `copy (select 1)  to stdout with
> > (format json, force_array false); ` will fail.
>
> Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
> (see base_yylex there).  I'm not really sure but I think it might be
> better to make it "| FORMAT_LA JSON" instead of invoking the whole
> copy_generic_opt_arg syntax.  Not because of performance, but just
> because it's much clearer what's going on.
>

sorry to bother you.
Now I didn't apply any patch, just at the master.
I don't know much about gram.y.

copy (select 1)  to stdout with  (format json1);
ERROR:  COPY format "json1" not recognized
LINE 1: copy (select 1)  to stdout with  (format json1);
  ^
copy (select 1) to stdout with  (format json);
ERROR:  syntax error at or near "format"
LINE 1: copy (select 1) to stdout with  (format json);
 ^

json is a keyword. Is it possible to escape it?
make `copy (select 1) to stdout with  (format json)` error message the same as
`copy (select 1) to stdout with  (format json1)`

Re: Improve eviction algorithm in ReorderBuffer

2024-02-02 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
 wrote:
>
> On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > > The individual transactions shouldn't cross
> > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > proposal to
> > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > 5% >=
> > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > impression
> > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > list
> > > > > > > which doesn't seems to be the case.
> > > > > >
> > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > maintain the transactions consuming more than 10% of
> > > > > > logical_decoding_work_mem,
> > > > > >
> > > > >
> > > > > How can we have multiple transactions in the list consuming more than
> > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > before any xact reaches logical_decoding_work_mem?
> > > >
> > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > possible that the list has three transactions each of which are
> > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > still 30MB (less than logical_decoding_work_mem).
> > > >
> > >
> > > Thanks for the clarification. I misunderstood the list to have
> > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > thing to note is that maintaining these lists by default can also have
> > > some overhead unless the list of open transactions crosses a certain
> > > threshold.
> > >
> >
> > On further analysis, I realized that the approach discussed here might
> > not be the way to go. The idea of dividing transactions into several
> > subgroups is to divide a large number of entries into multiple
> > sub-groups so we can reduce the complexity to search for the
> > particular entry. Since we assume that there are no big differences in
> > entries' sizes within a sub-group, we can pick the entry to evict in
> > O(1). However, what we really need to avoid here is that we end up
> > increasing the number of times to evict entries because serializing an
> > entry to the disk is more costly than searching an entry on memory in
> > general.
> >
> > I think that it's no problem in a large-entries subgroup but when it
> > comes to the smallest-entries subgroup, like for entries consuming
> > less than 5% of the limit, it could end up evicting many entries. For
> > example, there would be a huge difference between serializing 1 entry
> > consuming 5% of the memory limit and serializing 5000 entries
> > consuming 0.001% of the memory limit. Even if we can select 5000
> > entries quickly, I think the latter would be slower in total. The more
> > subgroups we create, the more the algorithm gets complex and the
> > overheads could cause. So I think we need to search for the largest
> > entry in order to minimize the number of evictions anyway.
> >
> > Looking for data structures and algorithms, I think binaryheap with
> > some improvements could be promising. I mentioned before why we cannot
> > use the current binaryheap[1]. The missing pieces are efficient ways
> > to remove the arbitrary entry and to update the arbitrary entry's key.
> > The current binaryheap provides binaryheap_remove_node(), which is
> > O(log n), but it requires the entry's position in the binaryheap. We
> > can know the entry's position just after binaryheap_add_unordered()
> > but it might be changed after heapify. Searching the node's position
> > is O(n). So the improvement idea is to add a hash table to the
> > binaryheap so that it can track the positions for each entry so that
> > we can remove the arbitrary entry in O(log n) and also update the
> > arbitrary entry's key in O(log n). This is known as the indexed
> > priority queue. I've attached the patch for that (0001 and 0002).
> >
> > That way, in terms of reorderbuffer, we can update and remove the
> > transaction's memory usage in O(log n) (in worst case and O(1) in
> > average) and then pick the largest transaction in O(1). Since we might
> > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > we need to maintain the binaryheap anyway. I've

Re: Synchronizing slots from primary to standby