Re: Add common function ReplicationOriginName.

2022-11-07 Thread Aleksander Alekseev
Hi Tom,

> I looked at this and am inclined to reject it. [...]

OK, thanks. Then we are done with this thread. I closed the
corresponding CF entry.

-- 
Best regards,
Aleksander Alekseev




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Peter Smith
Here are my review comments for v42-0001

==

1. General.

Please take the time to process all new code comments using a
grammar/spelling checker (e.g. simply cut/paste them into MSWord or
Grammarly or any other tool of your choice as a quick double-check)
*before* posting the patches; too many of my review comments are about
code comments and it's taking a long time to keep cycling through
reporting/fixing/confirming comments for every patch version  -
whereas it probably would take hardly any time to make the same
spelling/grammar corrections up-front.


==

.../replication/logical/applyparallelworker.c

2. ParallelApplyLockids

This seems like a bogus name. Code is using this in a way that means
the subset of lockED ids. Not the list of all the lock ids.

OTHO, having another list of ALL lock-ids might be useful (for
detecting unique ids) if you are able to maintain such a list safely.

~~~

3. parallel_apply_can_start

+
+ if (switching_to_serialize)
+ return false;

This should have an explanatory comment.

~~~

4. parallel_apply_start_worker

+ /* Check if the transaction in that worker has been finished. */
+ xact_state = parallel_apply_get_xact_state(tmp_winfo->shared);
+ if (xact_state == PARALLEL_TRANS_FINISHED)

"has been finished." -> "has finished."

~~~

5.

+ /*
+ * Set the xact_state flag in the leader instead of the
+ * parallel apply worker to avoid the race condition where the leader has
+ * already started waiting for the parallel apply worker to finish
+ * processing the transaction while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * xact_state to true.
+ */
+ SpinLockAcquire(&winfo->shared->mutex);
+ winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;
+ winfo->shared->xid = xid;
+ winfo->shared->fileset_valid = false;
+ winfo->shared->partial_sent_message = false;
+ SpinLockRelease(&winfo->shared->mutex);

This code comment is stale, because xact_state is no longer a "flag",
nor does "set the xact_state to true." make sense anymore.

~~~

6. parallel_apply_free_worker

+ /*
+ * Don't free the worker if the transaction in the worker is still in
+ * progress. This could happend as we don't wait for transaction rollback
+ * to finish.
+ */
+ if (parallel_apply_get_xact_state(winfo->shared) < PARALLEL_TRANS_FINISHED)
+ return;

6a.
typo "happend"

~

6b.
Saying "< PARALLEL_TRANS_FINISHED" seems kind of risky because not it
is assuming a specific ordering of those enums which has never been
mentioned before. I think it will be safer to say "!=
PARALLEL_TRANS_FINISHED" instead. Alternatively, if the enum order is
important then it must be documented with the typedef so that nobody
changes it.

~~~

7.

+ ParallelApplyWorkersList = list_delete_ptr(ParallelApplyWorkersList,
+winfo);

Unnecessary wrapping

~~~

8.

+ /*
+ * Resend the pending message to parallel apply worker to cleanup the
+ * queue. Note that parallel apply worker will just ignore this message
+ * as it has already handled this message while applying spooled
+ * messages.
+ */
+ result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg),
+ winfo->pending_msg, false, true);

If I understand this logic it seems a bit hacky. From the comment, it
seems you are resending a message that you know/expect to be ignored
simply to make it disappear. (??). Isn't there some other way to clear
the pending message without requiring a bogus send?

~~~

9. parallel_apply_spooled_messages

+
+static void
+parallel_apply_spooled_messages(void)

Missing function comment

~~~

10.

+parallel_apply_spooled_messages(void)
+{
+ bool fileset_valid = false;
+
+ /*
+ * Check if changes has been serialized to disk. if so, read and
+ * apply them.
+ */
+ SpinLockAcquire(&MyParallelShared->mutex);
+ fileset_valid = MyParallelShared->fileset_valid;
+ SpinLockRelease(&MyParallelShared->mutex);

The variable assignment in the declaration seems unnecessary.

~~~

11.

+ /*
+ * Check if changes has been serialized to disk. if so, read and
+ * apply them.
+ */
+ SpinLockAcquire(&MyParallelShared->mutex);
+ fileset_valid = MyParallelShared->fileset_valid;
+ SpinLockRelease(&MyParallelShared->mutex);

"has been" -> "have been"

~~~

12.

+ apply_spooled_messages(&MyParallelShared->fileset,
+MyParallelShared->xid,
+InvalidXLogRecPtr);
+ parallel_apply_set_fileset(MyParallelShared, false);

parallel_apply_set_fileset() is a confusing function name. IMO this
logic would be better split into 2 smaller functions:
- parallel_apply_set_fileset_valid()
- parallel_apply_set_fileset_invalid()

~~~

13. parallel_apply_get_unique_id

+/*
+ * Returns the unique id among all parallel apply workers in the subscriber.
+ */
+static uint16
+parallel_apply_get_unique_id()

The meaning of that comment and the purpose of this function are not
entirely clear... e.g. I had to read the code to figure out what the
comment is describing.

~~~

14.

The function seems to be written in some way that scans all known ids

Re: Allow single table VACUUM in transaction block

2022-11-07 Thread Simon Riggs
On Sun, 6 Nov 2022 at 20:40, Peter Geoghegan  wrote:
>
> On Sun, Nov 6, 2022 at 11:14 AM Tom Lane  wrote:
> > In general, I do not believe in encouraging users to run VACUUM
> > manually in the first place.  We would be far better served by
> > spending our effort to improve autovacuum's shortcomings.
>
> I couldn't agree more. A lot of problems seem related to the idea that
> VACUUM is just a command that the DBA periodically runs to get a
> predictable fixed result, a little like CREATE INDEX. That conceptual
> model isn't exactly wrong; it just makes it much harder to apply any
> kind of context about the needs of the table over time. There is a
> natural cycle to how VACUUM (really autovacuum) is run, and the
> details matter.
>
> There is a significant amount of relevant context that we can't really
> use right now. That wouldn't be true if VACUUM only ran within an
> autovacuum worker (by definition). The VACUUM command itself would
> still be available, and support the same user interface, more or less.
> Under the hood the VACUUM command would work by enqueueing a VACUUM
> job, to be performed asynchronously by an autovacuum worker. Perhaps
> the initial enqueue operation could be transactional, fixing Simon's 
> complaint.

Ah, I see you got to this idea first!

Yes, what we need is for the "VACUUM command" to not fail in a script.
Not sure anyone cares where the work takes place.

Enqueuing a request for autovacuum to do that work, then blocking
until it is complete would do the job.

> "No more VACUUMs outside of autovacuum" would enable more advanced
> autovacuum.c scheduling, allowing us to apply a lot more context about
> the costs and benefits, without having to treat manual VACUUM as an
> independent thing. We could coalesce together redundant VACUUM jobs,
> suspend and resume VACUUM operations, and have more strategies to deal
> with problems as they emerge.

+1, but clearly this would not make temp table VACUUMs work.

> > I'd like to see some sort of direct attack on its inability to deal
> > with temp tables, for instance.  (Force the owning backend to
> > do it?  Temporarily change the access rules so that the data
> > moves to shared buffers?  Dunno, but we sure haven't tried hard.)

This was a $DIRECT attack on making temp tables work! ;-)

Temp tables are actually easier, since we don't need any of the
concurrency features we get with lazy vacuum. So the answer is to
always run a VACUUM FULL on temp tables since this skips any issues
with indexes etc..

We would need to check a few things first maybe something like
this (mostly borrowed heavily from COPY)

InvalidateCatalogSnapshot();
if (!ThereAreNoPriorRegisteredSnapshots() || !ThereAreNoReadyPortals())
ereport(WARNING,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
 errmsg("vacuum of temporary table ignored because
of prior transaction activity")));
   CheckTableNotInUse(rel, "VACUUM");

> This is a good example of the kind of thing I have in mind. Perhaps it
> could work by killing the backend that owns the temp relation when
> things truly get out of hand? I think that that would be a perfectly
> reasonable trade-off.

+1

> Another related idea: better behavior in the event of a manually
> issued VACUUM (now just an enqueued autovacuum) that cannot do useful
> work due to the presence of a long running snapshot. The VACUUM
> doesn't have to dutifully report "success" when there is no practical
> sense in which it was successful. There could be a back and forth
> conversation between autovacuum.c and vacuumlazy.c that makes sure
> that something useful happens sooner or later. The passage of time
> really matters here.

Regrettably, neither vacuum nor autovacuum waits for xmin to change;
perhaps it should.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Error for WITH options on partitioned tables

2022-11-07 Thread Karina Litskevich
Hi David,

> I am not very clear about why `build_reloptions` is removed in patch
> `v2-0002-better-error-message-for-setting-parameters-for-p.patch`, if
> you can help explain would be great.

"build_reloptions" parses "reloptions" and takes for it a list of allowed
options defined by the 5th argument "relopt_elems" and the 6th argument
"num_relopt_elems", which are NULL and 0 in the removed call. If "validate"
is false, it ignores options, which are not in the list, while parsing. If
"validate" is true, it "elog"s ERROR when it meets option, which is not in
the
allowed list.

As in the deleted call "build_reloptions" is called with an empty list of
allowed options, it does nothing (returns NULL) when "validate" is false,
and
"elog"s ERROR when "validate" is true and "reloptions" is not empty. That is
what the deleted comment above the deleted call about. This call is there
only
for validation. So as I wanted to make a specific error message for the
case of
partitioned tables, I added the validation in "partitioned_table_reloptions"
and saw no reason to call "build_reloptions" any more because it would just
return NULL in other cases.

> This generic error message is reported by
> `errdetail_relkind_not_supported`, and there is a similar error message
> for partitioned tables. Anyone knows if this can be an option for
> reporting this `fillfactor` parameter not supported error.

This error is reported by "ATSimplePermissions" and, as we can see in the
beginning of this function, it makes no difference between regular relations
and partitioned tables now. To make it report errors for partitioned tables
we
should add new "alter table target-type flag" and add it to a mask of each
"AlterTableType" if partitioned table is an allowed target for it (see that
huge switch-case in function "ATPrepCmd"). Then
"partitioned_table_reloptions"
may become useless and we also should check weather some other functions
become
useless. Maybe that is the right way, but it looks much harder than the
existing solutions, so I believe, before anyone began going this way, it's
better to know whether there are any pitfalls there.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/


[PATCH] Const'ify the arguments of ilist.c/ilist.h functions

2022-11-07 Thread Aleksander Alekseev
Hi hackers,

During the [1] discussion it was suggested to constify the arguments
of ilist.c/ilist.h functions. Bharath (cc'ed) pointed out that it's
better to start a new thread in order to attract more hackers that may
be interested in this change, so I started one.

The patch is attached. Here are the reasons why we may want to do this:

"""
Const qualifiers ensure that we don't do something stupid in the function
implementation. Additionally they clarify the interface. As an example:

void
slist_delete(slist_head *head, const slist_node *node)

Here one can instantly tell that node->next is not going to be set to NULL.
Finally, const qualifiers potentially allow the compiler to do more
optimizations. This being said no benchmarking was done for this patch.
"""

Additionally Bharath pointed out that there are other pieces of code
that we may want to change in a similar fashion,
proclist.h/proclist_types.h as one example. I didn't do this yet
because I would like to know the community opinion first on whether we
should do this at all.

Thoughts?

[1]: 
https://www.postgresql.org/message-id/flat/caaphdvrtvxr+fxex0vbvicfkdgxa3twdgw9ofewnxcjmmwl...@mail.gmail.com

-- 
Best regards,
Aleksander Alekseev


v1-0001-Constify-the-arguments-of-ilist.c-h-functions.patch
Description: Binary data


Re: [patch] Adding an assertion to report too long hash table name

2022-11-07 Thread Zhang Mingli
Hi,


On Sep 29, 2022, 09:38 +0800, Xiaoran Wang , wrote:
> Hi,
> The max size for the shmem hash table name is SHMEM_INDEX_KEYSIZE - 1. but 
> when the caller uses a longer hash table name, it doesn't report any error, 
> insteadit just uses the first SHMEM_INDEX_KEYSIZE -1 chars as the hash table 
> name.
> I created some shmem hash tables with the same prefix which was longer 
> thanSHMEM_INDEX_KEYSIZE - 1, and the size of those hash tables were the 
> same,then only one hash table was created. But I thought those hash tables 
> were createdsuccessfully.
> I know this is a corner case, but it's difficult to figure it out when run 
> into it. So I addan assertion to prevent it.
>
> Thanks,Xiaoran
Seems Postgres doesn’t have a case that strlen(name) >= SHMEM_INDEX_KEYSIZE(48).
The max length of name I found is 29:
```
ShmemInitHash("Shared Buffer Lookup Table”

```
But it will help for other Databases built on Postgres or later Postgres in 
case of forgetting to update SHMEM_INDEX_KEYSIZE
when new shmem added with a name longer than current SHMEM_INDEX_KEYSIZE.
And we don’t have such assertion now.
So, +1 for the patch.

Regards,
Zhang Mingli


Re: Adding doubly linked list type which stores the number of items in the list

2022-11-07 Thread Aleksander Alekseev
Hi Bharath,

> Thanks for the patch. IMO, this can be discussed in a separate thread
> to get more thoughts from the hackers.

OK, I started a new thread [1], thanks.

Regarding the improvement of the code coverage I realized that this
may be a good patch for a newcomer. I know several people who may be
interested in starting to contribute to PostgreSQL. Maybe I'll be able
to find a volunteer.

[1]: 
https://www.postgresql.org/message-id/flat/CAJ7c6TM2=08mnkd9ajg8vey9hd+g4l7+nvh30uint3kshgr...@mail.gmail.com

-- 
Best regards,
Aleksander Alekseev




Use proc instead of MyProc in ProcArrayGroupClearXid()/TransactionGroupUpdateXidStatus()

2022-11-07 Thread rajesh singarapu
Hi,

In both TransactionGroupUpdateXidStatus and ProcArrayGroupClearXid
global MyProc is used. for consistency, replaced with a function local variable.


thanks
Rajesh


v1-0001-Use-proc-instead-of-MyProc-in-ProcArrayGroupClear.patch
Description: Binary data


Re: Adding doubly linked list type which stores the number of items in the list

2022-11-07 Thread Bharath Rupireddy
On Mon, Nov 7, 2022 at 2:37 PM Aleksander Alekseev
 wrote:
>
> Regarding the improvement of the code coverage I realized that this
> may be a good patch for a newcomer. I know several people who may be
> interested in starting to contribute to PostgreSQL. Maybe I'll be able
> to find a volunteer.

Hm. Or adding a ToDo item here https://wiki.postgresql.org/wiki/Todo
might also help?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Adding doubly linked list type which stores the number of items in the list

2022-11-07 Thread Aleksander Alekseev
Hi Bharath,

> > Regarding the improvement of the code coverage I realized that this
> > may be a good patch for a newcomer. I know several people who may be
> > interested in starting to contribute to PostgreSQL. Maybe I'll be able
> > to find a volunteer.
>
> Hm. Or adding a ToDo item here https://wiki.postgresql.org/wiki/Todo
> might also help?

Good point. Will it be better to use the "Miscellaneous Other" section
for this or create a new "Code coverage" section?

-- 
Best regards,
Aleksander Alekseev




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Masahiko Sawada
On Thu, Nov 3, 2022 at 10:06 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada 
>  wrote:
> >
> > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila 
> > wrote:
> > > >
> > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> >  wrote:
> > > > >
> > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila 
> > wrote:
> > > > > >
> > > > > > About your point that having different partition structures for
> > > > > > publisher and subscriber, I don't know how common it will be once we
> > > > > > have DDL replication. Also, the default value of
> > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > that this is a quite common case.
> > > > >
> > > > > So how can we consider these concurrent issues that could happen only
> > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > the problem or can we have a safeguard against these conflicts?
> > > > >
> > > >
> > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > cases as you can see in *0003* patch.
> > >
> > > Tightening the restrictions could work in some cases but there might
> > > still be coner cases and it could reduce the usability. I'm not really
> > > sure that we can ensure such a deadlock won't happen with the current
> > > restrictions. I think we need something safeguard just in case. For
> > > example, if the leader apply worker is waiting for a lock acquired by
> > > its parallel worker, it cancels the parallel worker's transaction,
> > > commits its transaction, and restarts logical replication. Or the
> > > leader can log the deadlock to let the user know.
> > >
> >
> > As another direction, we could make the parallel apply feature robust
> > if we can detect deadlocks that happen among the leader worker and
> > parallel workers. I'd like to summarize the idea discussed off-list
> > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > that when the leader worker or parallel worker needs to wait for
> > something (eg. transaction completion, messages) we use lmgr
> > functionality so that we can create wait-for edges and detect
> > deadlocks in lmgr.
> >
> > For example, a scenario where a deadlock occurs is the following:
> >
> > [Publisher]
> > create table tab1(a int);
> > create publication pub for table tab1;
> >
> > [Subcriber]
> > creat table tab1(a int primary key);
> > create subscription sub connection 'port=1 dbname=postgres'
> > publication pub with (streaming = parallel);
> >
> > TX1:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> > Tx2:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- 
> > streamed
> > COMMIT;
> > COMMIT;
> >
> > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > leader apply worker (LA) is executing TX-2 concurrently on the
> > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > deadlock between PA-1 and LA but lmgr cannot detect it.
> >
> > One idea to resolve this issue is that we have LA acquire a session
> > lock on a shared object (by LockSharedObjectForSession()) and have
> > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > acquires the lock before sending STREAM_STOP and releases it if
> > already acquired before sending STREAM_START, STREAM_PREPARE and
> > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > processing STREAM_STOP and then release immediately after acquiring
> > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> >
> > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > object) -> LA
> >
> > We would need the shared objects per parallel apply worker.
> >
> > After detecting a deadlock, we can restart logical replication with
> > temporarily disabling the parallel apply, which is done by 0005 patch.
> >
> > Another scenario is similar to the previous case but TX-1 and TX-2 are
> > executed by two parallel apply workers (PA-1 and PA-2 respectively).
> > In this scenario, PA-2 is waiting for PA-1 to complete its transaction
> > while PA-1 is waiting for subsequent input from LA. Also, LA is
> > waiting for PA-2 to complete its transaction in order to preserve the
> > commit order. There is a deadlock among three processes but it cannot
> > be detected in lmgr because the fact that LA is waiting for PA-2 to
> > complete its transaction doesn't appear in lmgr (see
> > parallel_apply_wait_for_xact_finish()). To fix it, we can use
> > XactLockTableWait() instead.
> >
> > However, since XactLockTableWait() considers PREPARED TRANSACTION as
> > still in progress, probably we need a similar trick as above in case
> > where a transact

Re: Pluggable toaster

2022-11-07 Thread Aleksander Alekseev
Hi Nikita,

> Pluggable TOAST is provided as an interface to allow developers to plug
> in custom TOAST mechanics. It does not determines would it be universal
> Toaster or one data type, but syntax for universal Toaster is out of scope
> for this patchset.

If I understand correctly, what is going to happen - the same
interface TsrRoutine is going to be used for doing two things:

1) Implementing type-aware TOASTers as a special case for the default
TOAST algorithm when EXTERNAL storage strategy is used.
2) Implementing universal TOASTers from scratch that have nothing to
do with the default TOAST algorithm.

Assuming this is the case, using the same interface for doing two very
different things doesn't strike me as a great design decision. While
working on v24 you may want to rethink this.

Personally I believe that Pluggable TOASTers should support only case
#2. If there is a need of reusing parts of the default TOASTer,
corresponding pieces of code should be declared as non-static and
called from the pluggable TOASTers directly.

Alternatively we could have separate interfaces for case #1 and case
#2 but this IMO becomes rather complicated.

> I'm currently working on a revision of Pluggable TOAST that would make
> dropping Toaster possible if there is no data TOASTed with it, along with
> several other major changes. It will be available in this (I hope so) or the
> following, if I won't make it in time, commitfest.

Looking forward to v24!

This is a major change so I hope there will be more feedback from
other people on the mailing list.

-- 
Best regards,
Aleksander Alekseev




Re: New docs chapter on Transaction Management and related changes

2022-11-07 Thread Simon Riggs
On Mon, 7 Nov 2022 at 07:43, Bruce Momjian  wrote:
>
> On Fri, Nov  4, 2022 at 04:17:28PM +0100, Laurenz Albe wrote:
> > On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote:
> > > I therefore merged all three paragraphs into
> > > one and tried to make the text saner;  release_savepoint.sgml diff
> > > attached, URL content updated.
> >
> > I wanted to have a look at this, but I am confused.  The original patch
> > was much bigger.  Is this just an incremental patch?  If yes, it would
> > be nice to have a "grand total" patch, so that I can read it all
> > in one go.
>
> Yeah, I said:
>
> Yes, I didn't like global either --- I like your wording.  I added 
> your
> other changes too, with slight rewording.  Merged patch to be posted 
> in
>-
> a later email.
>
> but that was unclear.  Let me post one now, and Simon posted one too.

What I've posted is the merged patch, i.e. your latest patch, plus
changes to RELEASE SAVEPOINT from you on Oct 16, plus changes based on
the later comments from Robert and I.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Error for WITH options on partitioned tables

2022-11-07 Thread Simon Riggs
On Mon, 7 Nov 2022 at 08:55, Karina Litskevich
 wrote:
>
> Hi David,
>
> > I am not very clear about why `build_reloptions` is removed in patch
> > `v2-0002-better-error-message-for-setting-parameters-for-p.patch`, if
> > you can help explain would be great.
>
> "build_reloptions" parses "reloptions" and takes for it a list of allowed
> options defined by the 5th argument "relopt_elems" and the 6th argument
> "num_relopt_elems", which are NULL and 0 in the removed call. If "validate"
> is false, it ignores options, which are not in the list, while parsing. If
> "validate" is true, it "elog"s ERROR when it meets option, which is not in the
> allowed list.

Karina's changes make sense to me, so +1.

This is a minor patch, so I will set this as Ready For Committer.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: thinko in basic_archive.c

2022-11-07 Thread Bharath Rupireddy
On Sun, Nov 6, 2022 at 3:17 AM Nathan Bossart  wrote:
>
> On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote:
> > On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi
> >  wrote:
> >> Thanks, but we don't need to wipe out the all bytes. Just putting \0
> >> at the beginning of the buffer is sufficient.
> >
> > Nah, that's not a clean way IMO.
>
> Why not?  This is a commonly-used technique.  I see over 80 existing useѕ
> in PostgreSQL.  Plus, your shutdown callback only checks for the first
> byte, anyway.
>
> +   if (tempfilepath[0] == '\0')
> +   return;

The tempfile name can vary in size for the simple reason that a pid
can be of varying digits - for instance, tempfile name is 'foo1234'
(pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the
first char to '\0' and say pid wraparound occurred, now the it becomes
'bar5674' (pid being 567).

BTW, I couldn't find the 80 existing instances, can you please let me
know your search keyword?

> As noted upthread [0], I think we should be cautious to only remove the
> temporary file if we know we created it.  I still feel that trying to add
> this cleanup logic to basic_archive is probably more trouble than it's
> worth, but if there is a safe and effective way to do so, I won't object.
>
> [0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13

So, IIUC, your point here is what if the copy_file fails to create the
temp file when it already exists. With the name collision being a rare
scenario, given the pid and timestamp variables, I'm not sure if
copy_file can ever fail because the temp file already exists (with
errno EEXIST). However, if we want to be extra-cautious, checking if
temp file exists with file_exists() before calling copy_file() might
help avoid such cases. If we don't want to have extra system call (via
file_exists()) to check the temp file existence, we can think of
sending a flag to copy_file(src, dst, &is_dst_file_created) and use
is_dst_file_created in the shutdown callback. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Postgres auto vacuum - Disable

2022-11-07 Thread Karthik Jagadish (kjagadis)
Hi,

We have a NMS application in cisco and using postgres as a database.

We have query related to disabling auto vacuum. We have below configuration in 
postgres.conf where the autovacuum=on is commented out.

[Shape  Description automatically generated]

But when checked in database we notice that it’s showing as on

[Graphical user interface, timeline  Description automatically generated]

What would this mean? Does it mean that autovacuum is not disabled? Appreciate 
a response.

Regards,
Karthik


RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

The followings are my comments. I want to consider the patch more, but I sent 
it once.

===
worker.c

01. typedef enum TransApplyAction

```
/*
 * What action to take for the transaction.
 *
 * TRANS_LEADER_APPLY means that we are in the leader apply worker and changes
 * of the transaction are applied directly in the worker.
 *
 * TRANS_LEADER_SERIALIZE means that we are in the leader apply worker or table
 * sync worker. Changes are written to temporary files and then applied when
 * the final commit arrives.
 *
 * TRANS_LEADER_SEND_TO_PARALLEL means that we are in the leader apply worker
 * and need to send the changes to the parallel apply worker.
 *
 * TRANS_PARALLEL_APPLY means that we are in the parallel apply worker and
 * changes of the transaction are applied directly in the worker.
 */
```

TRANS_LEADER_PARTIAL_SERIALIZE should be listed in.

02. handle_streamed_transaction()

```
+   StringInfoData  origin_msg;
...
+   origin_msg = *s;
...
+   /* Write the change to the current file */
+   stream_write_change(action,
+   
apply_action == TRANS_LEADER_SERIALIZE ?
+   s : 
&origin_msg);
```

I'm not sure why origin_msg is needed. Can we remove the conditional operator?


03. apply_handle_stream_start()

```
+ * XXX We can avoid sending pairs of the START/STOP messages to the parallel
+ * worker because unlike apply worker it will process only one transaction at a
+ * time. However, it is not clear whether any optimization is worthwhile
+ * because these messages are sent only when the logical_decoding_work_mem
+ * threshold is exceeded.
```

This comment should be modified because PA must acquire and release locks at 
that time.


04. apply_handle_stream_prepare()

```
+   /*
+* After sending the data to the parallel apply worker, 
wait for
+* that worker to finish. This is necessary to maintain 
commit
+* order which avoids failures due to transaction 
dependencies and
+* deadlocks.
+*/
+   parallel_apply_wait_for_xact_finish(winfo->shared);
```

Here seems not to be correct. LA may not send data but spill changes to file.

05. apply_handle_stream_commit()

```
+   if (apply_action == TRANS_LEADER_PARTIAL_SERIALIZE)
+   stream_cleanup_files(MyLogicalRepWorker->subid, 
xid);
```

I'm not sure whether the stream files should be removed by LA or PAs. Could you 
tell me the reason why you choose LA?

===
applyparallelworker.c

05. parallel_apply_can_start()

```
+   if (switching_to_serialize)
+   return false;
```

Could you add a comment like:
Don't start a new parallel apply worker if the leader apply worker has been 
spilling changes to the disk temporarily.

06. parallel_apply_start_worker()

```
+   /*
+* Set the xact_state flag in the leader instead of the
+* parallel apply worker to avoid the race condition where the leader 
has
+* already started waiting for the parallel apply worker to finish
+* processing the transaction while the child process has not yet
+* processed the first STREAM_START and has not set the
+* xact_state to true.
+*/
```

I thinkg the word "flag" should be used for boolean, so the comment should be 
modified.
(There are so many such code-comments, all of them should be modified.)


07. parallel_apply_get_unique_id()

```
+/*
+ * Returns the unique id among all parallel apply workers in the subscriber.
+ */
+static uint16
+parallel_apply_get_unique_id()
```

I think this function is inefficient: the computational complexity will be 
increased linearly when the number of PAs is increased. I think the Bitmapset 
data structure may be used.

08. parallel_apply_send_data()

```
#define CHANGES_THRESHOLD   1000
#define SHM_SEND_TIMEOUT_MS 1
```

I think the timeout may be too long. Could you tell me the background about it?


09. parallel_apply_send_data()

```
/*
 * Close the stream file if not in a streaming block, 
the file will
 * be reopened later.
 */
if (!stream_apply_worker)
serialize_stream_stop(winfo->shared->xid);
```

a.
IIUC the timings when LA tries to send data but stream_apply_worker is NULL are:
* apply_handle_stream_prepare, 
* apply_handle_stream_start, 
* apply_handle_stream_abort, and
* apply_handle_stream_commit.
And at that time the state of TransApplyAction may be 
TRANS_LEADER_SEND_TO_PARALLEL. When should be close the file?

b.
Even if this is needed, I think the name of the called function should be 

Re: Postgres auto vacuum - Disable

2022-11-07 Thread Dave Page
Hi

On Mon, 7 Nov 2022 at 11:42, Karthik Jagadish (kjagadis) 
wrote:

> Hi,
>
>
>
> We have a NMS application in cisco and using postgres as a database.
>
>
>
> We have query related to disabling auto vacuum. We have below
> configuration in postgres.conf where the autovacuum=on is commented out.
>
>
>
> [image: Shape Description automatically generated]
>
>
>
> But when checked in database we notice that it’s showing as on
>
>
>
> [image: Graphical user interface, timeline Description automatically
> generated]
>
>
>
> What would this mean? Does it mean that autovacuum is not disabled?
> Appreciate a response.
>

Right. The default is for it to be enabled, so commenting out the option
does nothing. You would need to set it explicitly to off.

BUT... you almost certainly don't want to do that. Cases where it should be
disabled are *extremely* rare. Make sure you *really* know what you're
letting yourself in for by disabling autovacuum, and don't rely on 10+ year
old performance tuning advice from random places on the internet, if that's
what you're doing.

-- 
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com


Re: pg_upgrade test failure

2022-11-07 Thread Thomas Munro
So [1] on its own didn't fix this.  My next guess is that the attached
might help.

Hmm.  Following Michael's clue that this might involve log files and
pg_ctl, I noticed one thing: pg_ctl implements
wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
our kill emulation does CallNamedPipe().  If the server is in the
process of exiting and the kernel is cleaning up all the handles we
didn't close, is there any reason to expect the signal pipe to be
closed after the log file?

[1] 
https://www.postgresql.org/message-id/flat/20221025213055.GA8537%40telsasoft.com#9030de6c4c5e544d2b057b793a5b42af
From 9cbfd5c1fc1327443152d89ca7f5346bfeda3f52 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 7 Nov 2022 17:43:42 +1300
Subject: [PATCH] Refactor rmtree() to use get_dirent_type().

Switch to get_dirent_type() instead of lstat() while traversing a
directory graph, to see if that fixes intermittent ENOTEMPTY failures
while cleaning up after the pg_upgrade tests.

The idea is that the offending directory entries must be
STATUS_DELETE_PENDING.  With this change, rmtree() should not skip them,
and try to unlink again.  That should either wait a short time for them
to go away (because other processes close the handle) or log a message
to tell us the path of the problem file if not so we can dig further.

Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de
---
 src/common/rmtree.c | 94 ++---
 1 file changed, 37 insertions(+), 57 deletions(-)

diff --git a/src/common/rmtree.c b/src/common/rmtree.c
index 221d0e20a7..736e34892c 100644
--- a/src/common/rmtree.c
+++ b/src/common/rmtree.c
@@ -20,13 +20,16 @@
 #include 
 #include 
 
+#include "common/file_utils.h"
+
 #ifndef FRONTEND
 #define pg_log_warning(...) elog(WARNING, __VA_ARGS__)
+#define LOG_LEVEL WARNING
 #else
 #include "common/logging.h"
+#define LOG_LEVEL PG_LOG_WARNING
 #endif
 
-
 /*
  *	rmtree
  *
@@ -41,82 +44,59 @@
 bool
 rmtree(const char *path, bool rmtopdir)
 {
-	bool		result = true;
 	char		pathbuf[MAXPGPATH];
-	char	  **filenames;
-	char	  **filename;
-	struct stat statbuf;
-
-	/*
-	 * we copy all the names out of the directory before we start modifying
-	 * it.
-	 */
-	filenames = pgfnames(path);
+	DIR		   *dir;
+	struct dirent *de;
+	bool		result = true;
 
-	if (filenames == NULL)
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		pg_log_warning("could not open directory \"%s\": %m", path);
 		return false;
+	}
 
-	/* now we have the names we can start removing things */
-	for (filename = filenames; *filename; filename++)
+	while (errno = 0, (de = readdir(dir)))
 	{
-		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename);
-
-		/*
-		 * It's ok if the file is not there anymore; we were just about to
-		 * delete it anyway.
-		 *
-		 * This is not an academic possibility. One scenario where this
-		 * happens is when bgwriter has a pending unlink request for a file in
-		 * a database that's being dropped. In dropdb(), we call
-		 * ForgetDatabaseSyncRequests() to flush out any such pending unlink
-		 * requests, but because that's asynchronous, it's not guaranteed that
-		 * the bgwriter receives the message in time.
-		 */
-		if (lstat(pathbuf, &statbuf) != 0)
-		{
-			if (errno != ENOENT)
-			{
-pg_log_warning("could not stat file or directory \"%s\": %m",
-			   pathbuf);
-result = false;
-			}
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
-		}
-
-		if (S_ISDIR(statbuf.st_mode))
-		{
-			/* call ourselves recursively for a directory */
-			if (!rmtree(pathbuf, true))
-			{
-/* we already reported the error */
-result = false;
-			}
-		}
-		else
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
+		switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL))
 		{
-			if (unlink(pathbuf) != 0)
-			{
-if (errno != ENOENT)
+			case PGFILETYPE_ERROR:
+/* already logged, press on */
+break;
+			case PGFILETYPE_DIR:
+if (!rmtree(pathbuf, true))
+	result = false;
+break;
+			default:
+if (unlink(pathbuf) != 0 && errno != ENOENT)
 {
-	pg_log_warning("could not remove file or directory \"%s\": %m",
-   pathbuf);
+	pg_log_warning("could not unlink file \"%s\": %m", pathbuf);
 	result = false;
 }
-			}
+break;
 		}
 	}
 
+	if (errno != 0)
+	{
+		pg_log_warning("could not read directory \"%s\": %m", path);
+		result = false;
+	}
+
+	closedir(dir);
+
 	if (rmtopdir)
 	{
 		if (rmdir(path) != 0)
 		{
-			pg_log_warning("could not remove file or directory \"%s\": %m",
-		   path);
+			pg_log_warning("could not remove directory \"%s\": %m", path);
 			result = false;
 		}
 	}
 
-	pgfnames_cleanup(filenames);
-
 	return result;
 }
-- 
2.30.2



Re: Collation version tracking for macOS

2022-11-07 Thread Peter Eisentraut

On 02.11.22 00:57, Thomas Munro wrote:

3.  Library availability.  This is a problem for downstream
communities to solve.  For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out.  I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)


I made a Homebrew repository for ICU versions 50 through 72: 
https://github.com/petere/homebrew-icu


All of these packages build and pass their self-tests on my machine.  So 
from that experience, I think maintaining a repository of ICU versions, 
and being able to install more than one for testing this feature, is 
feasible.


Now I have started building PostgreSQL against these, to get some 
baseline of what is supported and actually works.  The results are a bit 
mixed so far, more to come later.


The installation instructions currently say that the minimum required 
version of ICU is 4.2.  That was the one that shipped with RHEL 6.  I 
think we have de-supported RHEL 6 and could increase that.  The version 
in RHEL 7 is 50.


(My repository happens to start at 50 because the new versioning system 
started at 49, but 49 doesn't appear to be tagged at the icu github site.)


Note: Recent versions of libxml2 link against icu.  This isn't a 
problem, thanks to the symbol versioning, but if you get libxml2 via 
pkg-config, you might get LDFLAGS from not the icu version you wanted.






Re: Postgres auto vacuum - Disable

2022-11-07 Thread Karthik Jagadish (kjagadis)
Hi,

Thanks for the response.

I have follow-up question where the vacuum process is waiting and not doing 
it’s job. When we grep on waiting process we see below output. Whenever we see 
this we notice that the vacuum is not happening and the system is running out 
of space.

[root@zpah0031 ~]# ps -ef | grep 'waiting'
postgres  8833 62646  0 Jul28 ?00:00:00 postgres: postgres cgms [local] 
VACUUM waiting
postgres 18437 62646  0 Jul27 ?00:00:00 postgres: postgres cgms [local] 
VACUUM waiting


What could be the reason as to why the vacuum is not happening? Is it because 
some lock is present in the table/db or any other reason?

Regards,
Karthik

From: Dave Page 
Date: Monday, 7 November 2022 at 5:17 PM
To: Karthik Jagadish (kjagadis) 
Cc: pgsql-hack...@postgresql.org , Chandruganth 
Ayyavoo Selvam (chaayyav) , Prasanna Satyanarayanan 
(prassaty) , Jaganbabu M (jmunusam) , 
Joel Mariadasan (jomariad) 
Subject: Re: Postgres auto vacuum - Disable
Hi

On Mon, 7 Nov 2022 at 11:42, Karthik Jagadish (kjagadis) 
mailto:kjaga...@cisco.com>> wrote:
Hi,

We have a NMS application in cisco and using postgres as a database.

We have query related to disabling auto vacuum. We have below configuration in 
postgres.conf where the autovacuum=on is commented out.

[ShapeDescription automatically generated]

But when checked in database we notice that it’s showing as on

[Graphical user interface, timelineDescription automatically generated]

What would this mean? Does it mean that autovacuum is not disabled? Appreciate 
a response.

Right. The default is for it to be enabled, so commenting out the option does 
nothing. You would need to set it explicitly to off.

BUT... you almost certainly don't want to do that. Cases where it should be 
disabled are *extremely* rare. Make sure you *really* know what you're letting 
yourself in for by disabling autovacuum, and don't rely on 10+ year old 
performance tuning advice from random places on the internet, if that's what 
you're doing.

--
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com


Re: refactor ownercheck and aclcheck functions

2022-11-07 Thread Antonin Houska
Peter Eisentraut  wrote:

> These patches take the dozens of mostly-duplicate pg_foo_ownercheck() and
> pg_foo_aclcheck() functions and replace (most of) them by common functions
> that are driven by the ObjectProperty table.  All the required information is
> already in that table.
> 
> This is similar to the consolidation of the drop-by-OID functions that we did
> a while ago (b1d32d3e3230f00b5baba08f75b4f665c7d6dac6).

I've reviewed this patch, as it's related to my patch [1] (In particular, it
reduces the size of my patch a little bit). I like the idea to reduce the
amount of (almost) copy & pasted code. I haven't found any problem in your
patch that would be worth mentioning, except that the 0001 part does not apply
to the current master branch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Postgres auto vacuum - Disable

2022-11-07 Thread Laurenz Albe
On Mon, 2022-11-07 at 12:12 +, Karthik Jagadish (kjagadis) wrote:
> I have follow-up question where the vacuum process is waiting and not doing 
> it’s job.
> When we grep on waiting process we see below output. Whenever we see this we 
> notice
> that the vacuum is not happening and the system is running out of space.
>  
> [root@zpah0031 ~]# ps -ef | grep 'waiting'
> postgres  8833 62646  0 Jul28 ?        00:00:00 postgres: postgres cgms 
> [local] VACUUM waiting
> postgres 18437 62646  0 Jul27 ?        00:00:00 postgres: postgres cgms 
> [local] VACUUM waiting
>  
>  
> What could be the reason as to why the vacuum is not happening? Is it because 
> some lock is
> present in the table/db or any other reason?

Look in "pg_stat_activity".  I didn't check, but I'm sure it's the intentional 
break
configured with "autovacuum_vacuum_cost_delay".  Reduce that parameter for more
autovacuum speed.

Yours,
Laurenz Albe




Re: collect_corrupt_items_vacuum.patch

2022-11-07 Thread Nikita Malakhov
Hi hackers!

Daniel is busy with other tasks. I've found this topic and this problem
seems to be actual
or v15 too.
Please correct me if I am wrong. I've checked another discussion related to
pg_visibility [1].
According to discussion: if using latest completed xid is not right for
checking visibility, than
it should be the least running transaction xid? So it must be another
function to be used for
these calculations, not the GetOldestNonRemovableTransactionId that uses
the ComputeXidHorizons.

[1]
https://www.postgresql.org/message-id/flat/c0610352-8433-ab4b-986d-0e803c628efe%40postgrespro.ru

On Wed, Oct 12, 2022 at 8:15 AM Michael Paquier  wrote:

> On Wed, Jul 27, 2022 at 09:47:19PM -0400, Robert Haas wrote:
> > On Wed, Jul 27, 2022 at 5:56 PM Tom Lane  wrote:
> > > Maybe we need a different function for pg_visibility to call?
> > > If we want ComputeXidHorizons to serve both these purposes, then it
> > > has to always deliver exactly the right answer, which seems like
> > > a definition that will be hard and expensive to achieve.
> >
> > Yeah, I was thinking along similar lines.
> >
> > I'm also kind of wondering why these calculations use
> > latestCompletedXid. Is that something we do solely to reduce locking?
> > The XIDs of running transactions matter, and their snapshots matter,
> > and the XIDs that could start running in the future matter, but I
> > don't know why it matters what the latest completed XID is.
>
> Daniel, it seems to me that this thread is waiting for some input from
> you, based on the remarks of Tom and Robert.  Are you planning to do
> so?  This is marked as a bug fix, so I have moved this item to the
> next CF for now.
> --
> Michael
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: Postgres auto vacuum - Disable

2022-11-07 Thread Julien Rouhaud
Hi,

On Mon, Nov 07, 2022 at 02:22:56PM +0100, Laurenz Albe wrote:
> On Mon, 2022-11-07 at 12:12 +, Karthik Jagadish (kjagadis) wrote:
> > I have follow-up question where the vacuum process is waiting and not doing 
> > it’s job.
> > When we grep on waiting process we see below output. Whenever we see this 
> > we notice
> > that the vacuum is not happening and the system is running out of space.
> >  
> > [root@zpah0031 ~]# ps -ef | grep 'waiting'
> > postgres  8833 62646  0 Jul28 ?        00:00:00 postgres: postgres cgms 
> > [local] VACUUM waiting
> > postgres 18437 62646  0 Jul27 ?        00:00:00 postgres: postgres cgms 
> > [local] VACUUM waiting
> >  
> >  
> > What could be the reason as to why the vacuum is not happening? Is it 
> > because some lock is
> > present in the table/db or any other reason?
>
> Look in "pg_stat_activity".  I didn't check, but I'm sure it's the 
> intentional break
> configured with "autovacuum_vacuum_cost_delay".  Reduce that parameter for 
> more
> autovacuum speed.

Really?  An autovacuum should be displayed as "autovacuum worker", this looks
like plain backends to me, where an interactive VACUUM has been issued and is
waiting on a heavyweight lock.




Re: Skipping schema changes in publication

2022-11-07 Thread vignesh C
On Fri, 4 Nov 2022 at 08:19, Ian Lawrence Barwick  wrote:
>
> Hi
>
> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.
>
> [1] http://cfbot.cputube.org/patch_40_3646.log

Here is an updated patch which is rebased on top of HEAD.

Regards,
Vignesh
From 22d17bd8eeb2c35870b80cbc41c23e89e2c8f981 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 6 Nov 2022 06:34:44 +0530
Subject: [PATCH v9 1/2] Add RESET clause to Alter Publication which will reset
 the publication with default values.

This patch adds a new RESET clause to ALTER PUBLICATION which will reset
the publication to the default state which includes resetting the publication
parameters, setting ALL TABLES flag to false and dropping the relations and
schemas that are associated with the publication.
Usage:
ALTER PUBLICATION pub1 RESET;
---
 doc/src/sgml/ref/alter_publication.sgml   |  25 +-
 src/backend/commands/publicationcmds.c| 105 --
 src/backend/parser/gram.y |   9 ++
 src/bin/psql/tab-complete.c   |   2 +-
 src/include/nodes/parsenodes.h|   3 +-
 src/test/regress/expected/publication.out | 101 +
 src/test/regress/sql/publication.sql  |  50 +++
 7 files changed, 285 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml
index c84b11f47a..2f0509ec43 100644
--- a/doc/src/sgml/ref/alter_publication.sgml
+++ b/doc/src/sgml/ref/alter_publication.sgml
@@ -27,6 +27,7 @@ ALTER PUBLICATION name DROP name SET ( publication_parameter [= value] [, ... ] )
 ALTER PUBLICATION name OWNER TO { new_owner | CURRENT_ROLE | CURRENT_USER | SESSION_USER }
 ALTER PUBLICATION name RENAME TO new_name
+ALTER PUBLICATION name RESET
 
 where publication_object is one of:
 
@@ -67,14 +68,26 @@ ALTER PUBLICATION name RENAME TO 
 
   
-   The remaining variants change the owner and the name of the publication.
+   The OWNER clause will change the owner of the publication.
+  
+
+  
+   The RENAME clause will change the name of the publication.
+  
+
+  
+   The RESET clause will reset the publication to the
+   default state which includes resetting the publication parameters, setting
+   ALL TABLES flag to false and
+   dropping all relations and schemas that are associated with the publication.
   
 
   
You must own the publication to use ALTER PUBLICATION.
Adding a table to a publication additionally requires owning that table.
-   The ADD TABLES IN SCHEMA and
-   SET TABLES IN SCHEMA to a publication requires the
+   The ADD TABLES IN SCHEMA,
+   SET TABLES IN SCHEMA to a publication and
+   RESET of publication requires the
invoking user to be a superuser.  To alter the owner, you must also be a
direct or indirect member of the new owning role. The new owner must have
CREATE privilege on the database.  Also, the new owner
@@ -210,6 +223,12 @@ ALTER PUBLICATION sales_publication ADD TABLES IN SCHEMA marketing, sales;
production_publication:
 
 ALTER PUBLICATION production_publication ADD TABLE users, departments, TABLES IN SCHEMA production;
+
+
+  
+   Reset the publication production_publication:
+
+ALTER PUBLICATION production_publication RESET;
 
  
 
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index a8b75eb1be..3cff1635bf 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -54,6 +54,14 @@
 #include "utils/varlena.h"
 
 
+/* CREATE PUBLICATION default values for flags and publication parameters */
+#define PUB_DEFAULT_ACTION_INSERT true
+#define PUB_DEFAULT_ACTION_UPDATE true
+#define PUB_DEFAULT_ACTION_DELETE true
+#define PUB_DEFAULT_ACTION_TRUNCATE true
+#define PUB_DEFAULT_VIA_ROOT false
+#define PUB_DEFAULT_ALL_TABLES false
+
 /*
  * Information used to validate the columns in the row filter expression. See
  * contain_invalid_rfcolumn_walker for details.
@@ -92,11 +100,11 @@ parse_publication_options(ParseState *pstate,
 	*publish_via_partition_root_given = false;
 
 	/* defaults */
-	pubactions->pubinsert = true;
-	pubactions->pubupdate = true;
-	pubactions->pubdelete = true;
-	pubactions->pubtruncate = true;
-	*publish_via_partition_root = false;
+	pubactions->pubinsert = PUB_DEFAULT_ACTION_INSERT;
+	pubactions->pubupdate = PUB_DEFAULT_ACTION_UPDATE;
+	pubactions->pubdelete = PUB_DEFAULT_ACTION_DELETE;
+	pubactions->pubtruncate = PUB_DEFAULT_ACTION_TRUNCATE;
+	*publish_via_partition_root = PUB_DEFAULT_VIA_ROOT;
 
 	/* Parse options */
 	foreach(lc, options)
@@ -1078,6 +1086,91 @@ InvalidatePublicationRels(List *relids)
 		CacheInvalidateRelcacheAll();
 }
 
+/*
+ * Reset the publication.
+ *
+ * Reset the publication parameters, setting ALL TABLES flag to false and drop
+ * all relations and schemas that are associated with the publication.
+ */
+static void
+AlterPublicationR

Re: Code checks for App Devs, using new options for transaction behavior

2022-11-07 Thread Simon Riggs
On Wed, 2 Nov 2022 at 07:40, Simon Riggs  wrote:

> > What will be the behavior if someone declares a savepoint with this
> > name ("_internal_nested_xact").  Will this interfere with this new
> > functionality?
>
> Clearly! I guess you are saying we should disallow them.
>
> > Have we tested scenarios like that?
>
> No, but that can be done.

More tests as requested, plus minor code rework, plus comment updates.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


002_nested_xacts.v10.patch
Description: Binary data


001_psql_parse_only.v1.patch
Description: Binary data


003_rollback_on_commit.v1.patch
Description: Binary data


Re: Postgres auto vacuum - Disable

2022-11-07 Thread Karthik Jagadish (kjagadis)
Hi Again,

Is there any difference in the way vacuum is handled in postgres9.6 and 
postgres12.9, We are noticing the below issue of waiting process only after 
upgrading to postgres12.5

$ ps -ef | grep 'waiting'
postgres  8833 62646  0 Jul28 ?00:00:00 postgres: postgres cgms [local] 
VACUUM waiting
postgres 18437 62646  0 Jul27 ?00:00:00 postgres: postgres cgms [local] 
VACUUM waiting

Regards,
Karthik

From: Julien Rouhaud 
Date: Monday, 7 November 2022 at 7:06 PM
To: Laurenz Albe 
Cc: Karthik Jagadish (kjagadis) , Dave Page 
, pgsql-hack...@postgresql.org 
, Chandruganth Ayyavoo Selvam (chaayyav) 
, Prasanna Satyanarayanan (prassaty) , 
Jaganbabu M (jmunusam) , Joel Mariadasan (jomariad) 

Subject: Re: Postgres auto vacuum - Disable
Hi,

On Mon, Nov 07, 2022 at 02:22:56PM +0100, Laurenz Albe wrote:
> On Mon, 2022-11-07 at 12:12 +, Karthik Jagadish (kjagadis) wrote:
> > I have follow-up question where the vacuum process is waiting and not doing 
> > it’s job.
> > When we grep on waiting process we see below output. Whenever we see this 
> > we notice
> > that the vacuum is not happening and the system is running out of space.
> >
> > [root@zpah0031 ~]# ps -ef | grep 'waiting'
> > postgres  8833 62646  0 Jul28 ?00:00:00 postgres: postgres cgms 
> > [local] VACUUM waiting
> > postgres 18437 62646  0 Jul27 ?00:00:00 postgres: postgres cgms 
> > [local] VACUUM waiting
> >
> >
> > What could be the reason as to why the vacuum is not happening? Is it 
> > because some lock is
> > present in the table/db or any other reason?
>
> Look in "pg_stat_activity".  I didn't check, but I'm sure it's the 
> intentional break
> configured with "autovacuum_vacuum_cost_delay".  Reduce that parameter for 
> more
> autovacuum speed.

Really?  An autovacuum should be displayed as "autovacuum worker", this looks
like plain backends to me, where an interactive VACUUM has been issued and is
waiting on a heavyweight lock.


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Ronan Dunklau
Le samedi 5 novembre 2022, 09:51:23 CET Pavel Luzanov a écrit :
> While playing with the patch I found a situation where the performance
> may be degraded compared to previous versions.
> 
> The test case below.
> If you create a proper index for the query (a,c), version 16 wins. On my
> notebook, the query runs ~50% faster.
> But if there is no index (a,c), but only (a,b), in previous versions the
> planner uses it, but with this patch a full table scan is selected.

Hello,

In your exact use case, the combo incremental-sort + Index scan is evaluated 
to cost more than doing a full sort + seqscan. 

If you try for example to create an index on (b, a) and group by b, you will 
get the expected behaviour:

ro=# create index on t (b, a);
CREATE INDEX
ro=# explain  select b, array_agg(c order by c) from t group by b;
   QUERY PLAN   
 
-
 GroupAggregate  (cost=10.64..120926.80 rows=9970 width=36)
   Group Key: b
   ->  Incremental Sort  (cost=10.64..115802.17 rows=100 width=6)
 Sort Key: b, c
 Presorted Key: b
 ->  Index Scan using t_b_a_idx on t  (cost=0.42..47604.12 
rows=100 width=6)
(6 rows)

I think we can trace that back to incremental sort being pessimistic about 
it's performance. If you try the same query, but with set enable_seqscan = off, 
you will get a full sort over an index scan:

   QUERY PLAN   
 
-
 GroupAggregate  (cost=154944.94..162446.19 rows=100 width=34)
   Group Key: a
   ->  Sort  (cost=154944.94..157444.94 rows=100 width=4)
 Sort Key: a, c
 ->  Index Scan using t_a_b_idx on t  (cost=0.42..41612.60 
rows=100 width=4)
(5 rows)


This probably comes from the overly pessimistic behaviour that the number of 
tuples per group will be 1.5 times as much as we should estimate:

/*
 * Estimate average cost of sorting of one group where presorted 
keys are
 * equal.  Incremental sort is sensitive to distribution of tuples 
to the
 * groups, where we're relying on quite rough assumptions.  Thus, 
we're
 * pessimistic about incremental sort performance and increase its 
average
 * group size by half.
 */

I can't see why an incrementalsort could be more expensive than a full sort, 
using the same presorted path. It looks to me that in that case we should 
always prefer an incrementalsort. Maybe we should bound incremental sorts cost 
to make sure they are never more expensive than the full sort ?

Also, prior to this commit I don't think it made a real difference, because 
worst case scenario we would have missed an incremental sort, which we didn't 
have beforehand. But with this patch, we may actually replace a "hidden" 
incremental sort which was done in the agg codepath by a full sort. 

Best regards,

--
Ronan Dunklau






Re: WIN32 pg_import_system_collations

2022-11-07 Thread Peter Eisentraut

On 04.11.22 23:08, Juan José Santamaría Flecha wrote:

Ok, I can definitely improve the comments for that function.

Also consider describing in the commit message what you are doing in
more detail, including some of the things that have been discussed in
this thread.

Going through the thread for the commit message, I think that maybe the 
collation naming remarks were not properly addressed. In the current 
version the collations retain their native name, but an alias is created 
for those with a shape that we can assume a POSIX equivalent exists.


This looks pretty good to me.  The refactoring of the non-Windows parts 
makes sense.  The Windows parts look reasonable on manual inspection, 
but again, I don't have access to Windows here, so someone else should 
also look it over.


A small style issue: Change return (TRUE) to return TRUE.

The code

+   if (strlen(localebuf) == 5 && localebuf[2] == '-')

might be too specific.  At least on some POSIX systems, I have seen 
locales with a three-letter language name.  Maybe you should look with 
strchr() and not be too strict about the exact position.


For the test patch, why is a separate test for non-UTF8 needed on 
Windows.  Does the UTF8 one not work?


+   version() !~ 'Visual C\+\+'

This probably won't work for MinGW.





Re: Improve logging when using Huge Pages

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-06 14:04:29 +0700, John Naylor wrote:
> I think the best thing to do is change huge_pages='on' to log a WARNING and
> fallback to regular pages if the mapping fails. That way, both dev and prod
> can keep the same settings, since 'on' will have both visibility and
> robustness. I don't see a good reason to refuse to start -- seems like an
> anti-pattern.

How would on still have robustness if it doesn't actually do anything other
than cause a WARNING? The use of huge pages can have very substantial effects
on memory usage an performance. And it's easy to just have huge_pages fail,
another program that started could have used huge pages, or some config
variables was changed to incerase shared memory usage...

I am strongly opposed to making 'on' fall back to not using huge pages. That's
what 'try' is for.

I know of people that scripted cluster start so that they start with 'on' and
change the system setting of the number of huge pages according to the error
message.

Greetings,

Andres Freund




Re: Draft back-branch release notes are up

2022-11-07 Thread Jonathan S. Katz

On 11/6/22 11:14 AM, Tom Lane wrote:

Justin Pryzby  writes:

+  Fix planner failure with extended statistics on partitioned tables
+  (Richard Guo, Justin Pryzby)



This can also happen with inheritance tables.



+  Add missing guards for NULL connection pointer



Maybe should be NULL or NULL ?


Done, thanks.


Hopefully not too late, but I noticed

> Fix CREATE DATABASE to allow its oid parameter to exceed 231

which should be 2^31 according to 2c6d4365

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Draft back-branch release notes are up

2022-11-07 Thread Jonathan S. Katz

On 11/7/22 10:30 AM, Jonathan S. Katz wrote:

On 11/6/22 11:14 AM, Tom Lane wrote:

Justin Pryzby  writes:
+  Fix planner failure with extended statistics on partitioned 
tables

+  (Richard Guo, Justin Pryzby)



This can also happen with inheritance tables.



+  Add missing guards for NULL connection pointer



Maybe should be NULL or NULL ?


Done, thanks.


Hopefully not too late, but I noticed

 > Fix CREATE DATABASE to allow its oid parameter to exceed 231

which should be 2^31 according to 2c6d4365


Oh ignore, I had a copy-and-paste error into my plaintext editor.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


2022-11-10 release announcement draft

2022-11-07 Thread Jonathan S. Katz

Hi,

Attached is a draft of the release announcement for the 2022-11-10 release.

Please provide feedback no later than 2022-11-10 0:00 AoE[1].

Thanks,

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth
The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 15.1, 14.6, 13.9, 12.13, 11.18, and 10.23.
This release fixes 25 bugs reported over the last several months.

This is the **final release of PostgreSQL 10**. PostgreSQL 10 will no longer
receive
[security and bug fixes](https://www.postgresql.org/support/versioning/).
If you are running PostgreSQL 10 in a production environment, we suggest that
you make plans to upgrade.

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

Bug Fixes and Improvements
--

This update fixes over 25 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 15. Some of these issues may also
affect other supported versions of PostgreSQL.

Included in this release:

* Fix for updatable views for `INSERT` statements that include multi-row
`VALUES` clauses with a `DEFAULT` set.
* Disallow rules named `_RETURN` that are not `ON SELECT` rules.
* Disallow use of `MERGE` on a partitioned table that has foreign-table
partitions.
* Fix for construction of per-partition foreign key constraints while doing
`ALTER TABLE ... ATTACH PARTITION`, where previously incorrect or duplicate
constraints could be built.
* Fix for a planner failure with extended statistics on partitioned or inherited
tables.
* Fix bugs in logical decoding that could lead to memory leaks when replay
starts from a point between the beginning of a transaction and the beginning of
its subtransaction.
* Fix issues with slow shutdown of replication workers by allowing interrupts
in more places.
* Disallow logical replication into foreign-table partitions.
* Prevent crash in replication works after a SQL or PL/pgSQL function syntax
error.
* `psql -c` now exists with a nonzero status if the query is canceled.
* Allow cross-platform tablespace relocation in `pg_basebackup`.
* Fix pg_dump to include comments attached to some `CHECK` constraints.

This release also updates time zone data files to use tzdata release 2022f. This
includes DST law changes in Chile, Fiji, Iran, Jordan, Mexico, Palestine, and
Syria, plus historical corrections for Chile, Crimea, Iran, and Mexico.

There are several other changes to note in the tzdata 2022f release that may
change the display for pre-1970 timestamps. For a detailed explanation, please
review the [release notes](https://www.postgresql.org/docs/release/).

For the full list of changes available, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 10 is EOL


PostgreSQL 10.23 is the final release of PostgreSQL 10. If you are running
PostgreSQL 10 in a production environment, we suggest that you make plans to
upgrade to a newer, supported version of PostgreSQL. Please see our
[versioning policy](https://www.postgresql.org/support/versioning/) for more
information.

Updating


All PostgreSQL update releases are cumulative. As with other minor releases,
users are not required to dump and reload their database or use `pg_upgrade` in
order to apply this update release; you may simply shutdown PostgreSQL and
update its binaries.

Users who have skipped one or more update releases may need to run additional,
post-update steps; please see the release notes for earlier versions for
details.

For more details, please see the
[release notes](https://www.postgresql.org/docs/release/).

Links
-
* [Download](https://www.postgresql.org/download/)
* [Release Notes](https://www.postgresql.org/docs/release/)
* [Security](https://www.postgresql.org/support/security/)
* [Versioning Policy](https://www.postgresql.org/support/versioning/)
* [Follow @postgresql on Twitter](https://twitter.com/postgresql)


OpenPGP_signature
Description: OpenPGP digital signature


Re: Improve logging when using Huge Pages

2022-11-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-06 14:04:29 +0700, John Naylor wrote:
>> I think the best thing to do is change huge_pages='on' to log a WARNING and
>> fallback to regular pages if the mapping fails.

> I am strongly opposed to making 'on' fall back to not using huge pages. That's
> what 'try' is for.

Agreed --- changing "on" to be exactly like "try" isn't an improvement.
If you want "try" semantics, choose "try".

regards, tom lane




Re: 2022-11-10 release announcement draft

2022-11-07 Thread Erikjan Rijkers

Op 07-11-2022 om 16:51 schreef Jonathan S. Katz:

Hi,

Attached is a draft of the release announcement for the 2022-11-10 release.

Please provide feedback no later than 2022-11-10 0:00 AoE[1].


'now exists'  should be (I think)
'now exits'


Erik



Thanks,

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth





Re: PL/pgSQL cursors should get generated portal names by default

2022-11-07 Thread Jan Wieck

On 11/4/22 19:46, Tom Lane wrote:

Jan Wieck  writes:
I need to do some testing on this. I seem to recall that the naming was 
originally done because a reference cursor is basically a named cursor 
that can be handed around between functions and even the top SQL level 
of the application. For the latter to work the application needs to know 
the name of the portal.


Right.  With this patch, it'd be necessary to hand back the actual
portal name (by returning the refcursor value), or else manually
set the refcursor value before OPEN to preserve the previous behavior.
But as far as I saw, all our documentation examples show handing back
the portal name, so I'm hoping most people do it like that already.


I was mostly concerned that we may unintentionally break underdocumented 
behavior that was originally implemented on purpose. As long as everyone 
is aware that this is breaking backwards compatibility in the way it 
does, that's fine.




I am currently down with Covid and have trouble focusing. But I hope to 
get to it some time next week.


Get well soon!


Thanks, Jan





Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-02 00:28:46 +1300, David Rowley wrote:
> Part of the work that Thomas mentions in [1], regarding Direct I/O,
> diff --git a/src/backend/utils/mmgr/alignedalloc.c 
> b/src/backend/utils/mmgr/alignedalloc.c
> new file mode 100644
> index 00..e581772758
> --- /dev/null
> +++ b/src/backend/utils/mmgr/alignedalloc.c
> @@ -0,0 +1,93 @@
> +/*-
> + *
> + * alignedalloc.c
> + * Allocator functions to implement palloc_aligned

FWIW, to me this is really more part of mcxt.c than its own
allocator... Particularly because MemoryContextAllocAligned() et al are
implemented there.


> +void *
> +AlignedAllocRealloc(void *pointer, Size size)

I doubtthere's ever a need to realloc such a pointer? Perhaps we could just
elog(ERROR)?


> +/*
> + * MemoryContextAllocAligned
> + *   Allocate 'size' bytes of memory in 'context' aligned to 
> 'alignto'
> + *   bytes.
> + *
> + * 'flags' may be 0 or set the same as MemoryContextAllocExtended().
> + * 'alignto' must be a power of 2.
> + */
> +void *
> +MemoryContextAllocAligned(MemoryContext context,
> +   Size size, Size alignto, int 
> flags)
> +{
> + Sizealloc_size;
> + void   *unaligned;
> + void   *aligned;
> +
> + /* wouldn't make much sense to waste that much space */
> + Assert(alignto < (128 * 1024 * 1024));
> +
> + /* ensure alignto is a power of 2 */
> + Assert((alignto & (alignto - 1)) == 0);

Hm, not that I can see a case for ever not using a power of two
alignment... There's not really a "need" for the restriction, right? Perhaps
we should note that?


> + /*
> +  * We implement aligned pointers by simply allocating enough memory for
> +  * the requested size plus the alignment and an additional MemoryChunk.
> +  * This additional MemoryChunk is required for operations such as pfree
> +  * when used on the pointer returned by this function.  We use this
> +  * "redirection" MemoryChunk in order to find the pointer to the memory
> +  * that was returned by the MemoryContextAllocExtended call below. We do
> +  * that by "borrowing" the block offset field and instead of using that 
> to
> +  * find the offset into the owning block, we use it to find the original
> +  * allocated address.
> +  *
> +  * Here we must allocate enough extra memory so that we can still align
> +  * the pointer returned by MemoryContextAllocExtended and also have 
> enough
> +  * space for the redirection MemoryChunk.
> +  */
> + alloc_size = size + alignto + sizeof(MemoryChunk);
> +
> + /* perform the actual allocation */
> + unaligned = MemoryContextAllocExtended(context, alloc_size, flags);

Should we handle the case where we get a suitably aligned pointer from
MemoryContextAllocExtended() differently?


> + /* XXX: should we adjust valgrind state here? */

Probably still need to do this... Kinda hard to get right without the code
getting exercised. Wonder if there's some minimal case we could actually use
it for?

Thanks,

Andres




Re: 2022-11-10 release announcement draft

2022-11-07 Thread Jonathan S. Katz

On 11/7/22 10:59 AM, Erikjan Rijkers wrote:

Op 07-11-2022 om 16:51 schreef Jonathan S. Katz:

Hi,

Attached is a draft of the release announcement for the 2022-11-10 
release.


Please provide feedback no later than 2022-11-10 0:00 AoE[1].


'now exists'  should be (I think)
'now exits'


Correct -- I have made that fix -- thanks!

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: PL/pgSQL cursors should get generated portal names by default

2022-11-07 Thread Pavel Stehule
Dne po 7. 11. 2022 17:10 uživatel Jan Wieck  napsal:

> On 11/4/22 19:46, Tom Lane wrote:
> > Jan Wieck  writes:
> >> I need to do some testing on this. I seem to recall that the naming was
> >> originally done because a reference cursor is basically a named cursor
> >> that can be handed around between functions and even the top SQL level
> >> of the application. For the latter to work the application needs to
> know
> >> the name of the portal.
> >
> > Right.  With this patch, it'd be necessary to hand back the actual
> > portal name (by returning the refcursor value), or else manually
> > set the refcursor value before OPEN to preserve the previous behavior.
> > But as far as I saw, all our documentation examples show handing back
> > the portal name, so I'm hoping most people do it like that already.
>
> I was mostly concerned that we may unintentionally break underdocumented
> behavior that was originally implemented on purpose. As long as everyone
> is aware that this is breaking backwards compatibility in the way it
> does, that's fine.
>

In this case I see current behaviors little bit unhappy. It breaks any
recursive call, it can break variable shadowing, so I prefer change. The
possibility of compatibility break is clean, but there is an possibility of
easy fix, and I think I can detect some possibly not compatible usage in
plpgsql_check.

The dependency on current behavior can be probably just for pretty old
application that doesn't use refcursors.

Regards

Pavel


> >
> >> I am currently down with Covid and have trouble focusing. But I hope to
> >> get to it some time next week.
> >
> > Get well soon!
>
> Thanks, Jan
>
>


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Pavel Luzanov

Hi,

On 07.11.2022 17:53, Ronan Dunklau wrote:

In your exact use case, the combo incremental-sort + Index scan is evaluated
to cost more than doing a full sort + seqscan.



I think we can trace that back to incremental sort being pessimistic about
it's performance. If you try the same query, but with set enable_seqscan = off,
you will get a full sort over an index scan:

QUERY PLAN
-
  GroupAggregate  (cost=154944.94..162446.19 rows=100 width=34)
Group Key: a
->  Sort  (cost=154944.94..157444.94 rows=100 width=4)
  Sort Key: a, c
  ->  Index Scan using t_a_b_idx on t  (cost=0.42..41612.60
rows=100 width=4)
(5 rows)


You are right. By disabling seq scan, we can get this plan. But compare 
it with the plan in v15:


postgres@db(15.0)=# explain
select a, array_agg(c order by c) from t group by a;
    QUERY PLAN
---
 GroupAggregate  (cost=0.42..46667.56 rows=100 width=34)
   Group Key: a
   ->  Index Scan using t_a_b_idx on t  (cost=0.42..41666.31 
rows=100 width=4)

(3 rows)

The total plan cost in v16 is ~4 times higher, while the index scan cost 
remains the same.



I can't see why an incrementalsort could be more expensive than a full sort,
using the same presorted path.


The only reason I can see is the number of buffers to read. In the plan 
with incremental sort we read the whole index, ~19 buffers.
And the plan with seq scan only required ~5000 (I think due to buffer 
ring optimization).


Perhaps this behavior is preferable. Especially when many concurrent 
queries are running. The less buffer cache is busy, the better. But in 
single-user mode this query is slower.


--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company





allow segment size to be set to < 1GiB

2022-11-07 Thread Andres Freund
Hi,

I was recently reminded of my previous desire to allow setting the segment
size to less than 1GB. It's pretty painful to test large amount of segments
with a segment size of 1GB, certainly our regression test don't cover anything
with multiple segments.

This likely wouldn't have detected the issue fixed in 0e758ae89a2, but it make
it easier to validate that the fix doesn't break anything badly.

In the attached patch I renamed --with-segsize= to --with-segsize-mb= /
-Dsegsize= to -Dsegsize_mb=, to avoid somebody building with --with-segsize=2
or such suddenly ending up with an incompatible build.

Greetings,

Andres Freund
>From e8f67e3c27ff553f6a3a26947cac8df91136e8f5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 7 Nov 2022 09:05:35 -0800
Subject: [PATCH] allow-smaller-segsize

---
 doc/src/sgml/installation.sgml |  6 +++---
 doc/src/sgml/storage.sgml  |  2 +-
 .cirrus.yml|  1 +
 configure  | 27 ++-
 configure.ac   | 12 ++--
 meson.build|  6 --
 meson_options.txt  |  4 ++--
 7 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 319c7e69660..20a98c3887c 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1458,13 +1458,13 @@ build-postgresql:
   
 
   
-   --with-segsize=SEGSIZE
+   --with-segsize-mb=SEGSIZE

 
- Set the segment size, in gigabytes.  Large tables are
+ Set the segment size, in megabytes.  Large tables are
  divided into multiple operating-system files, each of size equal
  to the segment size.  This avoids problems with file size limits
- that exist on many platforms.  The default segment size, 1 gigabyte,
+ that exist on many platforms.  The default segment size, 1024 megabytes,
  is safe on all supported platforms.  If your operating system has
  largefile support (which most do, nowadays), you can use
  a larger segment size.  This can be helpful to reduce the number of
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index e5b9f3f1ffa..03e0bb0ecbf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -236,7 +236,7 @@ When a table or index exceeds 1 GB, it is divided into gigabyte-sized
 filenode; subsequent segments are named filenode.1, filenode.2, etc.
 This arrangement avoids problems on platforms that have file size limitations.
 (Actually, 1 GB is just the default segment size.  The segment size can be
-adjusted using the configuration option --with-segsize
+adjusted using the configuration option --with-segsize-mb
 when building PostgreSQL.)
 In principle, free space map and visibility map forks could require multiple
 segments as well, though this is unlikely to happen in practice.
diff --git a/.cirrus.yml b/.cirrus.yml
index 9f2282471a9..b0543f20175 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -377,6 +377,7 @@ task:
   -Dextra_lib_dirs=${brewpath}/lib \
   -Dcassert=true \
   -Dssl=openssl -Duuid=e2fs -Ddtrace=auto \
+  -Dsegsize_mb=1 \
   -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
   build
 
diff --git a/configure b/configure
index 3966368b8d9..fb55792f51b 100755
--- a/configure
+++ b/configure
@@ -843,7 +843,7 @@ enable_coverage
 enable_dtrace
 enable_tap_tests
 with_blocksize
-with_segsize
+with_segsize_mb
 with_wal_blocksize
 with_CC
 with_llvm
@@ -1553,7 +1553,8 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
   set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
+  --with-segsize-mb=SEGSIZE
+  set table segment size in MB [1024]
   --with-wal-blocksize=BLOCKSIZE
   set WAL block size in kB [8]
   --with-CC=CMD   set compiler (deprecated)
@@ -3740,32 +3741,32 @@ $as_echo_n "checking for segment size... " >&6; }
 
 
 
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
+# Check whether --with-segsize-mb was given.
+if test "${with_segsize_mb+set}" = set; then :
+  withval=$with_segsize_mb;
   case $withval in
 yes)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
+  as_fn_error $? "argument required for --with-segsize-mb option" "$LINENO" 5
   ;;
 no)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
+  as_fn_error $? "argument required for --with-segsize-mb option" "$LINENO" 5
   ;;
 *)
-  segsize=$withval
+  segsize_mb=$withval
   ;;
   esac
 
 else
-  segsize=1
+  segsize_mb=1024
 fi
 
 
 # this expression is set up to avoid unnecessary integer overflow
 # blocksize is already guaranteed to be a factor of 1024
-RELSEG_SIZE=`e

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Ronan Dunklau
Le lundi 7 novembre 2022, 17:58:50 CET Pavel Luzanov a écrit :
> > I can't see why an incrementalsort could be more expensive than a full
> > sort, using the same presorted path.
> 
> The only reason I can see is the number of buffers to read. In the plan
> with incremental sort we read the whole index, ~19 buffers.
> And the plan with seq scan only required ~5000 (I think due to buffer
> ring optimization).

What I meant here is that disabling seqscans, the planner still chooses a full 
sort over a partial sort. The underlying index is the same, it is just a 
matter of choosing a Sort node over an IncrementalSort node. This, I think, is 
wrong: I can't see how it could be worse to use an incrementalsort in that 
case. 

It makes sense to prefer a SeqScan over an IndexScan if you are going to sort 
the whole table anyway. But in that case we shouldn't. What happened before is 
that some sort of incremental sort was always chosen, because it was hidden as 
an implementation detail of the agg node. But now it has to compete on a cost 
basis with the full sort, and that costing is wrong in that case. 

Maybe the original costing code for incremental sort was a bit too 
pessimistic. 

--
Ronan Dunklau






Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-11-07 Thread Robert Haas
On Mon, Oct 31, 2022 at 11:49 PM Amit Kapila  wrote:
> I am fine with any of those. Would you like to commit or do you prefer
> me to take care of this?

Sorry for not responding to this sooner. I think it's too late to do
anything about this for the current round of releases at this point,
but I am fine if you want to take care of it after that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: allow segment size to be set to < 1GiB

2022-11-07 Thread Tom Lane
Andres Freund  writes:
> In the attached patch I renamed --with-segsize= to --with-segsize-mb= /
> -Dsegsize= to -Dsegsize_mb=, to avoid somebody building with --with-segsize=2
> or such suddenly ending up with an incompatible build.

For the purpose of exercising these code paths with the standard
regression tests, even a megabyte seems large -- we don't create
very many test tables that are that big.  How about instead
allowing the segment size to be set in pages?

regards, tom lane




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-11-07 Thread Maciek Sakrejda
On Thu, Nov 3, 2022 at 10:00 AM Melanie Plageman
 wrote:
>
> I decided not to call it pg_statio because all of the other stats views
> have an underscore after stat and I thought it was an opportunity to be
> consistent with them.

Oh, got it. Makes sense.

> > I'm reviewing the rendered docs now, and I noticed sentences like this
> > are a bit hard to scan: they force the reader to parse a big list of
> > backend types before even getting to the meat of what this is talking
> > about. Should we maybe reword this so that the backend list comes at
> > the end of the sentence? Or maybe even use a list (e.g., like in the
> > "state" column description in pg_stat_activity)?
>
> Good idea with the bullet points.
> For the lengthy lists, I've added bullet point lists to the docs for
> several of the columns. It is quite long now but, hopefully, clearer?
> Let me know if you think it improves the readability.

Hmm, I should have tried this before suggesting it. I think the lists
break up the flow of the column description too much. What do you
think about the attached (on top of your patches--attaching it as a
.diff to hopefully not confuse cfbot)? I kept the lists for backend
types but inlined the others as a middle ground. I also added a few
omitted periods and reworded "read plus extended" to avoid starting
the sentence with a (lowercase) varname (I think in general it's fine
to do that, but the more complicated sentence structure here makes it
easier to follow if the sentence starts with a capital).

Alternately, what do you think about pulling equivalencies to existing
views out of the main column descriptions, and adding them after the
main table as a sort of footnote? Most view docs don't have anything
like that, but pg_stat_replication does and it might be a good pattern
to follow.

Thoughts?

> > Also, is this (in the middle of the table) the right place for this
> > column? I would have expected to see it before or after all the actual
> > I/O op cells.
>
> I put it after read, write, and extend columns because it applies to
> them. It doesn't apply to files_synced. For reused and evicted, I didn't
> think bytes reused and evicted made sense. Also, when we add non-block
> oriented IO, reused and evicted won't be used but op_bytes will be. So I
> thought it made more sense to place it after the operations it applies
> to.

Got it, makes sense.

> > + io_contexts. When a Buffer Access
> > + Strategy reuses a buffer in the strategy ring, it must evict its
> > + contents, incrementing reused. When a Buffer
> > + Access Strategy adds a new shared buffer to the strategy ring
> > + and this shared buffer is occupied, the Buffer Access
> > + Strategy must evict the contents of the shared buffer,
> > + incrementing evicted.
> >
> > I think the parallel phrasing here makes this a little hard to follow.
> > Specifically, I think "must evict its contents" for the strategy case
> > sounds like a bad thing, but in fact this is a totally normal thing
> > that happens as part of strategy access, no? The idea is you probably
> > won't need that buffer again, so it's fine to evict it. I'm not sure
> > how to reword, but I think the current phrasing is misleading.
>
> I had trouble rephrasing this. I changed a few words. I see what you
> mean. It is worth noting that reusing strategy buffers when there are
> buffers on the freelist may not be the best behavior, so I wouldn't
> necessarily consider "reused" a good thing. However, I'm not sure how
> much the user could really do about this. I would at least like this
> phrasing to be clear (evicted is for shared buffers, reused is for
> strategy buffers), so, perhaps this section requires more work.

Oh, I see. I think the updated wording works better. Although I think
we can drop the quotes around "Buffer Access Strategy" here. They're
useful when defining the term originally, but after that I think it's
clearer to use the term unquoted.

Just to understand this better myself, though: can you clarify when
"reused" is not a normal, expected part of the strategy execution? I
was under the impression that a ring buffer is used because each page
is needed only "once" (i.e., for one set of operations) for the
command using the strategy ring buffer. Naively, in that situation, it
seems better to reuse a no-longer-needed buffer than to claim another
buffer from the freelist (where other commands may eventually make
better use of it).

> > + again. A high number of repossessions is a sign of contention for the
> > + blocks operated on by the strategy operation.
> >
> > This (and in general the repossession description) makes sense, but
> > I'm not sure what to do with the information. Maybe Andres is right
> > that we could skip this in the first version?
>
> I've removed repossessed and rejected in attached v37. I am a bit sad
> about this because I don't see a good way forward and I think those
> could be useful for users.

I can see that, but I think as long as we're not doing 

Re: psql: Add command to use extended query protocol

2022-11-07 Thread Corey Huinker
>
>
>
> what about introduction new syntax for psql variables that should be
> passed as bind variables.
>

I thought about basically reserving the \$[0-9]+ space as bind variables,
but it is possible, though unlikely, that users have been naming their
variables like that.

It's unclear from your example if that's what you meant, or if you wanted
actual named variables ($name, $timestamp_before, $x).

Actual named variables might cause problems withCREATE FUNCTION AS ...
$body$ ... $body$; as well as the need to deduplicate them.

So while it is less seamless, I do like the \bind x y z \g idea because it
requires no changes in variable interpolation, and the list can be
terminated with a slash command or ;

To your point about forcing extended query protocol even when no parameters
are, that would be SELECT 1 \bind \g

It hasn't been discussed, but the question of how to handle output
parameters seems fairly straightforward: the value of the bind variable is
the name of the psql variable to be set a la \gset.


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Pavel Luzanov

On 07.11.2022 20:30, Ronan Dunklau wrote:

What I meant here is that disabling seqscans, the planner still chooses a full
sort over a partial sort. The underlying index is the same, it is just a
matter of choosing a Sort node over an IncrementalSort node. This, I think, is
wrong: I can't see how it could be worse to use an incrementalsort in that
case.


I finally get your point. And I agree with you.


Maybe the original costing code for incremental sort was a bit too
pessimistic.


In this query, incremental sorting lost just a little bit in cost: 
164468.95 vs 162504.23.


QUERY PLAN
---
 GroupAggregate  (cost=155002.98..162504.23 rows=100 width=34) (actual 
time=296.591..568.270 rows=100 loops=1)

   Group Key: a
   ->  Sort  (cost=155002.98..157502.98 rows=100 width=4) (actual 
time=293.810..454.170 rows=100 loops=1)

 Sort Key: a, c
 Sort Method: external merge  Disk: 15560kB
 ->  Index Scan using t_a_b_idx on t (cost=0.42..41670.64 
rows=100 width=4) (actual time=0.021..156.441 rows=100 loops=1)

 Settings: enable_seqscan = 'off'
 Planning Time: 0.074 ms
 Execution Time: 569.957 ms
(9 rows)

set enable_sort=off;
SET
QUERY PLAN
---
 GroupAggregate  (cost=1457.58..164468.95 rows=100 width=34) (actual 
time=6.623..408.833 rows=100 loops=1)

   Group Key: a
   ->  Incremental Sort  (cost=1457.58..159467.70 rows=100 width=4) 
(actual time=2.652..298.530 rows=100 loops=1)

 Sort Key: a, c
 Presorted Key: a
 Full-sort Groups: 100  Sort Method: quicksort  Average Memory: 
27kB  Peak Memory: 27kB
 Pre-sorted Groups: 100  Sort Method: quicksort  Average 
Memory: 697kB  Peak Memory: 697kB
 ->  Index Scan using t_a_b_idx on t (cost=0.42..41670.64 
rows=100 width=4) (actual time=0.011..155.260 rows=100 loops=1)

 Settings: enable_seqscan = 'off', enable_sort = 'off'
 Planning Time: 0.044 ms
 Execution Time: 408.867 ms

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company





Re: [PATCH] pgbench: add multiconnect option

2022-11-07 Thread Fabien COELHO


Hello Ian,


cfbot reports the patch no longer applies.  As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.


Attached a v5 which is just a rebase.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 40e6a50a7f..a3ae7cc9be 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -29,7 +29,7 @@ PostgreSQL documentation
   
pgbench
option
-   dbname
+   dbname or conninfo
   
  
 
@@ -169,6 +169,9 @@ pgbench  options  d
 not specified, the environment variable
 PGDATABASE is used. If that is not set, the
 user name specified for the connection is used.
+Alternatively, the dbname can be
+a standard connection information string.
+Several connections can be provided.

   
  
@@ -918,6 +921,21 @@ pgbench  options  d
 
 
 
+ 
+  --connection-policy=policy
+  
+   
+Set the connection policy when multiple connections are available.
+Default is round-robin provided (ro).
+Possible values are:
+first (f),
+random (ra),
+round-robin (ro),
+working (w).
+   
+  
+ 
+
  
   -h hostname
   --host=hostname
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b208d74767..02f8278b34 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -301,13 +301,39 @@ uint32		max_tries = 1;
 bool		failures_detailed = false;	/* whether to group failures in
 		 * reports or logs by basic types */
 
+char	   *logfile_prefix = NULL;
+
+/* main connection definition */
 const char *pghost = NULL;
 const char *pgport = NULL;
 const char *username = NULL;
-const char *dbName = NULL;
-char	   *logfile_prefix = NULL;
 const char *progname;
 
+/* multi connections */
+typedef enum mc_policy_t
+{
+	MC_UNKNOWN = 0,
+	MC_FIRST,
+	MC_RANDOM,
+	MC_ROUND_ROBIN,
+	MC_WORKING
+} mc_policy_t;
+
+/* connection info list */
+typedef struct connection_t
+{
+	const char *connection;		/* conninfo or dbname */
+	int			errors;			/* number of connection errors */
+} connection_t;
+
+static intn_connections = 0;
+static connection_t	   *connections = NULL;
+static mc_policy_t	mc_policy = MC_ROUND_ROBIN;
+
+/* last used connection */
+// FIXME per thread?
+static int current_connection = 0;
+
 #define WSEP '@'/* weight separator */
 
 volatile bool timer_exceeded = false;	/* flag from signal handler */
@@ -871,7 +897,7 @@ usage(void)
 {
 	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
 		   "Usage:\n"
-		   "  %s [OPTION]... [DBNAME]\n"
+		   "  %s [OPTION]... [DBNAME or CONNINFO ...]\n"
 		   "\nInitialization options:\n"
 		   "  -i, --initialize invokes initialization mode\n"
 		   "  -I, --init-steps=[" ALL_INIT_STEPS "]+ (default \"" DEFAULT_INIT_STEPS "\")\n"
@@ -929,6 +955,7 @@ usage(void)
 		   "  -h, --host=HOSTNAME  database server host or socket directory\n"
 		   "  -p, --port=PORT  database server port number\n"
 		   "  -U, --username=USERNAME  connect as specified database user\n"
+		   "  --connection-policy=Sset multiple connection policy (\"first\", \"rand\", \"round-robin\", \"working\")\n"
 		   "  -V, --versionoutput version information, then exit\n"
 		   "  -?, --help   show this help, then exit\n"
 		   "\n"
@@ -1535,13 +1562,89 @@ tryExecuteStatement(PGconn *con, const char *sql)
 	PQclear(res);
 }
 
+/* store a new connection information string */
+static void
+push_connection(const char *c)
+{
+	connections = pg_realloc(connections, sizeof(connection_t) * (n_connections+1));
+	connections[n_connections].connection = pg_strdup(c);
+	connections[n_connections].errors = 0;
+	n_connections++;
+}
+
+/* switch connection */
+static int
+next_connection(int *pci)
+{
+	int ci;
+
+	ci = ((*pci) + 1) % n_connections;
+	*pci = ci;
+
+	return ci;
+}
+
+/* return the connection index to use for next attempt */
+static int
+choose_connection(int *pci)
+{
+	int ci;
+
+	switch (mc_policy)
+	{
+		case MC_FIRST:
+			ci = 0;
+			break;
+		case MC_RANDOM:
+			// FIXME should use a prng state ; not thread safe ;
+			ci = (int) getrand(&base_random_sequence, 0, n_connections-1);
+			*pci = ci;
+			break;
+		case MC_ROUND_ROBIN:
+			ci = next_connection(pci);
+			break;
+		case MC_WORKING:
+			ci = *pci;
+			break;
+		default:
+			pg_fatal("unexpected multi connection policy: %d", mc_policy);
+			exit(1);
+	}
+
+	return ci;
+}
+
+/* return multi-connection policy based on its name or shortest prefix */
+static mc_policy_t
+get_connection_policy(const char *s)
+{
+	if (s == NULL || *s == '\0' || strcmp(s, "first") == 0 || strcmp(s, "f") == 0)
+		return MC_FIRST;
+	else if (strcmp(s, "random") == 0 || strcmp(s, "ra") == 0)
+		return MC_RANDOM;
+	else if (strcmp(s, "round-robin") == 0 || strcmp(s, "ro") == 0)
+		return MC_ROUND_ROBIN;
+	else if (strcmp(s, "working") == 0 |

Re: psql: Add command to use extended query protocol

2022-11-07 Thread Tom Lane
Corey Huinker  writes:
> I thought about basically reserving the \$[0-9]+ space as bind variables,
> but it is possible, though unlikely, that users have been naming their
> variables like that.

Don't we already reserve that syntax as Params?  Not sure whether there
would be any conflicts versus Params, but these are definitely not legal
as SQL identifiers.

regards, tom lane




Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-07 Thread Simon Riggs
On Tue, 1 Nov 2022 at 08:55, Julien Tachoires  wrote:
>
> > > 4. Test results with transaction 1
> > >
> > > TPS vs number of sub-transaction
> > >
> > > nsubx  HEAD  patched
> > > 
> > >   0   441109  439474
> > >   8   439045  438103
> > >  16   439123  436993
> > >  24   436269  434194
> > >  32   439707  437429
> > >  40   439997  437220
> > >  48   439388  437422
> > >  56   439409  437210
> > >  64   439748  437366
> > >  7292869  434448
> > >  8066577  434100
> > >  8861243  434255
> > >  9657016  434419
> > > 10452132  434917
> > > 11249181  433755
> > > 12046581  434044
> > > 12844067  434268
> > >
> > > Perf profiling on HEAD with 80 sub-transactions:
> > > Overhead  Symbol
> > >   51.26%  [.] LWLockAttemptLock
> > >   24.59%  [.] LWLockRelease
> > >0.36%  [.] base_yyparse
> > >0.35%  [.] PinBuffer
> > >0.34%  [.] AllocSetAlloc
> > >0.33%  [.] hash_search_with_hash_value
> > >0.22%  [.] LWLockAcquire
> > >0.20%  [.] UnpinBuffer
> > >0.15%  [.] SimpleLruReadPage_ReadOnly
> > >0.15%  [.] _bt_compare
> > >
> > > Perf profiling on patched with 80 sub-transactions:
> > > Overhead  Symbol
> > >   2.64%  [.] AllocSetAlloc
> > >   2.09%  [.] base_yyparse
> > >   1.76%  [.] hash_search_with_hash_value
> > >   1.62%  [.] LWLockAttemptLock
> > >   1.26%  [.] MemoryContextAllocZeroAligned
> > >   0.93%  [.] _bt_compare
> > >   0.92%  [.] expression_tree_walker_impl.part.4
> > >   0.84%  [.] SearchCatCache1
> > >   0.79%  [.] palloc
> > >   0.64%  [.] core_yylex
> > >
> > > 5. Test results with transaction 2
> > >
> > > nsubx  HEAD  patched
> > > 
> > >   0  440145  443816
> > >   8  438867  443081
> > >  16  438634  441786
> > >  24  436406  440187
> > >  32  439203  442447
> > >  40  439819  443574
> > >  48  439314  442941
> > >  56  439801  443736
> > >  64  439074  441970
> > >  72  439833  444132
> > >  80  148737  439941
> > >  88  413714  443343
> > >  96  251098  442021
> > > 104   70190  443488
> > > 112  405507  438866
> > > 120  177827  443202
> > > 128  399431  441842
> > >
> > > From the performance point of view, this patch clearly fixes the
> > > dramatic TPS collapse shown in these tests.
> >
> > I think these are really promising results.  Although the perf result
> > shows that the bottleneck on the SLRU is no more there with the patch,
> > I think it would be nice to see the wait event as well.
>
> Please find attached samples returned by the following query when
> testing transaction 1 with 80 subxacts:
> SELECT wait_event_type, wait_event, locktype, mode, database,
> relation, COUNT(*) from pg_stat_activity AS psa JOIN pg_locks AS pl ON
> (psa.pid = pl.pid) GROUP BY 1, 2, 3, 4, 5, 6 ORDER BY 7 DESC;

These results are compelling, thank you.

Setting this to Ready for Committer.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: archive modules

2022-11-07 Thread Nathan Bossart
On Mon, Nov 07, 2022 at 03:20:31PM +0900, Michael Paquier wrote:
> On Sat, Nov 05, 2022 at 02:08:58PM -0700, Nathan Bossart wrote:
>> Perhaps we could eventually move the archive_command functionality to a
>> contrib module (i.e., "shell_archive") so that users must always set
>> archive_library.  But until then, I suspect it's better to treat modules
>> and commands as two separate interfaces to ease migration from older major
>> versions (even though archive_command is now essentially a built-in archive
>> module).
> 
> I agree that this is a fine long-term goal, removing all traces of the
> archive_command from the backend core code.  This is actually an
> argument in favor of having no traces of XLogArchiveCommand in
> pgarch.c, no? ;p

Indeed.

> I am not sure how long we should wait before being able to do that,
> perhaps a couple of years of least?  I'd like to think the sooner the
> better (like v17?) but we are usually conservative, and the removal of
> the exclusive backup mode took 5~6 years if I recall correctly..

Yeah, I imagine we'd need to mark it as deprecated-and-to-be-removed for
several years first.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-07 Thread Melanie Plageman
On Fri, Nov 04, 2022 at 08:56:13AM -0400, Reid Thompson wrote:
> From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 2001
> From: Reid Thompson 
> Date: Thu, 11 Aug 2022 12:01:25 -0400
> Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity
> 
> This new field displays the current bytes of memory allocated to the
> backend process. It is updated as memory for the process is
> malloc'd/free'd. Memory allocated to items on the freelist is included in
> the displayed value.  Dynamic shared memory allocations are included
> only in the value displayed for the backend that created them, they are
> not included in the value for backends that are attached to them to
> avoid double counting. On occasion, orphaned memory segments may be
> cleaned up on postmaster startup. This may result in decreasing the sum
> without a prior increment. We limit the floor of backend_mem_allocated
> to zero. Updated pg_stat_activity documentation for the new column.
> ---
>  doc/src/sgml/monitoring.sgml|  12 +++
>  src/backend/catalog/system_views.sql|   1 +
>  src/backend/storage/ipc/dsm_impl.c  |  81 +++
>  src/backend/utils/activity/backend_status.c | 105 
>  src/backend/utils/adt/pgstatfuncs.c |   4 +-
>  src/backend/utils/mmgr/aset.c   |  18 
>  src/backend/utils/mmgr/generation.c |  15 +++
>  src/backend/utils/mmgr/slab.c   |  21 
>  src/include/catalog/pg_proc.dat |   6 +-
>  src/include/utils/backend_status.h  |   7 +-
>  src/test/regress/expected/rules.out |   9 +-
>  11 files changed, 270 insertions(+), 9 deletions(-)
> 
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index e5d622d514..972805b85a 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -947,6 +947,18 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
> 11:34   0:00 postgres: ser
>   
>  
>  
> +
> + 
> +  backend_allocated_bytes bigint
> + 
> + 
> +  The byte count of memory allocated to this backend. Dynamic shared 
> memory
> +  allocations are included only in the value displayed for the backend 
> that
> +  created them, they are not included in the value for backends that are
> +  attached to them to avoid double counting.
> + 
> +
> +

It doesn't seem like you need the backend_ prefix in the view since that
is implied by it being in pg_stat_activity.

For the wording on the description, I find "memory allocated to this
backend" a bit confusing. Perhaps you could reword it to make clear you
mean that the number represents the balance of allocations by this
backend. Something like:

Memory currently allocated to this backend in bytes. This is the
balance of bytes allocated and freed by this backend.

I would also link to the system administration function pg_size_pretty()
so users know how to easily convert the value.

If you end up removing shared memory as Andres suggests in [1], I would link
pg_shmem_allocations view here and point out that shared memory allocations can
be viewed there instead (and why).

You could instead add dynamic shared memory allocation to the
pg_shmem_allocations view as suggested as follow-on work by the commit which
introduced it, ed10f32e3.

> +/* 
> + * pgstat_report_backend_allocated_bytes_increase() -
> + *
> + * Called to report increase in memory allocated for this backend
> + * 
> + */

It seems like you could combine the
pgstat_report_backend_allocated_bytes_decrease/increase() by either using a
signed integer to represent the allocation/deallocation or passing in a
"direction" that is just a positive or negative multiplier enum.

Especially if you don't use the write barriers, I think you could
simplify the logic in the two functions.

If you do combine them, you might shorten the name to
pgstat_report_backend_allocation() or pgstat_report_allocation().

> +void
> +pgstat_report_backend_allocated_bytes_increase(uint64 allocation)
> +{
> + volatile PgBackendStatus *beentry = MyBEEntry;
> +
> + if (!beentry || !pgstat_track_activities)
> + {
> + /*
> +  * Account for memory before pgstats is initialized. This will 
> be
> +  * migrated to pgstats on initialization.
> +  */
> + backend_allocated_bytes += allocation;
> +
> + return;
> + }
> +
> + /*
> +  * Update my status entry, following the protocol of bumping
> +  * st_changecount before and after.  We use a volatile pointer here to
> +  * ensure the compiler doesn't try to get cute.
> +  */
> + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
> + beentry->backend_allocated_bytes += allocation;
> + PGSTAT_END_WRITE_ACTIVITY(beentry);
> +}
> +
> +/* 
> + * pgstat_report_backend_allocated_bytes_decrease() -
> + *
> + * Called to r

Re: thinko in basic_archive.c

2022-11-07 Thread Nathan Bossart
On Mon, Nov 07, 2022 at 04:53:35PM +0530, Bharath Rupireddy wrote:
> The tempfile name can vary in size for the simple reason that a pid
> can be of varying digits - for instance, tempfile name is 'foo1234'
> (pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the
> first char to '\0' and say pid wraparound occurred, now the it becomes
> 'bar5674' (pid being 567).

The call to snprintf() should take care of adding a terminating null byte
in the right place.

> BTW, I couldn't find the 80 existing instances, can you please let me
> know your search keyword?

grep "\[0\] = '\\\0'" src -r

> So, IIUC, your point here is what if the copy_file fails to create the
> temp file when it already exists. With the name collision being a rare
> scenario, given the pid and timestamp variables, I'm not sure if
> copy_file can ever fail because the temp file already exists (with
> errno EEXIST). However, if we want to be extra-cautious, checking if
> temp file exists with file_exists() before calling copy_file() might
> help avoid such cases. If we don't want to have extra system call (via
> file_exists()) to check the temp file existence, we can think of
> sending a flag to copy_file(src, dst, &is_dst_file_created) and use
> is_dst_file_created in the shutdown callback. Thoughts?

Presently, if copy_file() encounters a pre-existing file, it should ERROR,
which will be caught in the sigsetjmp() block in basic_archive_file().  The
shutdown callback shouldn't run in this scenario.

I think this cleanup logic should run in both the shutdown callback and the
sigsetjmp() block, but it should only take action (i.e., deleting the
leftover temporary file) if the ERROR or shutdown occurs after creating the
file in copy_file() and before renaming the temporary file to its final
destination.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PL/pgSQL cursors should get generated portal names by default

2022-11-07 Thread Jan Wieck
My comments were in no way meant as an argument for or against the 
change itself. Only to clearly document the side effect it will have.



Regards, Jan


On 11/7/22 11:57, Kirk Wolak wrote:



On Mon, Nov 7, 2022 at 11:10 AM Jan Wieck > wrote:


On 11/4/22 19:46, Tom Lane wrote:
 > Jan Wieck mailto:j...@wi3ck.info>> writes:
 >> I need to do some testing on this. I seem to recall that the
naming was
 >> originally done because a reference cursor is basically a named
cursor
 >> that can be handed around between functions and even the top SQL
level
 >> of the application. For the latter to work the application needs
to know
 >> the name of the portal.
 >
 > Right.  With this patch, it'd be necessary to hand back the actual
 > portal name (by returning the refcursor value), or else manually
 > set the refcursor value before OPEN to preserve the previous
behavior.
 > But as far as I saw, all our documentation examples show handing back
 > the portal name, so I'm hoping most people do it like that already.

I was mostly concerned that we may unintentionally break
underdocumented
behavior that was originally implemented on purpose. As long as
everyone
is aware that this is breaking backwards compatibility in the way it
does, that's fine.


I respect the concern, and applied some deeper thinking to it...

Here is the logic I am applying to this compatibility issue and what may 
break.

[FWIW, my motto is to be wrong out  loud, as you learn faster]

At first pass, I thought "Well, since this does not break a refcursor, 
which is the obvious use case for RETURNING/PASSING, we are fine!"


But in trying to DEFEND this case, I have come up with example of code 
(that makes some SENSE, but would break):


CREATE FUNCTION test() RETURNS refcursor() LANGUAGE plpgsql AS  $$
DECLARE
     cur_this cursor FOR SELECT 1;
     ref_cur refcursor;
BEGIN
     OPEN cur_this;
     ref_cur := 'cur_this';  -- Using the NAME of the cursor as the 
portal name: Should do:  ref_cur := cur_this; -- Only works after OPEN

     RETURN ref_cur;
END;
$$;

As noted in the comments.  If the code were:
   ref_cur := 'cur_this';  -- Now you can't just use ref_cur := cur_this;
   OPEN cur_this;
   RETURN ref_cur;
Then it would break now...  And even the CORRECT syntax would break, 
since the cursor was not opened, so "cur_this" is null.


Now, I have NO IDEA if someone would actually do this.  It is almost 
pathological.  The use case would be a complex cursor with parameters,

and they changed the code to return a refcursor!
This was the ONLY use case I could think of that wasn't HACKY!

HACKY use cases involve a child routine setting:  local_ref_cursor := 
'cur_this'; in order to access a cursor that was NOT passed to the child.
FWIW, I tested this, and it works, and I can FETCH in the child routine, 
and it affects the parents' LOOP as it should... WOW.  I would be HAPPY

to break such horrible code, it has to be a security concern at some level.

Personally (and my 2 cents really shouldn't matter much), I think this 
should still be fixed.
Because I believe this small use case is rare, it will break 
immediately, and the fix is trivial (just initialize cur_this := 
'cur_this'  in this example),
and the fix removes the Orthogonal Behavior Tom pointed out, which led 
me to reporting this.


I think I have exhausted examples of how this impacts a VALID 
refcursor implementation.  I believe any other such versions are 
variations of this!
And maybe we document that if a refcursor of a cursor is to be returned, 
that the refcursor is ASSIGNED after the OPEN of the cursor, and it is 
done without the quotes, as:

   ref_cursor := cur_this;  -- assign the name after opening.

Thanks!








Re: New docs chapter on Transaction Management and related changes

2022-11-07 Thread Laurenz Albe
On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote:
> Agreed; new compilation patch attached, including mine and then
> Robert's suggested rewordings.

Thanks.  There is clearly a lot of usefule information in this.

Some comments:

> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -24673,7 +24673,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
> 
>  Returns the current transaction's ID.  It will assign a new one if 
> the
>  current transaction does not have one already (because it has not
> -performed any database updates).
> +performed any database updates);  see  +linkend="transaction-id"/> for details.  If executed in a
> +subtransaction this will return the top-level xid;  see  +linkend="subxacts"/> for details.
> 
>

I would use a comma after "subtransaction", and I think it would be better to 
write
"transaction ID" instead of "xid".

> @@ -24690,6 +24693,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
>  ID is assigned yet.  (It's best to use this variant if the 
> transaction
>  might otherwise be read-only, to avoid unnecessary consumption of an
>  XID.)
> +If executed in a subtransaction this will return the top-level xid.
> 
>

Same as above.

> @@ -24733,6 +24737,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
> 
>  Returns a current snapshot, a data structure
>  showing which transaction IDs are now in-progress.
> +Only top-level xids are included in the snapshot; subxids are not
> +shown;  see  for details.
> 
>

Again, I would avoid "xid" and "subxid", or at least use "transaction ID (xid)"
and similar.

> --- a/doc/src/sgml/ref/release_savepoint.sgml
> +++ b/doc/src/sgml/ref/release_savepoint.sgml
> @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] 
> savepoint_name
>Description
>  
>
> -   RELEASE SAVEPOINT destroys a savepoint previously 
> defined
> -   in the current transaction.
> +   RELEASE SAVEPOINT will subcommit the subtransaction
> +   established by the named savepoint, if one exists. This will release
> +   any resources held by the subtransaction. If there were any
> +   subtransactions of the named savepoint, these will also be subcommitted.
>
> 
>

"Subtransactions of the named savepoint" is somewhat confusing; how about
"subtransactions of the subtransaction established by the named savepoint"?

If that is too long and explicit, perhaps "subtransactions of that 
subtransaction".

> @@ -78,7 +71,7 @@ RELEASE [ SAVEPOINT ] 
> savepoint_name
> 
>
> It is not possible to release a savepoint when the transaction is in
> -   an aborted state.
> +   an aborted state, to do that use .
>
> 
>

I think the following is more English:
"It is not possible ... state; to do that, use ."

> --- a/doc/src/sgml/ref/rollback.sgml
> +++ b/doc/src/sgml/ref/rollback.sgml
> @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
>  AND CHAIN
>  
>   
> -  If AND CHAIN is specified, a new transaction is
> +  If AND CHAIN is specified, a new unaborted 
> transaction is
>immediately started with the same transaction characteristics (see 
> linkend="sql-set-transaction"/>) as the just finished one.  Otherwise,
>no new transaction is started.

I don't think that is an improvement.  "Unaborted" is an un-word.  A new 
transaction
is always "unaborted", isn't it?

> --- a/doc/src/sgml/wal.sgml
> +++ b/doc/src/sgml/wal.sgml
> @@ -909,4 +910,36 @@
> seem to be a problem in practice.
>
>   
> +
> + 
> +
> +  Two-Phase Transactions
> +
> +  
> +   PostgreSQL supports a two-phase commit (2PC)
[...]
> +   pg_twophase directory. Currently-prepared
> +   transactions can be inspected using  +   
> linkend="view-pg-prepared-xacts">pg_prepared_xacts.
> +  
> + 
> +
>  

I don't like "currently-prepared".  How about:
"Transaction that are currently prepared can be inspected..."

This is clearly interesting information, but I don't think the WAL chapter is 
the right
place for this.  "pg_twophase" is already mentioned in "storage.sgml", and 
details about
when exactly a prepared transaction is persisted may exceed the details level 
needed by
the end user.

I'd look for that information in the reference page for PREPARE TRANSACTION; 
perhaps
that would be a better place.  Or, even better, the new "xact.sgml" chapter.

> --- /dev/null
> +++ b/doc/src/sgml/xact.sgml

+  Transaction Management

+   The word transaction is often abbreviated as "xact".

Should use  here.

> +   Transactions and Identifiers

> +   
> +Once a transaction writes to the database, it is assigned a
> +non-virtual TransactionId (or xid),
> +e.g., 278394. Xids are assigned sequentially
> +using a global counter used by all databases within the
> +PostgreSQL cluster. This property is used by
> +the transaction system to 

Re: Improve logging when using Huge Pages

2022-11-07 Thread Thomas Munro
On Tue, Nov 8, 2022 at 4:56 AM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2022-11-06 14:04:29 +0700, John Naylor wrote:
> >> I think the best thing to do is change huge_pages='on' to log a WARNING and
> >> fallback to regular pages if the mapping fails.
>
> > I am strongly opposed to making 'on' fall back to not using huge pages. 
> > That's
> > what 'try' is for.
>
> Agreed --- changing "on" to be exactly like "try" isn't an improvement.
> If you want "try" semantics, choose "try".

Agreed, but how can we make the people who want a log message happy,
without upsetting the people who don't want new log messages?  Hence
my suggestion of a new level.  How about try_verbose?




Re: Large Pages and Super Pages for PostgreSQL

2022-11-07 Thread Thomas Munro
On Sun, Jan 16, 2022 at 8:32 PM Thomas Munro  wrote:
> On Sun, Jan 16, 2022 at 6:03 PM DEVOPS_WwIT  wrote:
> > Solaris and FreeBSD supports large/super pages, and can be used
> > automatically by applications.
> >
> > Seems Postgres can't use the large/super pages on Solaris and FreeBSD
> > os(I think can't use the large/super page HPUX and AIX), is there anyone
> > could take a look?
>
> 3.  FreeBSD: FreeBSD does transparently migrate PostgreSQL memory to
> "super" pages quite well in my experience, but there is also a new
> facility in FreeBSD 13 to ask for specific page sizes explicitly.  I
> wrote a quick and dirty patch to enable PostgreSQL's huge_pages and
> huge_page_size settings to work with that interface, but I haven't yet
> got as far as testing it very hard or proposing it...  but here it is,
> if you like experimental code[2].

I was reminded to rebase that and tidy it up a bit, by recent
discussion of page table magic in other threads.  Documentation of
these interfaces is sparse to put it mildly (I may try to improve that
myself) but basically the terminology is "super" for pages subject to
promotion/demotion, and "large" when explicitly managed.  Not
proposing for commit right now as I need to learn more about all this
and there are some policy decisions lurking in here (eg synchronous
defrag vs nowait depending on flags), but the patch may be useful for
experimentation.  For example, it allows huge_page_size=1GB if your
system can handle that.
From b76dc0e5a472824aaa87de4f6c1f6db26c810c7f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 3 Nov 2022 10:06:24 +1300
Subject: [PATCH] Support huge_pages and huge_page_size on FreeBSD.

FreeBSD often uses huge (super) pages due to automatic promotion, but it
may be useful to be able to request that explicitly, and to be able to
experiment with different page sizes by explicit request.

Discussion: https://postgr.es/m/3043b674-46d6-a8e9-3811-5a3007c8dceb%40ww-it.cn
---
 doc/src/sgml/config.sgml  |  6 ++--
 src/backend/port/sysv_shmem.c | 58 ++-
 2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 559eb898a9..cd65916bb5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1678,9 +1678,9 @@ include_dir 'conf.d'

 

-At present, this setting is supported only on Linux and Windows. The
+At present, this setting is supported only on Linux, FreeBSD and Windows. The
 setting is ignored on other systems when set to
-try.  On Linux, it is only supported when
+try.  On Linux and FreeBSD, it is only supported when
 shared_memory_type is set to mmap
 (the default).

@@ -1742,7 +1742,7 @@ include_dir 'conf.d'
 about usage and support, see .


-Non-default settings are currently supported only on Linux.
+Non-default settings are currently supported only on Linux and FreeBSD.

   
  
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 97ce7b7c49..0bb8c20c8d 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -576,8 +576,9 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 bool
 check_huge_page_size(int *newval, void **extra, GucSource source)
 {
-#if !(defined(MAP_HUGE_MASK) && defined(MAP_HUGE_SHIFT))
-	/* Recent enough Linux only, for now.  See GetHugePageSize(). */
+#if !(defined(MAP_HUGE_MASK) && defined(MAP_HUGE_SHIFT)) && \
+	!defined(SHM_LARGEPAGE_ALLOC_DEFAULT)
+	/* Recent enough Linux and FreeBSD only, for now. */
 	if (*newval != 0)
 	{
 		GUC_check_errdetail("huge_page_size must be 0 on this platform.");
@@ -601,12 +602,9 @@ CreateAnonymousSegment(Size *size)
 	void	   *ptr = MAP_FAILED;
 	int			mmap_errno = 0;
 
-#ifndef MAP_HUGETLB
-	/* PGSharedMemoryCreate should have dealt with this case */
-	Assert(huge_pages != HUGE_PAGES_ON);
-#else
 	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
 	{
+#ifdef MAP_HUGETLB
 		/*
 		 * Round up the request size to a suitable large value.
 		 */
@@ -624,8 +622,52 @@ CreateAnonymousSegment(Size *size)
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
  allocsize);
-	}
 #endif
+#ifdef SHM_LARGEPAGE_ALLOC_DEFAULT
+		int			nsizes;
+		size_t	   *page_sizes;
+		size_t		page_size;
+		int			page_size_index = -1;
+
+		/*
+		 * Find the matching page size index, or if huge_page_size wasn't set,
+		 * then skip the smallest size and take the next one after that.
+		 */
+		nsizes = getpagesizes(NULL, 0);
+		page_sizes = palloc(nsizes * sizeof(*page_sizes));
+		getpagesizes(page_sizes, nsizes);
+		for (int i = 0; i < nsizes; ++i)
+		{
+			if (huge_page_size * 1024 == page_sizes[i] ||
+(huge_page_size == 0 && i > 0))
+			{
+page_size = page_sizes[i];
+page_size_index = i;
+if (

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-07 Thread David Christensen
Hi Justin et al,

Enclosed is v5 of this patch which now passes the CirrusCI checks for
all supported OSes. I went ahead and reworked the test a bit so it's a
little more amenable to the OS-agnostic approach for testing.

Best,

David


v5-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data


Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Jérémie Grauer

Hello,

First, thank you for reviewing.

ZFS writes files in increment of its configured recordsize for the 
current filesystem dataset.


So with a recordsize configured to be a multiple of 8K, you can't get 
torn pages on writes, that's why full_page_writes can be safely 
deactivated on ZFS (the usual advice is to configure ZFS with a 
recordsize of 8K for postgres, but on some workloads, it can actually be 
beneficial to go to a higher multiple of 8K).


On 06/11/2022 03:38, Andres Freund wrote:

Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:

Currently pg_rewind refuses to run if full_page_writes is off. This is to
prevent it to run into a torn page during operation.

This is usually a good call, but some file systems like ZFS are naturally
immune to torn page (maybe btrfs too, but I don't know for sure for this
one).


Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.
Like I wrote above, ZFS will prevent torn pages on writes, like 
full_page_writes does.


Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents 

short _reads_ ?

Presumably, if we do something like read the first 4K of a file, then 
change the file, then read the next 4K, the second 4K may be a torn 
read. But I fail to see how full_page_writes prevents this since it only 
act on writes>

This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).
I agree, but this would need a differen patch, which may be beyond my 

skills.

Greetings,

Andres Freund
Anyway, ZFS will act like full_page_writes is always active, so isn't 
the proposed modification to pg_rewind valid?


You'll find attached a second version of the patch, which is cleaner 
(removed double negation).


Regards,
Jérémie GrauerFrom 309649dfa2eea9d696e43271057da7e643d771b4 Mon Sep 17 00:00:00 2001
From: Jeremie Grauer 
Date: Mon, 7 Nov 2022 23:50:43 +0100
Subject: [PATCH] adds the option --no-ensure-full-page-writes to pg_rewind

---
 doc/src/sgml/ref/pg_rewind.sgml  | 27 +++-
 src/bin/pg_rewind/libpq_source.c | 22 ++--
 src/bin/pg_rewind/pg_rewind.c| 43 +++-
 src/bin/pg_rewind/pg_rewind.h|  1 +
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..64ee600c2b 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -99,7 +99,13 @@ PostgreSQL documentation
in postgresql.conf or data checksums enabled when
the cluster was initialized with initdb.  Neither of these
are currently on by default.  
-   must also be set to on, but is enabled by default.
+   should also be set to on (which is the default) when the
+   underlying file system can write data in an increment of less than the page size.
+   Some file systems (like ZFS when configured with a recordsize as a multiple of
+   the page size) don't need this parameter since they will never write less than a
+   complete page to disk. If you are running such a system, you can add the option
+   --no-enforce-full-page-writes so that pg_rewind will run
+   even when  is set to off.
   
 
   
@@ -283,6 +289,25 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-ensure-full-page-writes
+  
+   
+pg_rewind by default requires that 
+is set to on. This ensures that pg_rewind do
+not run into torn pages while running. This option is necessary most of
+the time since very few filesystems enforce the flush of full page to disk.
+One such system is ZFS where the page is always flushed to disk in its
+entirety. Other storage system may also have the same warranty. This option
+is here to allow pg_rewind to run on such systems without enforcing that
+ is on. It's important to know what
+you are doing before setting this setting because in case your syst

Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Thomas Munro
On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer
 wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> >> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> >> prevent it to run into a torn page during operation.
> >>
> >> This is usually a good call, but some file systems like ZFS are naturally
> >> immune to torn page (maybe btrfs too, but I don't know for sure for this
> >> one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Just to spell out the distinction Andres was making, and maybe try to
answer a couple of questions if I can, there are two completely
different phenomena here:

1.  Generally full_page_writes is for handling a lack of atomic writes
on power loss, but ZFS already does that itself by virtue of its COW
design and data-logging in certain cases.

2.  Here we are using full_page_writes to handle lack of atomicity
when there are concurrent reads and writes to the same file from
different threads.  Basically, by turning on full_page_writes we say
that we don't trust any block that might have been written to during
the copying.  Again, ZFS already handles that for itself: it uses
range locking in the read and write paths (see zfs_rangelock_enter()
in zfs_write() etc), BUT that's only going to work if the actual
pread()/pwrite() system calls that reach ZFS are aligned with
PostgreSQL's pages.

Every now and then a discussion breaks out about WTF POSIX actually
requires WRT concurrent read/write, but it's trivial to show  that the
most popular Linux filesystem exposes randomly mashed-up data from old
and new versions of even small writes if you read while a write is
concurrently in progress[1], while many others don't.  That's what the
2nd thing is protecting against.  I think it must be possible to show
that breaking on ZFS too, *if* the file regions arriving into system
calls are NOT correctly aligned.  As Andres points out, 
buffered IO streams create a risk there: we have no idea what system
calls are reaching ZFS, so it doesn't seem safe to turn off full page
writes unless you also fix that.

> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something 
> > here: how does full_page_writes prevents
> short _reads_ ?

I don't know, but I think the paranoid approach would be that if you
get a short read, you go back and pread() at least that whole page, so
all your system calls are fully aligned.  Then I think you'd be safe?
Because zfs_read() does:

/*
 * Lock the range against changes.
 */
zfs_locked_range_t *lr = zfs_rangelock_enter(&zp->z_rangelock,
zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER);

So it should be possible to make a safe version of this patch, by
teaching the file-reading code to require BLCKSZ integrity for all
reads.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B19bZKidSiWmMsDmgUVe%3D_rr0m57LfR%2BnAbWprVDd_cw%40mail.gmail.com




Re: [DOCS] Stats views and functions not in order?

2022-11-07 Thread Peter Smith
On Mon, Nov 7, 2022 at 5:50 AM Tom Lane  wrote:
>
> Peter Smith  writes:
> > Sorry, I forgot the attachments in the previous post. PSA.
>
> I spent a bit of time looking at this.  I agree that a lot of the
> current ordering choices here look like they were made with the
> advice of a dartboard, and there's a number of things that are
> pretty blatantly just sloppy merging (like the out-of-order
> wait-event items).  However, I'm not a big fan of "alphabetical
> order at all costs", because that frequently leads to ordering
> decisions that are not a lot better than random from a semantic
> standpoint.  For example, I resist the idea that it's sensible
> to put pg_stat_all_indexes before pg_stat_all_tables.
> I'm unconvinced that putting pg_stat_sys_tables and
> pg_stat_user_tables far away from pg_stat_all_tables is great,
> either.
>

Thanks for taking the time to look at my patch. The "at all costs"
approach was not the intention - I was just trying only to apply some
sane ordering where I did not recognize a reason for the current
order.

> So ... how do we proceed?
>

To proceed with the existing patches I need some guidance on exactly
which of the changes can be considered improvements versus which ones
are maybe just trading one 'random' order for another.

How about below?

Table 28.1. Dynamic Statistics Views -- Alphabetical order would be a
small improvement here, right?

Table 28.2. Collected Statistics Views -- Leave this one unchanged
(per your comments above).

Table 28.12 Wait Events of type LWLock -- Seems a clear case of bad
merging. Alphabetical order is surely needed here, right?

Table 28.34 Additional Statistic Functions -- Alphabetical order would
be a small improvement here, right?

Table 28.35 Per-Backend Statistics Functions --  Alphabetical order
would be a small improvement here, right?

> One thing I'm unhappy about that you didn't address is that
> the subsection ordering in "28.4. Progress Reporting" could
> hardly have been invented even with a dartboard.  Perhaps it
> reflects development order, but that's a poor excuse.
> I'd be inclined to alphabetize by SQL command name, but maybe
> leave Base Backup to the end since it's not a SQL command.
>

Yes, I had previously only looked at the content of section 28.2
because I didn't want to get carried away by changing too much until
there was some support for doing the first part.

Now PSA a separate patch for fixing section "28.4. Progress Reporting"
order as suggested.

-
Kind Regards,
Peter Smith.
Fujitsu Australia.


v4-0001-Re-order-sections-of-28.4.-Progress-Reporting.patch
Description: Binary data


Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-08 00:07:09 +0100, Jérémie Grauer wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> > > Currently pg_rewind refuses to run if full_page_writes is off. This is to
> > > prevent it to run into a torn page during operation.
> > >
> > > This is usually a good call, but some file systems like ZFS are naturally
> > > immune to torn page (maybe btrfs too, but I don't know for sure for this
> > > one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Unfortunately not relevant for pg_rewind due to the issues mentioned
subsequently.


> > Right now, that definitely allows for torn reads, because of the way
> > pg_read_binary_file() is implemented.  We only ensure a 4k read size from 
> > the
> > view of our code, which obviously can lead to torn 8k page reads, no matter
> > what the filesystem guarantees.
> >
> > Also, for reasons I don't understand we use C streaming IO or
> > pg_read_binary_file(), so you'd also need to ensure that the buffer size 
> > used
> > by the stream implementation can't cause the reads to happen in smaller
> > chunks.  Afaict we really shouldn't use file streams here, then we'd at 
> > least
> > have control over that aspect.
> >
> >
> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something
> > here: how does full_page_writes prevents

> short _reads_ ?

Yes.


> Presumably, if we do something like read the first 4K of a file, then change
> the file, then read the next 4K, the second 4K may be a torn read.

Correct.


> But I fail to see how full_page_writes prevents this since it only act on 
> writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.


I suspect to avoid the need for this we'd need to atomically read all the
pages involved in a WAL record (presumably by locking the pages against
IO). That'd then safely allow skipping replay of WAL records based on the LSN.a

A slightly easier thing would be to force-enable full page writes just for the
duration of a rewind, similar to what we do during base backups. But that'd
still require a bunch more work than done here.


> > This isn't an fundamental issue - we could have a version of
> > pg_read_binary_file() for relation data that prevents the page being written
> > out concurrently by locking the buffer page. In addition it could often 
> > avoid
> > needing to read the page from the OS / disk, if present in shared buffers
> > (perhaps minus cases where we haven't flushed the WAL yet, but we could also
> > flush the WAL in those).
> > I agree, but this would need a differen patch, which may be beyond my
> skills.
> > Greetings,
> >
> > Andres Freund
> Anyway, ZFS will act like full_page_writes is always active, so isn't the
> proposed modification to pg_rewind valid?

No. This really isn't about the crash safety aspects of full page writes, so
the fact that ZFS is used is just not really relevant.

Regards,

Andres




Re: Allow single table VACUUM in transaction block

2022-11-07 Thread Peter Geoghegan
On Mon, Nov 7, 2022 at 12:20 AM Simon Riggs
 wrote:
> > Another related idea: better behavior in the event of a manually
> > issued VACUUM (now just an enqueued autovacuum) that cannot do useful
> > work due to the presence of a long running snapshot. The VACUUM
> > doesn't have to dutifully report "success" when there is no practical
> > sense in which it was successful. There could be a back and forth
> > conversation between autovacuum.c and vacuumlazy.c that makes sure
> > that something useful happens sooner or later. The passage of time
> > really matters here.
>
> Regrettably, neither vacuum nor autovacuum waits for xmin to change;
> perhaps it should.

Yes, it's very primitive right now. In fact I recently discovered that
just using the reloption version (not the GUC version) of
autovacuum_freeze_max_age in a totally straightforward way is all it
takes to utterly confuse autovacuum.c:

https://postgr.es/m/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=d...@mail.gmail.com

It's easy to convince autovacuum.c to launch antiwraparound
autovacuums that reliably have no chance of advancing relfrozenxid to
a degree that satisfies autovacuum.c. It will launch antiwraparound
autovacuums again and again, never realizing that VACUUM doesn't
really care about what it expects (at least not with the reloption in
use). Clearly that's just broken. It also suggests a more general
design problem, at least in my mind.

--
Peter Geoghegan




Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-07 Thread Michael Paquier
On Mon, Nov 07, 2022 at 03:07:15PM +0900, Michael Paquier wrote:
> Attached is a set of three patches:
> - 0001 changes tokenize_inc_file() to use AbsoluteConfigLocation().
> AbsoluteConfigLocation() uses a static buffer and a MAXPGPATH, but
> we'd rather change it to use a palloc()+strcpy() instead and remove
> the static restriction?  What do you think?  The same applies for the
> case where we use DataDir, actually, and it seems like there is no
> point in this path-length restriction in this code path.
> - 0002 invents the interface to open auth files and check for their
> depths, simplifying the main patch a bit as there is no need to track
> the depth level here and there anymore.
> - 0003 is the rebased patch, simplified after the other changes.  The
> bulk of the patch is in its TAP test.

CF bot unhappy as I have messed up with rules.out.  Rebased.  I have
removed the restriction on MAXPGPATH in AbsoluteConfigLocation() in
0001, while on it.  The absolute paths built on GUC or ident
inclusions are the same.
--
Michael
From f056ccbd604d63185763f774c5105f3208919306 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 8 Nov 2022 09:51:20 +0900
Subject: [PATCH v17 1/3] Expand the use of AbsoluteConfigLocation() in hba.c

The logic in charge of expanding an include file for database and user
names used the same code as AbsoluteConfigLocation() when building the
configuration file to include, so simplify this code.  While on it,
remove the restriction to MAXPGPATH, and switch to the same method as
what tokenize_inc_file() used.
---
 src/backend/libpq/hba.c| 17 ++---
 src/backend/utils/misc/conffiles.c | 12 
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index e9fc0af7c9..a9f87ab5bf 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -41,6 +41,7 @@
 #include "storage/fd.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/conffiles.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -466,21 +467,7 @@ tokenize_inc_file(List *tokens,
 	ListCell   *inc_line;
 	MemoryContext linecxt;
 
-	if (is_absolute_path(inc_filename))
-	{
-		/* absolute path is taken as-is */
-		inc_fullname = pstrdup(inc_filename);
-	}
-	else
-	{
-		/* relative path is relative to dir of calling file */
-		inc_fullname = (char *) palloc(strlen(outer_filename) + 1 +
-	   strlen(inc_filename) + 1);
-		strcpy(inc_fullname, outer_filename);
-		get_parent_directory(inc_fullname);
-		join_path_components(inc_fullname, inc_fullname, inc_filename);
-		canonicalize_path(inc_fullname);
-	}
+	inc_fullname = AbsoluteConfigLocation(inc_filename, outer_filename);
 
 	inc_file = AllocateFile(inc_fullname, "r");
 	if (inc_file == NULL)
diff --git a/src/backend/utils/misc/conffiles.c b/src/backend/utils/misc/conffiles.c
index 4a99a1961e..35e2a3790b 100644
--- a/src/backend/utils/misc/conffiles.c
+++ b/src/backend/utils/misc/conffiles.c
@@ -35,15 +35,17 @@
 char *
 AbsoluteConfigLocation(const char *location, const char *calling_file)
 {
-	char		abs_path[MAXPGPATH];
-
 	if (is_absolute_path(location))
 		return pstrdup(location);
 	else
 	{
+		char	   *abs_path;
+
 		if (calling_file != NULL)
 		{
-			strlcpy(abs_path, calling_file, sizeof(abs_path));
+			abs_path = (char *) palloc0(strlen(calling_file) + 1 +
+		strlen(location) + 1);
+			strcpy(abs_path, calling_file);
 			get_parent_directory(abs_path);
 			join_path_components(abs_path, abs_path, location);
 			canonicalize_path(abs_path);
@@ -51,10 +53,12 @@ AbsoluteConfigLocation(const char *location, const char *calling_file)
 		else
 		{
 			Assert(DataDir);
+			abs_path = (char *) palloc0(strlen(DataDir) + 1 +
+		strlen(location) + 1);
 			join_path_components(abs_path, DataDir, location);
 			canonicalize_path(abs_path);
 		}
-		return pstrdup(abs_path);
+		return abs_path;
 	}
 }
 
-- 
2.38.1

From e201139be17e525c266240acf762c2b6a9fe6436 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 7 Nov 2022 13:35:44 +0900
Subject: [PATCH v17 2/3] Invent open_auth_file() in hba.c, to refactor auth
 file opening

This adds a check on the recursion depth when including auth files,
something that has never been done when processing '@' files for
database and user name lists in pg_hba.conf.
---
 src/include/libpq/hba.h  |   4 +-
 src/backend/libpq/hba.c  | 100 ++-
 src/backend/utils/adt/hbafuncs.c |  18 ++
 3 files changed, 79 insertions(+), 43 deletions(-)

diff --git a/src/include/libpq/hba.h b/src/include/libpq/hba.h
index 7ad227d34a..a84a5f0961 100644
--- a/src/include/libpq/hba.h
+++ b/src/include/libpq/hba.h
@@ -177,7 +177,9 @@ extern int	check_usermap(const char *usermap_name,
 extern HbaLine *parse_hba_line(TokenizedAuthLine *tok_line, int elevel);
 extern IdentLine *parse_ident_line(TokenizedAuthLine *tok_line, int elevel);
 extern bool

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2022-11-07 Thread Jacob Champion
On Thu, Nov 3, 2022 at 4:39 PM Jacob Champion  wrote:
> There is an additional test failure with LibreSSL, which doesn't appear
> to honor the SSL_CERT_FILE environment variable. This isn't a problem in
> production -- if you're using LibreSSL, you'd presumably understand that
> you can't use that envvar -- but it makes testing difficult, because I
> don't yet know a way to tell LibreSSL to use a different set of roots
> for the duration of a test. Has anyone dealt with this before?

Fixed in v3, with a large hammer (configure-time checks). Hopefully
I've missed a simpler solution.

> > If there are no valuable use cases for weaker checks, then we could go
> > even further than my 0002 and just reject any weaker sslmodes
> > outright. That'd be nice.

Done. sslrootcert=system now prevents you from explicitly setting a
weaker sslmode, to try to cement it as a Do What I Mean sort of
feature. If you need something weird then you can still jump through
the hoops by setting sslrootcert to a real file, same as today.

The macOS/OpenSSL 3.0.0 failure is still unfixed.

Thanks,
--Jacob
diff --git a/configure b/configure
index 3966368b8d..56166f1097 100755
--- a/configure
+++ b/configure
@@ -2095,6 +2095,56 @@ $as_echo "$ac_res" >&6; }
 
 } # ac_fn_c_check_func
 
+# ac_fn_c_check_decl LINENO SYMBOL VAR INCLUDES
+# -
+# Tests whether SYMBOL is declared in INCLUDES, setting cache variable VAR
+# accordingly.
+ac_fn_c_check_decl ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  # Initialize each $ac_[]_AC_LANG_ABBREV[]_decl_warn_flag once.
+  as_decl_name=`echo $2|sed 's/ *(.*//'`
+  as_decl_use=`echo $2|sed -e 's/(/((/' -e 's/)/) 0&/' -e 's/,/) 0& (/g'`
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether $as_decl_name is 
declared" >&5
+$as_echo_n "checking whether $as_decl_name is declared... " >&6; }
+if eval \${$3+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_save_werror_flag=$ac_c_werror_flag
+  ac_c_werror_flag="$ac_c_decl_warn_flag$ac_c_werror_flag"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$4
+int
+main ()
+{
+#ifndef $as_decl_name
+#ifdef __cplusplus
+  (void) $as_decl_use;
+#else
+  (void) $as_decl_name;
+#endif
+#endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  eval "$3=yes"
+else
+  eval "$3=no"
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+  ac_c_werror_flag=$ac_save_werror_flag
+fi
+eval ac_res=\$$3
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5
+$as_echo "$ac_res" >&6; }
+  eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno
+
+} # ac_fn_c_check_decl
+
 # ac_fn_c_check_type LINENO TYPE VAR INCLUDES
 # ---
 # Tests whether TYPE exists after having included INCLUDES, setting cache
@@ -2388,56 +2438,6 @@ rm -f conftest.val
   as_fn_set_status $ac_retval
 
 } # ac_fn_c_compute_int
-
-# ac_fn_c_check_decl LINENO SYMBOL VAR INCLUDES
-# -
-# Tests whether SYMBOL is declared in INCLUDES, setting cache variable VAR
-# accordingly.
-ac_fn_c_check_decl ()
-{
-  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
-  # Initialize each $ac_[]_AC_LANG_ABBREV[]_decl_warn_flag once.
-  as_decl_name=`echo $2|sed 's/ *(.*//'`
-  as_decl_use=`echo $2|sed -e 's/(/((/' -e 's/)/) 0&/' -e 's/,/) 0& (/g'`
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether $as_decl_name is 
declared" >&5
-$as_echo_n "checking whether $as_decl_name is declared... " >&6; }
-if eval \${$3+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_save_werror_flag=$ac_c_werror_flag
-  ac_c_werror_flag="$ac_c_decl_warn_flag$ac_c_werror_flag"
-  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-$4
-int
-main ()
-{
-#ifndef $as_decl_name
-#ifdef __cplusplus
-  (void) $as_decl_use;
-#else
-  (void) $as_decl_name;
-#endif
-#endif
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  eval "$3=yes"
-else
-  eval "$3=no"
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-  ac_c_werror_flag=$ac_save_werror_flag
-fi
-eval ac_res=\$$3
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5
-$as_echo "$ac_res" >&6; }
-  eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno
-
-} # ac_fn_c_check_decl
 cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.
@@ -13035,6 +13035,107 @@ _ACEOF
 fi
 done
 
+  # Let tests differentiate between vanilla OpenSSL and LibreSSL.
+  # The Clang compiler raises a warning for an undeclared identifier that 
matches
+# a compiler builtin function.  All extant Clang versions are affected, as of
+# Clang 3.6.0.  Test a builtin known to every version.  This problem affects 
the
+# C and Objective C language

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread David Rowley
On Tue, 8 Nov 2022 at 03:53, Ronan Dunklau  wrote:
> I can't see why an incrementalsort could be more expensive than a full sort,
> using the same presorted path. It looks to me that in that case we should
> always prefer an incrementalsort. Maybe we should bound incremental sorts cost
> to make sure they are never more expensive than the full sort ?

The only thing that I could think of that would cause incremental sort
to be more expensive than sort is the tuplesort_reset() calls that are
performed between sorts. However, I see cost_incremental_sort()
accounts for those already with:

run_cost += 2.0 * cpu_tuple_cost * input_groups;

Also, I see at the top of incremental_sort.sql there's a comment claiming:

-- When we have to sort the entire table, incremental sort will
-- be slower than plain sort, so it should not be used.

I'm just unable to verify that's true by doing the following:

$ echo "select * from (select * from tenk1 order by four) t order by
four, ten;" > bench.sql

$ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps"
tps = 102.136151 (without initial connection time)

$ # disable sort so that the test performs Sort -> Incremental Sort rather
$ # than Sort -> Sort
$ psql -c "alter system set enable_sort=0;" regression
$ psql -c "select pg_reload_conf();" regression

$ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps"
tps = 112.378761 (without initial connection time)

When I disable sort, the plan changes to use Incremental Sort and
execution becomes faster, not slower like the comment claims it will.
Perhaps this was true during the development of Incremental sort and
then something was changed to speed things up.  I do recall reviewing
that patch many years ago and hinting about the invention of
tuplesort_reset(). I don't recall, but I assume the patch must have
been creating a new tuplesort each group before that.

Also, I was looking at add_paths_to_grouping_rel() and I saw that if
presorted_keys > 0 that we'll create both a Sort and Incremental Sort
path.  If we assume Incremental Sort is always better when it can be
done, then it seems useless to create the Sort path when Incremental
Sort is possible.  When working on making Incremental Sorts work for
window functions I did things that way. Maybe we should just make
add_paths_to_grouping_rel() work the same way.

Regarding the 1.5 factor in cost_incremental_sort(), I assume this is
for skewed groups. Imagine there's 1 huge group and 99 tiny ones.
However, even if that were the case, I imagine the performance would
still be around the same performance as the non-incremental variant of
sort.

I've been playing around with the attached patch which does:

1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
when we can add an Incremental Sort path instead. This removes quite a
few redundant lines of code.
2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
3. Does various other code tidy stuff in cost_incremental_sort().
4. Removes the test from incremental_sort.sql that was ensuring the
inferior Sort -> Sort plan was being used instead of the superior Sort
-> Incremental Sort plan.

I'm not really that 100% confident in the removal of the * 1.5 thing.
I wonder if there's some reason we're not considering that might cause
a performance regression if we're to remove it.

David
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 4c6b1d1f55..9f4c011dac 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1959,8 +1959,8 @@ cost_incremental_sort(Path *path,
  double input_tuples, int width, Cost 
comparison_cost, int sort_mem,
  double limit_tuples)
 {
-   Coststartup_cost = 0,
-   run_cost = 0,
+   Coststartup_cost,
+   run_cost,
input_run_cost = input_total_cost - 
input_startup_cost;
double  group_tuples,
input_groups;
@@ -1969,10 +1969,9 @@ cost_incremental_sort(Path *path,
group_input_run_cost;
List   *presortedExprs = NIL;
ListCell   *l;
-   int i = 0;
boolunknown_varno = false;
 
-   Assert(presorted_keys != 0);
+   Assert(presorted_keys > 0 && presorted_keys < list_length(pathkeys));
 
/*
 * We want to be sure the cost of a sort is never estimated as zero, 
even
@@ -2025,8 +2024,7 @@ cost_incremental_sort(Path *path,
/* expression not containing any Vars with "varno 0" */
presortedExprs = lappend(presortedExprs, member->em_expr);
 
-   i++;
-   if (i >= presorted_keys)
+   if (foreach_current_index(l) + 1 >= presorted_keys)
break;

Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread David Rowley
Thanks for having a look.

On Tue, 8 Nov 2022 at 05:24, Andres Freund  wrote:
> FWIW, to me this is really more part of mcxt.c than its own
> allocator... Particularly because MemoryContextAllocAligned() et al are
> implemented there.

I'm on the fence about this one. I thought it was nice that we have a
file per consumed MemoryContextMethodID. The thing that caused me to
add alignedalloc.c was the various other comments in the declaration
of mcxt_methods[] that mention where to find each of the methods being
assigned in that array (e.g /* aset.c */).  I can change it back if
you feel strongly. I just don't.

> I doubtthere's ever a need to realloc such a pointer? Perhaps we could just
> elog(ERROR)?

Are you maybe locked into just thinking about your current planned use
case that we want to allocate BLCKSZ bytes in each case? It does not
seem impossible to me that someone will want something more than an
8-byte alignment and also might want to enlarge the allocation at some
point. I thought it might be more dangerous not to implement repalloc.
It might not be clear to someone using palloc_aligned() that there's
no code path that can call repalloc on the returned pointer.

> Hm, not that I can see a case for ever not using a power of two
> alignment... There's not really a "need" for the restriction, right? Perhaps
> we should note that?

TYPEALIGN() will not work correctly unless the alignment is a power of
2. We could modify it to, but that would require doing some modular
maths instead of bitmasking. That seems pretty horrible when the macro
is given a value that's not constant at compile time as we'd end up
with a (slow) divide in the code path.  I think the restriction is a
good idea. I imagine there will never be any need to align to anything
that's not a power of 2.

> Should we handle the case where we get a suitably aligned pointer from
> MemoryContextAllocExtended() differently?

Maybe it would be worth the extra check. I'm trying to imagine future
use cases.  Maybe if someone wanted to ensure that we're aligned to
CPU cache line boundaries then the chances of the pointer already
being aligned to 64 bytes is decent enough.  The problem is it that
it's too late to save any memory, it just saves a bit of boxing and
unboxing of the redirect headers.

> > + /* XXX: should we adjust valgrind state here? */
>
> Probably still need to do this... Kinda hard to get right without the code
> getting exercised.

Yeah, that comment kept catching my eye. I agree. That should be
handled correctly. I'll work on that.

> Wonder if there's some minimal case we could actually use
> it for?

Is there anything we could align to CPU cacheline size that would
speed something up?

David




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread David Rowley
On Fri, 4 Nov 2022 at 04:20, Maxim Orlov  wrote:
> I've done a quick look and the patch is looks good to me.
> Let's add tests for these functions, should we? If you think this is an 
> overkill, feel free to trim tests for your taste.

Thanks for doing that.  I'm keen to wait a bit and see if we can come
up with a core user of this before adding a test module.  However, if
we keep the repalloc() implementation, then I wonder if it might be
worth having a test module for that. I see you're not testing it in
the one you've written.  Andres has suggested I remove the repalloc
stuff I added but see my reply to that. I think we should keep it
based on the fact that someone using palloc_aligned might have no idea
if some other code path can call repalloc() on the returned pointer.

David




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread John Naylor
On Tue, Nov 8, 2022 at 8:57 AM David Rowley  wrote:
> Is there anything we could align to CPU cacheline size that would
> speed something up?

InitCatCache() already has this, which could benefit from simpler notation.

/*
 * Allocate a new cache structure, aligning to a cacheline boundary
 *
 * Note: we rely on zeroing to initialize all the dlist headers correctly
 */
sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
cp = (CatCache *) CACHELINEALIGN(palloc0(sz));

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Suppressing useless wakeups in walreceiver

2022-11-07 Thread Thomas Munro
On Sun, Nov 6, 2022 at 12:01 PM Nathan Bossart  wrote:
> Here is a new version of the patch that addresses this feedback.

This looks pretty good to me.  Thanks for picking it up!  I can live
with the change to use a global variable; it seems we couldn't quite
decide what the scope was for moving stuff into a struct (ie various
pre-existing stuff representing the walreceiver's state), but I'm OK
with considering that as a pure refactoring project for a separate
thread.  That array should be "static", though.




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-08 15:01:18 +1300, David Rowley wrote:
> Andres has suggested I remove the repalloc stuff I added but see my reply to
> that.

I'm fine with keeping it, I just couldn't really think of cases that have
strict alignment requirements but also requires resizing.

Greetings,

Andres Freund




Re: allow segment size to be set to < 1GiB

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-07 12:52:25 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > In the attached patch I renamed --with-segsize= to --with-segsize-mb= /
> > -Dsegsize= to -Dsegsize_mb=, to avoid somebody building with 
> > --with-segsize=2
> > or such suddenly ending up with an incompatible build.
> 
> For the purpose of exercising these code paths with the standard
> regression tests, even a megabyte seems large -- we don't create
> very many test tables that are that big.

Good point.


> How about instead allowing the segment size to be set in pages?

In addition or instead of --with-segsize/-Dsegsize? Just offering the number
of pages seems like a not great UI.

I guess we could add support for units or such? But that seems messy as well.

Greetings,

Andres Freund




Re: In-placre persistance change of a relation

2022-11-07 Thread Kyotaro Horiguchi
At Fri, 4 Nov 2022 09:32:52 +0900, Ian Lawrence Barwick  
wrote in 
> cfbot reports the patch no longer applies. As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.

Indeed, thanks!  I'll do that in a few days.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: allow segment size to be set to < 1GiB

2022-11-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-07 12:52:25 -0500, Tom Lane wrote:
>> How about instead allowing the segment size to be set in pages?

> In addition or instead of --with-segsize/-Dsegsize?

In addition to.  What I meant by "instead" was to replace
your proposal of --with-segsize-mb.

> Just offering the number of pages seems like a not great UI.

Well, it's a developer/debug focused API.  I think regular users
would only care for the existing --with-segsize = so-many-GB API.
But for testing, I think --with-segsize-pages = so-many-pages
is actually a pretty good UI.

regards, tom lane




Re: psql: Add command to use extended query protocol

2022-11-07 Thread Corey Huinker
On Mon, Nov 7, 2022 at 4:12 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > I thought about basically reserving the \$[0-9]+ space as bind variables,
> > but it is possible, though unlikely, that users have been naming their
> > variables like that.
>
> Don't we already reserve that syntax as Params?  Not sure whether there
> would be any conflicts versus Params, but these are definitely not legal
> as SQL identifiers.
>
> regards, tom lane
>

I think Pavel was hinting at something like:

\set $1 foo
\set $2 123
UPDATE mytable SET value = $1 WHERE id = $2;

Which wouldn't step on anything, because I tested it, and \set $1 foo
already returns 'Invalid variable name "$1"'.

So far, there seem to be two possible variations on how to go about this:

1. Have special variables or a variable namespace that are known to be bind
variables. So long as one of them is defined, queries are sent using
extended query protocol.
2. Bind parameters one-time-use, applied strictly to the query currently in
the buffer in positional order, and once that query is run their
association with being binds is gone.

Each has its merits, I guess it comes down to how much we expect users to
want to re-use some or all the bind params of the previous query.


RE: Logical replication timeout problem

2022-11-07 Thread wangw.f...@fujitsu.com
On Fri, Nov 4, 2022 at 18:13 PM Fabrice Chapuis  wrote:
> Hello Wang,
> 
> I tested the draft patch in my lab for Postgres 14.4, the refresh of the
> materialized view ran without generating the timeout on the worker.
> Do you plan to propose this patch at the next commit fest.

Thanks for your confirmation!
I will add this thread to the commit fest soon.

The following is the problem analysis and fix approach:
I think the problem is when there is a DDL in a transaction that generates lots
of temporary data due to rewrite rules, these temporary data will not be
processed by the pgoutput - plugin. Therefore, the previous fix (f95d53e) for
DML had no impact on this case.

To fix this, I think we need to try to send the keepalive messages after each
change is processed by walsender, not in the pgoutput-plugin.

Attach the patch.

Regards,
Wang wei


v1-0001-Fix-the-logical-replication-timeout-during-proces.patch
Description:  v1-0001-Fix-the-logical-replication-timeout-during-proces.patch


Re: Allow single table VACUUM in transaction block

2022-11-07 Thread Simon Riggs
On Mon, 7 Nov 2022 at 08:20, Simon Riggs  wrote:

> Temp tables are actually easier, since we don't need any of the
> concurrency features we get with lazy vacuum. So the answer is to
> always run a VACUUM FULL on temp tables since this skips any issues
> with indexes etc..

So I see 3 options for what to do next

1. Force the FULL option for all tables, when executed in a
transaction block. This gets round the reasonable objections to
running a concurrent vacuum in a shared xact block. As Justin points
out, CLUSTER is already supported, which uses the same code.

2. Force the FULL option for temp tables, when executed in a
transaction block. In a later patch, queue up an autovacuum run for
regular tables.

3. Return with feedback this patch. (But then what happens with temp tables?)

Thoughts?

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: New docs chapter on Transaction Management and related changes

2022-11-07 Thread Laurenz Albe
On Mon, 2022-11-07 at 23:04 +0100, Laurenz Albe wrote:
> On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote:
> > Agreed; new compilation patch attached, including mine and then
> > Robert's suggested rewordings.
> 
> Thanks.  There is clearly a lot of usefule information in this.
> 
> Some comments: [...]

I thought some more about the patch, and I don't like the title
"Transaction Management" for the new chapter.  I'd expect some more
from a chapter titled "Internals" / "Transaction Management".

In reality, the new chapter is about transaction IDs.  So perhaps the
name should reflect that, so that it does not mislead the reader.

Yours,
Laurenz Albe




Re: New docs chapter on Transaction Management and related changes

2022-11-07 Thread Bruce Momjian
On Mon, Nov  7, 2022 at 10:58:05AM +, Simon Riggs wrote:
> What I've posted is the merged patch, i.e. your latest patch, plus
> changes to RELEASE SAVEPOINT from you on Oct 16, plus changes based on
> the later comments from Robert and I.

Thanks.  I have two changes to your patch.  First, I agree "destroy" is
the wrong word for this, but I don't think "subcommit" is good, for
three reasons:

1. Release merges the non-aborted changes into the previous transaction
_and_ frees their resources --- "subcommit" doesn't have both meanings,
which I think means if we need a single word, we should use "release"
and later define what that means.

2. The "subcommit" concept doesn't closely match the user-visible
behavior, even though we use subtransactions to accomplish this. Release
is more of a rollup/merge into the previously-active
transaction/savepoint.

3. "subcommit" is an implementation detail that I don't think we should
expose to users in the manual pages.

I adjusted the first paragraph of RELEASE SAVEPOINT to highlight the
above issues.  My original patch had similar wording.

The first attachment shows my changes to your patch, and the second
attachment is my full patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

diff --git a/doc/src/sgml/ref/release_savepoint.sgml b/doc/src/sgml/ref/release_savepoint.sgml
index 00b57ffcbb..34299f5618 100644
--- a/doc/src/sgml/ref/release_savepoint.sgml
+++ b/doc/src/sgml/ref/release_savepoint.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
   RELEASE SAVEPOINT
-  subcommit a previously defined savepoint
+  release a previously defined savepoint
  
 
  
@@ -34,16 +34,13 @@ RELEASE [ SAVEPOINT ] savepoint_name
   Description
 
   
-   RELEASE SAVEPOINT will subcommit the subtransaction
-   established by the named savepoint, if one exists. This will release
-   any resources held by the subtransaction. If there were any
-   subtransactions of the named savepoint, these will also be subcommitted.
-  
-
-  
-   Changes made after RELEASE SAVEPOINT will be in
-   the same transaction, and have the same transaction id, as changes
-   made before the named savepoint was created.
+   RELEASE SAVEPOINT releases the named savepoint and
+   all active savepoints that were created after the named savepoint,
+   and frees their resources.  All changes made since the creation of the
+   savepoint, excluding rolled back savepoints changes, are merged into
+   the transaction or savepoint that was active when the named savepoint
+   was created.  Changes made after RELEASE SAVEPOINT
+   will also be part of this active transaction or savepoint.
   
  
 
@@ -55,7 +52,7 @@ RELEASE [ SAVEPOINT ] savepoint_name
 savepoint_name
 
  
-  The name of the savepoint to subcommit.
+  The name of the savepoint to release.
  
 

diff --git a/doc/src/sgml/ref/rollback.sgml b/doc/src/sgml/ref/rollback.sgml
index 02b118fc04..e0e7c61560 100644
--- a/doc/src/sgml/ref/rollback.sgml
+++ b/doc/src/sgml/ref/rollback.sgml
@@ -61,9 +61,6 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
   linkend="sql-set-transaction"/>) as the just finished one.  Otherwise,
   no new transaction is started.
  
- 
-  The SQL Standard describes this as a chained transaction.
- 
 

   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 559eb898a9..6df0092f4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7225,12 +7225,14 @@ local0.*/var/log/postgresql
 
 
  %v
- Virtual transaction ID (backendID/localXID)
+ Virtual transaction ID (backendID/localXID);  see
+ 
  no
 
 
  %x
- Transaction ID (0 if none is assigned)
+ Transaction ID (0 if none is assigned);  see
+ 
  no
 
 
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index b030b36002..fdffba4442 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4992,7 +4992,8 @@ WHERE ...
 xmin and xmax.  Transaction identifiers are 32-bit quantities.
 In some contexts, a 64-bit variant xid8 is used.  Unlike
 xid values, xid8 values increase strictly
-monotonically and cannot be reused in the lifetime of a database cluster.
+monotonically and cannot be reused in the lifetime of a database
+cluster.  See  for more details.

 

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index de450cd661..0d6be9a2fa 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -104,6 +104,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6e0425cb3d..153587c4b8 100644
--- a/doc/src/sgml/func.sgml
+++ b/d

Re: New docs chapter on Transaction Management and related changes

2022-11-07 Thread Bruce Momjian
On Tue, Nov  8, 2022 at 04:37:59AM +0100, Laurenz Albe wrote:
> On Mon, 2022-11-07 at 23:04 +0100, Laurenz Albe wrote:
> > On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote:
> > > Agreed; new compilation patch attached, including mine and then
> > > Robert's suggested rewordings.
> > 
> > Thanks.  There is clearly a lot of usefule information in this.
> > 
> > Some comments: [...]
> 
> I thought some more about the patch, and I don't like the title
> "Transaction Management" for the new chapter.  I'd expect some more
> from a chapter titled "Internals" / "Transaction Management".
> 
> In reality, the new chapter is about transaction IDs.  So perhaps the
> name should reflect that, so that it does not mislead the reader.

I renamed it to "Transaction Processing" since we also cover locking and
subtransactions.  How is that?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Hayato Kuroda (Fujitsu)
> Fair point. I think if the user wants, she can join with
> pg_stat_subscription based on PID and find the corresponding
> subscription. However, if we want to identify everything via pg_locks
> then I think we should also mention classid or database id as field1.
> So, it would look like: field1: (pg_subscription's oid or current db
> id); field2: OID of subscription in pg_subscription; field3: local or
> remote xid; field4: 0/1 to differentiate between remote and local xid.

Sorry I missed the discussion related with LOCKTAG.
+1 for adding a new tag like LOCKTAG_PARALLEL_APPLY, and
I prefer field1 should be dbid because it is more useful for reporting a lock 
in DescribeLockTag().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread David Rowley
On Tue, 8 Nov 2022 at 14:57, David Rowley  wrote:
> On Tue, 8 Nov 2022 at 05:24, Andres Freund  wrote:
> > Should we handle the case where we get a suitably aligned pointer from
> > MemoryContextAllocExtended() differently?
>
> Maybe it would be worth the extra check. I'm trying to imagine future
> use cases.  Maybe if someone wanted to ensure that we're aligned to
> CPU cache line boundaries then the chances of the pointer already
> being aligned to 64 bytes is decent enough.  The problem is it that
> it's too late to save any memory, it just saves a bit of boxing and
> unboxing of the redirect headers.

Thinking about that a bit more, if we keep the repalloc support then
we can't do this as if we happen to get the right alignment by chance
during the palloc_aligned, then if we don't have the redirection
MemoryChunk, then we've no way to ensure we keep the alignment after a
repalloc.

David




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread David Rowley
On Tue, 8 Nov 2022 at 15:17, John Naylor  wrote:
>
>
> On Tue, Nov 8, 2022 at 8:57 AM David Rowley  wrote:
> > Is there anything we could align to CPU cacheline size that would
> > speed something up?
>
> InitCatCache() already has this, which could benefit from simpler notation.

Thanks. I wasn't aware. I'll convert that to use palloc_aligned in the patch.

David




Re: psql: Add command to use extended query protocol

2022-11-07 Thread Pavel Stehule
út 8. 11. 2022 v 3:47 odesílatel Corey Huinker 
napsal:

> On Mon, Nov 7, 2022 at 4:12 PM Tom Lane  wrote:
>
>> Corey Huinker  writes:
>> > I thought about basically reserving the \$[0-9]+ space as bind
>> variables,
>> > but it is possible, though unlikely, that users have been naming their
>> > variables like that.
>>
>> Don't we already reserve that syntax as Params?  Not sure whether there
>> would be any conflicts versus Params, but these are definitely not legal
>> as SQL identifiers.
>>
>> regards, tom lane
>>
>
> I think Pavel was hinting at something like:
>
> \set $1 foo
> \set $2 123
> UPDATE mytable SET value = $1 WHERE id = $2;
>

no, I just proposed special syntax for variable usage like bind variable

like

\set var Ahoj

SELECT $var;

I think so there should not be problem with custom strings, because we are
able to push $x to stored procedures, so it should be safe to use it
elsewhere

We can use the syntax @var - that is used by pgadmin

Regards

Pavel




> Which wouldn't step on anything, because I tested it, and \set $1 foo
> already returns 'Invalid variable name "$1"'.
>
> So far, there seem to be two possible variations on how to go about this:
>
> 1. Have special variables or a variable namespace that are known to be
> bind variables. So long as one of them is defined, queries are sent using
> extended query protocol.
> 2. Bind parameters one-time-use, applied strictly to the query currently
> in the buffer in positional order, and once that query is run their
> association with being binds is gone.
>
> Each has its merits, I guess it comes down to how much we expect users to
> want to re-use some or all the bind params of the previous query.
>
>


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-07 Thread Michael Paquier
On Tue, Nov 01, 2022 at 08:32:48AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v9 patch from upthread here again for further review
> and to make CF bot happy.

So, I have looked at that, and at the end concluded that Andres'
suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
in the long run.  Thomas has mentioned upthread that some of the
comments don't need to be that long, so I have tweaked these to be
minimal, and updated a few more areas.  Note that this has been split
into two commits: one to introduce the new routine in file_utils.c and
a second for the switch in walmethods.c.
--
Michael


signature.asc
Description: PGP signature


Re: Tracking last scan time

2022-11-07 Thread Michael Paquier
On Mon, Nov 07, 2022 at 04:54:07PM +0900, Michael Paquier wrote:
> FWIW, all the other areas of pgstatfuncs.c manipulate timestamptz
> fields with a style like the attached.  That's a nit, still per the
> role of consistency with the surroundings..
> 
> Anyway, it seems to me that a regression test is in order before a
> scan happens just after the relation creation, and the same problem
> shows up with last_idx_scan.

Hearing nothing, done this way as of d7744d5.  Thanks for the report,
Robert.  And thanks for the patch, Dave.
--
Michael


signature.asc
Description: PGP signature


RE: Data is copied twice when specifying both child and parent table in publication

2022-11-07 Thread Takamichi Osumi (Fujitsu)
On Monday, October 17, 2022 2:49 PM Wang, Wei/王 威  
wrote:
> Attach the new patch set.
Hi, thank you for posting the new patches.


Here are minor comments on the HEAD_v13-0002.

(1) Suggestion for the document description

+ 
+  If a root partitioned table is published by any subscribed 
publications which
+  set publish_via_partition_root = true, changes on this root 
partitioned table
+  (or on its partitions) will be published using the identity and 
schema of this
+  root partitioned table rather than that of the individual partitions.
+ 
+

I suppose this sentence looks quite similar to the one in the previous 
paragraph and can be adjusted.

IIUC the main value of the patch is to clarify what happens when
we mix publications of different publish_via_partition_root settings for one 
partition hierarchy.
If this is true, how about below sentence instead of the one above ?

"
There can be a case where a subscription combines publications with
different publish_via_partition_root values for one same partition hierarchy
(e.g. subscribe two publications indicating the root partitioned table and its 
child table respectively).
In this case, the identity and schema of the root partitioned table take 
priority.
"

(2) Better documentation alignment

I think we need to wrap publish_via_partition_root by "literal" tag
in the documentation create_publication.sgml.


Best Regards,
Takamichi Osumi



Re: psql: Add command to use extended query protocol

2022-11-07 Thread David G. Johnston
On Mon, Nov 7, 2022 at 9:02 PM Pavel Stehule 
wrote:

>
>
> út 8. 11. 2022 v 3:47 odesílatel Corey Huinker 
> napsal:
>
>> On Mon, Nov 7, 2022 at 4:12 PM Tom Lane  wrote:
>>
>>> Corey Huinker  writes:
>>> > I thought about basically reserving the \$[0-9]+ space as bind
>>> variables,
>>> > but it is possible, though unlikely, that users have been naming their
>>> > variables like that.
>>>
>>> Don't we already reserve that syntax as Params?  Not sure whether there
>>> would be any conflicts versus Params, but these are definitely not legal
>>> as SQL identifiers.
>>>
>>> regards, tom lane
>>>
>>
>> I think Pavel was hinting at something like:
>>
>> \set $1 foo
>> \set $2 123
>> UPDATE mytable SET value = $1 WHERE id = $2;
>>
>
> no, I just proposed special syntax for variable usage like bind variable
>
> like
>
> \set var Ahoj
>
> SELECT $var;
>

Why not extend psql conventions for variable specification?

SELECT :$var$;

Thus:
:var => Ahoj
:'var' => 'Ahoj'
:"var" => "Ahoj"
:$var$ => $n  (n => )

The downside is it looks like dollar-quoting but isn't actually causing
<$Ahoj$> to be produced.  Instead psql would have to substitute $n at that
location and internally remember that for this query $1 is the contents of
var.

I would keep the \gp meta-command to force extended mode regardless of
whether the query itself requires it.

A pset variable to control the default seems reasonable as well.  The
implication would be that if you set that pset variable there is no way to
have individual commands use simple query mode directly.

David J.


Re: psql: Add command to use extended query protocol

2022-11-07 Thread Pavel Stehule
út 8. 11. 2022 v 5:21 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> On Mon, Nov 7, 2022 at 9:02 PM Pavel Stehule 
> wrote:
>
>>
>>
>> út 8. 11. 2022 v 3:47 odesílatel Corey Huinker 
>> napsal:
>>
>>> On Mon, Nov 7, 2022 at 4:12 PM Tom Lane  wrote:
>>>
 Corey Huinker  writes:
 > I thought about basically reserving the \$[0-9]+ space as bind
 variables,
 > but it is possible, though unlikely, that users have been naming their
 > variables like that.

 Don't we already reserve that syntax as Params?  Not sure whether there
 would be any conflicts versus Params, but these are definitely not legal
 as SQL identifiers.

 regards, tom lane

>>>
>>> I think Pavel was hinting at something like:
>>>
>>> \set $1 foo
>>> \set $2 123
>>> UPDATE mytable SET value = $1 WHERE id = $2;
>>>
>>
>> no, I just proposed special syntax for variable usage like bind variable
>>
>> like
>>
>> \set var Ahoj
>>
>> SELECT $var;
>>
>
> Why not extend psql conventions for variable specification?
>
> SELECT :$var$;
>
> Thus:
> :var => Ahoj
> :'var' => 'Ahoj'
> :"var" => "Ahoj"
> :$var$ => $n  (n => )
>
> The downside is it looks like dollar-quoting but isn't actually causing
> <$Ahoj$> to be produced.  Instead psql would have to substitute $n at that
> location and internally remember that for this query $1 is the contents of
> var.
>
> I would keep the \gp meta-command to force extended mode regardless of
> whether the query itself requires it.
>
> A pset variable to control the default seems reasonable as well.  The
> implication would be that if you set that pset variable there is no way to
> have individual commands use simple query mode directly.
>

:$var$ looks little bit scary, and there can be risk of collision with
custom string separator

but :$var can be ok?

There is not necessity of showing symmetry






>
> David J.
>


RE: logical replication restrictions

2022-11-07 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Do you have enough time to handle the issue? Our discussion has been suspended 
for two months...

If you could not allocate a time to discuss this problem because of other 
important tasks or events,
we would like to take over the thread and modify your patch.

We've planned that we will start to address comments and reported bugs if you 
would not respond by the end of this week.
I look forward to hearing from you.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: allow segment size to be set to < 1GiB

2022-11-07 Thread Bharath Rupireddy
On Tue, Nov 8, 2022 at 8:06 AM Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2022-11-07 12:52:25 -0500, Tom Lane wrote:
> >> How about instead allowing the segment size to be set in pages?
>
> > In addition or instead of --with-segsize/-Dsegsize?
>
> In addition to.  What I meant by "instead" was to replace
> your proposal of --with-segsize-mb.
>
> > Just offering the number of pages seems like a not great UI.
>
> Well, it's a developer/debug focused API.  I think regular users
> would only care for the existing --with-segsize = so-many-GB API.
> But for testing, I think --with-segsize-pages = so-many-pages
> is actually a pretty good UI.

Perhaps --with-segsize-blocks is a better name here as we use block
instead of page for --with-blocksize and --with-wal-blocksize.

If this option is for dev/debug purposes only, do we want to put a
mechanism to disallow it in release builds or something like that,
just in case? Or at least, add a note in the documentation?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-07 Thread Michael Paquier
On Mon, Nov 07, 2022 at 01:43:22AM -0500, Tom Lane wrote:
> WFM.

Okay, applied as bd95816, then.
--
Michael


signature.asc
Description: PGP signature


Re: generic plans and "initial" pruning

2022-11-07 Thread Amit Langote
On Thu, Oct 27, 2022 at 11:41 AM Amit Langote  wrote:
> On Mon, Oct 17, 2022 at 6:29 PM Amit Langote  wrote:
> > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote  
> > wrote:
> > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote  
> > > wrote:
> > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas  
> > > > wrote:
> > > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > > introduction of that field should be deferred until it's needed for
> > > > > something.
> > > >
> > > > Oops, looks like a mistake when breaking the patch.  Will move that bit 
> > > > to 0002.
> > >
> > > Fixed that and also noticed that I had defined PartitionPruneResult in
> > > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > > nodes not being able to be written and read, because
> > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > > routines for the nodes defined in execnodes.h.  I moved its definition
> > > to plannodes.h, even though it is not actually the planner that
> > > instantiates those; no other include/nodes header sounds better.
> > >
> > > One more thing I realized is that Bitmapsets added to the List
> > > PartitionPruneResult.valid_subplan_offs_list are not actually
> > > read/write-able.  That's a problem that I also faced in [1], so I
> > > proposed a patch there to make Bitmapset a read/write-able Node and
> > > mark (only) the Bitmapsets that are added into read/write-able node
> > > trees with the corresponding NodeTag.  I'm including that patch here
> > > as well (0002) for the main patch to work (pass
> > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > > to discuss it in its own thread?
> >
> > Had second thoughts on the use of List of Bitmapsets for this, such
> > that the make-Bitmapset-Nodes patch is no longer needed.
> >
> > I had defined PartitionPruneResult such that it stood for the results
> > of pruning for all PartitionPruneInfos contained in
> > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> > can use partition pruning in a given plan).  So, it had a List of
> > Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> > cover only one PartitionPruneInfo and thus need only a Bitmapset and
> > not a List thereof, which I have implemented in the attached updated
> > patch 0002.  So, instead of needing to pass around a
> > PartitionPruneResult with each PlannedStmt, this now passes a List of
> > PartitionPruneResult with an entry for each in
> > PlannedStmt.partPruneInfos.
>
> Rebased over 3b2db22fe.

Updated 0002 to cope with AssertArg() being removed from the tree.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v24-0002-Optimize-AcquireExecutorLocks-by-locking-only-un.patch
Description: Binary data


v24-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pl.patch
Description: Binary data


Re: Use proc instead of MyProc in ProcArrayGroupClearXid()/TransactionGroupUpdateXidStatus()

2022-11-07 Thread Bharath Rupireddy
On Mon, Nov 7, 2022 at 3:17 PM rajesh singarapu  wrote:
>
> Hi,
>
> In both TransactionGroupUpdateXidStatus and ProcArrayGroupClearXid
> global MyProc is used. for consistency, replaced with a function local 
> variable.

if (nextproc != MyProc)
PGSemaphoreUnlock(nextproc->sem);

The intention of this wake up code in the two functions is to skip the
leader process from waking itself up. Only the leader gets to execute
this code and all the followers don't hit this code at all as they
return from the first loop in those functions.

All the callers of ProcArrayGroupClearXid() get MyProc as their proc
and pass it down. And using the passed down function parameter proc
makes the function look consistent.

And, in TransactionGroupUpdateXidStatus() proc is initialized with
MyProc and using it instead of MyProc in the wake up loop also makes
the code consistent.

While it does no harm with the existing way using MyProc, +1 for
replacing it with the local variable proc in both the functions for
consistency.

Another thing I noticed is an extra assertion in
ProcArrayGroupClearXid() Assert(TransactionIdIsValid(proc->xid));, the
caller already has the same assertion, I think we can also remove it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Use proc instead of MyProc in ProcArrayGroupClearXid()/TransactionGroupUpdateXidStatus()

2022-11-07 Thread Amit Kapila
On Mon, Nov 7, 2022 at 3:17 PM rajesh singarapu  wrote:
>
> In both TransactionGroupUpdateXidStatus and ProcArrayGroupClearXid
> global MyProc is used. for consistency, replaced with a function local 
> variable.
>

In ProcArrayGroupClearXid(), currently, we always pass MyProc as proc,
so the change suggested by you will work but I think if in the future
someone calls it with a different proc, then the change suggested by
you won't work. The change in TransactionGroupUpdateXidStatus() looks
good but If we don't want to change ProcArrayGroupClearXid() then I am
not sure if there is much value in making the change in
TransactionGroupUpdateXidStatus().

-- 
With Regards,
Amit Kapila.




Re: Reviving lost replication slots

2022-11-07 Thread sirisha chamarthi
Hi Amit,

Thanks for your comments!

On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila  wrote:

> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>  wrote:
> >
> > A replication slot can be lost when a subscriber is not able to catch up
> with the load on the primary and the WAL to catch up exceeds
> max_slot_wal_keep_size. When this happens, target has to be reseeded
> (pg_dump) from the scratch and this can take longer. I am investigating the
> options to revive a lost slot.
> >
>
> Why in the first place one has to set max_slot_wal_keep_size if they
> care for WAL more than that?

 Disk full is a typical use where we can't wait until the logical slots to
catch up before truncating the log.

If you have a case where you want to
> handle this case for some particular slot (where you are okay with the
> invalidation of other slots exceeding max_slot_wal_keep_size) then the
> other possibility could be to have a similar variable at the slot
> level but not sure if that is a good idea because you haven't
> presented any such case.
>
IIUC, ability to fetch WAL from the archive as a fall back mechanism should
automatically take care of all the lost slots. Do you see a need to take
care of a specific slot? If the idea is not to download the wal files in
the pg_wal directory, they can be placed in a slot specific folder
(data/pg_replslot//) until they are needed while decoding and can be
removed.


>
> --
> With Regards,
> Amit Kapila.
>


Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-07 Thread Ian Lawrence Barwick
2022年11月8日(火) 14:49 Michael Paquier :
>
> On Mon, Nov 07, 2022 at 01:43:22AM -0500, Tom Lane wrote:
> > WFM.
>
> Okay, applied as bd95816, then.

Thanks!

CF entry updated accordingly.

Regards

Ian Barwick




Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Pavel Luzanov



On 08.11.2022 04:31, David Rowley wrote:

I've been playing around with the attached patch which does:

1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
when we can add an Incremental Sort path instead. This removes quite a
few redundant lines of code.
2. Removes the * 1.5 fuzz-factor in cost_incremental_sort()
3. Does various other code tidy stuff in cost_incremental_sort().
4. Removes the test from incremental_sort.sql that was ensuring the
inferior Sort -> Sort plan was being used instead of the superior Sort
-> Incremental Sort plan.


I can confirm that with this patch, the plan with incremental sorting 
beats the others.


Here are the test results with my previous example.

Script:

create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, 
generate_series(1,1) y;

create index on t (a);
vacuum analyze t;
reset all;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

\echo set enable_incremental_sort=off;
set enable_incremental_sort=off;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

\echo set enable_seqscan=off;
set enable_seqscan=off;

explain (settings, analyze)
select a, array_agg(c order by c) from t group by a;

Script output:

CREATE TABLE
INSERT 0 100
CREATE INDEX
VACUUM
RESET
QUERY PLAN
-
 GroupAggregate  (cost=957.60..113221.24 rows=100 width=34) (actual 
time=6.088..381.777 rows=100 loops=1)

   Group Key: a
   ->  Incremental Sort  (cost=957.60..108219.99 rows=100 width=4) 
(actual time=2.387..272.332 rows=100 loops=1)

 Sort Key: a, c
 Presorted Key: a
 Full-sort Groups: 100  Sort Method: quicksort  Average Memory: 
27kB  Peak Memory: 27kB
 Pre-sorted Groups: 100  Sort Method: quicksort  Average 
Memory: 697kB  Peak Memory: 697kB
 ->  Index Scan using t_a_idx on t (cost=0.42..29279.42 
rows=100 width=4) (actual time=0.024..128.083 rows=100 loops=1)

 Planning Time: 0.070 ms
 Execution Time: 381.815 ms
(10 rows)

set enable_incremental_sort=off;
SET
   QUERY PLAN

 GroupAggregate  (cost=128728.34..136229.59 rows=100 width=34) (actual 
time=234.044..495.537 rows=100 loops=1)

   Group Key: a
   ->  Sort  (cost=128728.34..131228.34 rows=100 width=4) (actual 
time=231.172..383.393 rows=100 loops=1)

 Sort Key: a, c
 Sort Method: external merge  Disk: 15600kB
 ->  Seq Scan on t  (cost=0.00..15396.00 rows=100 width=4) 
(actual time=0.005..78.189 rows=100 loops=1)

 Settings: enable_incremental_sort = 'off'
 Planning Time: 0.041 ms
 Execution Time: 497.230 ms
(9 rows)

set enable_seqscan=off;
SET
QUERY PLAN
-
 GroupAggregate  (cost=142611.77..150113.02 rows=100 width=34) (actual 
time=262.250..527.260 rows=100 loops=1)

   Group Key: a
   ->  Sort  (cost=142611.77..145111.77 rows=100 width=4) (actual 
time=259.551..417.154 rows=100 loops=1)

 Sort Key: a, c
 Sort Method: external merge  Disk: 15560kB
 ->  Index Scan using t_a_idx on t (cost=0.42..29279.42 
rows=100 width=4) (actual time=0.012..121.995 rows=100 loops=1)

 Settings: enable_incremental_sort = 'off', enable_seqscan = 'off'
 Planning Time: 0.041 ms
 Execution Time: 528.950 ms
(9 rows)

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company





Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-07 Thread Richard Guo
On Tue, Nov 8, 2022 at 9:31 AM David Rowley  wrote:

> I've been playing around with the attached patch which does:
>
> 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path
> when we can add an Incremental Sort path instead. This removes quite a
> few redundant lines of code.


For unsorted paths, the original logic here is to explicitly add a Sort
path only for the cheapest-total path.  This patch changes that and may
add a Sort path for other paths besides the cheapest-total path.  I
think this may introduce in some unnecessary path candidates.

I think it's good that this patch removes redundant codes.  ISTM we can
do the same when we try to finalize the partially aggregated paths from
partially_grouped_rel.

Thanks
Richard


  1   2   >