date:20211123


On 23.11.21 23:41, Heikki Linnakangas wrote:

On 23/11/2021 23:44, Robert Haas wrote:

On Tue, Nov 23, 2021 at 2:18 PM Tom Lane  wrote:

Jacob Champion  writes:

= Implicit TLS =


Aside from security, one small benefit of skipping the Starttls-style 
negotiation is that you avoid one round-trip to the server.


Also, you could make use of existing TLS-aware proxy infrastructure 
without having to hack in PostgreSQL protocol support.  There is 
definitely demand for that.

Re: Mop-up from Test::More version change patch


On 23.11.21 18:03, Tom Lane wrote:

0002 is written to apply to v14 and earlier, and what it wants
to do is back-patch the effects of 405f32fc4, so that the
minimum Test::More version is 0.98 in all branches.  The thought
here is that (1) somebody is likely to want to back-patch a
test involving Test::More::subtest before too long; (2) we have
zero coverage for older Test::More versions anyway, now that
all buildfarm members have been updated to work with HEAD.


The backpatching argument is a little off-target, I think.  The purpose 
of subtests is so that wrappers like command_fails_like() count only as 
one test on the outside, instead of however many checks it runs 
internally.  There is no use in using the subtest feature in a top-level 
test one might add, say in response to a bugfix.  So the only reason 
this might become relevant is if someone were to backpatch new test 
infrastructure, which seems rare and in most cases would probably 
require additional portability and backpatching care.  And even then, 
you still need to adjust the test count at the top of the file 
individually in each branch, because the total number of tests will 
probably be different.


In my mind, the subtests feature is useful during new development, where 
you write a bunch of new tests and want to set the total test count by 
eyeballing what you just wrote instead of having to run test cycles and 
react to the complaints.  But it won't be useful for patching tests into 
existing files.


In summary, I would leave it alone.

Re: Python 3.11 vs. Postgres


On 24.11.21 04:07, Tom Lane wrote:

According to [1], we need to stop including Python's .
I've not checked whether this creates any backwards-compatibility
issues.

regards, tom lane

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2023272


See attached patch.  The minimum Python version for this change is 2.4, 
which is the oldest version supported by PG10, so we can backpatch this 
to all live branches.From 069c0ee626aa7cabe3098feec97dd161f8c81376 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 10 Nov 2021 09:48:18 +0100
Subject: [PATCH] Remove unneeded Python includes

Inluding  and  has not been necessary since Python
2.4, since they are included via .  Morever,  is
being removed in Python 3.11.  So remove these includes.
---
 src/pl/plpython/plpython.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/pl/plpython/plpython.h b/src/pl/plpython/plpython.h
index 994457b37d..ae1275afe2 100644
--- a/src/pl/plpython/plpython.h
+++ b/src/pl/plpython/plpython.h
@@ -94,9 +94,6 @@
 #undef TEXTDOMAIN
 #define TEXTDOMAIN PG_TEXTDOMAIN("plpython")
 
-#include 
-#include 
-
 /* put back our *printf macros ... this must match src/include/port.h */
 #ifdef vsnprintf
 #undef vsnprintf
-- 
2.33.1

Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Laurenz Albe

On Tue, 2021-11-23 at 16:41 -0500, Tom Lane wrote:
> I wrote:
> > I wonder though if we shouldn't try to improve the existing text.
> > The phrasing "never rolled back" seems like it's too easily
> > misinterpreted.  Maybe rewrite the  block like
> > ...
> 
> A bit of polishing later, maybe like the attached.

That looks good to me.

Yours,
Laurenz Albe

[BUG]Missing REPLICA IDENTITY check when DROP NOT NULL

2021-11-23 Thread tanghy.f...@fujitsu.com

Hi, 

I think I found a problem related to replica identity. According to PG doc at 
[1], replica identity includes only columns marked NOT NULL. 
But in fact users can accidentally break this rule as follows:

create table tbl (a int not null unique);
alter table tbl replica identity using INDEX tbl_a_key;
alter table tbl alter column a drop not null;
insert into tbl values (null);

As a result, some operations on newly added null value will cause unexpected 
failure as below:

postgres=# delete from tbl;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

In the log, I can also see an assertion failure when deleting null value:
TRAP: FailedAssertion("!nulls[i]", File: "heapam.c", Line: 8439, PID: 274656)

To solve the above problem, I think it's better to add a check when executing  
ALTER COLUMN DROP NOT NULL,
and report an error if this column is part of replica identity.

Attaching a patch that disallow DROP NOT NULL on a column if it's in a REPLICA 
IDENTITY index. Also added a test in it.
Thanks Hou for helping me write/review this patch.

By the way, replica identity was introduced in PG9.4, so this problem exists in
all supported versions.

[1] https://www.postgresql.org/docs/current/sql-altertable.html

Regards,
Tang


0001-Disallow-DROP-NOT-NULL-for-column-in-the-REPLICA-IDE.patch
Description:  0001-Disallow-DROP-NOT-NULL-for-column-in-the-REPLICA-IDE.patch

Re: parallel vacuum comments

On Wed, Nov 24, 2021 at 1:34 PM Amit Kapila  wrote:
>
> On Wed, Nov 24, 2021 at 7:43 AM Masahiko Sawada  wrote:
> >
> > On Mon, Nov 22, 2021 at 1:48 PM Amit Kapila  wrote:
> > >
> > > On Fri, Nov 19, 2021 at 7:55 AM houzj.f...@fujitsu.com
> > >  wrote:
> > >
> > > >
> > > > 4)
> > > >
> > > > Just a personal suggestion for the parallel related function name. 
> > > > Since Andres
> > > > wanted a uniform naming pattern. Mabe we can rename the following 
> > > > functions:
> > > >
> > > > end|begin_parallel_vacuum => parallel_vacuum_end|begin
> > > > perform_parallel_index_bulkdel|cleanup => 
> > > > parallel_vacuum_index_bulkdel|cleanup
> > > >
> > > > So that all the parallel related functions' name is like 
> > > > parallel_vacuum_xxx.
> > > >
> > >
> > > BTW, do we really need functions
> > > perform_parallel_index_bulkdel|cleanup? Both do some minimal
> > > assignments and then call parallel_vacuum_all_indexes() and there is
> > > just one caller of each. Isn't it better to just do those assignments
> > > in the caller and directly call parallel_vacuum_all_indexes()?
> >
> > The reason why I declare these two functions are: (1) the fields of
> > ParallelVacuumState are not exposed and (2) bulk-deletion and cleanup
> > require different arguments (estimated_count is required only by
> > cleanup).  So if we expose the fields of ParallelVacuumState, the
> > caller can do those assignments and directly call
> > parallel_vacuum_all_indexes(). But I'm not sure it's good if those
> > assignments are the caller's responsibility.
> >
>
> Okay, that makes sense. However, I am still not very comfortable with
> the function naming suggested by Hou-San, do you have any thoughts on
> that?

I personally don't disagree with the names starting with
"parallel_vacuum_*". Alternative ideas would be names starting with
"vac_*" like other functions declared in vacuum.h, or to distinguish
from them names starting with "pvac_*".

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg_get_publication_tables() output duplicate relid

2021-11-23 Thread Amit Langote

On Tue, Nov 23, 2021 at 12:21 PM Amit Kapila  wrote:
> Isn't it better to document this case and explain what
> users can expect instead of trying to design a solution around this?

I thought about the problems you've described and it looks like I
hadn't considered many of the details which complicate implementing a
solution for this.

So yeah, documenting the ATTACH issue as a limitation sounds like the
best course for now.  I might word it as follows and add it under
Notes at https://www.postgresql.org/docs/current/sql-createpublication.html:

"ATTACHing a table into a partition tree whose root is published using
a publication with publish_via_partition_root set to true does not
result in the table's existing contents to be replicated."

I'm not sure there's a clean enough workaround to this that we can add
to the paragraph.

Does that make sense?

> Even if we do so the streaming after attach partition problem as
> discussed above should be fixed.

I agree.  I have reproduced the problem though haven't managed to pin
down the cause yet.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: parallel vacuum comments

On Wed, Nov 24, 2021 at 1:28 PM Amit Kapila  wrote:
>
> On Wed, Nov 24, 2021 at 7:07 AM Masahiko Sawada  wrote:
> >
> > On Mon, Nov 22, 2021 at 6:35 PM Amit Kapila  wrote:
> > >
> > > On Tue, Nov 16, 2021 at 11:23 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I've incorporated these comments and attached an updated patch.
> > > >
> > >
> > > Review comments:
> > > 
> > > 1.
> > > index_can_participate_parallel_vacuum()
> > > {
> > > ..
> > > + /*
> > > + * Not safe, if the index supports parallel cleanup conditionally,
> > > + * but we have already processed the index (for bulkdelete).  See the
> > > + * comments for option VACUUM_OPTION_PARALLEL_COND_CLEANUP to know
> > > + * when indexes support parallel cleanup conditionally.
> > > + */
> > > + if (num_index_scans > 0 &&
> > > + ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
> > > ..
> > > }
> > >
> > > IIRC, we do this to avoid the need to invoke worker when parallel
> > > cleanup doesn't need to scan the index which means the work required
> > > to be performed by a worker would be minimal. If so, maybe we can
> > > write that in comments here or with
> > > VACUUM_OPTION_PARALLEL_COND_CLEANUP.
> >
> > Right. Will add the comments.
> >
> > > If the above understanding is correct then is it correct to check
> > > num_index_scans here? AFAICS, this value is incremented in
> > > parallel_vacuum_all_indexes irrespective of whether it is invoked for
> > > bulk delete or clean up. OTOH, previously, this was done based on
> > > first_time variable which was in turn set based on
> > > vacrel->num_index_scans and that is incremented in
> > > lazy_vacuum_all_indexes(both in serial and parallel case).
> >
> > You're right. That's wrong to increment num_index_scans also when
> > vacuumcleanup. It should be incremented only when bulkdelete. Perhaps,
> > the caller (i.g., table AM) should pass num_index_scans to parallel
> > vacuum code? I initially thought that ParallelVacuumState can have
> > num_index_scans and increment it only when parallel bulk-deletion. But
> > if we do that we will end up having the same thing in two places:
> > ParallelVacuumState and LVRelState. It would be clearer if we maintain
> > num_index_scan in LVRelState and pass it to parallel index vacuum when
> > calling to parallel index bulk-deletion or cleanup.
> >
>
> That sounds reasonable.
>
> > On the other hand,
> > the downside would be that there is a possibility that a table AM
> > passes the wrong num_index_scans.
> >
>
> If that happens then also there will be no problem as such because it
> will do some work via worker where it would have been done by the
> leader itself. I think it is better to have one source of information
> for this as we need to mainly consider whether bulkdelete has been
> already performed or not, it doesn't matter whether that is performed
> by leader or worker.

Agreed.

>
> >
> > >
> > > 2. The structure ParallelVacuumState contains both PVIndVacStatus and
> > > PVIndStats. Considering PVIndVacStatus is already present in
> > > PVIndStats, does ParallelVacuumState need to have both?
> >
> > "PVIndVacStatus status” of ParallelVacuumState is used by the worker
> > in the error callback function,
> > parallel_index_vacuum_error_callback(), in order to know the status of
> > the index vacuum that the worker is working on. I think that without
> > PVIndVacStatus, the worker needs to have the index of the PVIndStats
> > array in order to get the status by like
> > errinfo->indstats[idx]->status. Do you prefer to do that?
> >
>
> As mentioned in my another email to which you agreed that we need to
> re-design this callback and do it separately, I think it is better to
> consider it separately. So, we can probably remove this parameter from
> the main patch as of now.

Yes, I'll remove it from the next version patch. With that, since
parallel vacuum workers don't set errcontext we will need to do
something for that in a separate patch.

>
> > > 3. Why ParallelVacuumCtl is declared in vacuum.h? It appears to be
> > > only used in one function begin_parallel_vacuum, can't we just declare
> > > in vacuumparallel.c?
> >
> > ParallelVacuumCtl is a struct to begin the parallel vacuum and
> > therefore is expected to be passed by table AM. If we declare it in
> > vacuumparallel.c a table AM (e.g., vacuumlazy.c) cannot use it, no?
> >
> > > As it is only required for one function and it is
> > > not that the number of individual parameters will be too huge, can't
> > > we do without having that structure.
> >
> > Yes, we can do that without having that structure. I was a bit
> > concerned that there are already 7 arguments.
> >
>
> Yeah, it is better to have fewer arguments but I don't this number is
> big enough to worry. Also, I am not sure about the table AM point as
> there is no clear example in front of us which tells how any other
> table AM might use it and whether this structure is generic enough. So
> I think it might be better

Re: Synchronizing slots from primary to standby

On Sun, Oct 31, 2021 at 7:08 PM Peter Eisentraut
 wrote:
>
> I want to reactivate $subject.  I took Petr Jelinek's patch from [0],
> rebased it, added a bit of testing.  It basically works, but as
> mentioned in [0], there are various issues to work out.

Thank you for working on this feature!

>
> The idea is that the standby runs a background worker to periodically
> fetch replication slot information from the primary.  On failover, a
> logical subscriber would then ideally find up-to-date replication slots
> on the new publisher and can just continue normally.
>
>   I have also found that this breaks some seemingly unrelated tests in
> the recovery test suite.  I have disabled these here.  I'm not sure if
> the patch actually breaks anything or if these are just differences in
> timing or implementation dependencies.

I haven’t looked at the patch deeply but regarding 007_sync_rep.pl,
the tests seem to fail since the tests rely on the order of the wal
sender array on the shared memory. Since a background worker for
synchronizing replication slots periodically connects to the walsender
on the primary and disconnects, it breaks the assumption of the order.
Regarding 010_logical_decoding_timelines.pl, I guess that the patch
breaks the test because the background worker for synchronizing slots
on the replica periodically advances the replica's slot. I think we
need to have a way to disable the slot synchronization or to specify
the slot name to sync with the primary. I'm not sure we already
discussed this topic but I think we need it at least for testing
purposes.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Rename dead_tuples to dead_items in vacuumlazy.c

2021-11-23 Thread Dilip Kumar

On Wed, Nov 24, 2021 at 11:16 AM Peter Geoghegan  wrote:
>
> Attached patch performs polishing within vacuumlazy.c, as follow-up
> work to the refactoring work in Postgres 14. This mainly consists of
> changing references of dead tuples to dead items, which reflects the
> fact that VACUUM no longer deals with TIDs that might point to
> remaining heap tuples with storage -- the TIDs in the array must now
> strictly point to LP_DEAD stub line pointers that remain in the heap,
> following pruning.
>
> I've also simplified header comments, and comments above the main
> entry point functions. These comments made much more sense back when
> lazy_scan_heap() was simpler, but wasn't yet broken up into smaller,
> better-scoped functions.
>
> If there are no objections, I'll move on this soon. It's mostly just
> mechanical changes.

-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES 6
+#define PROGRESS_VACUUM_MAX_DEAD_ITEMS 5
+#define PROGRESS_VACUUM_NUM_DEAD_ITEMS 6

Wouldn't this be more logical to change to DEAD_TIDS instead of DEAD_ITEMS?

+ /* Sorted list of TIDs to delete from indexes */
+ ItemPointerData dead[FLEXIBLE_ARRAY_MEMBER];

Instead of just dead, why not deadtid or deaditem?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: row filtering for logical replication

On Wed, Nov 24, 2021 at 6:51 AM houzj.f...@fujitsu.com
 wrote:
>
> On Tues, Nov 23, 2021 6:16 PM Amit Kapila  wrote:
> > On Tue, Nov 23, 2021 at 1:29 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tues, Nov 23, 2021 2:27 PM vignesh C  wrote:
> > > > On Thu, Nov 18, 2021 at 7:04 AM Peter Smith 
> > > > wrote:
> > > > >
> > > > > PSA new set of v40* patches.
> > > >
> > > > Few comments:
> > > > 1) When a table is added to the publication, replica identity is 
> > > > checked. But
> > > > while modifying the publish action to include delete/update, replica 
> > > > identity is
> > > > not checked for the existing tables. I felt it should be checked for 
> > > > the existing
> > > > tables too.
> > >
> > > In addition to this, I think we might also need some check to prevent 
> > > user from
> > > changing the REPLICA IDENTITY index which is used in the filter 
> > > expression.
> > >
> > > I was thinking is it possible do the check related to REPLICA IDENTITY in
> > > function CheckCmdReplicaIdentity() or In GetRelationPublicationActions(). 
> > > If we
> > > move the REPLICA IDENTITY check to this function, it would be consistent 
> > > with
> > > the existing behavior about the check related to REPLICA IDENTITY(see the
> > > comments in CheckCmdReplicaIdentity) and seems can cover all the cases
> > > mentioned above.
> > >
> >
> > Yeah, adding the replica identity check in CheckCmdReplicaIdentity()
> > would cover all the above cases but I think that would put a premium
> > on each update/delete operation. I think traversing the expression
> > tree (it could be multiple traversals if the relation is part of
> > multiple publications) during each update/delete would be costly.
> > Don't you think so?
>
> Yes, I agreed that traversing the expression every time would be costly.
>
> I thought maybe we can cache the columns used in row filter or cache only the 
> a
> flag(can_update|delete) in the relcache. I think every operation that affect
> the row-filter or replica-identity will invalidate the relcache and the cost 
> of
> check seems acceptable with the cache.
>

I think if we can cache this information especially as a bool flag
then that should probably be better.

-- 
With Regards,
Amit Kapila.

Rename dead_tuples to dead_items in vacuumlazy.c

2021-11-23 Thread Peter Geoghegan

Attached patch performs polishing within vacuumlazy.c, as follow-up
work to the refactoring work in Postgres 14. This mainly consists of
changing references of dead tuples to dead items, which reflects the
fact that VACUUM no longer deals with TIDs that might point to
remaining heap tuples with storage -- the TIDs in the array must now
strictly point to LP_DEAD stub line pointers that remain in the heap,
following pruning.

I've also simplified header comments, and comments above the main
entry point functions. These comments made much more sense back when
lazy_scan_heap() was simpler, but wasn't yet broken up into smaller,
better-scoped functions.

If there are no objections, I'll move on this soon. It's mostly just
mechanical changes.

-- 
Peter Geoghegan


v1-0001-vacuumlazy.c-Rename-dead_tuples-to-dead_items.patch
Description: Binary data

Re: Teach pg_receivewal to use lz4 compression

On Mon, Nov 22, 2021 at 09:02:47AM -0500, Robert Haas wrote:
> On Mon, Nov 22, 2021 at 12:46 AM Jeevan Ladhe
>  wrote:
>> Fair enough. But, still I have a doubt in mind what benefit would that
>> really bring to us here, because we are immediately also freeing the
>> lz4buf without using it anywhere.
> 
> Yeah, I'm also doubtful about that. If we're freeng the compression
> context, we shouldn't need to guarantee that it's in any particular
> state before doing so. Why would any critical cleanup be part of
> LZ4F_compressEnd() rather than LZ4F_freeCompressionContext()? The
> point of LZ4F_compressEnd() is to make sure all of the output bytes
> get written, and it would be stupid to force people to write the
> output bytes even when they've decided that they no longer care about
> them due to some error.

Hmm.  I have double-checked all that, and I agree that we could just
skip LZ4F_compressEnd() in this error code path.  From what I can see
in the upstream code, what we have now is not broken either, but the
compressEnd() call does some work that's not needed here.
--
Michael


signature.asc
Description: PGP signature

Re: Post-CVE Wishlist

On Tue, Nov 23, 2021 at 02:18:30PM -0500, Tom Lane wrote:
> Jacob Champion  writes:
>> = Client-Side Auth Selection = 
>> The second request is for the client to stop fully trusting the server
>> during the authentication phase. If I tell libpq to use a client
>> certificate, for example, I don't think the server should be allowed to
>> extract a plaintext password from my environment (at least not without
>> my explicit opt-in).
> 
> Yeah.  I don't recall whether it's been discussed in public or not,
> but it certainly seems like libpq should be able to be configured so
> that (for example) it will never send a cleartext password.  It's not
> clear to me what extent of configurability would be useful, and I
> don't want to overdesign it --- but that much at least would be a
> good thing.

I recall this part being discussed in public, but I cannot put my
finger on the exact thread.  I think that this was around when we
discussed the open items of 10 or 11 for things around channel binding
and how libpq was sensitive to downgrade attacks, which would mean
around 2016 or 2017.  I also recall reading (writing?) a patch that
introduced a new connection parameter that takes in input a
comma-separated list of keywords to allow the user to choose a set of
auth methods accepted, failing if the server is willing to use a
method that does not match with what the user has put in his list.
Perhaps this last part has never reached -hackers though :)

Anyway, the closest thing I can put my finger on now is that:
https://www.postgresql.org/message-id/c5cb08f4cce46ff661ad287fadaa1...@postgrespro.ru
--
Michael

signature.asc
Description: PGP signature

RE: [Proposal] Add foreign-server health checks infrastructure

2021-11-23 Thread kuroda.hay...@fujitsu.com

Dear Kato-san,

Thank you for reviewing!

> Thank you for sending the patches!
> I confirmed that they can be compiled and tested successfully on CentOS
> 8.

Thanks!

> + {
> + {"remote_servers_connection_check_interval", PGC_USERSET,
> CONN_AUTH_SETTINGS,
> + gettext_noop("Sets the time interval between checks
> for
> disconnection of remote servers."),
> + NULL,
> + GUC_UNIT_MS
> + },
> + _servers_connection_check_interval,
> + 0, 0, INT_MAX,
> + },
> 
> If you don't use check_hook, assign_hook and show_hook, you should
> explicitly write "NULL, NULL, NULL", as show below.

Yeah I forgot the line. Fixed.

> + ereport(ERROR,
> +
>   errcode(ERRCODE_CONNECTION_FAILURE),
> + errmsg("Postgres foreign server %s
> might be down.",
> +
>   server->servername));
> 
> According to [1], error messages should start with a lowercase letter
> and not use a period.
> Also, along with the rest of the code, it is a good idea to enclose the
> server name in double quotes.

I confirmed the postgres error-reporting policy and fixed to follow that.
How do you think?

Attached are the latest version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

<>


v02_add_checking_infrastracture.patch
Description: v02_add_checking_infrastracture.patch


v02_add_helth_check_for_postgres_fdw.patch
Description: v02_add_helth_check_for_postgres_fdw.patch

Re: Mop-up from Test::More version change patch

2021-11-23 Thread Noah Misch

On Tue, Nov 23, 2021 at 12:03:05PM -0500, Tom Lane wrote:
> Attached are a couple of patches I propose in the wake of commit
> 405f32fc4 (Require version 0.98 of Test::More for TAP tests).
> 
> 0001 responds to the failure we saw on buildfarm member wrasse [1]
> where, despite configure having carefully checked for Test::More
> being >= 0.98, actual tests failed with
> Test::More version 0.98 required--this is only version 0.92 at 
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Utils.pm
>  line 63.
> The reason is that wrasse was choosing "prove" from a different
> Perl installation than "perl", as a result of its configuration
> having set PERL to a nondefault place but doing nothing about PROVE.
> 
> We already installed a couple of mitigations for that:
> (a) as of c4fe3199a, configure checks "prove" not "perl" for
> appropriate module versions;
> (b) Noah has modified wrasse's configuration to set PROVE.
> But I'm of the opinion that (b) should not be necessary.
> If you set PERL then it's highly likely that you want to use
> "prove" from the same installation.

My regular development system is a counterexample.  It uses system Perl, but
it has a newer "prove" from CPAN:

$ grep -E '(PERL|PROVE)' config.status
S["PROVE"]="/home/nm/sw/cpan/bin/prove"
S["PERL"]="/usr/bin/perl"

The patch sends it back to using the system "prove":

S["PROVE"]="/usr/bin/prove"
S["PERL"]="/usr/bin/perl"

I could, of course, override that.  But with so little evidence about systems
helped by the proposed change, I'm now -1.0 on it.

> 0002 is written to apply to v14 and earlier, and what it wants
> to do is back-patch the effects of 405f32fc4, so that the
> minimum Test::More version is 0.98 in all branches.  The thought
> here is that (1) somebody is likely to want to back-patch a
> test involving Test::More::subtest before too long; (2) we have
> zero coverage for older Test::More versions anyway, now that
> all buildfarm members have been updated to work with HEAD.

wrasse v10..v14 is testing older Test::More, so coverage persists.  However, I
am okay with this change.

Re: parallel vacuum comments

On Wed, Nov 24, 2021 at 7:43 AM Masahiko Sawada  wrote:
>
> On Mon, Nov 22, 2021 at 1:48 PM Amit Kapila  wrote:
> >
> > On Fri, Nov 19, 2021 at 7:55 AM houzj.f...@fujitsu.com
> >  wrote:
> >
> > >
> > > 4)
> > >
> > > Just a personal suggestion for the parallel related function name. Since 
> > > Andres
> > > wanted a uniform naming pattern. Mabe we can rename the following 
> > > functions:
> > >
> > > end|begin_parallel_vacuum => parallel_vacuum_end|begin
> > > perform_parallel_index_bulkdel|cleanup => 
> > > parallel_vacuum_index_bulkdel|cleanup
> > >
> > > So that all the parallel related functions' name is like 
> > > parallel_vacuum_xxx.
> > >
> >
> > BTW, do we really need functions
> > perform_parallel_index_bulkdel|cleanup? Both do some minimal
> > assignments and then call parallel_vacuum_all_indexes() and there is
> > just one caller of each. Isn't it better to just do those assignments
> > in the caller and directly call parallel_vacuum_all_indexes()?
>
> The reason why I declare these two functions are: (1) the fields of
> ParallelVacuumState are not exposed and (2) bulk-deletion and cleanup
> require different arguments (estimated_count is required only by
> cleanup).  So if we expose the fields of ParallelVacuumState, the
> caller can do those assignments and directly call
> parallel_vacuum_all_indexes(). But I'm not sure it's good if those
> assignments are the caller's responsibility.
>

Okay, that makes sense. However, I am still not very comfortable with
the function naming suggested by Hou-San, do you have any thoughts on
that?

-- 
With Regards,
Amit Kapila.

RE: Implementing Incremental View Maintenance

2021-11-23 Thread r.takahash...@fujitsu.com

Hi Nagata-san,


> Ok. I'll fix _copyIntoClause() and _equalIntoClause() as well as 
> _readIntoClause()
> and _outIntoClause().

OK.

> > ivm=# create table t (c1 int, c2 int);
> > CREATE TABLE
> > ivm=# create incremental materialized view ivm_t as select distinct c1 from 
> > t;
> > NOTICE:  created index "ivm_t_index" on materialized view "ivm_t"
> > SELECT 0
> >
> > Then I executed pg_dump.
> >
> > In the dump, the following SQLs appear.
> >
> > CREATE INCREMENTAL MATERIALIZED VIEW public.ivm_t AS
> >  SELECT DISTINCT t.c1
> >FROM public.t
> >   WITH NO DATA;
> >
> > ALTER TABLE ONLY public.ivm_t
> > ADD CONSTRAINT ivm_t_index UNIQUE (c1);
> >
> > If I execute psql with the result of pg_dump, following error occurs.
> >
> > ERROR:  ALTER action ADD CONSTRAINT cannot be performed on relation
> "ivm_t"
> > DETAIL:  This operation is not supported for materialized views.
> 
> Good catch! It was my mistake creating unique constraints on IMMV in spite of
> we cannot defined them via SQL. I'll fix it to use unique indexes instead of
> constraints.

I checked the same procedure on v24 patch.
But following error occurs instead of the original error.

ERROR:  relation "ivm_t_index" already exists

Regards,
Ryohei Takahashi

Re: parallel vacuum comments

On Wed, Nov 24, 2021 at 7:07 AM Masahiko Sawada  wrote:
>
> On Mon, Nov 22, 2021 at 6:35 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 16, 2021 at 11:23 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've incorporated these comments and attached an updated patch.
> > >
> >
> > Review comments:
> > 
> > 1.
> > index_can_participate_parallel_vacuum()
> > {
> > ..
> > + /*
> > + * Not safe, if the index supports parallel cleanup conditionally,
> > + * but we have already processed the index (for bulkdelete).  See the
> > + * comments for option VACUUM_OPTION_PARALLEL_COND_CLEANUP to know
> > + * when indexes support parallel cleanup conditionally.
> > + */
> > + if (num_index_scans > 0 &&
> > + ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
> > ..
> > }
> >
> > IIRC, we do this to avoid the need to invoke worker when parallel
> > cleanup doesn't need to scan the index which means the work required
> > to be performed by a worker would be minimal. If so, maybe we can
> > write that in comments here or with
> > VACUUM_OPTION_PARALLEL_COND_CLEANUP.
>
> Right. Will add the comments.
>
> > If the above understanding is correct then is it correct to check
> > num_index_scans here? AFAICS, this value is incremented in
> > parallel_vacuum_all_indexes irrespective of whether it is invoked for
> > bulk delete or clean up. OTOH, previously, this was done based on
> > first_time variable which was in turn set based on
> > vacrel->num_index_scans and that is incremented in
> > lazy_vacuum_all_indexes(both in serial and parallel case).
>
> You're right. That's wrong to increment num_index_scans also when
> vacuumcleanup. It should be incremented only when bulkdelete. Perhaps,
> the caller (i.g., table AM) should pass num_index_scans to parallel
> vacuum code? I initially thought that ParallelVacuumState can have
> num_index_scans and increment it only when parallel bulk-deletion. But
> if we do that we will end up having the same thing in two places:
> ParallelVacuumState and LVRelState. It would be clearer if we maintain
> num_index_scan in LVRelState and pass it to parallel index vacuum when
> calling to parallel index bulk-deletion or cleanup.
>

That sounds reasonable.

> On the other hand,
> the downside would be that there is a possibility that a table AM
> passes the wrong num_index_scans.
>

If that happens then also there will be no problem as such because it
will do some work via worker where it would have been done by the
leader itself. I think it is better to have one source of information
for this as we need to mainly consider whether bulkdelete has been
already performed or not, it doesn't matter whether that is performed
by leader or worker.

>
> >
> > 2. The structure ParallelVacuumState contains both PVIndVacStatus and
> > PVIndStats. Considering PVIndVacStatus is already present in
> > PVIndStats, does ParallelVacuumState need to have both?
>
> "PVIndVacStatus status” of ParallelVacuumState is used by the worker
> in the error callback function,
> parallel_index_vacuum_error_callback(), in order to know the status of
> the index vacuum that the worker is working on. I think that without
> PVIndVacStatus, the worker needs to have the index of the PVIndStats
> array in order to get the status by like
> errinfo->indstats[idx]->status. Do you prefer to do that?
>

As mentioned in my another email to which you agreed that we need to
re-design this callback and do it separately, I think it is better to
consider it separately. So, we can probably remove this parameter from
the main patch as of now.

> > 3. Why ParallelVacuumCtl is declared in vacuum.h? It appears to be
> > only used in one function begin_parallel_vacuum, can't we just declare
> > in vacuumparallel.c?
>
> ParallelVacuumCtl is a struct to begin the parallel vacuum and
> therefore is expected to be passed by table AM. If we declare it in
> vacuumparallel.c a table AM (e.g., vacuumlazy.c) cannot use it, no?
>
> > As it is only required for one function and it is
> > not that the number of individual parameters will be too huge, can't
> > we do without having that structure.
>
> Yes, we can do that without having that structure. I was a bit
> concerned that there are already 7 arguments.
>

Yeah, it is better to have fewer arguments but I don't this number is
big enough to worry. Also, I am not sure about the table AM point as
there is no clear example in front of us which tells how any other
table AM might use it and whether this structure is generic enough. So
I think it might be better to use arguments for this and if we later
find some generic use then we can replace it with structure.

-- 
With Regards,
Amit Kapila.

RE: Implementing Incremental View Maintenance

2021-11-23 Thread r.takahash...@fujitsu.com

Hi Nagata-san,


Sorry for late reply.


> However, even if we create triggers recursively on the parents or children, 
> we would still
> need more consideration. This is because we will have to convert the format 
> of tuple of
> modified table to the format of the table specified in the view for cases 
> that the parent
> and some children have different format.
> 
> I think supporting partitioned tables can be left for the next release.

OK. I understand.
In the v24-patch, creating IVM on partions or partition table is prohibited.
It is OK but it should be documented.

Perhaps, the following statement describe this.
If so, I think the definition of "simple base table" is ambiguous for some 
users.

+ IMMVs must be based on simple base tables. It's not supported to
+ create them on top of views or materialized views.


> DEPENDENCY_IMMV was added to clear that a certain trigger is related to IMMV.
> We dropped the IVM trigger and its dependencies from IMMV when REFRESH ... 
> WITH NO DATA
> is executed. Without the new deptype, we may accidentally delete a dependency 
> created
> with an intention other than the IVM trigger.

OK. I understand.

> I think it is harder than you expected. When an IMMV is switched to a normal
> materialized view, we needs to drop hidden columns (__ivm_count__ etc.), and 
> in
> the opposite case, we need to create them again. The former (IMMV->IVM) might 
> be
> easer, but for the latter (IVM->IMMV) I wonder we would need to re-create
> IMMV.

OK. I understand.


Regards,
Ryohei Takahashi

Re: Some RELKIND macro refactoring

On Mon, Nov 22, 2021 at 11:21:52AM +0100, Peter Eisentraut wrote:
> On 19.11.21 08:31, Michael Paquier wrote:
>> Regarding 0001, I find the existing code a bit more self-documenting
>> if we keep those checks flagInhAttrs() and guessConstraintInheritance().
>> So I would rather leave these.
> 
> In that case, the existing check in guessConstraintInheritance() seems
> wrong, because it doesn't check for RELKIND_MATVIEW.  Should we fix that?
> It's dead code either way, but if the code isn't exercises, then these kinds
> of inconsistency come about.

Yeah, this one could be added.  Perhaps that comes down to one's taste
at the end, but I would add it.

> Maybe
> 
> else
> {
> Assert(RELKIND_HAS_STORAGE(rel->rd_rel->relkind);
> RelationCreateStorage(rel->rd_node, relpersistence);
> }
> 
> create_storage is set earlier based on RELKIND_HAS_STORAGE(), so this would
> be consistent.

Sounds fine by me.  Perhaps you should apply the same style in
RelationGetNumberOfBlocksInFork(), then?
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-11-23 Thread Dilip Kumar

On Tue, Nov 23, 2021 at 10:29 PM John Naylor
 wrote:
>
> Hi,
>
> I've looked over this patch set and email thread a couple times, and I don't 
> see anything amiss, but I'm also not terribly familiar with the subsystems 
> this part of the code relies on. I haven't yet tried to stress test with a 
> large database, but it seems like a good idea to do so.

Thanks, John for looking into the patches.  Yeah, that makes sense,
next week I will try to test with a large database and maybe with
multiple tablespaces as well to see how this behaves.

> I have a couple comments and questions:
>
> 0006:
>
> + * XXX We can optimize RelationMapOidToFileenodeForDatabase API
> + * so that instead of reading the relmap file every time, it can
> + * save it in a temporary variable and use it for subsequent
> + * calls.  Then later reset it once we're done or at the
> + * transaction end.
>
> Do we really need to consider optimizing here? Only a handful of relations 
> will be found in the relmap, right?

You are right, it is actually not required I will remove this comment.

>
> + * Once we start copying files from the source database, we need to be able
> + * to clean 'em up if we fail.  Use an ENSURE block to make sure this
> + * happens.  (This is not a 100% solution, because of the possibility of
> + * failure during transaction commit after we leave this routine, but it
> + * should handle most scenarios.)
>
> This comment in master started with
>
> - * Once we start copying subdirectories, we need to be able to clean 'em
>
> Is the distinction important enough to change this comment? Also, is "most 
> scenarios" still true with the patch? I haven't read into how ENSURE works.

Actually, it is like PG_TRY(), CATCH() block with extra assurance to
cleanup on shm_exit as well.  And in the cleanup function, we go
through all the tablespaces and remove the new DB-related directory
which we are trying to create.  And you are right, we actually don't
need to change the comments.

> Same with this comment change, seems fine the way it was:

Correct.

> - * Use an ENSURE block to make sure we remove the debris if the copy fails
> - * (eg, due to out-of-disk-space).  This is not a 100% solution, because
> - * of the possibility of failure during transaction commit, but it should
> - * handle most scenarios.
> + * Use an ENSURE block to make sure we remove the debris if the copy fails.
> + * This is not a 100% solution, because of the possibility of failure
> + * during transaction commit, but it should handle most scenarios.
>
> And do we need additional tests? Maybe we don't, but it seems good to make 
> sure.
>
> I haven't looked at 0007, and I have no opinion on which approach is better.

Okay, I like approach 6 because of mainly two reasons, 1) it is not
directly scanning the raw file to identify which files to copy so
seems cleaner to me 2) with 0007 if we directly scan directory we
don't know the relation oid, so before acquiring the buffer lock there
is no way to acquire the relation lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: VS2022: Support Visual Studio 2022 on Windows

On Sun, Nov 21, 2021 at 10:41:17AM +0900, Michael Paquier wrote:
> I'll double-check your patch later, but that looks rather good to me.
> Will try to apply and back-patch, and it would be better to check the
> version numbers assigned in the patch, as well.

I have spent a couple of hours on that today, and applied that down to
10 so as all branches benefit from that.  There was a hidden problem
in 10 and 11, where we need to be careful to use VC2012Project as base 
in MSBuildProject.pm.

Thanks, Hans!
--
Michael

signature.asc
Description: PGP signature

Re: Warning in geqo_main.c from clang 13

Thomas Munro  writes:
> Clang 13 on my machine and peripatus (but not Apple clang 13 on eg
> sifika, I'm still confused about Apple's versioning but I think that's
> really llvm 12-based) warns:
> geqo_main.c:86:8: warning: variable 'edge_failures' set but not used
> [-Wunused-but-set-variable]
> int edge_failures = 0;

Yeah, I noticed that a week or two ago, but didn't see a simple fix.

> Here's one way to silence it.

I'm kind of inclined to just drop the edge_failures recording/logging
altogether, rather than make that rats-nest of #ifdefs even worse.
It's not like anyone has cared about that number in the last decade
or two.

regards, tom lane

Re: Skipping logical replication transactions on subscriber side

On Wed, Nov 24, 2021 at 7:51 AM Masahiko Sawada  wrote:
>
> On Thu, Nov 18, 2021 at 12:52 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, November 16, 2021 2:31 PM Masahiko Sawada 
> >  wrote:
> > > Right. I've fixed this issue and attached an updated patch.
> > >
> > Hi,
> >
> > I have few comments for the testcases.
> >
> > 1)
> >
> > +my $appname = 'tap_sub';
> > +$node_subscriber->safe_psql(
> > +'postgres',
> > +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr 
> > application_name=$appname' PUBLICATION tap_pub WITH (streaming = off, 
> > two_phase = on);");
> > +my $appname_streaming = 'tap_sub_streaming';
> > +$node_subscriber->safe_psql(
> > +'postgres',
> > +"CREATE SUBSCRIPTION tap_sub_streaming CONNECTION '$publisher_connstr 
> > application_name=$appname_streaming' PUBLICATION tap_pub_streaming WITH 
> > (streaming = on, two_phase = on);");
> > +
> >
> > I think we can remove the 'application_name=$appname', so that the command
> > could be shorter.
>
> But we wait for the subscription to catch up by using
> wait_for_catchup() with application_name, no?
>

Yeah, but you can directly use the subscription name in
wait_for_catchup because we internally use that as
fallback_application_name. If application_name is not specified in the
connection string as suggested by Hou-San then
fallback_application_name will be considered. Both ways are okay and I
see we use both ways in the tests but it seems there are more places
where we use the method Hou-San is suggesting in subscription tests.

-- 
With Regards,
Amit Kapila.

Python 3.11 vs. Postgres

According to [1], we need to stop including Python's .
I've not checked whether this creates any backwards-compatibility
issues.

regards, tom lane

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2023272

Re: parallel vacuum comments

On Wed, Nov 24, 2021 at 7:48 AM Masahiko Sawada  wrote:
>
> On Fri, Nov 19, 2021 at 11:25 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > 3)
> >
> > +   /*
> > +* Reset all index status back to invalid (while checking that we 
> > have
> > +* processed all indexes).
> > +*/
> > +   for (int i = 0; i < pvs->nindexes; i++)
> > +   {
> > +   PVIndStats *stats = &(pvs->indstats[i]);
> > +
> > +   Assert(stats->status == INDVAC_STATUS_COMPLETED);
> > +   stats->status = INDVAC_STATUS_INITIAL;
> > +   }
> >
> > Would it be safer if we report an error if any index's status is not
> > INDVAC_STATUS_COMPLETED ?
>
> Agreed. It'd be safer since even if some indexes are vacuumed due to a
> bug vacuum errored out rather than continue it (and cause index
> corruption).
>

I think if we want to report an error in this case, we should use elog
as this is an internal error.

-- 
With Regards,
Amit Kapila.

Warning in geqo_main.c from clang 13

2021-11-23 Thread Thomas Munro

Hi,

Clang 13 on my machine and peripatus (but not Apple clang 13 on eg
sifika, I'm still confused about Apple's versioning but I think that's
really llvm 12-based) warns:

geqo_main.c:86:8: warning: variable 'edge_failures' set but not used
[-Wunused-but-set-variable]
int edge_failures = 0;

Here's one way to silence it.
diff --git a/src/backend/optimizer/geqo/geqo_main.c b/src/backend/optimizer/geqo/geqo_main.c
index 09d9e7d4dd..b18eccfae4 100644
--- a/src/backend/optimizer/geqo/geqo_main.c
+++ b/src/backend/optimizer/geqo/geqo_main.c
@@ -83,8 +83,10 @@ geqo(PlannerInfo *root, int number_of_rels, List *initial_rels)
 
 #if defined(ERX)
 	Edge	   *edge_table;		/* list of edges */
+#if defined(GEQO_DEBUG)
 	int			edge_failures = 0;
 #endif
+#endif
 #if defined(CX) || defined(PX) || defined(OX1) || defined(OX2)
 	City	   *city_table;		/* list of cities */
 #endif
@@ -187,7 +189,10 @@ geqo(PlannerInfo *root, int number_of_rels, List *initial_rels)
 		kid = momma;
 
 		/* are there any edge failures ? */
-		edge_failures += gimme_tour(root, edge_table, kid->string, pool->string_length);
+#if defined(GEQO_DEBUG)
+		edge_failures +=
+#endif
+			gimme_tour(root, edge_table, kid->string, pool->string_length);
 #elif defined(PMX)
 		/* PARTIALLY MATCHED CROSSOVER */
 		pmx(root, momma->string, daddy->string, kid->string, pool->string_length);

Re: Skipping logical replication transactions on subscriber side

On Thu, Nov 18, 2021 at 12:52 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, November 16, 2021 2:31 PM Masahiko Sawada  
> wrote:
> > Right. I've fixed this issue and attached an updated patch.
> >
> Hi,
>
> I have few comments for the testcases.
>
> 1)
>
> +my $appname = 'tap_sub';
> +$node_subscriber->safe_psql(
> +'postgres',
> +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr 
> application_name=$appname' PUBLICATION tap_pub WITH (streaming = off, 
> two_phase = on);");
> +my $appname_streaming = 'tap_sub_streaming';
> +$node_subscriber->safe_psql(
> +'postgres',
> +"CREATE SUBSCRIPTION tap_sub_streaming CONNECTION '$publisher_connstr 
> application_name=$appname_streaming' PUBLICATION tap_pub_streaming WITH 
> (streaming = on, two_phase = on);");
> +
>
> I think we can remove the 'application_name=$appname', so that the command
> could be shorter.

But we wait for the subscription to catch up by using
wait_for_catchup() with application_name, no?

>
> 2)
> +...(streaming = on, two_phase = on);");
> Besides, is there some reasons to set two_phase to ? If so,
> It might be better to add some comments about it.
>

Yes, two_phase = on is required by the tests for skip transaction
patch. WIll remove it.

>
> 3)
> +CREATE PUBLICATION tap_pub_streaming FOR TABLE test_tab_streaming;
> +]);
> +
>
> It seems there's no tests to use the table test_tab_streaming. I guess this
> table is used to test streaming change error, maybe we can add some tests for
> it ?

Oops, similarly this is also required by the skip transaction tests.
Will remove it.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: parallel vacuum comments

On Fri, Nov 19, 2021 at 11:25 AM houzj.f...@fujitsu.com
 wrote:
>
> On Tues, Nov 16, 2021 1:53 PM Masahiko Sawada  wrote:
> > I've incorporated these comments and attached an updated patch.
>
> Thanks for updating the patch.
> I read the latest patch and have few comments.

Thank you for the comments! For the comments (2) and (4), I replied in
a separate email to answer your and Amit's comments.

>
> 1)
> +/*
> + * lazy_vacuum_one_index() -- vacuum index relation.
> ...
> +IndexBulkDeleteResult *
> +vacuum_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
>
>
> + * vac_cleanup_one_index() -- do post-vacuum cleanup for index relation.
> ...
> +IndexBulkDeleteResult *
> +cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
>
> The above function names seem different from the name mentioned in the 
> function
> header.

Will fix both.

>
> 3)
>
> +   /*
> +* Reset all index status back to invalid (while checking that we have
> +* processed all indexes).
> +*/
> +   for (int i = 0; i < pvs->nindexes; i++)
> +   {
> +   PVIndStats *stats = &(pvs->indstats[i]);
> +
> +   Assert(stats->status == INDVAC_STATUS_COMPLETED);
> +   stats->status = INDVAC_STATUS_INITIAL;
> +   }
>
> Would it be safer if we report an error if any index's status is not
> INDVAC_STATUS_COMPLETED ?

Agreed. It'd be safer since even if some indexes are vacuumed due to a
bug vacuum errored out rather than continue it (and cause index
corruption).

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Mop-up from Test::More version change patch

Andrew Dunstan  writes:
> On 11/23/21 12:03, Tom Lane wrote:
>> If you set PERL then it's highly likely that you want to use
>> "prove" from the same installation.  So 0001 modifies configure
>> to first check for an executable "prove" in the same directory
>> as $PERL.  If that's not what you want then you should override
>> it by setting PROVE explicitly.

> Do we really have much of an issue left to solve given c4fe3199a? It
> feels a bit like a solution in search of a problem.

We don't absolutely have to do this, agreed.  But I think the
current behavior fails to satisfy the POLA.  As I remarked in
the other thread, I'm worried about somebody wasting time trying
to identify why their TAP test isn't behaving the way it does
when they invoke the code under the perl they think they're using.

regards, tom lane

Re: parallel vacuum comments

On Mon, Nov 22, 2021 at 1:48 PM Amit Kapila  wrote:
>
> On Fri, Nov 19, 2021 at 7:55 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tues, Nov 16, 2021 1:53 PM Masahiko Sawada  wrote:
> > > I've incorporated these comments and attached an updated patch.
> >
> >
> > 2)
> >  static void vacuum_error_callback(void *arg);
> >
> > I noticed the patch changed the parallel worker's error callback function to
> > parallel_index_vacuum_error_callback(). The error message in new callback
> > function seems a little different from the old one, was it intentional ?
> >
>
> One more point related to this is that it seems a new callback will be
> invoked only by parallel workers, so the context displayed during
> parallel vacuum will be different based on if the error happens during
> processing by leader or worker. I think if done correctly this would
> be an improvement over what we have now but isn't it better to do this
> change as a separate patch?

Agreed.

>
> >
> > 4)
> >
> > Just a personal suggestion for the parallel related function name. Since 
> > Andres
> > wanted a uniform naming pattern. Mabe we can rename the following functions:
> >
> > end|begin_parallel_vacuum => parallel_vacuum_end|begin
> > perform_parallel_index_bulkdel|cleanup => 
> > parallel_vacuum_index_bulkdel|cleanup
> >
> > So that all the parallel related functions' name is like 
> > parallel_vacuum_xxx.
> >
>
> BTW, do we really need functions
> perform_parallel_index_bulkdel|cleanup? Both do some minimal
> assignments and then call parallel_vacuum_all_indexes() and there is
> just one caller of each. Isn't it better to just do those assignments
> in the caller and directly call parallel_vacuum_all_indexes()?

The reason why I declare these two functions are: (1) the fields of
ParallelVacuumState are not exposed and (2) bulk-deletion and cleanup
require different arguments (estimated_count is required only by
cleanup).  So if we expose the fields of ParallelVacuumState, the
caller can do those assignments and directly call
parallel_vacuum_all_indexes(). But I'm not sure it's good if those
assignments are the caller's responsibility.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Mop-up from Test::More version change patch

2021-11-23 Thread Andrew Dunstan



On 11/23/21 12:03, Tom Lane wrote:
> [ moving thread to -hackers for a bit more visibility ]
>
> Attached are a couple of patches I propose in the wake of commit
> 405f32fc4 (Require version 0.98 of Test::More for TAP tests).
>
> 0001 responds to the failure we saw on buildfarm member wrasse [1]
> where, despite configure having carefully checked for Test::More
> being >= 0.98, actual tests failed with
> Test::More version 0.98 required--this is only version 0.92 at 
> /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Utils.pm
>  line 63.
> The reason is that wrasse was choosing "prove" from a different
> Perl installation than "perl", as a result of its configuration
> having set PERL to a nondefault place but doing nothing about PROVE.
>
> We already installed a couple of mitigations for that:
> (a) as of c4fe3199a, configure checks "prove" not "perl" for
> appropriate module versions;
> (b) Noah has modified wrasse's configuration to set PROVE.
> But I'm of the opinion that (b) should not be necessary.
> If you set PERL then it's highly likely that you want to use
> "prove" from the same installation.  So 0001 modifies configure
> to first check for an executable "prove" in the same directory
> as $PERL.  If that's not what you want then you should override
> it by setting PROVE explicitly.
>
> Since this is mainly meant to prevent an easy-to-make error in
> setting up buildfarm configurations, we should back-patch it.


Do we really have much of an issue left to solve given c4fe3199a? It
feels a bit like a solution in search of a problem.



>
> 0002 is written to apply to v14 and earlier, and what it wants
> to do is back-patch the effects of 405f32fc4, so that the
> minimum Test::More version is 0.98 in all branches.  The thought
> here is that (1) somebody is likely to want to back-patch a
> test involving Test::More::subtest before too long; (2) we have
> zero coverage for older Test::More versions anyway, now that
> all buildfarm members have been updated to work with HEAD.
>

This one seems like a good idea.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: parallel vacuum comments

On Mon, Nov 22, 2021 at 6:35 PM Amit Kapila  wrote:
>
> On Tue, Nov 16, 2021 at 11:23 AM Masahiko Sawada  
> wrote:
> >
> > I've incorporated these comments and attached an updated patch.
> >
>
> Review comments:
> 
> 1.
> index_can_participate_parallel_vacuum()
> {
> ..
> + /*
> + * Not safe, if the index supports parallel cleanup conditionally,
> + * but we have already processed the index (for bulkdelete).  See the
> + * comments for option VACUUM_OPTION_PARALLEL_COND_CLEANUP to know
> + * when indexes support parallel cleanup conditionally.
> + */
> + if (num_index_scans > 0 &&
> + ((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
> ..
> }
>
> IIRC, we do this to avoid the need to invoke worker when parallel
> cleanup doesn't need to scan the index which means the work required
> to be performed by a worker would be minimal. If so, maybe we can
> write that in comments here or with
> VACUUM_OPTION_PARALLEL_COND_CLEANUP.

Right. Will add the comments.

> If the above understanding is correct then is it correct to check
> num_index_scans here? AFAICS, this value is incremented in
> parallel_vacuum_all_indexes irrespective of whether it is invoked for
> bulk delete or clean up. OTOH, previously, this was done based on
> first_time variable which was in turn set based on
> vacrel->num_index_scans and that is incremented in
> lazy_vacuum_all_indexes(both in serial and parallel case).

You're right. That's wrong to increment num_index_scans also when
vacuumcleanup. It should be incremented only when bulkdelete. Perhaps,
the caller (i.g., table AM) should pass num_index_scans to parallel
vacuum code? I initially thought that ParallelVacuumState can have
num_index_scans and increment it only when parallel bulk-deletion. But
if we do that we will end up having the same thing in two places:
ParallelVacuumState and LVRelState. It would be clearer if we maintain
num_index_scan in LVRelState and pass it to parallel index vacuum when
calling to parallel index bulk-deletion or cleanup. On the other hand,
the downside would be that there is a possibility that a table AM
passes the wrong num_index_scans. Probably it’s also a valid argument
that since if a table AM is capable of parallel index vacuum, it’s
better to outsource index bulkdelete/cleanup to parallel index vacuum
through a whole vacuum operation, it’d be better to have
ParallelVacuumState maintain num_index_scans.

>
> 2. The structure ParallelVacuumState contains both PVIndVacStatus and
> PVIndStats. Considering PVIndVacStatus is already present in
> PVIndStats, does ParallelVacuumState need to have both?

"PVIndVacStatus status” of ParallelVacuumState is used by the worker
in the error callback function,
parallel_index_vacuum_error_callback(), in order to know the status of
the index vacuum that the worker is working on. I think that without
PVIndVacStatus, the worker needs to have the index of the PVIndStats
array in order to get the status by like
errinfo->indstats[idx]->status. Do you prefer to do that?

> 3. Why ParallelVacuumCtl is declared in vacuum.h? It appears to be
> only used in one function begin_parallel_vacuum, can't we just declare
> in vacuumparallel.c?

ParallelVacuumCtl is a struct to begin the parallel vacuum and
therefore is expected to be passed by table AM. If we declare it in
vacuumparallel.c a table AM (e.g., vacuumlazy.c) cannot use it, no?

> As it is only required for one function and it is
> not that the number of individual parameters will be too huge, can't
> we do without having that structure.

Yes, we can do that without having that structure. I was a bit
concerned that there are already 7 arguments.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2021-11-23 Thread Andres Freund

Hi,

On 2021-11-23 17:01:20 -0800, Peter Geoghegan wrote:
> > On reason for my doubt is the following:
> >
> > We can set all-visible on a page without a FPW image (well, as long as hint
> > bits aren't logged). There's a significant difference between needing to WAL
> > log FPIs for every heap page or not, and it's not that rare for data to live
> > shorter than autovacuum_freeze_max_age or that limit never being reached.
> 
> This sounds like an objection to one specific heuristic, and not an
> objection to the general idea.

I understood you to propose that we do not have separate frozen and
all-visible states. Which I think will be problematic, because of scenarios
like the above.

> The only essential part is "opportunistic freezing during vacuum, when the
> cost is clearly very low, and the benefit is probably high". And so it now
> seems you were making a far more limited statement than I first believed.

I'm on board with freezing when we already dirty out the page, and when doing
so doesn't cause an additional FPI. And I don't think I've argued against that
in the past.

> These all-visible (but not all-frozen) heap pages could be considered
> "tenured", since they have survived at least one full VACUUM cycle
> without being unset. So why not also freeze them based on the
> assumption that they'll probably stay that way forever?

Because it's a potentially massive increase in write volume? E.g. if you have
a insert-only workload, and you discard old data by dropping old partitions,
this will often add yet another rewrite, despite your data likely never
getting old enough to need to be frozen.

Given that we often immediately need to start another vacuum just when one
finished, because the vacuum took long enough to reach thresholds of vacuuming
again, I don't think the (auto-)vacuum count is a good proxy.

Maybe you meant this as a more limited concept, i.e. only doing so when the
percentage of all-visible but not all-frozen pages is small?

We could perhaps do better with if we had information about the system-wide
rate of xid throughput and how often / how long past vacuums of a table took.

Greetings,

Andres Freund

RE: row filtering for logical replication

2021-11-23 Thread houzj.f...@fujitsu.com

On Tues, Nov 23, 2021 6:16 PM Amit Kapila  wrote:
> On Tue, Nov 23, 2021 at 1:29 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tues, Nov 23, 2021 2:27 PM vignesh C  wrote:
> > > On Thu, Nov 18, 2021 at 7:04 AM Peter Smith 
> > > wrote:
> > > >
> > > > PSA new set of v40* patches.
> > >
> > > Few comments:
> > > 1) When a table is added to the publication, replica identity is checked. 
> > > But
> > > while modifying the publish action to include delete/update, replica 
> > > identity is
> > > not checked for the existing tables. I felt it should be checked for the 
> > > existing
> > > tables too.
> >
> > In addition to this, I think we might also need some check to prevent user 
> > from
> > changing the REPLICA IDENTITY index which is used in the filter expression.
> >
> > I was thinking is it possible do the check related to REPLICA IDENTITY in
> > function CheckCmdReplicaIdentity() or In GetRelationPublicationActions(). 
> > If we
> > move the REPLICA IDENTITY check to this function, it would be consistent 
> > with
> > the existing behavior about the check related to REPLICA IDENTITY(see the
> > comments in CheckCmdReplicaIdentity) and seems can cover all the cases
> > mentioned above.
> >
> 
> Yeah, adding the replica identity check in CheckCmdReplicaIdentity()
> would cover all the above cases but I think that would put a premium
> on each update/delete operation. I think traversing the expression
> tree (it could be multiple traversals if the relation is part of
> multiple publications) during each update/delete would be costly.
> Don't you think so?

Yes, I agreed that traversing the expression every time would be costly.

I thought maybe we can cache the columns used in row filter or cache only the a
flag(can_update|delete) in the relcache. I think every operation that affect
the row-filter or replica-identity will invalidate the relcache and the cost of
check seems acceptable with the cache.

The reason that I thought it might be better do check in
CheckCmdReplicaIdentity is that we might need to add duplicate check code for
a couple of places otherwise, for example, we might need to check
replica-identity when:

[ALTER REPLICA IDENTITY |
DROP INDEX |
ALTER PUBLICATION ADD TABLE |
ALTER PUBLICATION SET (pubaction)]

Best regards,
Hou zj

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2021-11-23 Thread Peter Geoghegan

On Mon, Nov 22, 2021 at 9:49 PM Andres Freund  wrote:
> > For example, we can definitely afford to wait a few more milliseconds
> > to get a cleanup lock just once
>
> We currently have no infrastructure to wait for an lwlock or pincount for a
> limited time. And at least for the former it'd not be easy to add. It may be
> worth adding that at some point, but I'm doubtful this is sufficient reason
> for nontrivial new infrastructure in very performance sensitive areas.

It was a hypothetical example. To be more practical about it: it seems
likely that we won't really benefit from waiting some amount of time
(not forever) for a cleanup lock in non-aggressive VACUUM, once we
have some of the relfrozenxid stuff we've talked about in place. In a
world where we're smarter about advancing relfrozenxid in
non-aggressive VACUUMs, the choice between waiting for a cleanup lock,
and not waiting (but also not advancing relfrozenxid at all) matters
less -- it's no longer a binary choice.

It's no longer a binary choice because we will have done away with the
current rigid way in which our new relfrozenxid for the relation is
either FreezeLimit, or nothing at all. So far we've only talked about
the case where we can update relfrozenxid with a value that happens to
be much newer than FreezeLimit. If we can do that, that's great. But
what about setting relfrozenxid to an *older* value than FreezeLimit
instead (in a non-aggressive VACUUM)? That's also pretty good! There
is still a decent chance that the final "suboptimal" relfrozenxid that
we determine can be safely set in pg_class at the end of our VACUUM
will still be far more recent than the preexisting relfrozenxid.
Especially with larger tables.

Advancing relfrozenxid should be thought of as a totally independent
thing to freezing tuples, at least in vacuumlazy.c itself. That's
kinda the case today, even, but *explicitly* decoupling advancing
relfrozenxid from actually freezing tuples seems like a good high
level goal for this project.

Remember, FreezeLimit is derived from vacuum_freeze_min_age in the
obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age
GUC/reloption setting. I'm pretty sure that this means that making
autovacuum freeze tuples more aggressively (by reducing
vacuum_freeze_min_age) could have the perverse effect of making
non-aggressive VACUUMs less likely to advance relfrozenxid -- which is
exactly backwards. This effect could easily be missed, even by expert
users, since there is no convenient instrumentation that shows how and
when relfrozenxid is advanced.

> > All of the autovacuums against the accounts table look similar to this
> > one -- you don't see anything about relfrozenxid being advanced
> > (because it isn't).

>> Does that really make
> > sense, though?
>
> Does what make really sense?

Well, my accounts table example wasn't a particularly good one (it was
a conveniently available example). I am now sure that you got the
point I was trying to make here already, based on what you go on to
say about non-aggressive VACUUMs optionally *not* skipping
all-visible-not-all-frozen heap pages in the hopes of advancing
relfrozenxid earlier (more on that idea below, in my response).

On reflection, the simplest way of expressing the same idea is what I
just said about decoupling (decoupling advancing relfrozenxid from
freezing).

> I think pgbench_accounts is just a really poor showcase. Most importantly
> there's no even slightly longer running transactions that hold down the xid
> horizon. But in real workloads thats incredibly common IME.  It's also quite
> uncommon in real workloads to huge tables in which all records are
> updated. It's more common to have value ranges that are nearly static, and a
> more heavily changing range.

I agree.

> I think the most interesting cases where using the "measured" horizon will be
> advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
> modified tables, including completely static ones, too. Using the "measured"
> horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
> tables, because we'll be able to set a much more recent relfrozenxid.

That's probably true in practice -- but who knows these days, with the
autovacuum_vacuum_insert_scale_factor stuff? Either way I see no
reason to emphasize that case in the design itself. The "decoupling"
concept now seems like the key design-level concept -- everything else
follows naturally from that.

> This is becoming more common with the increased use of partitioning.

Also with bulk loading. There could easily be a tiny number of
distinct XIDs that are close together in time, for many many rows --
practically one XID, or even exactly one XID.

> No, not quite. We treat anti-wraparound vacuums as an emergency (including
> logging messages, not cancelling). But the only mechanism we have against
> anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
> not really a "real"

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Mark Dilger

> On Nov 23, 2021, at 4:51 PM, Bossart, Nathan  wrote:
> 
> This is a good point.  Right now, you'd have to manually inspect the
> infomask field to determine the exact form of the invalid combination.
> My only worry with this is that we'd want to make sure these checks
> stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in
> htup_details.h.  I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to
> change all that often, though.

I expect that your patch will contain some addition to the amcheck (or 
pg_amcheck) tests, so if we ever allow some currently disallowed bit 
combination, we'd be reminded of the need to update amcheck.  So I'm not too 
worried about that aspect of this.

I prefer not to have to get a page (or full file) from a customer when the 
check reports corruption.  Even assuming they are comfortable giving that, 
which they may not be, it is an extra round trip to the customer asking for 
stuff.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Bossart, Nathan

On 11/23/21, 4:36 PM, "Mark Dilger"  wrote:
> I have to wonder if, when corruption is reported for conditions like this:
>
> +   if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
> +   HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
>
> if the first thing we're going to want to know is which branch of the 
> HEAP_XMAX_IS_LOCKED_ONLY macro evaluated true?  Should we split this check to 
> do each branch of the macro separately, such as:
>
> if (ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED)
> {
> if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)
> ... report something ...
> else if ((ctx->tuphdr->t_infomask & (HEAP_XMAX_IS_MULTI | 
> HEAP_LOCK_MASK)) == HEAP_XMAX_EXCL_LOCK)
> ... report a different thing ...
> }

This is a good point.  Right now, you'd have to manually inspect the
infomask field to determine the exact form of the invalid combination.
My only worry with this is that we'd want to make sure these checks
stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in
htup_details.h.  I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to
change all that often, though.

Nathan

Re: Windows build warnings

2021-11-23 Thread Greg Nancarrow

On Wed, Nov 24, 2021 at 1:41 AM Alvaro Herrera  wrote:
>
> On 2021-Nov-23, Juan José Santamaría Flecha wrote:
>
> > On Tue, Nov 23, 2021 at 2:11 PM Daniel Gustafsson  wrote:
>
> > > It's supported in clang as well per the documentation [0] in at least some
> > > configurations or distributions:
>
> > [[maybe_unused]] is also recognized from Visual Studio 2017 onwards [1].
> >
> > [1] https://docs.microsoft.com/en-us/cpp/cpp/attributes?view=msvc-170
>
> Right ... the problem, as I understand, is that the syntax for
> [[maybe_unused]] is different from what we can do with the current
> pg_attribute_unused -- [[maybe_unused]] goes before the variable name.
> We would need to define pg_attribute_unused macro (maybe have it take
> the variable name and initializator value as arguments?), and also
> define PG_USED_FOR_ASSERTS_ONLY in the same style.
>

Isn't "[[maybe_unused]]" only supported for MS C++ (not C)?
I'm using Visual Studio 17, and I get nothing but a syntax error if
trying to use it in C code, whereas it works if I rename the same
source file to have a ".cpp" extension (but even then I need to use
the "/std:c++17" compiler flag)

Regards,
Greg Nancarrow
Fujitsu Australia

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Mark Dilger




> On Nov 23, 2021, at 4:26 PM, Bossart, Nathan  wrote:
> 
> I've finally gotten started on this, and I've attached a work-in-
> progress patch to gather some early feedback.  I'm specifically
> wondering if there are other places it'd be good to check for these
> unsupported combinations and whether we should use the
> HEAP_XMAX_IS_LOCKED_ONLY macro or just look for the
> HEAP_XMAX_LOCK_ONLY bit. 

I have to wonder if, when corruption is reported for conditions like this:

+   if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+   HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))

if the first thing we're going to want to know is which branch of the 
HEAP_XMAX_IS_LOCKED_ONLY macro evaluated true?  Should we split this check to 
do each branch of the macro separately, such as:

if (ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED)
{   
if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)
... report something ...
else if ((ctx->tuphdr->t_infomask & (HEAP_XMAX_IS_MULTI | HEAP_LOCK_MASK)) 
== HEAP_XMAX_EXCL_LOCK)
... report a different thing ...
}

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Bossart, Nathan

The archives seem unhappy with my attempt to revive this old thread,
so here is a link to it in case anyone is looking for more context:


https://www.postgresql.org/message-id/flat/3476708e-7919-20cb-ca45-6603470565f7%40amazon.com

Nathan

Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Andy Fan

>
>
> You could say that that's the same rookie error as relying on the
> persistence of any other uncommitted DML ...
>

IIUC, This is not the same as uncommitted DML exactly.   For any
uncommitted
DML, it is a rollback for sure.  But as for sequence, The xmax is not
changed
during sequence's value update by design and we didn't maintain the multi
versions
for sequence,  so sequence can't be rolled back clearly.  The fact is a
dirty data page flush
can persist the change no matter the txn is committed or aborted.  The
below example
can show the difference:

SELECT nextval('s'); -- 1
begin;
SELECT nextval('s');  \watch 0.1 for a while, many checkpointer or data
flush happened.
-- crashed.

If we run nextval('s') from the recovered system, we probably will _not_ get
the 2 (assume cachesize=1) like uncommitted DML.


-- 
Best Regards
Andy Fan

Re: Failed transaction statistics to measure the logical replication progress

On Tue, Nov 23, 2021 at 3:21 PM Greg Nancarrow  wrote:
>
> On Sat, Nov 20, 2021 at 1:11 AM Masahiko Sawada  wrote:
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
>
> You mean you think users might think "first_error_time" is the
> timestamp at which the last_error first occurred (rather than the
> timestamp of the first of any type of error that occurred) on that
> worker?

I felt that "first_error_time" is the timestamp of the first of any
type of error that occurred on the worker.

>
> > ... Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
>
> It's the number of times that the last_error has occurred.
> Unless it's some kind of transient error, that might get resolved
> without intervention, logical replication will get stuck in a loop
> retrying and the last error will occur again and again, hence the
> count of how many times that has happened.
> Maybe there's not much benefit in counting different errors prior to
> the last error?

The name "last_error_count" is somewhat clear to me now. I had felt
that since the last error refers to *one* error that occurred last and
it’s odd there is the count of it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failed transaction statistics to measure the logical replication progress

On Sat, Nov 20, 2021 at 3:25 PM Amit Kapila  wrote:
>
> On Fri, Nov 19, 2021 at 7:41 PM Masahiko Sawada  wrote:
> >
> > On Thu, Nov 18, 2021 at 12:26 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Nov 17, 2021 at 7:12 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Wednesday, November 17, 2021 10:00 PM Amit Kapila 
> > > >  wrote:
> > > > >
> > > > > Can you please tell us why you think the names in your proposed patch 
> > > > > are
> > > > > better than the existing names proposed in Sawada-San's patch? Is it 
> > > > > because
> > > > > those fields always contain the information of the last or latest 
> > > > > error that
> > > > > occurred in the corresponding subscription worker?
> > > > This is one reason.
> > > >
> > > > Another big reason comes from the final alignment when we list up all 
> > > > columns of both patches.
> > > > The patches in this thread is trying to introduce a column that 
> > > > indicates
> > > > cumulative count of error to show all error counts that the worker got 
> > > > in the past.
> > > >
> > >
> > > Okay, I see your point and it makes sense to rename columns after
> > > these other stats. I am not able to come up with any better names than
> > > what is being used here. Sawada-San, do you agree with this, or do let
> > > us know if you have any better ideas?
> > >
> >
> > I'm concerned that these new names will introduce confusion; if we
> > have last_error_relid, last_error_command, last_error_message,
> > last_error_time, and last_error_xid, I think users might think that
> > first_error_time is the timestamp at which an error occurred for the
> > first time in the subscription worker.
> >
>
> Isn't to some extent that confusion already exists because of
> last_error_time column?
>
> > Also, I'm not sure
> > last_error_count is not clear to me (it looks like showing something
> > count but the only "last" one?).
> >
>
> I feel if all the error related columns have "last_error_" as a prefix
> then it should not be that confusing?
>
> > An alternative idea would be to add
> > total_error_count by this patch, resulting in having both error_count
> > and total_error_count. Regarding commit_count and abort_count, I
> > personally think xact_commit and xact_rollback would be better since
> > they’re more consistent with pg_stat_database view, although we might
> > already have discussed that.
> >
>
> Even if we decide to change the column names to
> xact_commit/xact_rollback, I think with additional non-error columns
> it will be clear to add 'error' in column names corresponding to error
> columns, and last_error_* seems to be consistent with what we have in
> pg_stat_archiver (last_failed_wal, last_failed_time).

Okay, I agree that last_error_* columns will be consistent.

> Your point
> related to first_error_time has merit and I don't have a better answer
> for it. I think it is just a convenience column and we are not sure
> whether that will be required in practice so maybe we can drop that
> column and come back to it later once we get some field feedback on
> this view?

+1

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2021-11-23 Thread Justin Pryzby

On Mon, Nov 22, 2021 at 07:17:01PM +, Bossart, Nathan wrote:
> In an attempt to get this patch set off the ground again, I took a
> look at the first 5 patches.

> I haven't looked at the following patches too much, but I'm getting
> the idea that they might address a lot of the feedback above and that
> the first bunch of patches are more like staging patches that add the
> abilities without changing the behavior.  I wonder if just going
> straight to the end goal behavior might simplify the patch set a bit.
> I can't say I feel too strongly about this, but I figure I'd at least
> share my thoughts.

Thanks for looking.

The patches are separate since the early patches are the most necessary, least
disputable parts, to allow the possibility of (say) chaging pg_ls_tmpdir() 
without
changing other functions, since pg_ls_tmpdir was was original motivation behind
this whole thread.

In a recent thread, Bharath Rupireddy added pg_ls functions for the logical
dirs, but expressed a preference not to add the metadata columns.  I still
think that at least "isdir" should be added to all the "ls" functions, since
it's easy to SELECT the columns you want, and a bit of a pain to write the
corresponding LATERAL query: 'dir' AS dir, pg_ls_dir(dir) AS ls,
pg_stat_file(ls) AS st.  I think it would be strange if pg_ls_tmpdir() were to
return a different set of columns than the other functions, even though admins
or extensions might have created dirs or other files in those directories.

Tom pointed out that we don't have a working lstat() for windows, so then it
seems like we're not yet ready to show file "types" (we'd show the type of the
link target, which is sometimes what's wanted, but not usually what "ls" would
show), nor ready to implement recurse.

As before:

On Tue, Apr 06, 2021 at 11:01:31AM -0500, Justin Pryzby wrote:
> The first handful of patches address the original issue, and I think could be
> "ready":
> 
> $ git log --oneline origin..pg-ls-dir-new |tac
> ... Document historic behavior of links to directories..
> ... Add tests on pg_ls_dir before changing it
> ... Add pg_ls_dir_metadata to list a dir with file metadata..
> ... pg_ls_tmpdir to show directories and "isdir" argument..
> ... pg_ls_*dir to show directories and "isdir" column..
> 
> These others are optional:
> ... pg_ls_logdir to ignore error if initial/top dir is missing..
> ... pg_ls_*dir to return all the metadata from pg_stat_file..
> 
> ..and these maybe requires more work for lstat on windows:
> ... pg_stat_file and pg_ls_dir_* to use lstat()..
> ... pg_ls_*/pg_stat_file to show file *type*..
> ... Preserve pg_stat_file() isdir..
> ... Add recursion option in pg_ls_dir_files..

rebased on 1922d7c6e1a74178bd2f1d5aa5a6ab921b3fcd34

-- 
Justin
>From 6bdb81194999cfe6b0ba4d850e4f31d022655a5b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v31 01/11] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74d3087a72..d36479d86d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27410,7 +27410,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users
-- 
2.17.0

>From d6068e34df8dd9ca8db1959010f1b41933732775 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Mar 2020 13:16:24 -0500
Subject: [PATCH v31 02/11] Add tests on pg_ls_dir before changing it

---
 src/test/regress/expected/misc_functions.out | 24 
 src/test/regress/sql/misc_functions.sql  |  8 +++
 2 files changed, 32 insertions(+)

diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 1013d17f87..830de507e7 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -243,6 +243,30 @@ select count(*) > 0 from
  t
 (1 row)
 
+select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true
+ name 
+--
+ .
+(1 row)
+
+select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false
+ name 
+--
+(0 rows)
+
+select pg_ls_dir('does not exist', true, false); -- ok with missingok=true
+ pg_ls_dir 
+---
+(0 rows)
+
+select pg_ls_dir('does not exist'); -- fails with missingok=false
+ERROR:  could not open directory "does not exist": No

Re: Building postgresql armv7 on emulated x86_64

2021-11-23 Thread Marek Kulik

What is weird is that this issue is only present in Fedora Rawhide, older
versions of fedora are not affected. I couldn't pinpoint what package
update caused that issue. I made a regression for gcc and packages related
to it with no luck.

It seems to be an issue related to a bug in gcc. Here is related topic:
https://www.spinics.net/lists/fedora-devel/msg294295.html

On Mon, Nov 22, 2021 at 5:54 PM Tom Lane  wrote:

> > Marek Kulik  writes:
> >> I cannot compile postgresql for armv7. I tested a bunch of versions of
> gcc
> >> and all have the same issue.
>
> FWIW, I just tried a build with --enable-dtrace on up-to-date Fedora 35
> aarch64, and that worked fine.  So this definitely seems like a problem
> in the toolchain not in Postgres.
>
> regards, tom lane
>
>

Re: Post-CVE Wishlist

2021-11-23 Thread Heikki Linnakangas


On 23/11/2021 23:44, Robert Haas wrote:

On Tue, Nov 23, 2021 at 2:18 PM Tom Lane  wrote:

Jacob Champion  writes:

= Implicit TLS =


Aside from security, one small benefit of skipping the Starttls-style 
negotiation is that you avoid one round-trip to the server.



I think this idea is a nonstarter.  It breaks backwards compatibility
at the protocol level in order to fix entirely-hypothetical bugs.
Nobody is going to like that tradeoff.  Moreover, how shall the
server know whether an incoming connection is expected to use TLS
or not, if no prior communication is allowed?  "TLS is the only
supported case" is even more of a nonstarter than a protocol change.


I am not persuaded by this argument. Suppose we added a server option
like ssl_port which causes us to listen on an additional port and, on
that port, everything, from the first byte on this connection, iswh
encrypted using SSL. Then we have to also add a matching libpq option
which the client must set to tell the server that this is the
expectation. For people using URL connection strings, that might be as
simple as specifying postgresqls://whatever rather than
postgresql://whatever. With such a design, the feature is entirely
ignorable for those who don't wish to use it, but anyone who wants it
can get it relatively easily. I don't know how we'd ever deprecate the
current system, but we don't necessarily need to do so. LDAP allows
either kind of scheme -- you can either connect to a regular LDAP
port, usually port 389, and negotiate security, or you can connect to
a different port, usually 636, and use SSL from the start. I don't see
why that couldn't be a model for us, and I suspect that it would get
decent uptake.
One intriguing option is to allow starting the TLS handshake without the 
SSLRequest packet. On the same port. A TLS handshake and PostgreSQL 
SSLRequest/StartupMessage/CancelRequest can be distinguished by looking 
at the first few bytes. The server can peek at them, and if it looks 
like a TLS handshake, start TLS (or fail if it's not supported), 
otherwise proceed with the libpq protocol.


This would make the transition smooth: we can add server support for 
this now, with an option in libpq to start using it. After some years, 
we can make it the default in libpq, but still fall back to the old 
method if it fails. After a long enough transition period, we can drop 
the old mechanism.


All that said, I'm not sure how serious I am about this. I think it 
would work, and it wouldn't even be very complicated, but it feels 
hacky, and that's not a good thing with anything security related. And 
the starttls-style negotiation isn't that bad, really. I'm inclined to 
do nothing I guess. Thoughts?



Now that being said, https://www.openldap.org/faq/data/cache/605.html
claims that ldaps (encrpyt from the first byte) is deprecated in favor
of STARTTLS (encrypt by negotiation). It's interesting that Jacob is
proposing to introduce as a new and better option the thing they've
decided they don't like. I guess my question is - is either one truly
better, or is this just a vi vs. emacs type debate where different
people have different preferences? I'm really not sure.


Huh, that's surprising, I though the trend is towards implicit TLS. As 
another data point, RFC 8314 recommends implicit TLS for POP, IMAP and SMTP.


- Heikki

Re: Support for NSS as a libpq TLS backend

2021-11-23 Thread Joshua Brindle

On Tue, Nov 23, 2021 at 9:12 AM Daniel Gustafsson  wrote:
>
> > On 17 Nov 2021, at 19:42, Joshua Brindle  
> > wrote:
> > On Tue, Nov 16, 2021 at 1:26 PM Joshua Brindle
> >  wrote:
>
> >> I think there it a typo in the docs here that prevents them from
> >> building (this diff seems to fix it):
>
> Ah yes, thanks, I had noticed that one but forgot to send out a new version to
> make the CFBot green.
>
> > After a bit more testing, the server is up and running with an nss
> > database but before configuring the client database I tried connecting
> > and got a segfault:
>
> Interesting.  I'm unable to reproduce this crash, can you show the sequence of
> commands which led to this?

It no longer happens with v49, since it was a null deref of the pr_fd
which no longer happens.

I'll continue testing now, so far it's looking better.

Did the build issue with --with-llvm get fixed in this update also? I
haven't tried building with it yet.

> > It looks like the ssl connection falls through to attempt a non-ssl
> > connection but at some point conn->ssl_in_use gets set to true,
> > despite pr_fd and nss_context being null.
>
> pgtls_close missed setting ssl_in_use to false, fixed in the attached.  I've
> also added some assertions to the connection setup for debugging this.
>
> > This patch fixes the segfault but I suspect is not the correct fix,
> > due to the error when connecting saying "Success":
>
> Right, without an SSL enabled FD we should never get here.
>

Thank you.

Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Bossart, Nathan

On 11/23/21, 1:41 PM, "Tom Lane"  wrote:
> I wrote:
>> I wonder though if we shouldn't try to improve the existing text.
>> The phrasing "never rolled back" seems like it's too easily
>> misinterpreted.  Maybe rewrite the  block like
>> ...
>
> A bit of polishing later, maybe like the attached.

The doc updates look good to me.  Yesterday I suggested possibly
adding a way to ensure that nextval() called in an uncommitted
transaction was persistent, but I think we'd have to also ensure that
synchronous replication waits for those records, too.  Anyway, I don't
think it is unreasonable to require the transaction to be committed to
avoid duplicates from nextval().

Nathan

Re: Post-CVE Wishlist

Robert Haas  writes:
> I am not persuaded by this argument. Suppose we added a server option
> like ssl_port which causes us to listen on an additional port and, on
> that port, everything, from the first byte on this connection, is
> encrypted using SSL.

Right, a separate port number (much akin to http 80 vs https 443) is
pretty much the only way this could be managed.  That's messy enough
that I don't see anyone wanting to do it for purely-hypothetical
benefits.  If we'd done it that way from the start, it'd be fine;
but there's way too much established practice now.

> Now that being said, https://www.openldap.org/faq/data/cache/605.html
> claims that ldaps (encrpyt from the first byte) is deprecated in favor
> of STARTTLS (encrypt by negotiation). It's interesting that Jacob is
> proposing to introduce as a new and better option the thing they've
> decided they don't like.

Indeed, that is interesting.  I wonder if we can find the discussions
that led to that decision.

regards, tom lane

Re: Post-CVE Wishlist

On Tue, Nov 23, 2021 at 2:18 PM Tom Lane  wrote:
> Jacob Champion  writes:
> > = Implicit TLS =
>
> I think this idea is a nonstarter.  It breaks backwards compatibility
> at the protocol level in order to fix entirely-hypothetical bugs.
> Nobody is going to like that tradeoff.  Moreover, how shall the
> server know whether an incoming connection is expected to use TLS
> or not, if no prior communication is allowed?  "TLS is the only
> supported case" is even more of a nonstarter than a protocol change.

I am not persuaded by this argument. Suppose we added a server option
like ssl_port which causes us to listen on an additional port and, on
that port, everything, from the first byte on this connection, is
encrypted using SSL. Then we have to also add a matching libpq option
which the client must set to tell the server that this is the
expectation. For people using URL connection strings, that might be as
simple as specifying postgresqls://whatever rather than
postgresql://whatever. With such a design, the feature is entirely
ignorable for those who don't wish to use it, but anyone who wants it
can get it relatively easily. I don't know how we'd ever deprecate the
current system, but we don't necessarily need to do so. LDAP allows
either kind of scheme -- you can either connect to a regular LDAP
port, usually port 389, and negotiate security, or you can connect to
a different port, usually 636, and use SSL from the start. I don't see
why that couldn't be a model for us, and I suspect that it would get
decent uptake.

Now that being said, https://www.openldap.org/faq/data/cache/605.html
claims that ldaps (encrpyt from the first byte) is deprecated in favor
of STARTTLS (encrypt by negotiation). It's interesting that Jacob is
proposing to introduce as a new and better option the thing they've
decided they don't like. I guess my question is - is either one truly
better, or is this just a vi vs. emacs type debate where different
people have different preferences? I'm really not sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade parallelism

Jacob Champion  writes:
> Right. What I'm worried about is, if disk space or write performance on
> the new cluster is a concern, then having a copy-mode upgrade silently
> use copy-on-write could be a problem if the DBA needs copy mode to
> actually copy.

Particularly for the cross-filesystem case, where it would not be
unreasonable to expect that one could dismount or destroy the old FS
immediately afterward.  I don't know if recent kernels try to make
that safe/transparent.

regards, tom lane

Re: Sequence's value can be rollback after a crashed recovery.

I wrote:
> I wonder though if we shouldn't try to improve the existing text.
> The phrasing "never rolled back" seems like it's too easily
> misinterpreted.  Maybe rewrite the  block like
> ...

A bit of polishing later, maybe like the attached.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74d3087a72..9e5ce3163a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17645,11 +17645,11 @@ SELECT setval('myseq', 42, false);Next nextval

 To avoid blocking concurrent transactions that obtain numbers from
-the same sequence, a nextval operation is never
-rolled back; that is, once a value has been fetched it is considered
-used and will not be returned again.  This is true even if the
-surrounding transaction later aborts, or if the calling query ends
-up not using the value.  For example an INSERT with
+the same sequence, the value obtained by nextval
+is not reclaimed for re-use if the calling transaction later aborts.
+This means that transaction aborts or database crashes can result in
+gaps in the sequence of assigned values.  That can happen without a
+transaction abort, too.  For example an INSERT with
 an ON CONFLICT clause will compute the to-be-inserted
 tuple, including doing any required nextval
 calls, before detecting any conflict that would cause it to follow
@@ -17661,8 +17661,22 @@ SELECT setval('myseq', 42, false);Next nextval
 

-Likewise, any sequence state changes made by setval
-are not undone if the transaction rolls back.
+Likewise, sequence state changes made by setval
+are immediately visible to other transactions, and are not undone if
+the calling transaction rolls back.
+   
+
+   
+If the database cluster crashes before committing a transaction
+containing a nextval
+or setval call, the sequence state change might
+not have made its way to persistent storage, so that it is uncertain
+whether the sequence will have its original or updated state after the
+cluster restarts.  This is harmless for usage of the sequence within
+the database, since other effects of uncommitted transactions will not
+be visible either.  However, if you wish to use a sequence value for
+persistent outside-the-database purposes, make sure that the
+nextval call has been committed before doing so.

Re: pg_upgrade parallelism

2021-11-23 Thread Jacob Champion

On Tue, 2021-11-23 at 13:51 -0600, Justin Pryzby wrote:
> 
> I guess you're concerned for someone who wants to be able to run pg_upgrade 
> and
> preserve the ability to start the old cluster in addition to the new.

Right. What I'm worried about is, if disk space or write performance on
the new cluster is a concern, then having a copy-mode upgrade silently
use copy-on-write could be a problem if the DBA needs copy mode to
actually copy.

--Jacob

Re: Sequence's value can be rollback after a crashed recovery.

Tomas Vondra  writes:
> I see Tom speculated we may not flush WAL if a transaction only does
> nextval() in that other thread, but I don't think that's true. AFAICS if
> the nextval() call writes stuff to WAL, the RecordTransactionCommit will
> have wrote_xlog=true and valid XID. And so it'll do the usual usual
> XLogFlush() etc.

Yeah.  I didn't look at the code during last year's discussion, but now
I have, and I see that if nextval_internal() decides it needs to make
a WAL entry, it is careful to ensure the xact has an XID:

/*
 * If something needs to be WAL logged, acquire an xid, so this
 * transaction's commit will trigger a WAL flush and wait for syncrep.
 * It's sufficient to ensure the toplevel transaction has an xid, no need
 * to assign xids subxacts, that'll already trigger an appropriate wait.
 * (Have to do that here, so we're outside the critical section)
 */
if (logit && RelationNeedsWAL(seqrel))
GetTopTransactionId();

So that eliminates my worry that RecordTransactionCommit would decide
it doesn't need to do anything.  If there was a WAL record, we will
flush it at commit (and not before).

As you say, this is exactly as durable as any other DML operation.
So I don't feel a need to change the code behavior.

The problematic situation seems to be where an application gets
a nextval() result and uses it for some persistent outside-the-DB
state, without having made sure that the nextval() was committed.
You could say that that's the same rookie error as relying on the
persistence of any other uncommitted DML ... except that at [1]
we say

To avoid blocking concurrent transactions that obtain numbers from the
same sequence, a nextval operation is never rolled back; that is, once
a value has been fetched it is considered used and will not be
returned again. This is true even if the surrounding transaction later
aborts, or if the calling query ends up not using the value.

It's not so unreasonable to read that as promising persistence over
crashes as well as xact aborts.  So I think we need to improve the docs
here.  A minimal fix would be to leave the existing text alone and add a
separate para to the  block, along the lines of

However, the above statements do not apply if the database cluster
crashes before committing the transaction containing the nextval
operation.  In that case the sequence advance might not have made its
way to persistent storage, so that it is uncertain whether the same
value can be returned again after the cluster restarts.  If you wish
to use a nextval result for persistent outside-the-database purposes,
make sure that the nextval has been committed before doing so.

I wonder though if we shouldn't try to improve the existing text.
The phrasing "never rolled back" seems like it's too easily
misinterpreted.  Maybe rewrite the  block like

To avoid blocking concurrent transactions that obtain numbers from the
same sequence, the value obtained by nextval is not reclaimed for
re-use if the calling transaction later aborts.  This means that
transaction aborts or database crashes can result in gaps in the
sequence of assigned values.  That can happen without a transaction
abort, too.
-- this text is unchanged: --
For example an INSERT with an ON CONFLICT clause will compute the
to-be-inserted tuple, including doing any required nextval calls,
before detecting any conflict that would cause it to follow the ON
CONFLICT rule instead. Such cases will leave unused “holes” in the
sequence of assigned values. Thus, PostgreSQL sequence objects cannot
be used to obtain “gapless” sequences.

Likewise, any sequence state changes made by setval are not undone if
the transaction rolls back.
-- end unchanged text --

If the database cluster crashes before committing the transaction
containing a nextval operation, the sequence advance might not yet
have made its way to persistent storage, so that it is uncertain
whether the same value can be returned again after the cluster
restarts.  This is harmless for usage of the nextval value within
that transaction, since its other effects will not be visible either.
However, if you wish to use a nextval result for persistent
outside-the-database purposes, make sure that the nextval operation
has been committed before doing so.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/docs/current/functions-sequence.html

Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Jeremy Schneider

On 11/23/21 05:49, Andy Fan wrote:
> 
> > I think at this thread[1], which claimed to get this issue even after
> > commit, I haven't tried it myself though.
> >
> > [1]
> 
> https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915
> 
> 
> >
> 
> I did try, and I haven't been able to reproduce that behavior (on
> master, at least).
> 
> 
> I agree with this,  the commit would flush the xlog and persist the change. 

On that older thread, there were exact reproductions in the first email
from Vini - two of them - available here:

https://gist.github.com/vinnix/2fe148e3c42e11269bac5fcc5c78a8d1

Nathan help me realize a mistake I've made here.

The second reproduction involved having psql run nextval() inside of an
explicit transaction. I had assumed that the transaction would be
committed when psql closed the session without error. This is because in
Oracle SQLPlus (my original RDBMS background), the "exitcommit" setting
has a default value giving this behavior.

This was a silly mistake on my part. When PostgreSQL psql closes the
connection with an open transaction, it turns out that the PostgreSQL
server will abort the transaction rather than committing it. (Oracle
DBAs be-aware!)

Nonetheless, Vini's first reproduction did not make this same mistake.
It involved 10 psql sessions in parallel using implicit transactions,
suspending I/O (via the linux device mapper), and killing PG while the
I/O is suspended.

Given my mistake on the second repro, I want to look a little closer at
this first reproduction and revisit whether it's actually demonstrating
a corner case where one could claim that durability isn't being handled
correctly - that "COMMIT" is returning successfully to the application,
and yet the sequence numbers are being repeated. Maybe there's something
involving the linux I/O path coming into play here.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Avoiding smgrimmedsync() during nbtree index builds

2021-11-23 Thread Melanie Plageman

On Fri, Nov 19, 2021 at 3:11 PM Melanie Plageman
 wrote:
>
> On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
>  wrote:
> >
> > So, I've written a patch which avoids doing the immediate fsync for
> > index builds either by using shared buffers or by queueing sync requests
> > for the checkpointer. If a checkpoint starts during the index build and
> > the backend is not using shared buffers for the index build, it will
> > need to do the fsync.
>
> I've attached a rebased version of the patch (old patch doesn't apply).
>
> With the patch applied (compiled at O2), creating twenty empty tables in
> a transaction with a text column and an index on another column (like in
> the attached SQL [make a test_idx schema first]) results in a fairly
> consistent 15-30% speedup on my laptop (timings still in tens of ms -
> avg 50 ms to avg 65 ms so run variation affects the % a lot).
> Reducing the number of fsync calls from 40 to 1 was what likely causes
> this difference.

Correction for the above: I haven't worked on mac in a while and didn't
realize that wal_sync_method=fsync was not enough to ensure that all
buffered data would actually be flushed to disk on mac (which was
required for my test).

Setting wal_sync_method to fsync_writethrough with my small test I see
over a 5-6X improvement in time taken - from 1 second average to 0.2
seconds average. And running Andres' "createlots.sql" test, I see around
a 16x improvement - from around 11 minutes to around 40 seconds. I ran
it on a laptop running macos and other than wal_sync_method, I only
changed shared_buffers (to 1GB).

- Melanie

Re: prevent immature WAL streaming

On 2021-Nov-23, Tom Lane wrote:

> We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> as we are continuing to see occasional "mismatching overwritten LSN"
> failures, further down in the test where it tries to start up the
> standby:

Augh.

> Looking at adjacent successful runs, it seems that the exact point
> where the "missing contrecord" starts varies substantially, even after
> our previous fix to disable autovacuum in this test.  How could that be?

Well, there is intentionally some variability.  Maybe not as much as one
would wish, but I expect that that should explain why that point is not
always the same.

> It's probably for the best though, because I think this is exposing
> an actual bug that we would not have seen if the start point were
> completely consistent.  I have not dug into the code, but it looks to
> me like if the "consistent recovery state" is reached exactly at a
> page boundary (0/1FFE000 in all these cases), then the standby expects
> that to be what the OVERWRITE_CONTRECORD record will point at.  But
> actually it points to the first WAL record on that page, resulting
> in a bogus failure.

So what is happening is that we set state->overwrittenRecPtr to the LSN
of page start, ignoring the page header.  Is that the LSN of the first
record in a page?  I'll see if I can reproduce the problem.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)

Re: pg_upgrade parallelism

2021-11-23 Thread Justin Pryzby

On Tue, Nov 23, 2021 at 06:54:03PM +, Jacob Champion wrote:
> On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote:
> > On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> > > 
> > > - why we read()/write() at all? is not a faster way of copying the file?
> > >   i'm asking that because i don't actually know.
> > 
> > No portable way.  Linux has this:
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.htmldata=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3Dreserved=0
> > 
> > But I just read:
> > 
> > >   First support for cross-filesystem copies was introduced in Linux
> > >   5.3.  Older kernels will return -EXDEV when cross-filesystem
> > >   copies are attempted.
> > 
> > To me that sounds like it may not be worth it, at least not quite yet.
> > But it would be good to test.

I realized that pg_upgrade doesn't copy between filesystems - it copies from
$tablespace/PG13/NNN to $tblespace/PG14/NNN.  So that's no issue.

And I did a bit of testing with this last weekend, and saw no performance
benefit from a larger buffersize, nor from copy_file_range, nor from libc stdio
(fopen/fread/fwrite/fclose).

> I think a downside of copy_file_range() is that filesystems might
> perform a reflink under us, and to me that seems like something that
> needs to be opted into via clone mode.

You're referring to this:

|   copy_file_range()  gives  filesystems an opportunity to implement "copy
|   acceleration" techniques, such as the use of reflinks (i.e., two or more
|   i-nodes that share pointers to the same copy-on-write disk blocks) or
|   server-side-copy (in the case of NFS).

I don't see why that's an issue though ?  It's COW, not hardlink.  It'd be the
same as if the filesystem implemented deduplication, right?  postgres shouldn't
notice nor care.

I guess you're concerned for someone who wants to be able to run pg_upgrade and
preserve the ability to start the old cluster in addition to the new.  But
that'd work fine on a COW filesystem, right ?

> (https://lwn.net/Articles/846403/ is also good reading on some sharp
> edges, though I doubt many of them apply to our use case.)

Yea, it doesn't seem the issues are relevant, other than to indicate that the
syscall is still evolving, which supports my initial conclusion.

-- 
Justin

Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?

2021-11-23 Thread Thomas Munro

Here's a patch for Linux and also FreeBSD.  The latter OS decided to
turn on ASLR by default recently, causing my workstation to fail like
this quite reliably, which reminded me to follow up with this.  It
disables ASLR in pg_ctl and pg_regress, which is enough for check and
check-world, but doesn't help you if you run the server directly
(unlike the different hack done for macOS).

For whatever random reason the failures are rarer on Linux (could be
my imagination, but I think they might be clustered, I didn't look
into the recipe for the randomness), but even without reproducing a
failure it's clear to see using pmap that this has the right effect.
I didn't bother with a check for the existence of ADDR_NO_RANDOMIZE
because it's since 2.6.12 which is definitely ancient enough.
From e615dc88c89142a1b7ecf7ba6f6f01ea59c61be5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 17 Nov 2021 20:25:19 +1300
Subject: [PATCH] Make EXEC_BACKEND work reliably on Linux and FreeBSD.

Try to disable ASLR when building in EXEC_BACKEND mode (developer only),
to avoid random memory mapping failures.

Suggested-by: Andres Freund 
Discussion: https://postgr.es/m/20210806032944.m4tz7j2w47mant26%40alap3.anarazel.de
---
 configure |  2 +-
 configure.ac  |  1 +
 src/bin/pg_ctl/pg_ctl.c   |  4 
 src/common/exec.c | 32 
 src/include/pg_config.h.in|  3 +++
 src/include/port.h| 11 +++
 src/test/regress/pg_regress.c |  4 
 7 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index ded80be880..9b7033d7dd 100755
--- a/configure
+++ b/configure
@@ -13427,7 +13427,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/personality.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.ac b/configure.ac
index 775e5c4436..9602b04e36 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1376,6 +1376,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/epoll.h
 	sys/event.h
 	sys/ipc.h
+	sys/personality.h
 	sys/prctl.h
 	sys/procctl.h
 	sys/pstat.h
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 7fbbe7022e..eadb5c2297 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -451,6 +451,10 @@ start_postmaster(void)
 	fflush(stdout);
 	fflush(stderr);
 
+#ifdef EXEC_BACKEND
+	pg_disable_aslr();
+#endif
+
 	pm_pid = fork();
 	if (pm_pid < 0)
 	{
diff --git a/src/common/exec.c b/src/common/exec.c
index 81b810d4cf..e5d9174333 100644
--- a/src/common/exec.c
+++ b/src/common/exec.c
@@ -25,6 +25,14 @@
 #include 
 #include 
 
+#ifdef EXEC_BACKEND
+#if defined(HAVE_SYS_PERSONALITY_H)
+#include 
+#elif defined(HAVE_SYS_PROCCTL_H)
+#include 
+#endif
+#endif
+
 /*
  * Hacky solution to allow expressing both frontend and backend error reports
  * in one macro call.  First argument of log_error is an errcode() call of
@@ -470,6 +478,30 @@ set_pglocale_pgservice(const char *argv0, const char *app)
 	}
 }
 
+#ifdef EXEC_BACKEND
+/*
+ * For the benefit of PostgreSQL developers testing EXEC_BACKEND code paths on
+ * Unix systems (ie paths normally exercised only on Windows), provide a way to
+ * disable ASLR so that we avoid memory map in developer-only builds, if we
+ * know how on this platform.  (See also the macOS-specific hack in
+ * sysv_shmem.c.)
+ */
+int
+pg_disable_aslr(void)
+{
+#if defined(HAVE_SYS_PERSONALITY_H)
+	return personality(ADDR_NO_RANDOMIZE);
+#elif defined(HAVE_SYS_PROCCTL_H) && defined(PROC_ASLR_FORCE_DISABLE)
+	int			data = PROC_ASLR_FORCE_DISABLE;
+
+	return procctl(P_PID, 0, PROC_ASLR_CTL, );
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+#endif
+
 #ifdef WIN32
 
 /*
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ca3592465e..a3d658644a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -617,6 +617,9 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_SYS_IPC_H
 
+/* Define to 1 if you have the  header file. */
+#undef HAVE_SYS_PERSONALITY_H
+
 /* Define to 1 if you have the  header file. */
 #undef HAVE_SYS_PRCTL_H
 
diff --git a/src/include/port.h b/src/include/port.h
index 49b4d38131..94b0f23837 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -42,6 +42,12 @@ typedef SOCKET

Re: Post-CVE Wishlist

Jacob Champion  writes:
> = Implicit TLS =

I think this idea is a nonstarter.  It breaks backwards compatibility
at the protocol level in order to fix entirely-hypothetical bugs.
Nobody is going to like that tradeoff.  Moreover, how shall the
server know whether an incoming connection is expected to use TLS
or not, if no prior communication is allowed?  "TLS is the only
supported case" is even more of a nonstarter than a protocol change.

> = Client-Side Auth Selection =

> The second request is for the client to stop fully trusting the server
> during the authentication phase. If I tell libpq to use a client
> certificate, for example, I don't think the server should be allowed to
> extract a plaintext password from my environment (at least not without
> my explicit opt-in).

Yeah.  I don't recall whether it's been discussed in public or not,
but it certainly seems like libpq should be able to be configured so
that (for example) it will never send a cleartext password.  It's not
clear to me what extent of configurability would be useful, and I
don't want to overdesign it --- but that much at least would be a
good thing.

> ... Implementations could range from a simple "does the server's
> auth method match the single one I expect" to a full SASL mechanism
> negotation.

Again, any sort of protocol change seems like a nonstarter from a
cost/benefit standpoint, even before you get to the question of
whether a downgrade attack could defeat it.  I'm envisioning just
having local configuration (probably in the connection string) that
tells libpq to fail the connection upon seeing certain auth requests.

regards, tom lane

Re: prevent immature WAL streaming

We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |branch |  snapshot   | stage | 
l   
   
+---+-+---+
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 
00:08:09.324 EDT [2455:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 sidewinder | HEAD  | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 
06:46:23.168 CEST [26393:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 
01:48:05.646 UTC [3417202][][1/0:0] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 petalura   | HEAD  | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 
02:58:12.146 CET [61848fb3.28d157:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 lapwing| REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 
17:39:29.741 UTC [9831:6] FATAL:  mismatching overwritten LSN 0/1FFE014 -> 
0/1FFE000
 morepork   | HEAD  | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 
04:03:33.576 CET [73561:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 petalura   | HEAD  | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 
18:16:47.875 CET [6193e77f.35b87f:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 morepork   | HEAD  | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 
04:57:04.359 CET [32089:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 
17:29:35.520 EST [16011:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

regards, tom lane

Re: pg_upgrade parallelism

2021-11-23 Thread Jacob Champion

On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote:
> On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> > 
> > - why we read()/write() at all? is not a faster way of copying the file?
> >   i'm asking that because i don't actually know.
> 
> No portable way.  Linux has this:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.htmldata=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3Dreserved=0
> 
> But I just read:
> 
> >   First support for cross-filesystem copies was introduced in Linux
> >   5.3.  Older kernels will return -EXDEV when cross-filesystem
> >   copies are attempted.
> 
> To me that sounds like it may not be worth it, at least not quite yet.
> But it would be good to test.

I think a downside of copy_file_range() is that filesystems might
perform a reflink under us, and to me that seems like something that
needs to be opted into via clone mode.

(https://lwn.net/Articles/846403/ is also good reading on some sharp
edges, though I doubt many of them apply to our use case.)

--Jacob

Re: xlog.c: removing ReadRecPtr and EndRecPtr

On Wed, Nov 17, 2021 at 5:01 PM Robert Haas  wrote:
> In rescanLatestTimeLine(), the problem is IMHO probably serious enough
> to justify a separate commit with back-patching.

On second thought, I think it's better not to back-patch this fix. It
turns out that, while it's easy enough to back-patch the proposed
patch, it doesn't make the test case pass in pre-v14 versions. The
test case itself requires some changes to work at all because of the
perl module renaming, but that's not a big deal. The issue is that, in
the back-branches, when starting up the server without any local WAL,
rescanLatestTimeLine() is checking with not only the wrong LSN but
also with the wrong TLI. That got fixed in master by commit
4a92a1c3d1c361ffb031ed05bf65b801241d7cdd even though, rather
unfortunately, the commit message does not say so. So to back-patch
this commit we would need to also back-patch much of that commit. But
that commit depends on the other commits that reduced use of
ThisTimeLineID. Untangling that seems like an undue amount of work and
risk for a corner-case bug fix that was discovered in the lab rather
than in the field and which won't matter anyway if you do things
correctly. So now I'm intending to commit to just to master only.

Attached please find the test case not for commit as
v2-0001-dubious-test-case.patch; the fix, for commit and now with a
proper commit message as
v2-0002-Fix-corner-case-failure-to-detect-improper-timeli.patch; and
the back-ported test case for v14 as v14-0001-dubious-test-case.patch
in case anyone wants to play around with that. (with apologies for
using the idea of a version number in two different and conflicting
senses)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v2-0001-dubious-test-case.patch
Description: Binary data

v2-0002-Fix-corner-case-failure-to-detect-improper-timeli.patch
Description: Binary data

v14-0001-dubious-test-case.patch
Description: Binary data

Post-CVE Wishlist

2021-11-23 Thread Jacob Champion

Hello all,

Now that the MITM CVEs are published [1], I wanted to share my wishlist
of things that would have made those attacks difficult/impossible to
pull off.

= Implicit TLS =

The frontend/backend protocol uses a STARTTLS-style negotiation, which
has had a fair number of implementation vulnerabilities in the SMTP
space [2] and is where I got the idea for the attacks to begin with.
There is a large amount of plaintext traffic on the wire that happens
before either party trusts the other, and there is plenty of code that
touches that untrusted traffic.

An implicit TLS flow would put the TLS handshake at the very beginning
of the connection, before any of the frontend/backend protocol is
spoken on the wire. (You all have daily experience with this flow, by
way of HTTPS.) So a verify-full client would know that the server is
the expected recipient, and that the connection is encrypted, before
ever sending a single byte of Postgres-specific communication, and
before receiving potential error messages. Similarly, a
clientcert=verify-* server would have already partially authenticated
the client before ever receiving a startup packet, and it can trust
that any messages sent during the handshake are received without
tampering.

This has downsides. Backwards compatibility tends to reintroduce
downgrade attacks, which defeats the goal of implicit TLS. So DBAs
would probably have to either perform a hard cutover to the new system
(with all-new clients ready to go), or else temporarily introduce a new
port for the implicit TLS mode (similar to HTTP/HTTPS separation) while
older clients are migrated.

And obviously this doesn't help you if you're already using a different
form of encryption (GSSAPI). I'm ignorant of the state of the art for
implicit encryption using that method. But for DBAs who are already
deploying TLS as their encryption layer and want to force its use for
every client, I think this would be a big step forward.

= Client-Side Auth Selection =

The second request is for the client to stop fully trusting the server
during the authentication phase. If I tell libpq to use a client
certificate, for example, I don't think the server should be allowed to
extract a plaintext password from my environment (at least not without
my explicit opt-in).

Auth negotiation has been proposed a couple of times on the list (see
for example [3]), and it's already been narrowly applied for SCRAM's
channel binding, since allowing a server to sidestep the client
authentication defeats the purpose. But I'd like to see this applied
generally, so that if the server sends an authentication request that
my client hasn't been explicitly configured to use, the connection
fails. Implementations could range from a simple "does the server's
auth method match the single one I expect" to a full SASL mechanism
negotation.

WDYT? Are these worth pursuing in the near future?

--Jacob

[1]
https://www.postgresql.org/about/news/postgresql-141-135-129-1114-1019-and-9624-released-2349/
[2]
https://www.feistyduck.com/bulletproof-tls-newsletter/issue_80_vulnerabilities_show_fragility_of_starttls
[3]
https://www.postgresql.org/message-id/flat/CAB7nPqS-aFg0iM3AQOJwKDv_0WkAedRjs1W2X8EixSz%2BsKBXCQ%40mail.gmail.com

Mop-up from Test::More version change patch

[ moving thread to -hackers for a bit more visibility ]

Attached are a couple of patches I propose in the wake of commit
405f32fc4 (Require version 0.98 of Test::More for TAP tests).

0001 responds to the failure we saw on buildfarm member wrasse [1]
where, despite configure having carefully checked for Test::More
being >= 0.98, actual tests failed with
Test::More version 0.98 required--this is only version 0.92 at 
/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Utils.pm
 line 63.
The reason is that wrasse was choosing "prove" from a different
Perl installation than "perl", as a result of its configuration
having set PERL to a nondefault place but doing nothing about PROVE.

We already installed a couple of mitigations for that:
(a) as of c4fe3199a, configure checks "prove" not "perl" for
appropriate module versions;
(b) Noah has modified wrasse's configuration to set PROVE.
But I'm of the opinion that (b) should not be necessary.
If you set PERL then it's highly likely that you want to use
"prove" from the same installation.  So 0001 modifies configure
to first check for an executable "prove" in the same directory
as $PERL.  If that's not what you want then you should override
it by setting PROVE explicitly.

Since this is mainly meant to prevent an easy-to-make error in
setting up buildfarm configurations, we should back-patch it.

0002 is written to apply to v14 and earlier, and what it wants
to do is back-patch the effects of 405f32fc4, so that the
minimum Test::More version is 0.98 in all branches.  The thought
here is that (1) somebody is likely to want to back-patch a
test involving Test::More::subtest before too long; (2) we have
zero coverage for older Test::More versions anyway, now that
all buildfarm members have been updated to work with HEAD.

Any objections?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse=2021-11-20%2022%3A58%3A17

diff --git a/configure b/configure
index 896b781473..a86a31fb42 100755
--- a/configure
+++ b/configure
@@ -19411,7 +19411,17 @@ fi
 #
 if test "$enable_tap_tests" = yes; then
   # Make sure we know where prove is.
+  # Prefer a prove located in the same directory as $PERL,
+  # otherwise search $PATH.
   if test -z "$PROVE"; then
+if test -x "`dirname "$PERL"`/prove"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for prove" >&5
+$as_echo_n "checking for prove... " >&6; }
+  PROVE="`dirname "$PERL"`/prove"
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $PROVE" >&5
+$as_echo "$PROVE" >&6; }
+else
+  if test -z "$PROVE"; then
   for ac_prog in prove
 do
   # Extract the first word of "$ac_prog", so it can be a program name with args.
@@ -19465,8 +19475,10 @@ $as_echo_n "checking for PROVE... " >&6; }
 $as_echo "$PROVE" >&6; }
 fi
 
-  if test -z "$PROVE"; then
-as_fn_error $? "prove not found" "$LINENO" 5
+  if test -z "$PROVE"; then
+as_fn_error $? "prove not found" "$LINENO" 5
+  fi
+fi
   fi
   # Check for necessary Perl modules.  You might think we should use
   # AX_PROG_PERL_MODULES here, but prove might be part of a different Perl
diff --git a/configure.ac b/configure.ac
index b50130b323..98cd5ae12e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2378,9 +2378,19 @@ PGAC_PATH_PROGS(DBTOEPUB, dbtoepub)
 #
 if test "$enable_tap_tests" = yes; then
   # Make sure we know where prove is.
-  PGAC_PATH_PROGS(PROVE, prove)
+  # Prefer a prove located in the same directory as $PERL,
+  # otherwise search $PATH.
   if test -z "$PROVE"; then
-AC_MSG_ERROR([prove not found])
+if test -x "`dirname "$PERL"`/prove"; then
+  AC_MSG_CHECKING(for prove)
+  PROVE="`dirname "$PERL"`/prove"
+  AC_MSG_RESULT([$PROVE])
+else
+  PGAC_PATH_PROGS(PROVE, prove)
+  if test -z "$PROVE"; then
+AC_MSG_ERROR([prove not found])
+  fi
+fi
   fi
   # Check for necessary Perl modules.  You might think we should use
   # AX_PROG_PERL_MODULES here, but prove might be part of a different Perl
diff --git a/config/check_modules.pl b/config/check_modules.pl
index 399ac5e3b6..cc0a7ab0e7 100644
--- a/config/check_modules.pl
+++ b/config/check_modules.pl
@@ -11,7 +11,7 @@ use IPC::Run 0.79;
 
 # Test::More and Time::HiRes are supposed to be part of core Perl,
 # but some distros omit them in a minimal installation.
-use Test::More 0.87;
+use Test::More 0.98;
 use Time::HiRes 1.52;
 
 # While here, we might as well report exactly what versions we found.
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 10bf31755e..f607ee3a4b 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -58,9 +58,8 @@ use File::Temp ();
 use IPC::Run;
 use SimpleTee;
 
-# specify a recent enough version of Test::More to support the
-# done_testing() function
-use Test::More 0.87;
+# We need a version of Test::More recent enough to support subtests
+use Test::More 0.98;

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-11-23 Thread John Naylor

Hi,

I've looked over this patch set and email thread a couple times, and I
don't see anything amiss, but I'm also not terribly familiar with the
subsystems this part of the code relies on. I haven't yet tried to stress
test with a large database, but it seems like a good idea to do so.

I have a couple comments and questions:

0006:

+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls.  Then later reset it once we're done or at the
+ * transaction end.

Do we really need to consider optimizing here? Only a handful of relations
will be found in the relmap, right?

+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+ * happens.  (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)

This comment in master started with

- * Once we start copying subdirectories, we need to be able to clean 'em

Is the distinction important enough to change this comment? Also, is "most
scenarios" still true with the patch? I haven't read into how ENSURE works.

Same with this comment change, seems fine the way it was:

- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space).  This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.

And do we need additional tests? Maybe we don't, but it seems good to make
sure.

I haven't looked at 0007, and I have no opinion on which approach is better.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: row filtering for logical replication

2021-11-23 Thread vignesh C

On Tue, Nov 23, 2021 at 4:58 PM Ajin Cherian  wrote:
>
> Attaching a new patchset v41 which includes changes by both Peter and myself.

Few comments on v41-0002 patch:
1) Tab completion should be handled for completion of "WITH(" in
"create publication pub1 for table t1 where (c1 > 10)":
@@ -2757,10 +2765,13 @@ psql_completion(const char *text, int start, int end)
else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR",
"ALL", "TABLES"))
COMPLETE_WITH("IN SCHEMA", "WITH (");
else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR",
"TABLE", MatchAny))
-   COMPLETE_WITH("WITH (");
+   COMPLETE_WITH("WHERE (", "WITH (");
/* Complete "CREATE PUBLICATION  FOR TABLE" with ", ..." */
else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR", "TABLE"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables, NULL);
+   /* "CREATE PUBLICATION  FOR TABLE  WHERE (" -
complete with table attributes */
+   else if (HeadMatches("CREATE", "PUBLICATION", MatchAny) &&
TailMatches("WHERE", "("))
+   COMPLETE_WITH_ATTR(prev3_wd, "");

2) Tab completion completes with "WHERE (" in case of "alter
publication pub1 add table t1,":
+   /* ALTER PUBLICATION  SET TABLE  */
+   /* ALTER PUBLICATION  ADD TABLE  */
+   else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET|ADD",
"TABLE", MatchAny))
+   COMPLETE_WITH("WHERE (");

Should this be changed to:
+   /* ALTER PUBLICATION  SET TABLE  */
+   /* ALTER PUBLICATION  ADD TABLE  */
+   else if (Matches("ALTER", "PUBLICATION", MatchAny, "SET|ADD",
"TABLE", MatchAny) && (!ends_with(prev_wd, ','))
+   COMPLETE_WITH("WHERE (");

Regards,
Vignesh

Re: Reduce function call costs on ELF platforms


On 22.11.21 23:32, Tom Lane wrote:

The easier approach for this class of issues is to use the linker option
-Bsymbolic.

I don't recall details, but we've previously rejected the idea of
trying to use -Bsymbolic widely; apparently it has undesirable
side-effects on some platforms.  See commit message for e3d77ea6b
(hopefully there's some detail in the email thread [1]).  It sounds
like you're not actually proposing that, but I thought it would be
a good idea to note the hazard here.


Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, 
since it prevents use of LD_PRELOAD.  I'm not sure what the current 
thinking there is, however.

Re: [RFC] ASOF Join

2021-11-23 Thread Chapman Flack

On 11/23/21 10:41, Isaac Morland wrote:
> Umm, it's definitely negative:
> 
> odyssey=> select '1 month -31 days +12:00:00'::interval < '0
> months'::interval;
> --
>  t

Well, what you've shown here is that it's "negative" according to
an arbitrary total ordering imposed in interval_cmp_value for the purpose
of making it indexable in a btree ...

> It's just that due to the complexities of our calendar/time systems, adding
> it to a timestamp can move the timestamp in either direction:

... and this is just another way of saying that said arbitrary choice of
btree ordering can't be used to tell you whether the interval is
semantically positive or negative. (Of course, for a great many intervals,
the two answers will be the same, but they're still answers to different
questions.)

> I'm working on a patch to add abs(interval) so I noticed this. There are
> lots of oddities, including lots of intervals which compare equal to 0 but
> which can change a timestamp when added to it, but as presently designed,
> this particular interval compares as negative.

It's no use—it's oddities all the way down. You can shove them off to one
side of the desk or the other depending on your intentions of the moment,
but they're always there. If you want to put intervals in a btree, you
can define a total ordering where all days are 24 hours and all months
are 30 days, and then there are no oddities in your btree, they're just
everywhere else. Or you can compare your unknown interval to a known
one like '0 months' and say you know whether it's "negative", you just
don't know whether it moves a real date forward or back. Or you can see
what it does to a real date, but not know whether it would precede or
follow some other interval in a btree.

Regards,
-Chap

Re: Granting SET and ALTER SYSTE privileges for GUCs

On Mon, Nov 22, 2021 at 7:21 PM Mark Dilger
 wrote:
> There is a new information_schema.guc_privileges view, not present in v2.

It's my impression that information_schema is a child of the SQL
standard, and that inventions specific to PG go in pg_catalog.

Also, I think the user-facing name for GUCs is "settings".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: A test for replay of regression tests

2021-11-23 Thread Andrew Dunstan



On 11/23/21 04:07, Thomas Munro wrote:
> On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro  wrote:
>> I wonder if for Windows we'd want to switch to real symlinks, since,
>> as far as I know from some light reading, reparse points are converted
>> to absolute paths on creation, so the pgdata directory would be less
>> portable than it would be on POSIX systems.
> I looked into that a bit, and it seems that the newer "real" symlinks
> can't be created without admin privileges, unlike junction points, so
> that wouldn't help.  I still need to figure out what this
> junction-based patch set is doing wrong on Windows though trial-by-CI.
> In the meantime, here is a rebase over recent changes to the Perl
> testing APIs.


More exactly you need to "Create Symbolic Links" privilege. see



I haven't yet worked out a way of granting that from the command line,
but if it's working the buildfarm client (as of git tip) will use it on
windows for the workdirs space saving feature.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Feature Proposal: Connection Pool Optimization - Change the Connection User

2021-11-23 Thread Todd Hubers

Hi Jacob and Daniel,

Thanks for your feedback.

>@Daniel - I think thats conflating session_user and current_user, SET ROLE
is not a login event. This is by design and discussed in the documentation..

Agreed, I am using those terms loosely. I have updated option 4 in the
proposal document. I have crossed it out. Option 5 is more suitable "SET
SESSION AUTHORIZATION" for further consideration.

>@Daniel - but it's important to remember that we need to cover the
functionality in terms of *tests* first, performance benchmarking is
another concern.

For implementation absolutely, but not for a basic feasibility prototype. A
quick non-secure non-reliable prototype is probably an important first-step
to confirming which options work best for the stated goals. Importantly, if
the improvement is only 5% (whatever that might mean), then the project is
probably not work starting. But I do expect that a benchmark will prove
benefits that justify the resources to build the feature(s).

>@Jacob - A more modern approach might be to attach the authentication to
the packet itself (e.g. cryptographically, with a MAC), if the goal is to
enable per-statement authentication anyway. In theory that turns the
middleware into a message passer instead of a confusable deputy. But it
requires more complicated setup between the client and server.

I did consider this, but I ruled it out. I have now added it to the
proposal document, and included two Issues. Please review and let me know
whether I might be mistaken.

>@Jacob - Having protocol-level tests for bytes on the wire would not only
help proposals like this but also get coverage for a huge number of edge
cases. Magnus has added src/test/protocol for the server, written in Perl,
in his PROXY proposal. And I've added a protocol suite for both the client
and server, written in Python/pytest, in my OAuth proof of concept. I think
something is badly needed in this area.

Thanks for highlighting this emerging work. I have noted this in the
proposal in the Next Steps section.

--Todd

Note: Here is the proposal document link again -
https://docs.google.com/document/d/1u6mVKEHfKtR80UrMLNYrp5D6cCSW1_arcTaZ9HcAKlw/edit#

On Tue, 23 Nov 2021 at 12:12, Jacob Champion  wrote:

> On Sat, 2021-11-20 at 16:16 -0500, Tom Lane wrote:
> > One more point is that the proposed business about
> >
> > * ImpersonateDatabaseUser will either succeed silently (0-RTT), or
> >   fail. Upon failure, no further commands will be processed until
> >   ImpersonateDatabaseUser succeeds.
> >
> > seems to require adding a huge amount of complication on the server side,
> > and complication in the protocol spec itself, to save a rather minimal
> > amount of complication in the middleware.  Why can't we just say that
> > a failed "impersonate" command leaves the session in the same state
> > as before, and it's up to the pooler to do something about it?  We are
> > in any case trusting the pooler not to send commands from user A to
> > a session logged in as user B.
>
> When combined with the 0-RTT goal, I think a silent ignore would just
> invite more security problems. Todd is effectively proposing packet
> pipelining, so the pipeline has to fail shut.
>
> A more modern approach might be to attach the authentication to the
> packet itself (e.g. cryptographically, with a MAC), if the goal is to
> enable per-statement authentication anyway. In theory that turns the
> middleware into a message passer instead of a confusable deputy. But it
> requires more complicated setup between the client and server.
>
> > PS: I wonder how we test such a feature meaningfully without
> > incorporating a pooler right into the Postgres tree.  I don't
> > want to do that, for sure.
>
> Having protocol-level tests for bytes on the wire would not only help
> proposals like this but also get coverage for a huge number of edge
> cases. Magnus has added src/test/protocol for the server, written in
> Perl, in his PROXY proposal. And I've added a protocol suite for both
> the client and server, written in Python/pytest, in my OAuth proof of
> concept. I think something is badly needed in this area.
>
> --Jacob
>


-- 
--
Todd Hubers

Re: [RFC] ASOF Join

2021-11-23 Thread Isaac Morland

On Tue, 23 Nov 2021 at 09:44, Chapman Flack  wrote:

> On 11/23/21 02:29, Ilya Anfimov wrote:
> > (*We
> > strangely don't have an absolute value operator on interval,  but
> > I think you've got the point*).
>
> Although tangential to the topic, that might be because a PG interval
> is a triple of independently-signed months/days/seconds components.
> An interval like '1 month -31 days +12:00:00' is positive or negative
> depending on the absolute date you apply it to, so what its absolute
> value should be isn't clear in isolation.
>

Umm, it's definitely negative:

odyssey=> select '1 month -31 days +12:00:00'::interval < '0
months'::interval;
 ?column?
--
 t
(1 row)

It's just that due to the complexities of our calendar/time systems, adding
it to a timestamp can move the timestamp in either direction:

odyssey=> select '2021-02-01'::timestamp + '1 month -31 days
+12:00:00'::interval;
  ?column?
-
 2021-01-29 12:00:00
(1 row)

odyssey=> select '2021-03-01'::timestamp + '1 month -31 days
+12:00:00'::interval;
  ?column?
-
 2021-03-01 12:00:00
(1 row)

I'm working on a patch to add abs(interval) so I noticed this. There are
lots of oddities, including lots of intervals which compare equal to 0 but
which can change a timestamp when added to it, but as presently designed,
this particular interval compares as negative.

Re: ResourceOwner refactoring

2021-11-23 Thread Aleksander Alekseev

Hi Heikki,

> Attached is a newly rebased version. It includes your tweaks, and a few
> more comment and indentation tweaks.

v10-0002 rotted a little so I rebased it. The new patch set passed `make
installcheck-world` locally, but let's see if cfbot has a second opinion.

I'm a bit busy with various things at the moment, so (hopefully) I will
submit the actual code review in the follow-up email.

-- 
Best regards,
Aleksander Alekseev


v11-0001-Move-a-few-ResourceOwnerEnlarge-calls-for-safety.patch
Description: Binary data


v11-0002-Make-resowners-more-easily-extensible.patch
Description: Binary data


v11-0003-Optimize-hash-function.patch
Description: Binary data

Re: Some RELKIND macro refactoring

On 2021-Nov-23, Alvaro Herrera wrote:

> On 2021-Nov-22, Peter Eisentraut wrote:
> 
> > Maybe
> > 
> > else
> > {
> > Assert(RELKIND_HAS_STORAGE(rel->rd_rel->relkind);
> > RelationCreateStorage(rel->rd_node, relpersistence);
> > }
> > 
> > create_storage is set earlier based on RELKIND_HAS_STORAGE(), so this would
> > be consistent.
> 
> Hmm, right ... but I think we can make this a little simpler.  How about
> the attached?

This doesn't actually work, so nevermind that.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)

Re: Some RELKIND macro refactoring

On 2021-Nov-22, Peter Eisentraut wrote:

> Maybe
> 
> else
> {
> Assert(RELKIND_HAS_STORAGE(rel->rd_rel->relkind);
> RelationCreateStorage(rel->rd_node, relpersistence);
> }
> 
> create_storage is set earlier based on RELKIND_HAS_STORAGE(), so this would
> be consistent.

Hmm, right ... but I think we can make this a little simpler.  How about
the attached?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index a13861e925..4d62a1960f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -308,7 +308,6 @@ heap_create(const char *relname,
 			TransactionId *relfrozenxid,
 			MultiXactId *relminmxid)
 {
-	bool		create_storage;
 	Relation	rel;
 
 	/* The caller must have provided an OID for the relation. */
@@ -343,19 +342,6 @@ heap_create(const char *relname,
 	if (!RELKIND_HAS_TABLESPACE(relkind))
 		reltablespace = InvalidOid;
 
-	/*
-	 * Decide whether to create storage. If caller passed a valid relfilenode,
-	 * storage is already created, so don't do it here.  Also don't create it
-	 * for relkinds without physical storage.
-	 */
-	if (!RELKIND_HAS_STORAGE(relkind) || OidIsValid(relfilenode))
-		create_storage = false;
-	else
-	{
-		create_storage = true;
-		relfilenode = relid;
-	}
-
 	/*
 	 * Never allow a pg_class entry to explicitly specify the database's
 	 * default tablespace in reltablespace; force it to zero instead. This
@@ -376,7 +362,7 @@ heap_create(const char *relname,
 	 tupDesc,
 	 relid,
 	 accessmtd,
-	 relfilenode,
+	 relfilenode ? relfilenode : relid,
 	 reltablespace,
 	 shared_relation,
 	 mapped_relation,
@@ -384,20 +370,23 @@ heap_create(const char *relname,
 	 relkind);
 
 	/*
-	 * Have the storage manager create the relation's disk file, if needed.
+	 * If caller gave us a valid relfilenode, storage is already created.
+	 * Otherwise, create one now if the relkind requires it.
 	 *
 	 * For tables, the AM callback creates both the main and the init fork.
 	 * For others, only the main fork is created; the other forks will be
 	 * created on demand.
 	 */
-	if (create_storage)
+	if (!OidIsValid(relfilenode))
 	{
-		if (RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+		if (RELKIND_HAS_TABLE_AM(relkind))
 			table_relation_set_new_filenode(rel, >rd_node,
 			relpersistence,
 			relfrozenxid, relminmxid);
-		else
+		else if (RELKIND_HAS_STORAGE(relkind))
 			RelationCreateStorage(rel->rd_node, relpersistence);
+		else
+			Assert(false);
 	}
 
 	/*

Re: Some RELKIND macro refactoring

On 2021-Nov-19, Michael Paquier wrote:

> I think that you should keep this block as it is now.  The part where
> a relkind does not support table AMs and does not require storage
> would get uncovered, and this new code would just attempt to create
> storage, so it seems to me that the existing code is better when it
> comes to prevent mistakes from developers?  You could as well use an
> else if (RELKIND_INDEX || RELKIND_SEQUENCE) for
> RelationCreateStorage(), and fall back to Assert(false) in the else
> branch, to get the same result as the original without impacting the
> level of safety of the original code.

Hmm, I think the added condition and else would make it safe and clear:

+   if (RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+   table_relation_set_new_filenode(rel, >rd_node,
+   relpersistence,
+   relfrozenxid, relminmxid);
+   else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
+   RelationCreateStorage(rel->rd_node, relpersistence);
+   else
+   Assert(false);

This is the same coding the patch proposed to put in
RelationSetNewRelfilenode, which IMO validates the idea.


In init_fork(), there's a typo:

+* For tables, the AM callback creates both the main and the init fork.
should read:
+* For tables, the AM callback creates both the main and the init forks.


In heap_create_with_catalog, you have

+   if (!relid)
should be
+   if (!OidIsValid(relid))


Overall, LGTM.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Should rename "startup process" to something else?

Alvaro Herrera  writes:
> On 2021-Nov-20, Andrew Dunstan wrote:
>> Maybe something along those lines but using a dash/hyphen would work:
>> e.g. wal-replayer

> Yeah, the idea of a dash occurred to me too.

Bad memories of uuid-ossp float up ... can we use underscore?

regards, tom lane

Re: Windows build warnings

Alvaro Herrera  writes:
> Right ... the problem, as I understand, is that the syntax for
> [[maybe_unused]] is different from what we can do with the current
> pg_attribute_unused -- [[maybe_unused]] goes before the variable name.
> We would need to define pg_attribute_unused macro (maybe have it take
> the variable name and initializator value as arguments?), and also
> define PG_USED_FOR_ASSERTS_ONLY in the same style.

I've thought all along that PG_USED_FOR_ASSERTS_ONLY was making
unwarranted assumptions about what the underlying syntax would be,
and it seems I was right.  Anyone want to look into what it'd take
to change this?

(It might be an idea to introduce a new macro with a slightly
different name, so we don't have to touch every usage site
right away.)

regards, tom lane

Re: [RFC] ASOF Join

2021-11-23 Thread Chapman Flack

On 11/23/21 02:29, Ilya Anfimov wrote:
> (*We
> strangely don't have an absolute value operator on interval,  but
> I think you've got the point*).

Although tangential to the topic, that might be because a PG interval
is a triple of independently-signed months/days/seconds components.
An interval like '1 month -31 days +12:00:00' is positive or negative
depending on the absolute date you apply it to, so what its absolute
value should be isn't clear in isolation.

That's also why there's no everywhere-defined conversion from PG interval
to XML xs:duration or to java.time.Duration, etc.

Regards,
-Chap

Re: Should rename "startup process" to something else?

On 2021-Nov-20, Andrew Dunstan wrote:

> Maybe something along those lines but using a dash/hyphen would work:
> e.g. wal-replayer

Yeah, the idea of a dash occurred to me too.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria.  Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)

Re: Windows build warnings

On 2021-Nov-23, Juan José Santamaría Flecha wrote:

> On Tue, Nov 23, 2021 at 2:11 PM Daniel Gustafsson  wrote:

> > It's supported in clang as well per the documentation [0] in at least some
> > configurations or distributions:

> [[maybe_unused]] is also recognized from Visual Studio 2017 onwards [1].
> 
> [1] https://docs.microsoft.com/en-us/cpp/cpp/attributes?view=msvc-170

Right ... the problem, as I understand, is that the syntax for
[[maybe_unused]] is different from what we can do with the current
pg_attribute_unused -- [[maybe_unused]] goes before the variable name.
We would need to define pg_attribute_unused macro (maybe have it take
the variable name and initializator value as arguments?), and also
define PG_USED_FOR_ASSERTS_ONLY in the same style.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: Windows build warnings

2021-11-23 Thread Juan José Santamaría Flecha

On Tue, Nov 23, 2021 at 2:11 PM Daniel Gustafsson  wrote:

> > On 22 Nov 2021, at 16:40, Tom Lane  wrote:
>
> > I can't find anything that is providing a non-empty definition of
> > PG_USED_FOR_ASSERTS_ONLY (a/k/a pg_attribute_unused) for anything
> > except GCC.
>
> It's supported in clang as well per the documentation [0] in at least some
> configurations or distributions:
>
> "The [[maybe_unused]] (or __attribute__((unused))) attribute can be
> used to silence such diagnostics when the entity cannot be removed.
> For instance, a local variable may exist solely for use in an
> assert()
> statement, which makes the local variable unused when NDEBUG is
> defined."
>
> [[maybe_unused]] is also recognized from Visual Studio 2017 onwards [1].

[1] https://docs.microsoft.com/en-us/cpp/cpp/attributes?view=msvc-170

Regards,

Juan José Santamaría Flecha

Re: [Patch] ALTER SYSTEM READ ONLY

2021-11-23 Thread Amul Sul

   On Wed, Nov 17, 2021 at 6:20 PM Amul Sul  wrote:
>
>   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul  wrote:
> >
> > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul  wrote:
> > >
> > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas  wrote:
> > > >
> > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul  wrote:
> > > > > Attached is the rebased version of refactoring as well as the
> > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > 39a3105678a).
> > > >
> > > > I spent a lot of time today studying 0002, and specifically the
> > > > question of whether EndOfLog must be the same as
> > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > XLogCtl->replayEndTLI.
> > > >
> > > > The answer to the former question is "no" because, if we don't enter
> > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > enter redo, then I think it has to be the same unless something very
> > > > weird happens. EndOfLog gets set like this:
> > > >
> > > > XLogBeginRead(xlogreader, LastRec);
> > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > > EndOfLog = EndRecPtr;
> > > >
> > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > same before these three lines of code as it is afterward. However, if
> > > > you test with recovery_target=immediate, you can get it to be
> > > > different, because in that case we drop out of the redo loop after
> > > > calling recoveryStopsBefore() rather than after calling
> > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > is ignored when you use recovery_target_name). It seems like a really
> > > > bad thing that neither recovery_target=immediate nor
> > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > add some.
> > > >
> > >
> > > recovery/t/003_recovery_targets.pl has test for
> > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > can add that for recovery_target_lsn, recovery_target_time, and
> > > recovery_target_xid case only where it affects.
> > >
> > > > Anyway, in effect, these three lines of code have the effect of
> > > > backing up the xlogreader by one record when we stop before rather
> > > > than after a record that we're replaying. What that means is that
> > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > replayed. There might be one more record that we read but did not
> > > > replay, and that record won't impact the value we end up with in
> > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > record that we actually replayed. To put that another way, there's no
> > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > and before we change LastRec. So in the cases where
> > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > different from EndOfLog if something different happens when we re-read
> > > > the last-replayed WAL record than what happened when we read it the
> > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > happen. I am inclined to think that it might be better not to reread
> > > > the record at all, though.
> > >
> > > There are two reasons that the record is reread; first, one that you
> > > have just explained where the redo loop drops out due to
> > > recoveryStopsBefore() and another one is that InRecovery is false.
> > >
> > > In the formal case at the end, redo while-loop does read a new record
> > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > reach the place where we do reread record -- where we do read the
> > > record (i.e. LastRec) before the record that redo loop has read and
> > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > don't need any adjustment to EndRecPtr.
> > >
> > > So technically one case needs reread but that is also not needed, we
> > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > be set conditionally something like:
> > >
> > > if (InRecovery)
> > > {
> > > EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > > EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > }
> > > else
> > > {
> > > EndOfLog = EndRecPtr;
> > > EndOfLogTLI = replayTLI;
> > > }
> > >
> > > > As far as this patch goes, I think we need
> > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > that's only sometimes initialized and then not doing anything with it
> > > > except in the cases where it was initialized.
> > > >
> > >
> > > Another reason could be EndOfLog changes further in the following case:
> > >
> > > /*
> > >  * Actually, if WAL ended in an

Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Andy Fan

>
>
> > I think at this thread[1], which claimed to get this issue even after
> > commit, I haven't tried it myself though.
> >
> > [1]
> https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915
> >
>
> I did try, and I haven't been able to reproduce that behavior (on
> master, at least).
>
>
I agree with this,  the commit would flush the xlog and persist the change.


> I see Tom speculated we may not flush WAL if a transaction only does
> nextval() in that other thread, but I don't think that's true. AFAICS if
> the nextval() call writes stuff to WAL, the RecordTransactionCommit will
> have wrote_xlog=true and valid XID. And so it'll do the usual usual
> XLogFlush() etc.
>
>
I agree with this as well.  or else, how can we replicate it to standby if
user only runs the SELECT nextval('s') in a transaction.

>  I fail to see how that's less durable than any other DML (e.g. we don't
> complain about INSERT not being durable if you don't commit the change).
> If you can show that the sequence goes back after a commit, that'd be an
actual durability issue.

This can't be called a tranaction's durability issue, but people usually
think
the value of sequence will not rollback.  so it may surprise people if that
happens.

-- 
Best Regards
Andy Fan

Re: logical decoding and replication of sequences

2021-11-23 Thread Tomas Vondra





On 11/23/21 02:01, Andres Freund wrote:

Hi,

On 2021-09-25 22:05:43 +0200, Hannu Krosing wrote:

If our aim is just to make sure that all user-visible data in
*transactional* tables is consistent with sequence state then  one
very much simplified approach to this could be to track the results of
nextval() calls in a transaction at COMMIT put the latest sequence
value in WAL (or just track the sequences affected and put the latest
sequence state in WAL at commit which needs extra read of sequence but
protects against race conditions with parallel transactions which get
rolled back later)


I think this is a bad idea. It's architecturally more complicated and prevents
use cases because sequence values aren't guaranteed to be as new as on the
original system. You'd need to track all sequence use somehow *even if there
is no relevant WAL generated* in a transaction. There's simply no evidence of
sequence use in a transaction if that transaction uses a previously logged
sequence value.



Not quite. We already have a cache of all sequences used by a session 
(see seqhashtab in sequence.c), and it's not that hard to extend it to 
per-transaction tracking. That's what the last two versions do, mostly.


But there are various issues with that approach, described in my last 
message(s).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: removing global variable ThisTimeLineID

On Tue, Nov 23, 2021 at 4:36 AM Drouvot, Bertrand  wrote:
> Indeed, I think the logical decoding on standby patch just needs to
> change the Assert in GetWALInsertionTimeLine() to check
> SharedRecoveryState is RECOVERY_STATE_DONE or RECOVERY_STATE_ARCHIVE.
>
> Would that make sense from your point of view to add the check on
> RECOVERY_STATE_ARCHIVE or do you think it would need to be more
> "sophisticated"?

Unless I'm confused, that would just be incorrect.
GetWALInsertionTimeLine() just returns InsertTimeLineID, which won't
be initialized during archive recovery. I discussed this point a bit
on some other thread with Andres. It's possible that all that you need
to do here is determine the replayTLI using e.g. GetXLogReplayRecPtr.
But I don't really see why that should be correct:
read_local_xlog_page() has some kind of wait-and-retry logic and I'm
not clear why we don't need the same thing here. After all, if we're
on the primary and we determine that sendTimeLineHistoric = false and
sendTimeLine = whatever, neither of those values can become incorrect
afterward. But on a standby that can certainly happen, if an upstream
server is promoted. Note that the comment in read_local_xlog_page()
says this:

 * If that happens after our caller determined the TLI but before
 * we actually read the xlog page, we might still try to read from the
 * old (now renamed) segment and fail. There's not much we can do
 * about this, but it can only happen when we're a leaf of a cascading
 * standby whose primary gets promoted while we're decoding, so a
 * one-off ERROR isn't too bad.

That seems to imply that the retry loop here is designed, at least in
part, to protect against exactly this kind of scenario.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Windows build warnings

2021-11-23 Thread Daniel Gustafsson

> On 22 Nov 2021, at 16:40, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Fair enough.  Looking at where we use PG_USED_FOR_ASSERTS_ONLY (and where it
>> works), these two warnings are the only places where we apply it to a pointer
>> typedef (apart from one place where the variable is indeed used outside of
>> asserts).  Since it clearly works in all other cases, I wonder if something
>> like the below sketch could make MSVC handle the attribute?
> 
> Ugh.  If we're changing the code anyway, I think I prefer Greg's original
> patch; it's at least not randomly inconsistent with everything else.

Totally agree.

> However ... I question your assumption that it works everywhere else.

Right, I wasn't expressing myself well uncaffeinated; I meant that it doesn't
emit a warning for the other instances, which I agree isn't the same thing as
the warning avoidance working.

> I can't find anything that is providing a non-empty definition of
> PG_USED_FOR_ASSERTS_ONLY (a/k/a pg_attribute_unused) for anything
> except GCC.  

It's supported in clang as well per the documentation [0] in at least some
configurations or distributions:

"The [[maybe_unused]] (or __attribute__((unused))) attribute can be
used to silence such diagnostics when the entity cannot be removed.
For instance, a local variable may exist solely for use in an assert()
statement, which makes the local variable unused when NDEBUG is
defined."

> It seems likely to me that MSVC simply fails to produce
> such warnings in most places, but it's woken up and done so here.

Looking at the instances of PG_USED_FOR_ASSERTS_ONLY in the tree, every such
variable is set (although not read) either at instantiation or later in the
codepath *outside* of the Assertion.  The documentation for C4101 [1] isn't
exactly verbose, but the examples therein show it for variables not set at all
so it might be that it simply only warns for pruneheap.c since all other cases
are at least set.  Setting the variables in question to NULL as a test, instead
of marking them unused, the build pass through MSVC without any warnings.  We
might still want to use the original patch in this thread, but it seems that
this might at least hint at explaining why MSVC only emitted warnings for those
two. 

While skimming the variables marked as unused, I noticed that two of them seems
to be marked as such while being used in non-assert codepaths, see the attached
diff.  The one in postgres_fdw.c was added in 8998e3cafa2 with 1ec7fca8592
using the variable; in lsyscache.c it was introduced in 2a6368343ff and then
promptly used in the fix commit 7e041603904 shortly thereafter.  Any objections
to applying that?

--
Daniel Gustafsson   https://vmware.com/

[0] https://clang.llvm.org/docs/AttributeReference.html#maybe-unused-unused
[1] 
https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-3-c4101

unused.diff
Description: Binary data

Re: row filtering for logical replication

On Tue, Nov 23, 2021 at 4:58 PM Ajin Cherian  wrote:
>
> Attaching a new patchset v41 which includes changes by both Peter and myself.
>

In 0003 patch, why is below change required?
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1,4 +1,4 @@
-/*-
+/*
  *
  * pgoutput.c

I suggest at this stage we can combine 0001, 0003, and 0004. Then move
pg_dump and psql (describe.c) related changes to 0002 and make 0002 as
the last patch in the series. This will help review backend changes
first and then we can look at client-side changes.

After above, rearrange the code in pgoutput_row_filter(), so that two
different checks related to 'rfisnull'  (introduced by different
patches) can be combined as if .. else check.

-- 
With Regards,
Amit Kapila.

Re: logical decoding/replication: new functions pg_ls_logicaldir and pg_ls_replslotdir

On Sun, Nov 21, 2021 at 08:45:52AM +0530, Bharath Rupireddy wrote:
> I don't think we need to go far to contrib/test_decoding/, even if we
> add it there we can't test it for the outputs of these functions, so
> I've added the tests in misc_functinos.sql itself.

+SELECT COUNT(*) >= 0 AS OK FROM pg_ls_replslotdir('slot_dir_funcs');
+ ok
+
+ t
+(1 row)
Creating a slot within the main regression test suite is something we
should avoid as it impacts the portability of the tests (note that we
don't have tests creating slots in src/test/regress/, and we'd require
max_replication_slots > 0 with this version of the patch).  This was
the point I was trying to make upthread about using test_decoding/
where we already have slots.

A second thing I have noticed is the set of OIDs used by the patch
which was incorrect.  On a development branch, we require new features
to use OIDs between 8000- (unused_oids would recommend a random
range of them).  A third thing was that pg_proc.dat had an incorrect
description for pg_ls_replslotdir(), and that it was in need of
indentation.

I have tweaked a bit the tests and the docs, and the result looked
fine at the end.  Hence, applied.
--
Michael


signature.asc
Description: PGP signature

Re: row filtering for logical replication

On Tue, Nov 23, 2021 at 1:29 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tues, Nov 23, 2021 2:27 PM vignesh C  wrote:
> > On Thu, Nov 18, 2021 at 7:04 AM Peter Smith 
> > wrote:
> > >
> > > PSA new set of v40* patches.
> >
> > Few comments:
> > 1) When a table is added to the publication, replica identity is checked. 
> > But
> > while modifying the publish action to include delete/update, replica 
> > identity is
> > not checked for the existing tables. I felt it should be checked for the 
> > existing
> > tables too.
>
> In addition to this, I think we might also need some check to prevent user 
> from
> changing the REPLICA IDENTITY index which is used in the filter expression.
>
> I was thinking is it possible do the check related to REPLICA IDENTITY in
> function CheckCmdReplicaIdentity() or In GetRelationPublicationActions(). If 
> we
> move the REPLICA IDENTITY check to this function, it would be consistent with
> the existing behavior about the check related to REPLICA IDENTITY(see the
> comments in CheckCmdReplicaIdentity) and seems can cover all the cases
> mentioned above.
>

Yeah, adding the replica identity check in CheckCmdReplicaIdentity()
would cover all the above cases but I think that would put a premium
on each update/delete operation. I think traversing the expression
tree (it could be multiple traversals if the relation is part of
multiple publications) during each update/delete would be costly.
Don't you think so?

-- 
With Regards,
Amit Kapila.

Re: psql - add SHOW_ALL_RESULTS option

2021-11-23 Thread Daniel Gustafsson

> On 8 Oct 2021, at 14:15, Peter Eisentraut  
> wrote:
> 
> On 02.10.21 16:31, Fabien COELHO wrote:
>>> Attached v9 integrates your tests and makes them work.
>> Attached v11 is a rebase.
> 
> This patch still has a few of the problems reported earlier this year.

The patch fails to apply and the thread seems to have taken a nap.  You
mentioned on the "dynamic result sets support in extended query protocol"
thread [0] that you were going to work on this as a pre-requisite for that
patch. Is that still the plan so we should keep this in the Commitfest?

--
Daniel Gustafsson   https://vmware.com/

[0] 
https://www.postgresql.org/message-id/6f038f18-0f2b-5271-a56f-1770577f246c%40enterprisedb.com

Re: A test for replay of regression tests

2021-11-23 Thread Thomas Munro

On Wed, Oct 6, 2021 at 7:10 PM Thomas Munro  wrote:
> I wonder if for Windows we'd want to switch to real symlinks, since,
> as far as I know from some light reading, reparse points are converted
> to absolute paths on creation, so the pgdata directory would be less
> portable than it would be on POSIX systems.

I looked into that a bit, and it seems that the newer "real" symlinks
can't be created without admin privileges, unlike junction points, so
that wouldn't help.  I still need to figure out what this
junction-based patch set is doing wrong on Windows though trial-by-CI.
In the meantime, here is a rebase over recent changes to the Perl
testing APIs.
From 0a76e6570b8f6eeac146a726a82c7e0e2315cdff Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 25 May 2021 20:42:17 +1200
Subject: [PATCH v6 1/4] Allow restricted relative tablespace paths.

Traditionally, tablespace paths provided to LOCATION had to be absolute,
because relative paths risked colliding with PostgreSQL-managed files
and generally blurring the boundary between system-managed and
user-managed data.

This commit adds a very small exception: it allows relative paths to be
used, but only under a new directory "pg_user_files".  This is intended
to allow automated testing of replication in later patches.

Discussion: https://postgr.es/m/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com
---
 src/backend/commands/tablespace.c | 23 ---
 src/bin/initdb/initdb.c   |  1 +
 src/common/string.c   |  9 +
 src/include/common/string.h   |  1 +
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4b96eec9df..adfbde6b29 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,6 +70,7 @@
 #include "commands/tablecmds.h"
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
+#include "common/string.h"
 #include "miscadmin.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
@@ -267,14 +268,16 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
  errmsg("tablespace location cannot contain single quotes")));
 
 	/*
-	 * Allowing relative paths seems risky
+	 * Allowing relative paths seems risky in general, but we'll allow it under
+	 * pg_user_files.
 	 *
 	 * this also helps us ensure that location is not empty or whitespace
 	 */
-	if (!is_absolute_path(location))
+	if (!is_absolute_path(location) &&
+		!pg_str_startswith(location, "pg_user_files/"))
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
- errmsg("tablespace location must be an absolute path")));
+ errmsg("tablespace location must be an absolute path or a relative path under pg_user_files/")));
 
 	/*
 	 * Check that location isn't too long. Remember that we're going to append
@@ -587,6 +590,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 static void
 create_tablespace_directories(const char *location, const Oid tablespaceoid)
 {
+	char		buffer[MAXPGPATH];
 	char	   *linkloc;
 	char	   *location_with_version_dir;
 	struct stat st;
@@ -651,6 +655,19 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
 	if (InRecovery)
 		remove_tablespace_symlink(linkloc);
 
+#ifndef WIN32
+	/*
+	 * If location is relative to pgdata, prefix it with ".." so that we have a
+	 * path relative to the symlink.  (Not needed for Windows, because our
+	 * symlink() wrapper will interpret it as relative to pgdata instead).
+	 */
+	if (!is_absolute_path(location))
+	{
+		snprintf(buffer, sizeof(buffer), "../%s", location);
+		location = buffer;
+	}
+#endif
+
 	/*
 	 * Create the symlink under PGDATA
 	 */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3c61c789e4..5ce1c28174 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -224,6 +224,7 @@ static const char *const subdirs[] = {
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
+	"pg_user_files",
 	"pg_xact",
 	"pg_logical",
 	"pg_logical/snapshots",
diff --git a/src/common/string.c b/src/common/string.c
index 3aa378c051..7e7e34f1f5 100644
--- a/src/common/string.c
+++ b/src/common/string.c
@@ -24,6 +24,15 @@
 #include "common/string.h"
 
 
+/*
+ * Returns whether the string `str' has the prefix `start'.
+ */
+bool
+pg_str_startswith(const char *str, const char *start)
+{
+	return strncmp(str, start, strlen(start)) == 0;
+}
+
 /*
  * Returns whether the string `str' has the postfix `end'.
  */
diff --git a/src/include/common/string.h b/src/include/common/string.h
index 8eb5271ec8..ddd384f6f9 100644
--- a/src/include/common/string.h
+++ b/src/include/common/string.h
@@ -21,6 +21,7 @@ typedef struct PromptInterruptContext
 } PromptInterruptContext;
 
 /* functions in src/common/string.c */
+extern bool pg_str_startswith(const char *str, const char *start);
 extern bool pg_str_endswith(const char *str, const char *end);
 extern int	strtoint(const char *pg_restrict str,

Re: row filtering for logical replication