Re: [Proposal] Add foreign-server health checks infrastructure

2022-01-22 Thread Fujii Masao




On 2022/01/21 15:08, kuroda.hay...@fujitsu.com wrote:

Dear Fujii-san, Zhihong,

I attached the latest version.


Thanks for updating the patch!

+int
+TryDisableRemoteServerCheckingTimeout(void)

When more than one FDWs are used, even if one FDW calls this function to 
disable the timeout, its remote-server-check-callback can still be called. Is 
this OK?
Please imagine the case where two FDWs are used and they registered their 
callbacks. Even when one FDW calls TryDisableRemoteServerCheckingTimeout(), if 
another FDW has not called that yet, the timeout is still being enabled. If the 
timeout is triggered during that period, even the callback registered by the 
FDW that has already called TryDisableRemoteServerCheckingTimeout() would be 
called.


+   if (remote_servers_connection_check_interval > 0)
+   {
+   CallCheckingRemoteServersCallbacks();
+   
enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
+
remote_servers_connection_check_interval);

LockErrorCleanup() needs to be called before reporting the error, doesn't it?


This can cause an error even while DoingCommandRead == true. Is that safe?



The biggest change is that callbacks are no longer un-registered at the end of 
transactions.
FDW developer must enable or disable timeout instead, via new APIs.

The timer will be turned on when:
* new GUC is >= 0, and


This can cause the timeout to be enabled even when no remote transaction is 
started?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Parallelize correlated subqueries that execute within each worker

2022-01-22 Thread James Coleman
On Fri, Jan 21, 2022 at 3:20 PM Robert Haas  wrote:
>
> On Fri, Jan 14, 2022 at 2:25 PM James Coleman  wrote:
> > I've been chewing on this a bit, and I was about to go re-read the
> > code and see how easy it'd be to do exactly what you're suggesting in
> > generate_gather_paths() (and verifying it doesn't need to happen in
> > other places). However there's one (I think large) gotcha with that
> > approach (assuming it otherwise makes sense): it means we do
> > unnecessary work. In the current patch series we only need to recheck
> > parallel safety if we're in a situation where we might actually
> > benefit from doing that work (namely when we have a correlated
> > subquery we might otherwise be able to execute in a parallel plan). If
> > we don't track that status we'd have to recheck the full parallel
> > safety of the path for all paths -- even without correlated
> > subqueries.
>
> I don't think there's an intrinsic problem with the idea of making a
> tentative determination about parallel safety and then refining it
> later, but I'm not sure why you think it would be a lot of work to
> figure this out at the point where we generate gather paths. I think
> it's just a matter of testing whether the set of parameters that the
> path needs as input is the empty set. It may be that neither extParam
> nor allParam are precisely that thing, but I think both are very
> close, and it seems to me that there's no theoretical reason why we
> can't know for every path the set of inputs that it requires "from the
> outside."

As I understand it now (not sure I realized this before) you're
suggesting that *even when there are required params* marking it as
parallel safe, and then checking the params for parallel safety later.
>From a purely theoretical perspective that seemed reasonable, so I
took a pass at that approach.

The first, and likely most interesting, thing I discovered was that
the vast majority of what the patch accomplishes it does so not via
the delayed params safety checking but rather via the required outer
relids checks I'd added to generate_useful_gather_paths.

For that to happen I did have to mark PARAM_EXEC params as presumed
parallel safe. That means that parallel_safe now doesn't strictly mean
"parallel safe in the current context" but "parallel safe as long as
any params are provided". That's a real change, but probably
acceptable as long as a project policy decision is made in that
direction.

There are a few concerns I have (and I'm not sure what level they rise to):

1. From what I can tell we don't have access on a path to the set of
params required by that path (I believe this is what Tom was
referencing in his sister reply at this point in the thread). That
means we have to rely on checking that the required outer relids are
provided by the current query level. I'm not quite sure yet whether or
not that guarantees (or if the rest of the path construction logic
guarantees for us) that the params provided by the outer rel are used
in a correlated way that isn't shared across workers. And because we
don't have the param information available we can't add additional
checks (that I can tell) to verify that.
2. Are we excluding any paths (by having one that will always be
invalid win the cost comparisons in add_partial_path)? I suppose this
danger actually exists in the previous version of the patch as well,
and I don't actually have any examples of this being a problem. Also
maybe this can only be a problem if (1) reveals a bug.
3. The new patch series actually ends up allowing parallelization of
correlated params in a few more places than the original patch series.
>From what I can tell all of the cases are in fact safe to execute in
parallel, which, if true, means this is a feature not a concern. The
changed query plans fall into two categories: a.) putting a gather
inside a subplan and b.) correlated param usages in a subquery scan
path on the inner side of a join. I've separated out those specific
changes in a separate patch to make it easier to tell which ones I'm
referencing.

On the other hand this is a dramatically simpler patch series.
Assuming the approach is sound, it should much easier to maintain than
the previous version.

The final patch in the series is a set of additional checks I could
imagine to try to be more explicit, but at least in the current test
suite there isn't anything at all they affect.

Does this look at least somewhat more like what you'd envisionsed
(granting the need to squint hard given the relids checks instead of
directly checking params)?

Thanks,
James Coleman
From 6596bb6909b9dfaadd6e6e834be4032e7903c4e9 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Mon, 30 Nov 2020 11:36:35 -0500
Subject: [PATCH v4 1/4] Allow parallel LATERAL subqueries with LIMIT/OFFSET

The code that determined whether or not a rel should be considered for
parallel query excluded subqueries with LIMIT/OFFSET. That's correct in
the general case: as the comment notes that'd 

Re: Warning in geqo_main.c from clang 13

2022-01-22 Thread Thomas Munro
On Sun, Jan 23, 2022 at 11:34 AM Tom Lane  wrote:
> We're starting to see more buildfarm animals producing this warning,
> so I took another look, and thought of a slightly less invasive way to
> silence it.  I confirmed this works with clang 13.0.0 on Fedora 35.

LGTM.  Tested on bleeding edge clang 14.




Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

2022-01-22 Thread Tomas Vondra

Hi,

There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea, 
which may cause failures when starting a replica, making it unusable. 
The commit message for 8431e296ea is not very clear about what exactly 
is being done and why, but the root cause is that at while processing 
RUNNING_XACTS, the XIDs are sorted like this:


/*
 * Sort the array so that we can add them safely into
 * KnownAssignedXids.
 */
qsort(xids, nxids, sizeof(TransactionId), xidComparator);

where "safely" likely means "not violating the ordering expected by 
KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values 
as plain uint32 values, while KnownAssignedXidsAdd actually calls 
TransactionIdFollowsOrEquals() and compares the logical XIDs :-(


Triggering this is pretty simple - all you need is two transactions with 
XIDs before/after the 4B limit, and then (re)start a replica. The 
replica refuses to start with a message like this:


LOG:  9 KnownAssignedXids (num=4 tail=0 head=4) [0]=32705 [1]=32706
  [2]=32707 [3]=32708
CONTEXT:  WAL redo at 0/6000120 for Standby/RUNNING_XACTS: nextXid
  32715 latestCompletedXid 32714 oldestRunningXid
  4294967001; 8 xacts: 32708 32707 32706 32705 4294967009
  4294967008 4294967007 4294967006
FATAL:  out-of-order XID insertion in KnownAssignedXids

Clearly, we add the 4 "younger" XIDs first (because that's what the XID 
comparator does), but then KnownAssignedXidsAdd thinks there's some sort 
of corruption because logically 4294967006 is older.


This does not affect replicas in STANDBY_SNAPSHOT_READY state, because 
in that case ProcArrayApplyRecoveryInfo ignores RUNNING_XACTS messages.



The probability of hitting this in practice is proportional to how long 
you leave transactions running. The system where we observed this leaves 
transactions with XIDs open for days, and the age may be ~40M. 
Intuitivelly, that's ~40M/4B (=1%) probability that at any given time 
there are transactions with contradicting ordering. So most restarts 
worked fine, until one that happened at just the "right" time.


This likely explains why we never got any reports about this - most 
systems probably don't leave transactions running for this long, so the 
probability is much lower. And replica restarts are generally not that 
common events either.


Attached patch is fixing this by just sorting the XIDs logically. The 
xidComparator is meant for places that can't do logical ordering. But 
these XIDs come from RUNNING_XACTS, so they actually come from the same 
wraparound epoch (so sorting logically seems perfectly fine).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom c96f5d64f977bf252451657cfe80d603869a2c1e Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 23 Jan 2022 01:00:29 +0100
Subject: [PATCH] Fix ordering of XIDs in ProcArrayApplyRecoveryInfo

Commit 8431e296ea reworked ProcArrayApplyRecoveryInfo to sort XIDs
before adding them to KnownAssignedXids, but it used xidComparator for
that purpose, which simply compares the values as uint32 values. This
may easily confuse KnownAssignedXidsAdd() which enforces the logical
ordering by calling TransactionIdFollowsOrEquals() etc.

Hitting this issue is fairly easy - you just need two transactons, one
started before the 4B limit (e.g. XID 4294967290), the other sometime
after it (e.g. XID 1000). Logically (4294967290 <= 1000) but when
compared using xidComparator we try to add them in the opposite order.
Which makes KnownAssignedXidsAdd() fail with an error like this:

  ERROR: out-of-order XID insertion in KnownAssignedXids

This however only happens during replica initialization - so you have to
restart (or create) a replica while there are such transactions with
contradictory XID ordering. Long-running transactions naturally increase
the likelihood of hitting this issue. Once the replica gets into this
state, it can't be started (even if the old transactions are terminated
in some way).

Fixed by sorting the XIDs logically - this is fine because we're dealing
with normal XIDs (because it's XIDs assigned to backends) and from the
same wraparound epoch (otherwise the backends could not be running at
the same time on the primary node). So there are no problems with the
triangle inequality, which is why xidComparator compares raw values.

Patch by me. Investigation and root cause analysis by Abhijit Menon-Sen.

This issue is present in all releases since 9.4, however releases up to
9.6 are EOL already so backpatch to 10 only.

Reviewed-by: Abhijit Menon-Sen
Reviewed-by: Alvaro Herrera
Backpatch-through: 10
---
 src/backend/storage/ipc/procarray.c |  7 ++-
 src/backend/utils/adt/xid.c | 26 ++
 src/include/utils/builtins.h|  1 +
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/procarray.c 

Re: Parameter for planner estimate of recursive queries

2022-01-22 Thread Kenaniah Cerny
> The factor 10 should not be hardcoded in the planner, but should be
settable, just as cursor_tuple_fraction is.

I feel considerably out of my depth here, but I like the idea of a working
table size multiplier GUC, given the challenges of predicting the number of
iterations (and any adjustments to cardinality per iteration). An
exponential cost function may lead to increasingly pathological outcomes,
especially when estimates for cte_rows are off.

In the EXPLAINs, it looked like the estimates for knows_pkey were off by an
order of magnitude or so. It's possible the planner would have chosen the
Nested Loop plan if knows_pkey had estimated to rows=87 (as the WindowAgg
would have estimated to roughly the same size as the second plan anyways,
even with rel->tuples = 10 * cte_rows).

I also wonder if there's a better default than cte_rows * 10, but
implementing a new GUC sounds like a reasonable solution to this as well.

--

Kenaniah

On Sat, Jan 22, 2022 at 1:58 PM Simon Riggs 
wrote:

> On Wed, 27 Oct 2021 at 15:58, Simon Riggs 
> wrote:
>
> > The poor performance is traced to the planner cost estimates for
> > recursive queries. Specifically, the cost of the recursive arm of the
> > query is evaluated based upon both of these hardcoded assumptions:
> >
> > 1. The recursion will last for 10 loops
> > 2. The average size of the worktable will be 10x the size of the
> > initial query (non-recursive term).
> >
> > Taken together these assumptions lead to a very poor estimate of the
> > worktable activity (in this case), which leads to the plan changing as
> > a result.
> >
> > The factor 10 is a reasonably safe assumption and helps avoid worst
> > case behavior in bigger graph queries. However, the factor 10 is way
> > too large for many types of graph query, such as where the path
> > through the data is tight, and/or the query is written to prune bushy
> > graphs, e.g. shortest path queries. The factor 10 should not be
> > hardcoded in the planner, but should be settable, just as
> > cursor_tuple_fraction is.
>
> If you think this should be derived without parameters, then we would
> want a function that starts at 1 for 1 input row and gets much larger
> for larger input. The thinking here is that Graph OLTP is often a
> shortest path between two nodes, whereas Graph Analytics and so the
> worktable will get much bigger.
>
> So I'm, thinking we use
>
> rel->tuples = min(1, cte_rows * cte_rows/2);
>
> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>
>
>
>


Re: Warning in geqo_main.c from clang 13

2022-01-22 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> Clang 13 on my machine and peripatus (but not Apple clang 13 on eg
>> sifika, I'm still confused about Apple's versioning but I think that's
>> really llvm 12-based) warns:
>> geqo_main.c:86:8: warning: variable 'edge_failures' set but not used
>> [-Wunused-but-set-variable]
>> Here's one way to silence it.

> I'm kind of inclined to just drop the edge_failures recording/logging
> altogether, rather than make that rats-nest of #ifdefs even worse.
> It's not like anyone has cared about that number in the last decade
> or two.

We're starting to see more buildfarm animals producing this warning,
so I took another look, and thought of a slightly less invasive way to
silence it.  I confirmed this works with clang 13.0.0 on Fedora 35.

regards, tom lane

diff --git a/src/backend/optimizer/geqo/geqo_main.c b/src/backend/optimizer/geqo/geqo_main.c
index 333a26a695..68ed74325c 100644
--- a/src/backend/optimizer/geqo/geqo_main.c
+++ b/src/backend/optimizer/geqo/geqo_main.c
@@ -227,12 +227,17 @@ geqo(PlannerInfo *root, int number_of_rels, List *initial_rels)
 	}
 
 
-#if defined(ERX) && defined(GEQO_DEBUG)
+#if defined(ERX)
+#if defined(GEQO_DEBUG)
 	if (edge_failures != 0)
 		elog(LOG, "[GEQO] failures: %d, average: %d",
 			 edge_failures, (int) number_generations / edge_failures);
 	else
 		elog(LOG, "[GEQO] no edge failures detected");
+#else
+	/* suppress variable-set-but-not-used warnings from some compilers */
+	(void) edge_failures;
+#endif
 #endif
 
 #if defined(CX) && defined(GEQO_DEBUG)


Re: MDAM techniques and Index Skip Scan patch

2022-01-22 Thread Dmitry Dolgov
> On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
> Hi,
>
> On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
> >
> > FYI, I've attached this thread to the CF item as an informational one,
> > but as there are some patches posted here, folks may get confused. For
> > those who have landed here with no context, I feel obliged to mention
> > that now there are two alternative patch series posted under the same
> > CF item:
> >
> > * the original one lives in [1], waiting for reviews since the last May
> > * an alternative one posted here from Floris
>
> Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
> It's nice to have the alternative approach threads linkied in the commit fest,
> but it seems that the cfbot will use the most recent attachments as the only
> patchset, thus leaving the "original" one untested.
>
> I'm not sure of what's the best approach in such situation.  Maybe creating a
> different CF entry for each alternative, and link the other cf entry on the cf
> app using the "Add annotations" or "Links" feature rather than attaching
> threads?

I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.




Re: Index Skip Scan (new UniqueKeys)

2022-01-22 Thread Dmitry Dolgov
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> Hi,
>
> Here is another take on the patch with a couple of changes:
>
> * I've removed for now UniqueKeys parts. The interaction of skip scan &
>   unique keys patch was actually not that big, so the main difference is
>   that now the structure itself went away, a list of unique expressions
>   is used instead. All the suggestions about how this feature should
>   look like from the planning perspective are still there. On the one
>   hand it will allow to develop both patches independently and avoid
>   confusion for reviewers, on the other UniqueKeys could be easily
>   incorporated back when needed.
>
> * Support for skipping in case of moving backward on demand (scroll
>   cursor) is moved into a separate patch. This is implemented via
>   returning false from IndexSupportsBackwardScan in case if it's a skip
>   scan node, which in turn adds Materialize node on top when needed. The
>   name SupportsBackwardScan was a bit confusing for me, but it seems
>   it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
>   Eventually those cases when BackwardScanDirection is used are still
>   handled by amskip. This change didn't affect the test coverage, all
>   the test cases supported in previous patch versions are still there.
>
>   About Materialize node, I guess it could be less performant than a
>   "native" support, but it simplifies the implementation significantly
>   to the point that most parts, which were causing questions before, are
>   now located in the isolated patch. My idea here is to concentrate
>   efforts on the first three patches in this series, and consider the
>   rest of them as an experiment field.
>
> * IndexScan support was also relocated into a separate patch, the first
>   three patches are now only about IndexOnlyScan.
>
> * Last bits of reviews were incorporated and rebased.

While the patch is still waiting for a review, I was motivated by the
thread [1] to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.

As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?

Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".

[1]: 
https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
>From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v40 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every 

Re: CREATEROLE and role ownership hierarchies

2022-01-22 Thread Stephen Frost
Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Jan 4, 2022, at 12:47 PM, Joshua Brindle 
> >  wrote:
> > 
> >> I was able to reproduce that using REASSIGN OWNED BY to cause a user to 
> >> own itself.  Is that how you did it, or is there yet another way to get 
> >> into that state?
> > 
> > I did:
> > ALTER ROLE brindle OWNER TO brindle;
> 
> Ok, thanks.  I have rebased, fixed both REASSIGN OWNED BY and ALTER ROLE .. 
> OWNER TO cases, and added regression coverage for them.
> 
> The last patch set to contain significant changes was v2, with v3 just being 
> a rebase.  Relative to those sets:
> 
> 0001 -- rebased.
> 0002 -- rebased; extend AlterRoleOwner_internal to disallow making a role its 
> own immediate owner.
> 0003 -- rebased; extend AlterRoleOwner_internal to disallow cycles in the 
> role ownership graph.
> 0004 -- rebased.
> 0005 -- new; removes the broken pg_auth_members.grantor field.

> Subject: [PATCH v4 1/5] Add tests of the CREATEROLE attribute.

No particular issue with this one.

> Subject: [PATCH v4 2/5] Add owners to roles
> 
> All roles now have owners.  By default, roles belong to the role
> that created them, and initdb-time roles are owned by POSTGRES.

... database superuser, not 'POSTGRES'.

> +++ b/src/backend/catalog/aclchk.c
> @@ -5430,6 +5434,57 @@ pg_statistics_object_ownercheck(Oid stat_oid, Oid 
> roleid)
>   return has_privs_of_role(roleid, ownerId);
>  }
>  
> +/*
> + * Ownership check for a role (specified by OID)
> + */
> +bool
> +pg_role_ownercheck(Oid role_oid, Oid roleid)
> +{
> + HeapTuple   tuple;
> + Form_pg_authid  authform;
> + Oid owner_oid;
> +
> + /* Superusers bypass all permission checking. */
> + if (superuser_arg(roleid))
> + return true;
> +
> + /* Otherwise, look up the owner of the role */
> + tuple = SearchSysCache1(AUTHOID, ObjectIdGetDatum(role_oid));
> + if (!HeapTupleIsValid(tuple))
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> +  errmsg("role with OID %u does not exist",
> + role_oid)));
> + authform = (Form_pg_authid) GETSTRUCT(tuple);
> + owner_oid = authform->rolowner;
> +
> + /*
> +  * Roles must necessarily have owners.  Even the bootstrap user has an
> +  * owner.  (It owns itself).  Other roles must form a proper tree.
> +  */
> + if (!OidIsValid(owner_oid))
> + ereport(ERROR,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("role \"%s\" with OID %u has invalid 
> owner",
> + authform->rolname.data, 
> authform->oid)));
> + if (authform->oid != BOOTSTRAP_SUPERUSERID &&
> + authform->rolowner == authform->oid)
> + ereport(ERROR,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("role \"%s\" with OID %u owns itself",
> + authform->rolname.data, 
> authform->oid)));
> + if (authform->oid == BOOTSTRAP_SUPERUSERID &&
> + authform->rolowner != BOOTSTRAP_SUPERUSERID)
> + ereport(ERROR,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> +  errmsg("role \"%s\" with OID %u owned by role 
> with OID %u",
> + authform->rolname.data, 
> authform->oid,
> + authform->rolowner)));
> + ReleaseSysCache(tuple);
> +
> + return (owner_oid == roleid);
> +}

Do we really need all of these checks on every call of this function..?
Also, there isn't much point in including the role OID twice in the last
error message, is there?  Unless things have gotten quite odd, it's
goint to be the same value both times as we just proved to ourselves
that it is, in fact, the same value (and that it's not the
BOOTSTRAP_SUPERUSERID).

This function also doesn't actually do any kind of checking to see if
the role ownership forms a proper tree, so it seems a bit odd to have
the comment talking about that here where it's doing other checks.

> +++ b/src/backend/commands/user.c
> @@ -77,6 +79,9 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
>   Datum   new_record[Natts_pg_authid];
>   boolnew_record_nulls[Natts_pg_authid];
>   Oid roleid;
> + Oid owner_uid;
> + Oid saved_uid;
> + int save_sec_context;

Seems a bit odd to introduce 'uid' into this file, which hasn't got any
such anywhere in it, and I'm not entirely sure that any of these are
actually needed..?

> @@ -108,6 +113,16 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
>   DefElem*dvalidUntil = NULL;
>   

Re: autovacuum prioritization

2022-01-22 Thread Peter Geoghegan
On Thu, Jan 20, 2022 at 4:43 PM Robert Haas  wrote:
> > Since we now have the failsafe, the scheduling algorithm can afford to
> > not give too much special attention to table age until we're maybe
> > over the 1 billion age mark -- or even 1.5 billion+. But once the
> > scheduling stuff starts to give table age special attention, it should
> > probably become the dominant consideration, by far, completely
> > drowning out any signals about bloat. It's kinda never really supposed
> > to get that high, so when we do end up there it is reasonable to fully
> > freak out. Unlike the bloat criteria, the wraparound safety criteria
> > doesn't seem to have much recognizable space between not worrying at
> > all, and freaking out.
>
> I do not agree with all of this. First, on general principle, I think
> sharp edges are bad. If a table had priority 0 for autovacuum 10
> minutes ago, it can't now have priority one million bazillion. If
> you're saying that the priority of wraparound needs to, in the limit,
> become higher than any bloat-based priority, that is reasonable.

I'm definitely saying considerations about wraparound need to swamp
everything else out at the limit. But I'm also making the point that
(at least with the ongoing relfrozenxid/freezing work) the system does
remarkably well at avoiding all aggressive anti-wraparound VACUUMs in
most individual tables, with most workloads. And so having an
aggressive anti-wraparound VACUUM at all now becomes a pretty strong
signal.

As we discussed on the other thread recently, you're still only going
to get anti-wraparound VACUUMs in a minority of tables with the
patches in place -- for tables that won't ever get an autovacuum for
any other reason. And so having an anti-wraparound probably just
signals that we have such a table, which is totally inconsequential.
But what about when there is an anti-wraparound VACUUM (or a need for
one) on a table whose age is already (say) 2x  the value of
autovacuum_freeze_max_age? That really is an incredibly strong signal
that something is very much amiss. Since the relfrozenxid/freezing
patch series actually makes each VACUUM able to advance relfrozenxid
in a way that's really robust when the system is not under great
pressure, the failure of that strategy becomes a really strong signal.

So it's not that table age signals something that we can generalize
about too much, without context. The context is important. The
relationship between table age and autovacuum_freeze_max_age with the
new strategy from my patch series becomes an important negative
signal, about something that we reasonably expected to be quite stable
not actually being stable.

(Sorry to keep going on about my work, but it really seems relevant.)

> Also, it's worth keeping in mind that waiting longer to freak out is
> not necessarily an advantage. It may well be that the only way the
> problem will ever get resolved is by human intervention - going in and
> fixing whatever dumb thing somebody did - e.g. resolving the pending
> prepared transaction.

In that case we ought to try to alert the user earlier.

> Those are fair concerns. I assumed that if we knew the number of pages
> in the index, which we do, it wouldn't be too hard to make an estimate
> like this ... but you know more about this than I do, so tell me why
> you think that won't work. It's perhaps worth noting that even a
> somewhat poor estimate could be a big improvement over what we have
> now.

I can construct a plausible, totally realistic counter-example that
breaks a heuristic like that, unless it focuses on extremes only, like
no index growth at all since the last VACUUM (which didn't leave
behind any deleted pages). I think that such a model can work well,
but only if it's designed to matter less and less as our uncertainty
grows. It seems as if the uncertainty grows very sharply, once you
begin to generalize past the extremes.

We have to be totally prepared for the model to be wrong, except
perhaps as a way of prioritizing things when there is real urgency,
and we don't have a choice about choosing. All models are wrong, some
are useful.

> The problem that I'm principally concerned about here is the case
> where somebody had a system that was basically OK and then at some
> point, bad things started to happen.

It seems necessary to distinguish between the case where things really
were okay for a time, and the case where they merely appeared to be
okay to somebody whose understanding of the system isn't impossibly
deep and sophisticated. You'd have to be an all-knowing oracle to be
able to tell the difference, because the system itself has no
sophisticated notion of how far it is into debt. There are things that
we can do to address this gap directly (that's what I have been doing
myself), but that can only go so far.

ISTM that the higher the amount of debt that the system is actually
in, the greater the uncertainty about the total amount of debt. In
other words, the advantage of 

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-22 Thread Noah Misch
On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:
> My next steps:
> 
> - Report a Debian bug for the sparc64+ext4 zeros problem.

(Not done yet.)

> - Try to falsify the idea that "write only the not-already-written portion of
>   a WAL block" is an effective workaround.  Specifically, modify the test
>   program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
>   while the reader process reads offset N.  If the reader sees a zero, that
>   workaround is ineffective.

The reader did not see a zero.  In addition to bytes outside the write being
immune to the zeros bug, the first and last forty bytes of a write were immune
to the zeros bug.

> - Implement the workaround, if I didn't falsify its effectiveness.  If it
>   doesn't hurt performance on x86_64, we can use it unconditionally.
>   Otherwise, limit its use to sparc64 Linux.

Attached.  With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl.  Without
the patch, it failed many times, always within 1.3hr.  For easier review, this
patch uses the new behavior on all platforms.  Before commit and back-patch, I
plan to limit use of the new behavior to sparc Linux.  Future work can
benchmark the new behavior and, if it performs well, make it unconditional in
v15+.  I would expect performance to be unchanged or slightly better, because
the new behavior requests less futile work from the OS.
Author: Noah Misch 
Commit: Noah Misch 

On sparc Linux, write each WAL byte just once.

On sparc64, an ext4 filesystem bug can make readers see zeros if another
process simultaneously wrote the same offsets.  Per buildfarm member
kittiwake, this could make a streaming standby fail to advance,
reporting corrupt WAL.  It could make COMMIT PREPARED fail with "could
not read two-phase state from WAL".  Non-WAL PostgreSQL reads haven't
been affected, likely because those readers and writers have buffering
systems in common.  Fix this by writing each WAL byte just once, instead
of rewriting every byte of a page whenever writing some new byte of that
page.  This problem is not new, but buildfarm failures became frequent
when commits 3cd9c3b921977272e6650a5efbeade4203c4bca2 and
f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests with concurrency.
Back-patch to v10 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index e073121..68fe510 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -130,6 +130,19 @@ intwal_segment_size = 
DEFAULT_XLOG_SEG_SIZE;
 #define NUM_XLOGINSERT_LOCKS  8
 
 /*
+ * Should XLogWrite() skip the previously-written prefix of the block when
+ * LogwrtResult falls in the middle of a block, or should it write whole
+ * blocks always?  The latter is traditional PostgreSQL behavior, but the
+ * former avoids a sparc64+ext4 bug extant as of Linux 5.15.5:
+ * https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com
+ */
+#if defined(__sparc__) && defined(__linux__)
+#define SKIP_PREVIOUSLY_WRITTEN 1
+#else
+#define SKIP_PREVIOUSLY_WRITTEN 1  /* FIXME change to 0 before commit */
+#endif
+
+/*
  * Max distance from last checkpoint, before triggering a new xlog-based
  * checkpoint.
  */
@@ -2468,6 +2481,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool 
flexible)
boollast_iteration;
boolfinishing_seg;
int curridx;
+   Sizeskip;
int npages;
int startidx;
uint32  startoffset;
@@ -2483,12 +2497,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool 
flexible)
/*
 * Since successive pages in the xlog cache are consecutively allocated,
 * we can usually gather multiple pages together and issue just one
-* write() call.  npages is the number of pages we have determined can 
be
-* written together; startidx is the cache block index of the first one,
-* and startoffset is the file offset at which it should go. The latter
-* two variables are only valid when npages > 0, but we must initialize
-* all of them to keep the compiler quiet.
-*/
+* write() call.  skip is the already-written portion of the first page.
+* npages is the number of pages we have determined can be written
+* together; startidx is the cache block index of the first one, and
+* startoffset is the file offset at which the first byte (possibly
+* mid-page) should go. The latter two variables are only valid when
+* npages > 0, but we must initialize all of them to keep the compiler
+* quiet.
+*/
+   skip = 0;
npages = 0;
startidx = 0;
startoffset = 

Re: relcache not invalidated when ADD PRIMARY KEY USING INDEX

2022-01-22 Thread Tom Lane
"Euler Taveira"  writes:
> On Mon, Dec 20, 2021, at 3:45 AM, houzj.f...@fujitsu.com wrote:
>> When reviewing some replica identity related patches, I found that when 
>> adding
>> primary key using an existing unique index on not null columns, the
>> target table's relcache won't be invalidated.

> Good catch.

Indeed.

> It seems you can simplify your checking indexForm->indisprimary directly, no?

The logic seems fine to me --- it avoids an unnecessary cache flush
if the index was already the pkey.  (Whether we actually reach this
code in such a case, I dunno, but it's not costing much to be smart
if we are.)

> Why did you add new tests for test_decoding? I think the TAP tests alone are
> fine. BTW, this test is similar to publisher3/subscriber3. Isn't it better to
> use the same pub/sub to reduce the test execution time?

I agree, the proposed test cases are expensive overkill.  The repro
shown in the original message is sufficient and much cheaper.
Moreover, we already have some test cases very much like that in
regress/sql/publication.sql, so that's where I put it.

Pushed with some minor cosmetic adjustments.

regards, tom lane




pg_upgrade/test.sh and v9.5

2022-01-22 Thread Justin Pryzby
In fa66b6dee, Micheal fixed test.sh to work back to v11, so I suppose nobody is
trying to run it with older versions, as I was endeavored to do.

With the attached patch, I'm able to test upgrades back to v9.6.

In 9.5, there are regression diffs from CONTEXT lines from non-error messages,
which is a v9.5 change (0426f349e).  The first "make check" fails before even
getting to the upgrade part:

|  NOTICE:  trigger_func(before_ins_stmt) called: action = INSERT, when = 
BEFORE, level = STATEMENT
|- CONTEXT:  SQL statement "INSERT INTO main_table VALUES (NEW.a, NEW.b)"
|- PL/pgSQL function view_trigger() line 17 at SQL statement

I tried a lot of things but couldn't find one that worked.  I just tried this,
which allows the "make check" to pass, but then fails due missing symbols in
libpq during the upgrade phase.  Maybe I'm missing something - Tom must have
tested psql against old versions somehow before de-supporting old versions (but
maybe not like this).

| time make check -C src/bin/pg_upgrade oldsrc=`pwd`/new/95 
oldbindir=`pwd`/new/95/tmp_install/usr/local/pgsql/bin 
with_temp_install="LD_LIBRARY_PATH=`pwd`/new/95/tmp_install/usr/local/pgsql/lib"

I tried installcheck, but then that fails because psql doesn't accept multiple
-c options (it runs the final -c command only).

| EXTRA_REGRESS_OPTS="--bindir `pwd`/new/95/tmp_install/usr/local/pgsql/bin" 
LD_LIBRARY_PATH=`pwd`/new/95/tmp_install/usr/local/pgsql/lib PGHOST=/tmp time 
make installcheck
| ...
| == creating database "regression" ==
| ERROR:  database "regression" does not exist
| STATEMENT:  ALTER DATABASE "regression" SET lc_messages TO 'C';ALTER DATABASE 
"regression" SET lc_monetary TO 'C';ALTER DATABASE "regression" SET lc_numeric 
TO 'C';ALTER DATABASE "regression" SET lc_time TO 'C';ALTER DATABASE 
"regression" SET bytea_output TO 'hex';ALTER DATABASE "regression" SET 
timezone_abbreviations TO 'Default';
| ERROR:  database "regression" does not exist
| command failed: 
"/home/pryzbyj/src/postgres/new/95/tmp_install/usr/local/pgsql/bin/psql" -X -c 
"CREATE DATABASE \"regression\" TEMPLATE=template0" -c "ALTER DATABASE 
\"regression\" SET lc_messages TO 'C';ALTER DATABASE \"regression\" SET 
lc_monetary TO 'C';ALTER DATABASE \"regression\" SET lc_numeric TO 'C';ALTER 
DATABASE \"regression\" SET lc_time TO 'C';ALTER DATABASE \"regression\" SET 
bytea_output TO 'hex';ALTER DATABASE \"regression\" SET timezone_abbreviations 
TO 'Default';" "postgres"

pg_regress was changed to do that recently:

commit f45dc59a38cab1d2af6baaedb79559fe2e9b3781
Author: Tom Lane 
Date:   Wed Oct 20 18:44:37 2021 -0400

Improve pg_regress.c's infrastructure for issuing psql commands.

-- 
Justin
>From ccc113ccaef45b09077faa250e126e7c480324d7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 2 Jan 2022 15:00:19 -0600
Subject: [PATCH] wip: test.sh: allow pg_upgrade check from v10 and earlier

---
 src/bin/pg_upgrade/test.sh   | 32 +---
 src/bin/pg_upgrade/upgrade_adapt.sql |  8 +++
 2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ef328b3062f..c0a569ed6e3 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -24,7 +24,12 @@ standard_initdb() {
 	# without increasing test runtime, run these tests with a custom setting.
 	# Also, specify "-A trust" explicitly to suppress initdb's warning.
 	# --allow-group-access and --wal-segsize have been added in v11.
-	"$1" -N --wal-segsize 1 --allow-group-access -A trust
+	if [ "$newsrc" = "$oldsrc" ]; then
+		"$@" -N -A trust --wal-segsize 1 --allow-group-access
+	else
+		"$@" -N -A trust
+	fi
+
 	if [ -n "$TEMP_CONFIG" -a -r "$TEMP_CONFIG" ]
 	then
 		cat "$TEMP_CONFIG" >> "$PGDATA/postgresql.conf"
@@ -82,12 +87,25 @@ mkdir "$temp_root"
 oldsrc=`cd "$oldsrc" && pwd`
 newsrc=`cd ../../.. && pwd`
 
+# See similar logic in pg_upgrade/exec.c
+#oldpgversion0=`"$oldbindir"/pg_ctl --version`
+#oldpgversion=`echo "$oldpgversion0" |awk 'NF>=3 && $3~/^[0-9]+(\.[0-9]+|devel$)/{split($3, a, "."); print int(a[1])}'`
+#[ -n "$oldpgversion" ] || {
+	#echo "Could not determine version of old cluster: $oldpgversion0";
+	#exit 1
+#}
+oldpgversion=`grep '#define PG_VERSION_NUM' "$oldsrc"/src/include/pg_config.h | awk '{print $3}'`
+
 # We need to make pg_regress use psql from the desired installation
 # (likely a temporary one), because otherwise the installcheck run
 # below would try to use psql from the proper installation directory
 # of the target version, which might be outdated or not exist. But
 # don't override anything else that's already in EXTRA_REGRESS_OPTS.
-EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --bindir='$oldbindir'"
+if [ "$oldpgversion" -ge 90500 ]; then
+	EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --bindir='$oldbindir'"
+else
+	EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --psqldir='$bindir'"
+fi
 export EXTRA_REGRESS_OPTS
 
 # While in normal cases 

Re: How to get started with contribution

2022-01-22 Thread Stephen Frost
Greetings,

* vrund shah (vrund3...@gmail.com) wrote:
> Thank you for your valuable guidance.
> I will surely look at the links and if have any queries then I will contact
> you.

On these mailing lists, we prefer that you reply 'in-line', as I'm doing
here, and not use 'top-posting' (as you did in your replies).

* vrund shah (vrund3...@gmail.com) wrote:
> I am already using PostgreSQL for my college purpose and for learning SQL.
> I have learned SQL from udemy courses with instructor Jose Portilla. and I
> am well aware of PostgreSQL and PGAdmin.

Great.  Being familiar with SQL will certainly help.

* vrund shah (vrund3...@gmail.com) wrote:
> This year I am planning to take part in GSOC 2022 in the PostgreSQL
> organization.

Glad to hear that.  Note that while we do intend to submit for GSoC
2022, there's no guarantee that we will be selected.

That said, this is a great way to get started.  If you already have a
project idea in mind, I encourage you to post to this list what that
idea is and ask for feedback.  If you don't have a project idea already
then you could review the project ideas page:

https://wiki.postgresql.org/wiki/GSoC_2022

Note that the current page lists projects from last year and will
continue to be updated between now and the GSoC 2022 organization
submission deadline.  Still, hopefully reviewing the ones there will
give you some thoughts about what you might be interested in working on.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2022-01-22 Thread David G. Johnston
On Sat, Jan 22, 2022 at 9:21 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Sat, Jan 22, 2022 at 2:41 AM Amit Kapila 
> wrote:
>
>>
>> > Additionally, the description for pg_stat_subscription_workers should
>> describe what happens once the transaction represented by last_error_xid
>> has either been successfully processed or skipped.  Does this "last error"
>> stick around until another error happens (which is hopefully very rare) or
>> does it reset to blanks?
>> >
>>
>> It will be reset only on subscription drop, otherwise, it will stick
>> around until another error happens.
>
>
> I really dislike the user experience this provides, and given it is new in
> v15 (and right now this table seems to exist solely to support this
> feature) changing this seems within the realm of possibility. I have to
> imagine these workers have a sense of local state that would just be "no
> errors, no need to touch pg_stat_subscription_workers at the end of this
> transaction's commit".  It would save a local state of the error_xid and if
> a successfully committed transaction has that xid it would clear the
> error.  The skip code path would also check for and see the matching xid
> value and clear the error.  Even if the local state thing doesn't work, one
> catalog lookup per transaction seems like potentially reasonable overhead
> to incur here.
>
>
It shouldn't even need to be that overhead intensive.  Once an error is
encountered the system stops.  By construction it must be told to redo, at
which point the information about "last error" is no longer relevant and
can be removed (for skipping the user/system will have already done
everything with the xid that is needed before the redo is issued).  In the
steady-state it then is simply empty until a new error arises at which
point it becomes populated again; and stays that way until the system goes
into redo mode as instructed by the user via one of several methods.

David J.


Re: Skipping logical replication transactions on subscriber side

2022-01-22 Thread David G. Johnston
On Sat, Jan 22, 2022 at 2:41 AM Amit Kapila  wrote:

> On Sat, Jan 22, 2022 at 12:41 PM David G. Johnston
>  wrote:
> >
> > On Fri, Jan 21, 2022 at 10:30 PM Amit Kapila 
> wrote:
> >>
> >> On Fri, Jan 21, 2022 at 10:00 PM David G. Johnston
> >>  wrote:
> >> >
> >> > On Fri, Jan 21, 2022 at 4:55 AM Amit Kapila 
> wrote:
> >> >>
>
> >
> > I agree with incorporating "reset" into the paragraph somehow - does not
> have to mention NONE, just that ALTER SUBSCRIPTION SKIP (not a family
> friendly abbreviation...) is what does it.
> >
>
> It is not clear to me what you have in mind here but to me in this
> context saying "Setting NONE resets the transaction
> ID." seems quite reasonable.
>

OK

>
> Yeah, we will error in that situation. The only invalid values are
> system reserved values (1,2).
>

So long as the ALTER command errors when asked to skip those IDs there
isn't any reason for an end-user, who likely doesn't know or care that 1
and 2 are special, to be concerned about them (the only two invalid values)
while reading the docs.


> > Or, why even force them to specify a number instead of just saying SKIP
> and if there is a current error we skip its transaction, otherwise we warn
> them that nothing happened because there is no last error.
> >
>
> The idea is that we might extend this feature to skip specific
> operations on relations or maybe by having other identifiers.


Again, you've already got syntax reserved that lets you add more features
to this command in the future; and removing warnings or errors because new
features make them moot is easy.  Lets document and code what we are
willing to implement today.  A single top-level transaction xid that is
presently blocking the worker from applying any more WAL.

One idea
> we discussed was to automatically fetch the last error xid but then
> decided it can be done as a later patch.
>

This seems backwards.  The user-friendly approach is to not make them type
in anything at all.  That said, this particular UX seems like it could use
some safety.  Thus I would propose at this time that attempting to set the
skip_option to anything but THE active_error_xid for the named subscription
results in an error.  Once you add new features the user can set the
skip_option to other things without provoking errors.  Again, I consider
this a safety feature since the user now has to accurately match the xid to
the name in the SQL in order to perform a successful skip - and the to-be
affected transaction has to be one that is preventing replication from
moving forward.  I'm not interested in providing a foot-gun where an
arbitrary future transaction can be scheduled to be skipped.  Running the
command twice with the same values should provoke an error since the first
run should be allowed to finish (?).  Also, we handle the situation where
the state of the worker changes between when the user saw the error and
wrote down the xid to skip and the actual execution of the alter command.
Maybe not highly anticipated scenarios but this is an easy win to deal with
them.


> > Additionally, the description for pg_stat_subscription_workers should
> describe what happens once the transaction represented by last_error_xid
> has either been successfully processed or skipped.  Does this "last error"
> stick around until another error happens (which is hopefully very rare) or
> does it reset to blanks?
> >
>
> It will be reset only on subscription drop, otherwise, it will stick
> around until another error happens.


I really dislike the user experience this provides, and given it is new in
v15 (and right now this table seems to exist solely to support this
feature) changing this seems within the realm of possibility. I have to
imagine these workers have a sense of local state that would just be "no
errors, no need to touch pg_stat_subscription_workers at the end of this
transaction's commit".  It would save a local state of the error_xid and if
a successfully committed transaction has that xid it would clear the
error.  The skip code path would also check for and see the matching xid
value and clear the error.  Even if the local state thing doesn't work, one
catalog lookup per transaction seems like potentially reasonable overhead
to incur here.

David J.


Re: fairywren is generating bogus BASE_BACKUP commands

2022-01-22 Thread Andrew Dunstan


On 1/21/22 18:04, Andres Freund wrote:
> Hi,
>
> On 2022-01-21 17:42:45 -0500, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> c.f. src/bin/pg_verifybackup/t/003_corruption.pl which says:
>>>     my $source_ts_prefix = $source_ts_path;
>>>     $source_ts_prefix =~ s!(^[A-Z]:/[^/]*)/.*!$1!;
>>>     ...
>>>     # See https://www.msys2.org/wiki/Porting/#filesystem-namespaces
>>>     local $ENV{MSYS2_ARG_CONV_EXCL} = $source_ts_prefix;
>>> Probably in this case just setting it to 'server:' would do the trick.
>> The point I was trying to make is that if we have to jump through
>> that sort of hoop in the test scripts, then real users are going
>> to have to jump through it as well, and they won't like that
>> (and we will get bug reports about it).  It'd be better to design
>> the option syntax to avoid such requirements.
> Normal users aren't going to invoke a "native" basebackup from inside msys. I
> assume the translation happens because an "msys world" perl invokes
> a "native" pg_basebackup via msys system(), right?  If pg_basebackup instead 
> is
> "normally" invoked from a windows terminal, or anything else "native" windows,
> the problem won't exist, no?
>
> As we're building a "native" postgres in this case, none of our tools should
> internally have such translations happening. So I don't think it'll be a huge
> issue for users themselves?


All true. This is purely an issue for our testing regime, and not for
end users.


>
> Not that I think that there are all that many users of mingw built postgres on
> windows... I think it's mostly msvc built postgres in that world?
>

The vast majority use the installer which is built with MSVC. Very few
in my experience build their own.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Document atthasmissing default optimization avoids verification table scan

2022-01-22 Thread David G. Johnston
On Saturday, January 22, 2022, James Coleman  wrote:

> On Sat, Jan 22, 2022 at 12:35 AM David G. Johnston
>  wrote:
> >
> > On Fri, Jan 21, 2022 at 5:14 PM James Coleman  wrote:
> >>
> >>
> >> > Really?  That's horrid, because that's directly useful advice.
> >>
> >> Remedied, but rewritten a bit to better fit with the new style/goal of
> >> that tip).
> >>
> >> Version 3 is attached.
> >>
> >
> > Coming back to this after a respite I think the tip needs to be moved
> just like everything else.  For much the same reason (though this may only
> be a personal bias), I know what SQL Commands do the various things that
> DDL encompasses (especially the basics like adding a column) and so the DDL
> section is really just a tutorial-like chapter that I will generally forget
> about because I will go straight to the official source which is the SQL
> Command Reference.  My future self would want the tip to show up there.  If
> we put the tip after the existing paragraph that starts: "Adding a column
> with a volatile DEFAULT or changing the type of an existing column..." the
> need to specify an example function in the tip goes away - though maybe it
> should be moved to the notes paragraph instead: "with a volatile DEFAULT
> (e.g., clock_timestamp()) or  changing the type of an existing column..."
>
> In my mind that actually might be a reason to keep it that way. I
> expect someone who's somewhat experienced to know there are things
> (like table rewrites and scans) you need to consider and therefore go
> to the ALTER TABLE page and read the details. But for someone newer
> the tutorial page needs to introduce them to the idea that those
> gotchas exist.
>
>
Readers of the DDL page are given a hint of the issues and directed to
additional, arguably mandatory, reading.  They can not worry about the
nuances during their learning phase but instead can defer that reading
until they actually have need to alter a (large) table.  But expecting them
to read the command reference page is reasonable and is IMO the more
probable place they will look when they start doing stuff in earnest.  For
the inexperienced reader breaking this up in this manner based upon depth
of detail feels right to me.

David J.


Re: row filtering for logical replication

2022-01-22 Thread Alvaro Herrera
On 2022-Jan-22, Amit Kapila wrote:

> CREATE TABLE parent (a int primary key, b int not null, c varchar)
> PARTITION BY RANGE(a);
> CREATE TABLE child PARTITION OF parent FOR VALUES FROM (0) TO (250);
> CREATE UNIQUE INDEX b_index on child(b);
> ALTER TABLE child REPLICA IDENTITY using INDEX b_index;
> 
> In this, the parent table's replica identity is the primary
> key(default) and the child table's replica identity is the b_index.

Why is the partition's replica identity different from its parent's?
Does that even make sense?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La verdad no siempre es bonita, pero el hambre de ella sí"




Re: Proposal: allow database-specific role memberships

2022-01-22 Thread Julien Rouhaud
Hi,

On Sat, Jan 22, 2022 at 05:28:05AM -0800, Kenaniah Cerny wrote:
> Thank you so much for the backtrace!
> 
> This latest patch should address by moving the dumpRoleMembership call to
> before the pointer is closed.

Thanks!  The cfbot turned green since:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3374




Re: Proposal: allow database-specific role memberships

2022-01-22 Thread Kenaniah Cerny
Thank you so much for the backtrace!

This latest patch should address by moving the dumpRoleMembership call to
before the pointer is closed.

On Sat, Jan 22, 2022 at 1:11 AM Julien Rouhaud  wrote:

> Hi,
>
> On Fri, Jan 21, 2022 at 07:01:21PM -0800, Kenaniah Cerny wrote:
> > Thanks for the feedback.
> >
> > I have attached an alternate version of the v5 patch that incorporates
> the
> > suggested changes to the documentation and DRYs up some of the acl.c code
> > for comparison. As for the databaseOid / InvalidOid parameter, I'm open
> to
> > any suggestions for how to make that even cleaner, but am currently at a
> > loss as to how that would look.
> >
> > CI is showing a failure to run pg_dump on just the Linux - Debian
> Bullseye
> > job (https://cirrus-ci.com/task/5265343722553344). Does anyone have any
> > ideas as to where I should look in order to debug that?
>
> Did you try to reproduce it on some GNU/Linux system?  FTR I had and I get
> a
> segfault in pg_dumpall:
>
> (gdb) bt
> #0  __pthread_kill_implementation (threadid=,
> signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
> #1  0x7f329e7e40cf in __pthread_kill_internal (signo=6,
> threadid=) at pthread_kill.c:78
> #2  0x7f329e7987a2 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/posix/raise.c:26
> #3  0x7f329e783449 in __GI_abort () at abort.c:79
> #4  0x7f329e7d85d8 in __libc_message (action=action@entry=do_abort,
> fmt=fmt@entry=0x7f329e90b6aa "%s\n") at ../sysdeps/posix/libc_fatal.c:155
> #5  0x7f329e7edcfa in malloc_printerr (str=str@entry=0x7f329e9092c3
> "free(): invalid pointer") at malloc.c:5536
> #6  0x7f329e7ef504 in _int_free (av=, p= out>, have_lock=0) at malloc.c:4327
> #7  0x7f329e7f1f81 in __GI___libc_free (mem=) at
> malloc.c:3279
> #8  0x7f329e7dbec5 in __GI__IO_free_backup_area 
> (fp=fp@entry=0x561775f126c0)
> at genops.c:190
> #9  0x7f329e7db6af in _IO_new_file_overflow (f=0x561775f126c0, ch=-1)
> at fileops.c:758
> #10 0x7f329e7da7be in _IO_new_file_xsputn (n=2, data=,
> f=) at
> /usr/src/debug/sys-libs/glibc-2.34-r4/glibc-2.34/libio/libioP.h:947
> #11 _IO_new_file_xsputn (f=0x561775f126c0, data=, n=2) at
> fileops.c:1197
> #12 0x7f329e7cfd32 in __GI__IO_fwrite (buf=0x7ffc90bb0ac0, size=1,
> count=2, fp=0x561775f126c0) at
> /usr/src/debug/sys-libs/glibc-2.34-r4/glibc-2.34/libio/libioP.h:947
> #13 0x56177483c758 in flushbuffer (target=0x7ffc90bb0a90) at
> snprintf.c:310
> #14 0x56177483c4e8 in pg_vfprintf (stream=0x561775f126c0,
> fmt=0x561774840dec "\n\n", args=0x7ffc90bb0f00) at snprintf.c:259
> #15 0x56177483c5ce in pg_fprintf (stream=0x561775f126c0,
> fmt=0x561774840dec "\n\n") at snprintf.c:270
> #16 0x561774831893 in dumpRoleMembership (conn=0x561775f09600,
> databaseId=0x561775f152d2 "1") at pg_dumpall.c:991
> #17 0x561774832426 in dumpDatabases (conn=0x561775f09600) at
> pg_dumpall.c:1332
> #18 0x56177483049e in main (argc=3, argv=0x7ffc90bb1658) at
> pg_dumpall.c:596
>
> I didn't look in detail, but:
>
> @@ -1323,6 +1327,10 @@ dumpDatabases(PGconn *conn)
> exit_nicely(1);
> }
>
> +   /* Dump database-specific roles if server is running 15.0 or later
> */
> +   if (server_version >= 15)
> +   dumpRoleMembership(conn, dbid);
> +
>
> Isn't that trying print to OPF after the possible fclose(OPF) a bit before
> and
> before it's reopened?
>
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1e65c426b28e..a0beec6136e6 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1642,11 +1642,10 @@ SCRAM-SHA-256$iteration count:
   
 
   
-   Because user identities are cluster-wide,
-   pg_auth_members
-   is shared across all databases of a cluster: there is only one
-   copy of pg_auth_members per cluster, not
-   one per database.
+   User identities are cluster-wide, but role memberships can be either
+   cluster-wide or database-specific (as specified by the value of the
+   dbid column).  The pg_auth_members
+   catalog is shared across all databases of a cluster.
   
 
   
@@ -1703,6 +1702,17 @@ SCRAM-SHA-256$iteration count:
roleid to others
   
  
+
+  
+  
+   dbid oid
+   (references pg_database.oid)
+  
+  
+   ID of the database that this membership is constrained to; zero if membership is cluster-wide
+  
+ 
+
 

   
diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml
index a897712de2e5..98bcfed5f507 100644
--- a/doc/src/sgml/ref/grant.sgml
+++ b/doc/src/sgml/ref/grant.sgml
@@ -93,6 +93,7 @@ GRANT { USAGE | ALL [ PRIVILEGES ] }
 [ GRANTED BY role_specification ]
 
 GRANT role_name [, ...] TO role_specification [, ...]
+[ IN DATABASE database_name | IN CURRENT DATABASE ]
 [ WITH ADMIN OPTION ]
 [ GRANTED BY role_specification ]
 
@@ -243,7 +244,23 @@ GRANT role_name [, ...] TO GRANT command grants membership
in a role to one or more other roles.  

Re: Document atthasmissing default optimization avoids verification table scan

2022-01-22 Thread James Coleman
On Sat, Jan 22, 2022 at 12:35 AM David G. Johnston
 wrote:
>
> On Fri, Jan 21, 2022 at 5:14 PM James Coleman  wrote:
>>
>>
>> > Really?  That's horrid, because that's directly useful advice.
>>
>> Remedied, but rewritten a bit to better fit with the new style/goal of
>> that tip).
>>
>> Version 3 is attached.
>>
>
> Coming back to this after a respite I think the tip needs to be moved just 
> like everything else.  For much the same reason (though this may only be a 
> personal bias), I know what SQL Commands do the various things that DDL 
> encompasses (especially the basics like adding a column) and so the DDL 
> section is really just a tutorial-like chapter that I will generally forget 
> about because I will go straight to the official source which is the SQL 
> Command Reference.  My future self would want the tip to show up there.  If 
> we put the tip after the existing paragraph that starts: "Adding a column 
> with a volatile DEFAULT or changing the type of an existing column..." the 
> need to specify an example function in the tip goes away - though maybe it 
> should be moved to the notes paragraph instead: "with a volatile DEFAULT 
> (e.g., clock_timestamp()) or  changing the type of an existing column..."

In my mind that actually might be a reason to keep it that way. I
expect someone who's somewhat experienced to know there are things
(like table rewrites and scans) you need to consider and therefore go
to the ALTER TABLE page and read the details. But for someone newer
the tutorial page needs to introduce them to the idea that those
gotchas exist.

Thoughts?
James Coleman




Re: Logical replication timeout problem

2022-01-22 Thread Amit Kapila
On Fri, Jan 21, 2022 at 10:45 PM Fabrice Chapuis
 wrote:
>
> I keep your patch 0001 and I add these two calls in function 
> WalSndUpdateProgress without modifying WalSndKeepaliveIfNecessary, it works 
> too.
> What do your think of this patch?
>

I think this will also work. Here, the point was to just check what is
the exact problem and the possible approach to solve it, the actual
patch might be different from these ideas. So, let me try to summarize
the problem and the possible approach to solve it so that others can
also share their opinion.

Here, the problem is that we don't send keep-alive messages for a long
time while processing large transactions during logical replication
where we don't send any data of such transactions (say because the
table modified in the transaction is not published). We do try to send
the keep_alive if necessary at the end of the transaction (via
WalSndWriteData()) but by that time the subscriber-side can timeout
and exit.

Now, one idea to solve this problem could be that whenever we skip
sending any change we do try to update the plugin progress via
OutputPluginUpdateProgress(for walsender, it will invoke
WalSndUpdateProgress), and there it tries to process replies and send
keep_alive if necessary as we do when we send some data via
OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
don't know whether it is a good idea to invoke such a mechanism for
every change we skip to send or we should do it after we skip sending
some threshold of continuous changes. I think later would be
preferred. Also, we might want to introduce a new parameter
send_keep_alive to this API so that there is flexibility to invoke
this mechanism as we don't need to invoke it while we are actually
sending data and before that, we just update the progress via this
API.

Thoughts?

Note: I have added Simon and Petr J. to this thread as they introduced
the API OutputPluginUpdateProgress in commit 024711bb54 and know this
part of code/design well but ideas suggestions from everyone are
welcome.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-01-22 Thread Amit Kapila
On Sat, Jan 22, 2022 at 12:41 PM David G. Johnston
 wrote:
>
> On Fri, Jan 21, 2022 at 10:30 PM Amit Kapila  wrote:
>>
>> On Fri, Jan 21, 2022 at 10:00 PM David G. Johnston
>>  wrote:
>> >
>> > On Fri, Jan 21, 2022 at 4:55 AM Amit Kapila  
>> > wrote:
>> >>
>> >> Apart from this, I have changed a few comments and ran pgindent. Do
>> >> let me know what you think of the changes?
>> >
>> >
>> > The paragraph describing ALTER SUBSCRIPTION SKIP seems unnecessarily 
>> > repetitive.  Consider:
>> > """
>> > Skips applying all changes of the specified remote transaction, whose 
>> > value should be obtained from pg_stat_subscription_workers.last_error_xid.
>> >
>>
>> Here, you can also say that the value can be found from server logs as well.
>
>
> subscriber's server logs, right?
>

Right.

>  I would agree that adding that for completeness is warranted.
>
>>
>> > Then change the subskipxid column description to be:
>> > """
>> > ID of the transaction whose changes are to be skipped.  It is 0 when there 
>> > are no pending skips.  This is set by issuing ALTER SUBSCRIPTION SKIP and 
>> > resets back to 0 when the identified transactions passes through the 
>> > subscription stream and is successfully ignored.
>> > """
>> >
>>
>> Users can manually reset it by specifying NONE, so that should be
>> covered in the above text, otherwise, looks good.
>
>
> I agree with incorporating "reset" into the paragraph somehow - does not have 
> to mention NONE, just that ALTER SUBSCRIPTION SKIP (not a family friendly 
> abbreviation...) is what does it.
>

It is not clear to me what you have in mind here but to me in this
context saying "Setting NONE resets the transaction
ID." seems quite reasonable.

>>
>> > I don't understand why/how ", if a valid transaction ID;" comes into play 
>> > (how would we know whether it is valid, or if we do ALTER SUBSCRIPTION 
>> > SKIP should prohibit the invalid value from being chosen).
>> >
>>
>> What do you mean by invalid value here? Is it the value lesser than
>> FirstNormalTransactionId or a value that is of the non-error
>> transaction? For the former, we already have a check in the patch and
>> for later we can't identify it with any certainty because the error
>> stats are collected by the stats collector.
>
>
> The original proposal qualifies the non-zero transaction id in subskipxid as 
> being a "valid transaction ID" and that invalid ones (which is how 
> "otherwise" is interpreted given the "valid" qualification preceding it) are 
> shown as 0.  As an end-user that makes me wonder what it means for a 
> transaction ID to be invalid.  My point is that dropping the mention of 
> "valid transaction ID" avoids that and lets the reader operate with an 
> understanding that things should "just work".  If I see a non-zero in the 
> column I have a pending skip and if I see zero I do not.  My wording assumes 
> it is that simple.  If it isn't I would need some clarity as to why it is not 
> in order to write something I could read and understand from my inexperienced 
> user-centric point-of-view.
>
> I get that I may provide a transaction ID that is invalid such that the 
> system could never see it (or at least not for a long while) - say we error 
> on transaction 102 and I typo it as 1002 or 101.  But I would expect either 
> an error where I make the typo or the numbers 1002 or 101 to appear on the 
> table.  I would not expect my 101 typo to result in a 0 appearing on the 
> table (and if it does so today I argue that is a POLA violation).  Thus, "if 
> a valid transaction ID" from the original text just doesn't make sense to me.
>
> In typical usage it would seem strange to allow a skip to be recorded if 
> there is no existing error in the subscription.  Should we (do we, haven't 
> read the code) warn in that situation?
>

Yeah, we will error in that situation. The only invalid values are
system reserved values (1,2).

> Or, why even force them to specify a number instead of just saying SKIP and 
> if there is a current error we skip its transaction, otherwise we warn them 
> that nothing happened because there is no last error.
>

The idea is that we might extend this feature to skip specific
operations on relations or maybe by having other identifiers. One idea
we discussed was to automatically fetch the last error xid but then
decided it can be done as a later patch.

> Additionally, the description for pg_stat_subscription_workers should 
> describe what happens once the transaction represented by last_error_xid has 
> either been successfully processed or skipped.  Does this "last error" stick 
> around until another error happens (which is hopefully very rare) or does it 
> reset to blanks?
>

It will be reset only on subscription drop, otherwise, it will stick
around until another error happens.

>  Seems like it should reset, which really makes this more of an 
> "active_error" instead of a "last_error".  This system is linear, we are 
> stuck until this error is 

Re: Proposal: allow database-specific role memberships

2022-01-22 Thread Julien Rouhaud
Hi,

On Fri, Jan 21, 2022 at 07:01:21PM -0800, Kenaniah Cerny wrote:
> Thanks for the feedback.
> 
> I have attached an alternate version of the v5 patch that incorporates the
> suggested changes to the documentation and DRYs up some of the acl.c code
> for comparison. As for the databaseOid / InvalidOid parameter, I'm open to
> any suggestions for how to make that even cleaner, but am currently at a
> loss as to how that would look.
> 
> CI is showing a failure to run pg_dump on just the Linux - Debian Bullseye
> job (https://cirrus-ci.com/task/5265343722553344). Does anyone have any
> ideas as to where I should look in order to debug that?

Did you try to reproduce it on some GNU/Linux system?  FTR I had and I get a
segfault in pg_dumpall:

(gdb) bt
#0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x7f329e7e40cf in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78
#2  0x7f329e7987a2 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#3  0x7f329e783449 in __GI_abort () at abort.c:79
#4  0x7f329e7d85d8 in __libc_message (action=action@entry=do_abort, 
fmt=fmt@entry=0x7f329e90b6aa "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#5  0x7f329e7edcfa in malloc_printerr (str=str@entry=0x7f329e9092c3 
"free(): invalid pointer") at malloc.c:5536
#6  0x7f329e7ef504 in _int_free (av=, p=, 
have_lock=0) at malloc.c:4327
#7  0x7f329e7f1f81 in __GI___libc_free (mem=) at 
malloc.c:3279
#8  0x7f329e7dbec5 in __GI__IO_free_backup_area 
(fp=fp@entry=0x561775f126c0) at genops.c:190
#9  0x7f329e7db6af in _IO_new_file_overflow (f=0x561775f126c0, ch=-1) at 
fileops.c:758
#10 0x7f329e7da7be in _IO_new_file_xsputn (n=2, data=, 
f=) at 
/usr/src/debug/sys-libs/glibc-2.34-r4/glibc-2.34/libio/libioP.h:947
#11 _IO_new_file_xsputn (f=0x561775f126c0, data=, n=2) at 
fileops.c:1197
#12 0x7f329e7cfd32 in __GI__IO_fwrite (buf=0x7ffc90bb0ac0, size=1, count=2, 
fp=0x561775f126c0) at 
/usr/src/debug/sys-libs/glibc-2.34-r4/glibc-2.34/libio/libioP.h:947
#13 0x56177483c758 in flushbuffer (target=0x7ffc90bb0a90) at snprintf.c:310
#14 0x56177483c4e8 in pg_vfprintf (stream=0x561775f126c0, 
fmt=0x561774840dec "\n\n", args=0x7ffc90bb0f00) at snprintf.c:259
#15 0x56177483c5ce in pg_fprintf (stream=0x561775f126c0, fmt=0x561774840dec 
"\n\n") at snprintf.c:270
#16 0x561774831893 in dumpRoleMembership (conn=0x561775f09600, 
databaseId=0x561775f152d2 "1") at pg_dumpall.c:991
#17 0x561774832426 in dumpDatabases (conn=0x561775f09600) at 
pg_dumpall.c:1332
#18 0x56177483049e in main (argc=3, argv=0x7ffc90bb1658) at pg_dumpall.c:596

I didn't look in detail, but:

@@ -1323,6 +1327,10 @@ dumpDatabases(PGconn *conn)
exit_nicely(1);
}

+   /* Dump database-specific roles if server is running 15.0 or later */
+   if (server_version >= 15)
+   dumpRoleMembership(conn, dbid);
+

Isn't that trying print to OPF after the possible fclose(OPF) a bit before and
before it's reopened?