Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-26 Thread Michael Paquier
On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> Looks reasonable to me.

0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
applied.

Regarding 0002, using pg_pwrite_zeros() as a routine name, as
suggested by Thomas, sounds good to me.  However, I am not really a
fan of its dependency with PGAlignedXLogBlock, because it should be
able to work with any buffers of any sizes, as long as the input
buffer is aligned, shouldn't it?  For example, what about
PGAlignedBlock?  So, should we make this more extensible?  My guess
would be the addition of the block size and the block pointer to the 
arguments of pg_pwrite_zeros(), in combination with a check to make
sure that the input buffer is MAXALIGN()'d (with an Assert() rather
than just an elog/pg_log_error?).
--
Michael


signature.asc
Description: PGP signature


Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-26 Thread Julien Rouhaud
Hi,

On Thu, Oct 27, 2022 at 12:08:31PM +0900, Michael Paquier wrote:
>
> Putting things afresh, there are two different things here (sorry I
> need to see that typed ;p):
> 1) How do we want to check reliably the loading of the HBA and ident
> files on errors?

I guess you meant the failure to load HBA / ident files containing invalid
data?

> EXEC_BACKEND would reload an entire new thing for
> each connection, hence we need some loops to go through that.
> 2) How to check the contents of pg_hba_file_rules and
> pg_ident_file_mappings?
>
> There is a dependency between 1) and 2) once we try to check for error
> patterns in pg_hba_file_rules, because connections would just not
> happen.  This is not the case for pg_ident_file_mappings though, so we
> could still test for buggy patterns in pg_ident.conf (or any of its
> included parts) with some expected content of
> pg_ident_file_mappings.error after a successful connection.
>
> Hmm.  And what if we just gave up on the checks for error patterns in
> pg_hba_file_rules?

We discussed this problem in the past (1), and my understanding was that
detecting a -DEXEC_BACKEND/Win32 build and skipping those tests in that case
would be an acceptable solution to make sure there's at least some coverage.
The proposed patch adds such an approach, making sure that the failure is due
to an invalid HBA file.  If you changed you mind I can remove that part, but
again I'd like to be sure of what you exactly want before starting to rewrite
stuff.

> One thing that we could do for this part is to
> include all the buggy patterns we want to check at once in pg_hba.conf
> in its included portions, then scan for all the logs produced after
> attempting to start a server as the loading of pg_hba.conf would
> produce one LOG line with its CONTEXT for each buggy entry.  The patch
> checks for error patterns with generate_log_err_rows(), but it looks
> like it would make the part 3 of the new test cleaner and easier to
> maintain in the long-term.

I'm not sure what you mean here.  The patch does check for all the errors
looking at LOG lines and CONTEXT lines, but to make the regexp easier it
doesn't try to make sure that each CONTEXT line is immediately following the
expected LOG line.

That's why the errors are divided in 2 steps: a first step with a single error
using some inclusion, so we can validate that the CONTEXT line is entirely
correct (wrt. line number and such), and then every possible error pattern
where we assume that the CONTEXT line are still following their LOG entry if
they're found.  It also has the knowledge of which errors adds a CONTEXT line
and which don't.  And that's done twice, for HBA and ident.

The part 3 is just concatenating everything back, for HBA and ident.  So
long-term maintenance shouldn't get any harder as there won't be any need for
more steps.  We can just keep appending stuff in the 2nd step and all the tests
should run as expected.

[1] https://www.postgresql.org/message-id/YtZLeJPlMutL9heh%40paquier.xyz




Adding doubly linked list type which stores the number of items in the list

2022-10-26 Thread David Rowley
As part of the AIO work [1], there are quite a number of dlist_heads
which a counter is used to keep track on how many items are in the
list.  We also have a few places in master which do the same thing.

In order to tidy this up and to help ensure that the count variable
does not get out of sync with the items which are stored in the list,
how about we introduce "dclist" which maintains the count for us?

I've attached a patch which does this.  The majority of the functions
for the new type are just wrappers around the equivalent dlist
function.

dclist provides all of the functionality that dlist does except
there's no dclist_delete() function.  dlist_delete() can be done by
just knowing the element to delete and not the list that the element
belongs to.  With dclist, that's not possible as we must also subtract
1 from the count variable and obviously we need the dclist_head for
that.

I'll add this to the November commitfest.

David

[1] 
https://www.postgresql.org/message-id/flat/20210223100344.llw5an2akleng...@alap3.anarazel.de
From a24babfe52c7ab946f192f4872dc480ce18c8c0f Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Thu, 27 Oct 2022 16:01:34 +1300
Subject: [PATCH v1] Add doubly linked count list implementation

We have various requirements when using a dlist_head to keep track of the
number of items in the list.  This, traditionally, has been done by
maintaining a counter variable in the calling code.  Here we tidy this up
by adding "dclist", which is very similar to dlists but also keeps track
of the number of items stored in the list.

Callers may use the new dclist_count() function when they need to know how
many items are stored. Obtaining the count is an O(1) operation.

For simplicity reasons, dclists and dlists both use dlist_node as their
node type and dlist_iter/dlist_mutable_iter as their iterator type.
dclists have all of the same functionality as dlists except there is no
function named dclist_delete().  To remove an item from a list
dclist_delete_from() must be used.  This requires knowing which dclist the
given item is stored in.

Additionally, here we also convert some dlists which keep track of the
number of items stored to make them use dclists instead.
---
 .../replication/logical/reorderbuffer.c   |  58 ++--
 src/backend/replication/logical/snapbuild.c   |   2 +-
 src/backend/utils/activity/pgstat_xact.c  |  34 +-
 src/backend/utils/adt/ri_triggers.c   |  13 +-
 src/include/lib/ilist.h   | 313 +-
 src/include/replication/reorderbuffer.h   |   6 +-
 src/include/utils/pgstat_internal.h   |   3 +-
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 351 insertions(+), 79 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index c29894041e..603c861461 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -349,8 +349,6 @@ ReorderBufferAllocate(void)
buffer->by_txn_last_xid = InvalidTransactionId;
buffer->by_txn_last_txn = NULL;
 
-   buffer->catchange_ntxns = 0;
-
buffer->outbuf = NULL;
buffer->outbufsize = 0;
buffer->size = 0;
@@ -368,7 +366,7 @@ ReorderBufferAllocate(void)
 
dlist_init(>toplevel_by_lsn);
dlist_init(>txns_by_base_snapshot_lsn);
-   dlist_init(>catchange_txns);
+   dclist_init(>catchange_txns);
 
/*
 * Ensure there's no stale data from prior uses of this slot, in case 
some
@@ -413,7 +411,7 @@ ReorderBufferGetTXN(ReorderBuffer *rb)
 
dlist_init(>changes);
dlist_init(>tuplecids);
-   dlist_init(>subtxns);
+   dclist_init(>subtxns);
 
/* InvalidCommandId is not zero, so set it explicitly */
txn->command_id = InvalidCommandId;
@@ -1066,14 +1064,13 @@ ReorderBufferAssignChild(ReorderBuffer *rb, 
TransactionId xid,
 
subtxn->txn_flags |= RBTXN_IS_SUBXACT;
subtxn->toplevel_xid = xid;
-   Assert(subtxn->nsubtxns == 0);
+   Assert(dclist_count(>subtxns) == 0);
 
/* set the reference to top-level transaction */
subtxn->toptxn = txn;
 
/* add to subtransaction list */
-   dlist_push_tail(>subtxns, >node);
-   txn->nsubtxns++;
+   dclist_push_tail(>subtxns, >node);
 
/* Possibly transfer the subtxn's snapshot to its top-level txn. */
ReorderBufferTransferSnapToParent(txn, subtxn);
@@ -1241,7 +1238,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
if (txn->nentries > 0)
nr_txns++;
 
-   dlist_foreach(cur_txn_i, >subtxns)
+   dclist_foreach(cur_txn_i, >subtxns)
{
ReorderBufferTXN *cur_txn;
 
@@ -1308,7 +1305,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
}
 
/* add subtransactions if they contain changes */
-   dlist_foreach(cur_txn_i, 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada 
wrote:
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As
discussed, I will attempt the size class decoupling after v8 and see how it
goes.

> Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that
can certainly be put off for future work. I hope to at least test it out
with local memory.

> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want
the capacity to be a number we can directly compare with the count (we
won't ever need to store 256 because that node will never grow). Also,
further to my last message, we need to access the kind quickly, without
more cycles.

> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.

Great!

> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA
pointers, I imagine I will pretty much know the places that would be
affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass
because this library is not used anywhere in the backend yet, and
apparently the CI Meson builds don't know to run the regression test? That
will need to be done too. However, it's okay to keep the benchmarking
module in autoconf, since it won't be committed.

> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded)
at the top:
> >
> > if (count > 4)
> > pg_unreachable();

Actually it just now occurred to me there's a bigger issue here: *We* know
this code can only get here iff count==4, so why doesn't the compiler know
that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a
runtime lookup in this array, which is not constant. This might not be
important just yet, because I want to base the check on the proposed node
capacity instead, but I mention it as a reminder to us to make sure we take
all opportunities for the compiler to propagate constants.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 11:32:14PM +0800, Julien Rouhaud wrote:
> Have you already done a rebase while working on the patch or are you intending
> to take care of it, or should I?  Let's no both do the work :)

Spoiler alert: I have not done a rebase yet ;)
--
Michael


signature.asc
Description: PGP signature


Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 11:32:14PM +0800, Julien Rouhaud wrote:
> I don't mind taking care of that, but before doing so I'd like to have some
> feedback on whether you're ok with my approach (per my initial email about it
> at [1]) or if you had some different
> ideas on how to do it.

Putting things afresh, there are two different things here (sorry I
need to see that typed ;p):
1) How do we want to check reliably the loading of the HBA and ident
files on errors?  EXEC_BACKEND would reload an entire new thing for
each connection, hence we need some loops to go through that.
2) How to check the contents of pg_hba_file_rules and
pg_ident_file_mappings?

There is a dependency between 1) and 2) once we try to check for error
patterns in pg_hba_file_rules, because connections would just not
happen.  This is not the case for pg_ident_file_mappings though, so we
could still test for buggy patterns in pg_ident.conf (or any of its 
included parts) with some expected content of
pg_ident_file_mappings.error after a successful connection.

Hmm.  And what if we just gave up on the checks for error patterns in
pg_hba_file_rules?  One thing that we could do for this part is to
include all the buggy patterns we want to check at once in pg_hba.conf
in its included portions, then scan for all the logs produced after
attempting to start a server as the loading of pg_hba.conf would
produce one LOG line with its CONTEXT for each buggy entry.  The patch
checks for error patterns with generate_log_err_rows(), but it looks
like it would make the part 3 of the new test cleaner and easier to
maintain in the long-term.
--
Michael


signature.asc
Description: PGP signature


Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 09:49:34PM -0500, Justin Pryzby wrote:
> In v4, Peter posted a 2-patch series with my patch as 001.
> But I pointed out that it's better to fix the initialization of the
> compile-time GUCs rather than exclude them from the check.
> Then Peter submitted v5 whcih does that, and isn't built on top of my
> patch.

Okidoki, thanks for the clarification.
--
Michael


signature.asc
Description: PGP signature


Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant

2022-10-26 Thread Richard Guo
On Wed, Oct 26, 2022 at 4:25 PM David Rowley  wrote:

> One other thought I had about the duplicate "Limit" node in the final
> plan was that we could make the limit clause an Expr like
> LEAST(, 1).  That way we could ensure we get at
> most 1 row, but perhaps less if the expression given in the LIMIT
> clause evaluated to 0. This will still work correctly when the
> existing limit evaluates to NULL. I'm still just not that keen on this
> idea as it means still having to either edit the parse's limitCount or
> store the limit details in a new field in PlannerInfo and use that
> when making the final LimitPath. However, I'm still not sure doing
> this is worth the extra complexity.


I find the duplicate "Limit" node is not that concerning after I realize
it may appear in other queries, such as

explain (analyze, timing off, costs off)
select * from (select * from (select * from generate_series(1,100)i limit
10) limit 5) limit 1;
  QUERY PLAN
--
 Limit (actual rows=1 loops=1)
   ->  Limit (actual rows=1 loops=1)
 ->  Limit (actual rows=1 loops=1)
   ->  Function Scan on generate_series i (actual rows=1
loops=1)

Although the situation is different in that the Limit node is actually
atop SubqueryScan which is removed afterwards, but the final plan
appears as a Limit node atop another Limit node.

So I wonder maybe we can just live with it, or resolve it in a separate
patch.

Thanks
Richard


Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Peter Smith
On Thu, Oct 27, 2022 at 1:33 PM Michael Paquier  wrote:
>
> On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote:
> > It seems like you're reviewing the previous version of the patch, rather
> > than the one attached to the message you responded to (which doesn't
> > have anything to do with GUC_DEFAULT_COMPILE).
>
> It does not seem so as things stand, I have been looking at
> v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here:
> https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com
>
> In combination with a two-patch set as posted by you here:
> 0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch
> 0002-WIP-test-guc-default-values.patch
> https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com
>
> These are the latest patch versions posted on their respective thread
> I am aware of, and based on the latest updates of each thread it still
> looked like there was a dependency between both.  So, is that the case
> or not?  If not, sorry if I misunderstood things.

No. My v5 is no longer dependent on the other patch.

>
> > I don't know what you meant by "make the CF bot happy" (?)
>
> It is in my opinion confusing to see that the v5 posted on this
> thread, which was marked as ready for committer as of
> https://commitfest.postgresql.org/40/3934/, seem to rely on a facility
> that it makes no use of.  Hence it looks to me that this patch has
> been posted as-is to allow the CF bot to pass (I would have posted
> that as an isolated two-patch set with the first patch introducing the
> flag if need be).

Yeah, my v4 was posted along with the other GUC flag patch as a
prerequisite to make the cfbot happy. This is no longer the case - v5
is a single independent patch. Sorry for the "ready for the committer"
status being confusing. At that time I thought it was.

>
> Anyway, per my previous comments in my last message of this thread as
> of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz,
> I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I
> see a need to a style like that:
> +/* GUC variable */
> +bool   update_process_title =
> +#ifdef WIN32
> +   false;
> +#else
> +   true;
> +#endif
>
> I think that it would be cleaner to use the same approach as
> checking_after_flush and similar GUCs with a centralized definition,
> rather than spreading such style in two places for each GUC that this
> patch touches (aka its declaration and its default value in
> guc_tables.c).  In any case, the patch of this thread still needs some
> adjustments IMO.

OK, I can make that adjustment if it is preferred. I think it is the
same as what I already suggested a while ago [1] ("But probably it is
no problem to just add #defines...")

--
[1] https://www.postgresql.org/message-id/1113448.1665717297%40sss.pgh.pa.us

Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Justin Pryzby
On Thu, Oct 27, 2022 at 11:33:48AM +0900, Michael Paquier wrote:
> On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote:
> > It seems like you're reviewing the previous version of the patch, rather
> > than the one attached to the message you responded to (which doesn't
> > have anything to do with GUC_DEFAULT_COMPILE).
> 
> It does not seem so as things stand, I have been looking at
> v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here:
> https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com

This thread is about consistency of the global variables with what's set
by the GUC infrastructure.

In v4, Peter posted a 2-patch series with my patch as 001.
But I pointed out that it's better to fix the initialization of the
compile-time GUCs rather than exclude them from the check.
Then Peter submitted v5 whcih does that, and isn't built on top of my
patch.

> In combination with a two-patch set as posted by you here:
> 0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch
> 0002-WIP-test-guc-default-values.patch
> https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com

That's a separate thread regarding consistency of the default values
(annotations) shown in postgresql.conf.  (I'm not sure whether or not my
patch adding GUC flags is an agreed way forward, although they might
turn out to be useful for other purposes).

-- 
Justin




Re: generic plans and "initial" pruning

2022-10-26 Thread Amit Langote
On Mon, Oct 17, 2022 at 6:29 PM Amit Langote  wrote:
> On Wed, Oct 12, 2022 at 4:36 PM Amit Langote  wrote:
> > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote  
> > wrote:
> > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas  wrote:
> > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > introduction of that field should be deferred until it's needed for
> > > > something.
> > >
> > > Oops, looks like a mistake when breaking the patch.  Will move that bit 
> > > to 0002.
> >
> > Fixed that and also noticed that I had defined PartitionPruneResult in
> > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > nodes not being able to be written and read, because
> > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > routines for the nodes defined in execnodes.h.  I moved its definition
> > to plannodes.h, even though it is not actually the planner that
> > instantiates those; no other include/nodes header sounds better.
> >
> > One more thing I realized is that Bitmapsets added to the List
> > PartitionPruneResult.valid_subplan_offs_list are not actually
> > read/write-able.  That's a problem that I also faced in [1], so I
> > proposed a patch there to make Bitmapset a read/write-able Node and
> > mark (only) the Bitmapsets that are added into read/write-able node
> > trees with the corresponding NodeTag.  I'm including that patch here
> > as well (0002) for the main patch to work (pass
> > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > to discuss it in its own thread?
>
> Had second thoughts on the use of List of Bitmapsets for this, such
> that the make-Bitmapset-Nodes patch is no longer needed.
>
> I had defined PartitionPruneResult such that it stood for the results
> of pruning for all PartitionPruneInfos contained in
> PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> can use partition pruning in a given plan).  So, it had a List of
> Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> cover only one PartitionPruneInfo and thus need only a Bitmapset and
> not a List thereof, which I have implemented in the attached updated
> patch 0002.  So, instead of needing to pass around a
> PartitionPruneResult with each PlannedStmt, this now passes a List of
> PartitionPruneResult with an entry for each in
> PlannedStmt.partPruneInfos.

Rebased over 3b2db22fe.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v23-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pl.patch
Description: Binary data


v23-0002-Optimize-AcquireExecutorLocks-by-locking-only-un.patch
Description: Binary data


RE: Perform streaming logical transactions by background workers and parallel apply

2022-10-26 Thread shiy.f...@fujitsu.com
On Wed, Oct 26, 2022 7:19 PM Amit Kapila  wrote:
> 
> On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
> >  wrote:
> >
> > I've started to review this patch. I tested v40-0001 patch and have
> > one question:
> >
> > IIUC even when most of the changes in the transaction are filtered out
> > in pgoutput (eg., by relation filter or row filter), the walsender
> > sends STREAM_START. This means that the subscriber could end up
> > launching parallel apply workers also for almost empty (and streamed)
> > transactions. For example, I created three subscriptions each of which
> > subscribes to a different table. When I loaded a large amount of data
> > into one table, all three (leader) apply workers received START_STREAM
> > and launched their parallel apply workers.
> >
> 
> The apply workers will be launched just the first time then we
> maintain a pool so that we don't need to restart them.
> 
> > However, two of them
> > finished without applying any data. I think this behaviour looks
> > problematic since it wastes workers and rather decreases the apply
> > performance if the changes are not large. Is it worth considering a
> > way to delay launching a parallel apply worker until we find out the
> > amount of changes is actually large?
> >
> 
> I think even if changes are less there may not be much difference
> because we have observed that the performance improvement comes from
> not writing to file.
> 
> > For example, the leader worker
> > writes the streamed changes to files as usual and launches a parallel
> > worker if the amount of changes exceeds a threshold or the leader
> > receives the second segment. After that, the leader worker switches to
> > send the streamed changes to parallel workers via shm_mq instead of
> > files.
> >
> 
> I think writing to file won't be a good idea as that can hamper the
> performance benefit in some cases and not sure if it is worth.
> 

I tried to test some cases that only a small part of the transaction or an empty
transaction is sent to subscriber, to see if using streaming parallel will bring
performance degradation.

The test was performed ten times, and the average was taken.
The results are as follows. The details and the script of the test is attached.

10% of rows are sent
--
HEAD24.4595
patched 18.4545

5% of rows are sent
--
HEAD21.244
patched 17.9655

0% of rows are sent
--
HEAD18.0605
patched 17.893


It shows that when only 5% or 10% of rows are sent to subscriber, using parallel
apply takes less time than HEAD, and even if all rows are filtered there's no
performance degradation.


Regards
Shi yu
<>


Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote:
> It seems like you're reviewing the previous version of the patch, rather
> than the one attached to the message you responded to (which doesn't
> have anything to do with GUC_DEFAULT_COMPILE).

It does not seem so as things stand, I have been looking at
v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here:
https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com

In combination with a two-patch set as posted by you here:
0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch
0002-WIP-test-guc-default-values.patch
https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com

These are the latest patch versions posted on their respective thread
I am aware of, and based on the latest updates of each thread it still
looked like there was a dependency between both.  So, is that the case
or not?  If not, sorry if I misunderstood things.

> I don't know what you meant by "make the CF bot happy" (?)

It is in my opinion confusing to see that the v5 posted on this
thread, which was marked as ready for committer as of
https://commitfest.postgresql.org/40/3934/, seem to rely on a facility
that it makes no use of.  Hence it looks to me that this patch has
been posted as-is to allow the CF bot to pass (I would have posted
that as an isolated two-patch set with the first patch introducing the
flag if need be).

Anyway, per my previous comments in my last message of this thread as
of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz,
I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I
see a need to a style like that:
+/* GUC variable */
+bool   update_process_title =
+#ifdef WIN32
+   false;
+#else
+   true;
+#endif  

I think that it would be cleaner to use the same approach as
checking_after_flush and similar GUCs with a centralized definition,
rather than spreading such style in two places for each GUC that this
patch touches (aka its declaration and its default value in
guc_tables.c).  In any case, the patch of this thread still needs some
adjustments IMO.
--
Michael


signature.asc
Description: PGP signature


Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Justin Pryzby
On Thu, Oct 27, 2022 at 11:06:56AM +0900, Michael Paquier wrote:
> On Wed, Oct 26, 2022 at 06:31:56PM +1100, Peter Smith wrote:
> > I re-checked all the GUC C vars which your patch flags as
> > GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I
> > made the C var assignment use the same preprocessor rules as used by
> > guc_tables. For others (mostly the string ones) I left the GUC C var
> > untouched because the sanity checker function already has a rule not
> > to complain about int GUC C vars which are 0 or string GUC vars which
> > are NULL.
> 
> I see.  So you have on this thread an independent patch to make the CF
> bot happy, still depend on the patch posted on [1] to bypass the
> changes with variables whose boot values are compilation-dependent.

It seems like you're reviewing the previous version of the patch, rather
than the one attached to the message you responded to (which doesn't
have anything to do with GUC_DEFAULT_COMPILE).

I don't know what you meant by "make the CF bot happy" (?)

-- 
Justin




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread Masahiko Sawada
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
 wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada  
> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it 
> wasn't clear we should. After looking again at the node types, I must not 
> have thought through this at all. Storing one byte instead of four for the 
> full enum is a good step, but saving one more byte usually doesn't buy 
> anything because of padding, with a few exceptions like this example:
>
> node4:   4 +  4   +  4*8 =   40
> node4:   5 +  4+(7)   +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And 
> with my idea of decoupling size classes from kind, the variable-sized kinds 
> will require another byte to store "capacity". Then, even if the kind gets 
> encoded in a pointer tag, we'll still have 5 bytes in the base type. So I 
> think we should assume 5 bytes from the start. (Might be 6 temporarily if I 
> work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 
> has syntactic support for them, so no need to write your own shifting/masking 
> code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are 
> addressed, we can proceed independently with the above two items. Seeing the 
> DSA work will also inform me how invasive pointer tagging will be. There will 
> still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) 
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and 
> potential non-SIMD callers will be better off writing their own byte-wise 
> loop rather than using this API. Especially since the "min" function exists 
> only as a workaround for lack of unsigned comparison in (at least) SSE2. 
> There is one existing function in this file with that idiom for non-assert 
> code (for completeness), but even there, inputs of current interest to us use 
> the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << 
> sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is 
> greater
> + * than or equal to 'key'.  Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* insert to 

Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 06:31:56PM +1100, Peter Smith wrote:
> I re-checked all the GUC C vars which your patch flags as
> GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I
> made the C var assignment use the same preprocessor rules as used by
> guc_tables. For others (mostly the string ones) I left the GUC C var
> untouched because the sanity checker function already has a rule not
> to complain about int GUC C vars which are 0 or string GUC vars which
> are NULL.

I see.  So you have on this thread an independent patch to make the CF
bot happy, still depend on the patch posted on [1] to bypass the
changes with variables whose boot values are compilation-dependent.

Is it right to believe that the only requirement here is
GUC_DEFAULT_COMPILE but not GUC_DEFAULT_INITDB?  The former is much
more intuitive than the latter.  Still, I see an inconsistency here in
what you are doing here.

sanity_check_GUC_C_var() would need to skip all the GUCs marked as
GUC_DEFAULT_COMPILE, meaning that one could still be "fooled by a
mismatched value" in these cases.  We are talking about a limited set
of them, but it seems to me that we have no need for this flag at all
once the default GUC values are set with a #defined'd value, no?
checkpoint_flush_after, bgwriter_flush_after, port and
effective_io_concurrency do that, which is why
v5-0001-GUC-C-variable-sanity-check.patch does its stuff only for
maintenance_io_concurrency, update_process_title, assert_enabled and
syslog_facility.  I think that it would be simpler to have a default
for these last four with a centralized definition, meaning that we
would not need a GUC_DEFAULT_COMPILE at all, while the validation
could be done for basically all the GUCs with default values
assigned.  In short, this patch has no need to depend on what's posted
in [1]. 

[1]: https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com
--
Michael


signature.asc
Description: PGP signature


Re: Some regression tests for the pg_control_*() functions

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 01:41:12PM +0530, Bharath Rupireddy wrote:
> We will have bigger problems when a backend corrupts the pg_control
> file, no? The bigger problems could be that the server won't come up
> or it behaves abnormally or some other.

Possibly, yes.

> Can't the CRC check detect any of the above corruptions? Do we have
> any evidence of backend corrupting the pg_control file or any of the
> above variables while running regression tests?

It could be possible that the backend writes an incorrect data
combination though its APIs, where the CRC is correct but the data is
not (say a TLI of 0, as one example).

> If the concern is backend corrupting the pg_control file and CRC check
> can't detect it, then the extra checks (as proposed in the patch) must
> be placed within the core (perhaps before writing/after reading the
> pg_control file), not in regression tests for sure.

Well, that depends on the level of protection you want.  Now there are
things in place already when it comes to recovery or at startup.
Anyway, the recent experience with the 56-bit relfilenode thread is
really that we don't check the execution of these functions at all,
and that's the actual minimal requirement, so I have applied a patch
based on count(*) > 0 for now to cover that.  I am not sure if any of
the checks for the control file fields are valuable, perhaps some
are..
--
Michael


signature.asc
Description: PGP signature


Re: confused with name in the pic

2022-10-26 Thread David G. Johnston
On Wed, Oct 26, 2022 at 2:13 AM jack...@gmail.com  wrote:

> typedef struct A_Expr
>
>
>
> {
>
>
>
> pg_node_attr(custom_read_write)
>
>
>
> NodeTag type;
>
>
>
> A_Expr_Kind kind;   /* see above */
>
>
>
> List   *name;   /* possibly-qualified name of operator */
>
>
>
> Node   *lexpr;  /* left argument, or NULL if none */
>
>
>
> Node   *rexpr;  /* right argument, or NULL if none */
>
>
>
> int location;   /* token location, or -1 if unknown */
>
>
>
> } A_Expr;
>
>
>
> I run a sql like select a,b from t3 where a > 1; and I get the parseTree
> for selectStmt:
>
>
>
> why the name is '-' but not '>'?
>
>
Given the general lack of interest in this so far I'd suggest you put
together a minimal test case that includes a simple print-to-log command
patched into HEAD showing the problematic parse tree in its internal form.

Posting an image of some custom visualization that seems evidently produced
by custom code that itself may be buggy against an unspecified server with
no indication how any of this all works doesn't seem like enough detail and
there is little reason to think that such an obvious bug could exist.   I
do agree that your expectation seems quite sound.  Though I do not have the
faintest idea how to actually go about reproducing your result even in the
minimal way described above (though I know enough to know it is possible,
just not where to patch).

David J.


Re: Documentation for building with meson

2022-10-26 Thread samay sharma
Hi,

On Wed, Oct 19, 2022 at 7:43 PM Justin Pryzby  wrote:

> On Wed, Oct 19, 2022 at 11:35:10AM -0700, samay sharma wrote:
> > Creating a new thread focussed on adding docs for building Postgres with
> > meson. This is a spinoff from the original thread [1] and I've attempted
> to
> > address all the feedback provided there in the attached patch.
> >
> > Please let me know your thoughts.
>
> It's easier to review rendered documentation.
> I made a rendered copy available here:
>
> https://api.cirrus-ci.com/v1/artifact/task/6084297781149696/html_docs/html_docs/install-meson.html


Thanks for your for review. Attached v2 of the patch here.


>
>
> +  Flex and Bison
> +  are needed to build PostgreSQL using
> +  meson. Be sure to get
> +  Flex 2.5.31 or later and
> +  Bison 1.875 or later from your package
> manager.
> +  Other lex and
> yacc
> +  programs cannot be used.
>
> These versions need to be updated, see also: 57bab3330:
>   - b086a47a270 "Bump minimum version of Bison to 2.3"
>   - 8b878bffa8d "Bump minimum version of Flex to 2.5.35"
>

Changed


>
> + will be enabled automatically if the required packages are found.
>
> should refer to files/libraries/headers/dependencies rather than
> "packages" ?
>

Changed to dependencies


>
> + default is false that is to use
> Readline.
>
> "that is to use" should be parenthesized or separate with commas, like
> | default is false, that is to use Readline.
>
> zlib is mentioned twice, the first being "strongly recommended".
> Is that intended?  Also, basebackup can use zlib, too.
>

Yes, the first is in the requirements section where we just list packages
required / recommended. The other mention is in the list of configure
options. This is similar to how the documentation looks today for make /
autoconf. Added pg_basebackup as a use case too.


>
> + If you have the binaries for certain by programs required to
> build
>
> remove "by" ?
>

Done


>
> + Postgres (with or without optional flags) stored at non-standard
> + paths, you could specify them manually to meson configure. The
> complete
> + list of programs for whom this is supported can be found by
> running
>
> for *which
>
> Actually, I suggest to say:
> |If a program required to build Postgres (with or without optional flags)
> |is stored in a non-standard path, ...
>

Liked this framing better. Changed.

>
> + a build with a different value of these options.
>
> .. with different values ..
>

Done

>
> + the server, it is recommended to use atleast the
> --buildtype=debug
>
> at least
>
Done

>
> +and it's options in the meson documentation.
>
> its
>
Done

>
> Maybe other things should have  ?
>
>  Git
>  Libxml2
>  libxslt
>  visual studio
>  DTrace
>  ninja
>
> +  Flex and Bison
>
> Maybe these should use  ?
>

I'm unsure of the right protocol for this. I tried to follow the precedent
set in the make / autoconf part of the documentation, which uses
 at certain places and  at others. Is there a
reference or guidance on which to use where or is it mostly a case by case
decision?


> +  be installed with pip.
>
> Should be  ?
>

Changed.

>
> This part is redundant with prior text:
> " To use this option, you will need an implementation of the Gettext API. "
>

Modified.

>
> + Enabls use of the Zlib library
>
> typo: Enabls
>

Fixed.

>
> + This option is set to true by default and setting it to false will
>
> change "and" to ";" for spinlocks and atomics?
>

Done

>
> + Debug info is generated but the result is not optimized.
>
> Maybe say the "code" is not optimized ?
>

Changed

>
> + the tests can slow down the server significantly
>
> remove "can"
>

Done.


>
> + You can override the amount of parallel processes used with
>
> s/amount/number/
>

Done

>
> + If you'd like to build with a backend other that ninja
>
> other *than
>

Fixed.

>
> +  the GNU C library then you will additionally
>
> library comma
>

Added

>
> +argument. If no srcdir is given Meson will deduce
> the
>
> given comma
>

Added

>
> + It should be noted that after the initial configure step
>
> step comma
>

Added

>
> +After the installation you can free disk space by removing the built
>
> installation comma
>

Added

>
> +To learn more about running the tests and how to interpret the results
> +you can refer to the documentation for interpreting test results.
>
> interpret the results comma
> "for interpreting test results" seems redundant.
>

Changed.

>
> + ninja install should work for most cases but if you'd like to use more
> options
>
> cases comma
>

Added

>
> Starting with "Developer Options", this intermixes postgres
> project-specific options like cassert and tap-tests with meson's stuff
> like buildtype and werror.  IMO there's too much detail about meson's
> options, which I think is better left to that project's own
> documentation, and postgres docs should include only a brief 

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-10-26 Thread Andres Freund
Hi,

On 2022-10-24 14:38:52 -0400, Melanie Plageman wrote:
> > - "repossession" is a very unintuitive name for me. If we want something 
> > like
> >   it, can't we just name it reuse_failed or such?
>
> Repossession could be called eviction_failed or reuse_failed.
> Do we think we will ever want to use it to count buffers we released
> in other IOContexts (thus making the name eviction_failed better than
> reuse_failed)?

I've a somewhat radical proposal: Let's just not count any of this in the
initial version. I think we want something, but clearly it's one of the harder
aspects of this patch. Let's get the rest in, and then work on this is in
isolation.


> Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call
> it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What
> if, in the future, we want to track other IO done using data in local
> memory?

Fair point. However, I think 'tmp' or 'temp' would be worse, because there's
other sources of temporary files that would be worth counting, consider
e.g. tuplestore temporary files. 'temptable' isn't good because it's not just
tables. 'temprel'? On balance I think local is better, but not sure.


> Also, what if we want to track other IO done using data from shared memory
> that is not in shared buffers? Would IOCONTEXT_SB and IOCONTEXT_TEMP be
> better? Should IOContext literally describe the context of the IO being done
> and there be a separate column which indicates the source of the data for
> the IO?  Like wal_buffer, local_buffer, shared_buffer? Then if it is not
> block-oriented, it could be shared_mem, local_mem, or bypass?

Hm. I don't think we'd need _buffer for WAL or such, because there's nothing
else.


> If we had another dimension to the matrix "data_src" which, with
> block-oriented IO is equivalent to "buffer type", this could help with
> some of the clarity problems.
>
> We could remove the "reused" column and that becomes:
>
> IOCONTEXT | DATA_SRC| IOOP
> 
> strategy  | strategy_buffer | EVICT

> Having data_src and iocontext simplifies the meaning of all io
> operations involving a strategy. Some operations are done on shared
> buffers and some on existing strategy buffers and this would be more
> clear without the addition of special columns for strategies.

-1, I think this just blows up the complexity further, without providing much
benefit. But:

Perhaps a somewhat similar idea could be used to address the concerns in the
preceding paragraphs. How about the following set of columns:

backend_type:
object: relation, temp_relation[, WAL, tempfiles, ...]
iocontext: buffer_pool, bulkread, bulkwrite, vacuum[, bypass]
read:
written:
extended:
bytes_conversion:
evicted:
reused:
files_synced:
stats_reset:

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-10-26 Thread Melanie Plageman
okay, so I realized v35 had an issue where I wasn't counting strategy
evictions correctly. fixed in attached v36. This made me wonder if there
is actually a way to add a test for evictions (in strategy and shared
contexts) that is not flakey.

On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda  wrote:
>
> On Thu, Oct 20, 2022 at 10:31 AM Andres Freund  wrote:
> > - "repossession" is a very unintuitive name for me. If we want something 
> > like
> >   it, can't we just name it reuse_failed or such?
>
> +1, I think "repossessed" is awkward. I think "reuse_failed" works,
> but no strong opinions on an alternate name.

Also, re: repossessed, I can change it to reuse_failed but I do think it
is important to give users a way to distinguish between bulkread
rejections of dirty buffers and strategies failing to reuse buffers due
to concurrent pinning (since the reaction to these two scenarios would
likely be different).

If we added another column called something like "claim_failed" which
counts buffers which we failed to reuse because of concurrent pinning or
usage, we could recommend use of this column together with
"reuse_failed" to determine the cause of the failed reuses for a
bulkread. We could also use "claim_failed" in IOContext shared to
provide information on shared buffer contention.

- Melanie
From 0d5fc7da60f6b02259b8dd1d2eab25967cb9a95a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 6 Oct 2022 12:23:38 -0400
Subject: [PATCH v36 3/5] Aggregate IO operation stats per BackendType

Stats on IOOps for all IOContexts for a backend are tracked locally. Add
functionality for backends to flush these stats to shared memory and
accumulate them with those from all other backends, exited and live.
Also add reset and snapshot functions used by cumulative stats system
for management of these statistics.

The aggregated stats in shared memory could be extended in the future
with per-backend stats -- useful for per connection IO statistics and
monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO operation statistics
to shared memory in a timely manner.

Because not all BackendType, IOOp, IOContext combinations are valid, the
validity of the stats is checked before flushing pending stats and
before reading in the existing stats file to shared memory.

Author: Melanie Plageman 
Reviewed-by: Andres Freund 
Reviewed-by: Justin Pryzby 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Maciek Sakrejda 
Reviewed-by: Lukas Fittl 
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml  |   2 +
 src/backend/utils/activity/pgstat.c   |  35 
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c  |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c| 164 ++
 src/backend/utils/activity/pgstat_relation.c  |  15 +-
 src/backend/utils/activity/pgstat_shmem.c |   4 +
 src/backend/utils/activity/pgstat_wal.c   |   4 +-
 src/backend/utils/adt/pgstatfuncs.c   |   4 +-
 src/include/miscadmin.h   |   2 +
 src/include/pgstat.h  |  88 ++
 src/include/utils/pgstat_internal.h   |  36 
 src/tools/pgindent/typedefs.list  |   3 +
 13 files changed, 365 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..698f274341 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 the pg_stat_bgwriter
 view, archiver to reset all the counters shown in
 the pg_stat_archiver view,
+io to reset all the counters shown in the
+pg_stat_io view,
 wal to reset all the counters shown in the
 pg_stat_wal view or
 recovery_prefetch to reset all the counters shown
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b97597f17..4becee9a6c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -582,6 +591,7 @@ pgstat_report_stat(bool force)
 
 	/* Don't expend a clock check if nothing to do */
 	if (dlist_is_empty() &&
+		!have_ioopstats &&
 		!have_slrustats &&
 		!pgstat_have_pending_wal())
 	{
@@ -628,6 +638,9 @@ pgstat_report_stat(bool force)
 	/* 

Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-26 Thread Julien Rouhaud
On Wed, Oct 26, 2022 at 03:56:07PM +0900, Michael Paquier wrote:
>
> So, I have spent a good portion of today looking at what you have
> here, applying 0001 and 0003 while fixing, rebasing and testing the
> whole, discarding 0002 (we could do more for the line number and
> source file in terms of the LOGs reported for a regexec failure).

Thanks!

> Now remains 0004, which is the core of the proposal, and while it
> needs a rebase,

Have you already done a rebase while working on the patch or are you intending
to take care of it, or should I?  Let's no both do the work :)

> - The TAP test, which is half the size of the patch in line number.
> Could it be possible to make it more edible, introducing a basic
> infrastructure to check a set of rules in pg_hba.conf and
> pg_ident.conf without the inclusion logic?  Checks for error patterns
> (that I agree we strongly lack tests for) look like something we'd
> want to tackle independently of the inclusion logic, and it should be
> built on top of a basic test infra, at least.

I don't mind taking care of that, but before doing so I'd like to have some
feedback on whether you're ok with my approach (per my initial email about it
at [1]) or if you had some different
ideas on how to do it.

[1] 
https://www.postgresql.org/message-id/20220730080936.atyxodvwlmf2wnoc@jrouhaud




Re: Stack overflow issue

2022-10-26 Thread Egor Chindyaskin

24.08.2022 20:58, Tom Lane writes:

Nice work!  I wonder if you can make the regex crashes reachable by
reducing the value of max_stack_depth enough that it's hit before
reaching the "regular expression is too complex" limit.

regards, tom lane
Hello everyone! It's been a while since me and Alexander Lakhin have 
published a list of functions that have a stack overflow illness. We are 
back to tell you more about such places.
During our analyze we made a conclusion that some functions can be 
crashed without changing any of the parameters and some can be crashed 
only if we change some stuff.


The first function crashes without any changes:

# CheckAttributeType

(n=6; printf "create domain dint as int; create domain dinta0 as 
dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as 
dinta$(( $i - 1 ))[]; "; done; ) | psql

psql -c "create table t(f1 dinta6[]);"

Some of the others crash if we change "max_locks_per_transaction" 
parameter:


# findDependentObjects

max_locks_per_transaction = 200

(n=1; printf "create table t (i int); create view v0 as select * 
from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select * 
from v$(( $i - 1 )); "; done; ) | psql

psql -c "drop table t"

# ATExecDropColumn

max_locks_per_transaction = 300

(n=5; printf "create table t0 (a int, b int); "; for 
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 
))); "; done; printf "alter table t0 drop b;" ) | psql


# ATExecDropConstraint

max_locks_per_transaction = 300

(n=5; printf "create table t0 (a int, b int, constraint bc check (b 
> 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i() 
inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop 
constraint bc;" ) | psql


# ATExecAddColumn

max_locks_per_transaction = 200

(n=5; printf "create table t0 (a int, b int);"; for 
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 
))); "; done; printf "alter table t0 add column c int;" ) | psql


# ATExecAlterConstrRecurse

max_locks_per_transaction = 300

(n=5;
printf "create table t(a int primary key); create table pt (a int 
primary key, foreign key(a) references t) partition by range (a);";
printf "create table pt0 partition of pt for values from (0) to (10) 
partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$(( 
$i - 1 )) for values from ($i) to (10) partition by range (a); "; done;
printf "alter table pt alter constraint pt_a_fkey deferrable initially 
deferred;"

) | psql

This is where the fun begins. According to Tom Lane, a decrease in 
max_stack_depth could lead to new crashes, but it turned out that 
Alexander was able to find new crashes precisely due to the increase in 
this parameter. Also, we had ulimit -s set to 8MB as the default value.


# eval_const_expressions_mutator

max_stack_depth = '7000kB'

(n=1; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf " 
collate \"C\" "; done; ) | psql


If you didn’t have a crash, like me, when Alexander shared his find, 
then probably you configured your cluster with an optimization flag -Og. 
In the process of trying to break this function, we came to the 
conclusion that the maximum stack depth depends on the optimization flag 
(-O0/-Og). As it turned out, when optimizing, the function frame on the 
stack becomes smaller and because of this, the limit is reached more 
slowly, therefore, the system can withstand more torment. Therefore, 
this query will fail if you have a cluster configured with the -O0 
optimization flag.


The crash of the next function not only depends on the optimization 
flag, but also on a number of other things. While researching, we 
noticed that postgres enforces a distance ~400kB from max_stack_depth to 
ulimit -s. We thought we could hit the max_stack_depth limit and then 
hit the OS limit as well. Therefore, Alexander wrote a recursive SQL 
function, that eats up a stack within max_stack_depth, including a query 
that eats up the remaining ~400kB. And this causes a crash.


# executeBoolItem

max_stack_depth = '7600kB'

create function infinite_recurse(i int) returns int as $$
begin
  raise notice 'Level %', i;
  begin
    perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' || 
repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath);

  exception
    when others then raise notice 'jsonb_path_query error at level %, 
%', i, sqlerrm;

  end;
  begin
    select infinite_recurse(i + 1) into i;
  exception
    when others then raise notice 'Max stack depth reached at level %, 
%', i, sqlerrm;

  end;
  return i;
end;
$$ language plpgsql;

select infinite_recurse(1);

To sum it all up, we have not yet decided on a general approach to such 
functions. Some functions are definitely subject to stack overflow. Some 
are definitely not. This can be seen from the code where the recurse 
flag is passed, or a function that checks the stack is 

Re: Reducing duplicativeness of EquivalenceClass-derived clauses

2022-10-26 Thread Tom Lane
Richard Guo  writes:
> On Wed, Oct 26, 2022 at 6:09 AM Tom Lane  wrote:
>> The only thing that I think might be controversial here is that
>> I dropped the check for matching operator OID.  To preserve that,
>> we'd have needed to use get_commutator() in the reverse-match cases,
>> which it seemed to me would be a completely unjustified expenditure
>> of cycles.  The operators we select for freshly-generated clauses
>> will certainly always match those of previously-generated clauses.
>> Maybe there's a chance that they'd not match those of ec_sources
>> clauses (that is, the user-written clauses we started from), but
>> if they don't and that actually makes any difference then surely
>> we are talking about a buggy opclass definition.

> The operator is chosen according to the two given EC members's data
> type.  Since we are dealing with the same pair of EC members, I think
> the operator is always the same one. So it also seems no problem to drop
> the check for operator. I wonder if we can even add an assertion if
> we've found a RestrictInfo from ec_derives that the operator matches.

Yeah, I considered that --- even if somehow an ec_sources entry isn't
an exact match, ec_derives ought to be.  However, it still didn't seem
worth a get_commutator() call.  We'd basically be expending cycles to
check that select_equality_operator yields the same result with the same
inputs as it did before, and that doesn't seem terribly interesting to
check.  I'm also not sure what's the point of allowing divergence
from the requested operator in some but not all paths.

I added a bit of instrumentation to count how many times we need to build
new join clauses in create_join_clause.  In the current core regression
tests, I see this change reducing the number of new join clauses built
here from 9673 to 5142 (out of 26652 calls).  So not quite 50% savings,
but pretty close to it.  That should mean that this change is about
a wash just in terms of the code it touches directly: each iteration
of the search loops is nearly twice as expensive as before, but we'll
only need to do about half as many.  So whatever we save downstream
is pure gravy.

regards, tom lane




Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-26 Thread Amit Kapila
On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada  wrote:
>
> On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
>  wrote:
>
> I've started to review this patch. I tested v40-0001 patch and have
> one question:
>
> IIUC even when most of the changes in the transaction are filtered out
> in pgoutput (eg., by relation filter or row filter), the walsender
> sends STREAM_START. This means that the subscriber could end up
> launching parallel apply workers also for almost empty (and streamed)
> transactions. For example, I created three subscriptions each of which
> subscribes to a different table. When I loaded a large amount of data
> into one table, all three (leader) apply workers received START_STREAM
> and launched their parallel apply workers.
>

The apply workers will be launched just the first time then we
maintain a pool so that we don't need to restart them.

> However, two of them
> finished without applying any data. I think this behaviour looks
> problematic since it wastes workers and rather decreases the apply
> performance if the changes are not large. Is it worth considering a
> way to delay launching a parallel apply worker until we find out the
> amount of changes is actually large?
>

I think even if changes are less there may not be much difference
because we have observed that the performance improvement comes from
not writing to file.

> For example, the leader worker
> writes the streamed changes to files as usual and launches a parallel
> worker if the amount of changes exceeds a threshold or the leader
> receives the second segment. After that, the leader worker switches to
> send the streamed changes to parallel workers via shm_mq instead of
> files.
>

I think writing to file won't be a good idea as that can hamper the
performance benefit in some cases and not sure if it is worth.

-- 
With Regards,
Amit Kapila.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada 
wrote:

> I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes

Overall, this looks much better!

> * Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but
it wasn't clear we should. After looking again at the node types, I must
not have thought through this at all. Storing one byte instead of four for
the full enum is a good step, but saving one more byte usually doesn't buy
anything because of padding, with a few exceptions like this example:

node4:   4 +  4   +  4*8 =   40
node4:   5 +  4+(7)   +  4*8 =   48 bytes

Even there, I'd rather not spend the extra cycles to access the members.
And with my idea of decoupling size classes from kind, the variable-sized
kinds will require another byte to store "capacity". Then, even if the kind
gets encoded in a pointer tag, we'll still have 5 bytes in the base type.
So I think we should assume 5 bytes from the start. (Might be 6 temporarily
if I work on size decoupling first).

(Side note, if you have occasion to use bitfields again in the future, C99
has syntactic support for them, so no need to write your own
shifting/masking code).

> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are
addressed, we can proceed independently with the above two items. Seeing
the DSA work will also inform me how invasive pointer tagging will be.
There will still be some performance tuning and cosmetic work, but it's
getting closer.

-
0001:

+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif

Leftover from an earlier version?

+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);

Leftovers, causing compiler warnings. (Also see new variable shadow warning)

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) 
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]);
+
+ return r;
+#endif

As I mentioned a couple versions ago, this style is really awkward, and
potential non-SIMD callers will be better off writing their own byte-wise
loop rather than using this API. Especially since the "min" function exists
only as a workaround for lack of unsigned comparison in (at least) SSE2.
There is one existing function in this file with that idiom for non-assert
code (for completeness), but even there, inputs of current interest to us
use the uint64 algorithm.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) <<
sizeof(Vector8));

Hmm?

+/*
+ * Return index of the first element in chunks in the given node that is
greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)

The caller must now have logic for inserting at the end:

+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */

It would be a bit more clear if node_*_search_ge() always returns the
position we need (see the prototype for example). In fact, these functions
are probably better named node*_get_insertpos().

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */

We want all the node-growing code to be pushed down to the bottom so that
all branches of the hot path are close together. This provides better
locality for the CPU frontend. Looking at the assembly, the above doesn't
have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
  grow-node;
else
{
  ...;
  break;
}
/* FALLTHROUGH */

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node*child;
+
+   if (NODE_IS_LEAF(node))
+ break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, ))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }

Note that if we have to call rt_node_add_new_child(), each successive loop
iteration must search it and find nothing there (the prototype had a
separate function to handle this). Maybe it's not that critical yet, but
something to keep in mind as we proceed. Maybe a comment about it to remind
us.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, 

Re: Reducing duplicativeness of EquivalenceClass-derived clauses

2022-10-26 Thread Richard Guo
On Wed, Oct 26, 2022 at 6:09 AM Tom Lane  wrote:

> While fooling with my longstanding outer-join variables changes
> (I am making progress on that, honest), I happened to notice that
> equivclass.c is leaving some money on the table by generating
> redundant RestrictInfo clauses.  It already attempts to not generate
> the same clause twice, which can save a nontrivial amount of work
> because we cache selectivity estimates and so on per-RestrictInfo.
> I realized though that it will build and return a clause like
> "a.x = b.y" even if it already has "b.y = a.x".  This is just
> wasteful.  It's always been the case that equivclass.c will
> produce clauses that are ordered according to its own whims.
> Consumers that need the operands in a specific order, such as
> index scans or hash joins, are required to commute the clause
> to be the way they want it while building the finished plan.
> Therefore, it shouldn't matter which order of the operands we
> return, and giving back the commutator clause if available could
> potentially save as much as half of the selectivity-estimation
> work we do with these clauses subsequently.
>
> Hence, PFA a patch that adjusts create_join_clause() to notice
> commuted as well as exact matches among the EquivalenceClass's
> existing clauses.  This results in a number of changes visible in
> regression test cases, but they're all clearly inconsequential.


I think there is no problem with this idea, given the operands of
EC-derived clauses are commutative, and it seems no one would actually
rely on the order of the operands.  I can see hashjoin/mergejoin would
commute hash/merge joinclauses if needed with get_switched_clauses().


>
> The only thing that I think might be controversial here is that
> I dropped the check for matching operator OID.  To preserve that,
> we'd have needed to use get_commutator() in the reverse-match cases,
> which it seemed to me would be a completely unjustified expenditure
> of cycles.  The operators we select for freshly-generated clauses
> will certainly always match those of previously-generated clauses.
> Maybe there's a chance that they'd not match those of ec_sources
> clauses (that is, the user-written clauses we started from), but
> if they don't and that actually makes any difference then surely
> we are talking about a buggy opclass definition.


The operator is chosen according to the two given EC members's data
type.  Since we are dealing with the same pair of EC members, I think
the operator is always the same one. So it also seems no problem to drop
the check for operator. I wonder if we can even add an assertion if
we've found a RestrictInfo from ec_derives that the operator matches.

Thanks
Richard


Re: Have nodeSort.c use datum sorts single-value byref types

2022-10-26 Thread David Rowley
On Thu, 29 Sept 2022 at 18:12, David Rowley  wrote:
> In the attached patch, I've added a function named
> tuplesort_getdatum_nocopy() which is the same as tuplesort_getdatum()
> only without the datumCopy(). I opted for the new function rather than
> a new parameter in the existing function just to reduce branching and
> additional needless overhead.

Per what was said over on [1], I've adjusted the patch to just add a
'copy' parameter to tuplesort_getdatum() instead of adding the
tuplesort_getdatum_nocopy() function.

I also adjusted some code in heapam_index_validate_scan() to pass
copy=false to tuplesort_getdatum().  The datum in question here is a
TID type, so this really only saves a datumCopy() / pfree on 32-bit
systems. I wasn't too interested in speeding 32-bit systems up with
this, it was more a case of being able to remove the #ifndef
USE_FLOAT8_BYVAL / pfree code.

I think this is a fairly trivial patch, so if nobody objects, I plan
to push it in the next few days.

David

[1] https://www.postgresql.org/message-id/65629.1664460603%40sss.pgh.pa.us
diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index a3414a76e8..41f1ca65d0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1879,17 +1879,13 @@ heapam_index_validate_scan(Relation heapRelation,
}
 
tuplesort_empty = !tuplesort_getdatum(state->tuplesort, 
true,
-   
  _val, _isnull, NULL);
+   
  false, _val, _isnull,
+   
  NULL);
Assert(tuplesort_empty || !ts_isnull);
if (!tuplesort_empty)
{
itemptr_decode(, DatumGetInt64(ts_val));
indexcursor = 
-
-   /* If int8 is pass-by-ref, free (encoded) TID 
Datum memory */
-#ifndef USE_FLOAT8_BYVAL
-   pfree(DatumGetPointer(ts_val));
-#endif
}
else
{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index bcf3b1950b..986f761e87 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -879,7 +879,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 */
 
while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
- true, newVal, isNull, 
))
+ true, false, newVal, 
isNull, ))
{
/*
 * Clear and select the working context for evaluation of the 
equality
@@ -900,24 +900,33 @@ process_ordered_aggregate_single(AggState *aggstate,

 pertrans->aggCollation,

 oldVal, *newVal)
{
-   /* equal to prior, so forget this one */
-   if (!pertrans->inputtypeByVal && !*isNull)
-   pfree(DatumGetPointer(*newVal));
+   MemoryContextSwitchTo(oldContext);
+   continue;
}
else
{
advance_transition_function(aggstate, pertrans, 
pergroupstate);
-   /* forget the old value, if any */
-   if (!oldIsNull && !pertrans->inputtypeByVal)
-   pfree(DatumGetPointer(oldVal));
-   /* and remember the new one for subsequent equality 
checks */
-   oldVal = *newVal;
+
+   MemoryContextSwitchTo(oldContext);
+
+   /*
+* Forget the old value, if any and remember the new 
one for
+* subsequent equality checks.
+*/
+   if (!pertrans->inputtypeByVal)
+   {
+   if (!oldIsNull)
+   pfree(DatumGetPointer(oldVal));
+   if (!*isNull)
+   oldVal = datumCopy(*newVal, 
pertrans->inputtypeByVal,
+  
pertrans->inputtypeLen);
+   }
+   else
+   oldVal = *newVal;
oldAbbrevVal = newAbbrevVal;
oldIsNull = *isNull;
   

Re: replacing role-level NOINHERIT with a grant-level option

2022-10-26 Thread Pavel Luzanov

Hello,


Thanks for reviewing. Committed.


Let me return to this topic.

After looking at the changes in this patch, I have a suggestion.

The inheritance option for role memberships is important information to 
know if the role privileges will be available automatically or if a 
switch with a SET ROLE command is required. However, this information 
cannot be obtained with psql commands, specifically \du or \dg.


Previously, you could see the inherit attribute of the role (its absence 
is shown with \du). Now you have to look in the pg_auth_members system 
catalog.


My suggestion is to add information about pg_auth_members.inherit_option 
to the output of \du (\dg).


If so, we can also add information about pg_auth_members.admin_option. 
Right now this information is not available in psql command output either.


I thought about how exactly to represent these options in the output 
\du, but did not find a single solution. Considered the following choices:


1.
Add \du+ command and for each membership in the role add values of two 
options. I haven't done a patch yet, but you can imagine the changes 
like this:


CREATE ROLE alice LOGIN IN ROLE pg_read_all_data;

\du+ alice
 List of roles
 Role name | Attributes | Member of
---++
 alice |    | {pg_read_all_data(admin=f inherit=t)}

It looks long, but for \du+ it's not a problem.

2.
I assume that the default values for these options will rarely change. 
In that case, we can do without \du+ and output only the changed values 
directly in the \du command.


GRANT pg_read_all_data TO alice WITH INHERIT FALSE;

2a.
\du alice
 List of roles
 Role name | Attributes | Member of
---++
 alice |    | {pg_read_all_data(inherit=f)}

2b.
Similar to GRANT OPTION, we can use symbols instead of long text 
(inherit=f) for options. For example, for the ADMIN OPTION we can use 
"*" (again similar to GRANT OPTION), and for the missing INHERIT option 
something else, such as "-":


GRANT pg_read_all_data TO alice WITH ADMIN TRUE;
GRANT pg_write_all_data TO alice WITH INHERIT FALSE;

\du alice
 List of roles
 Role name | Attributes | Member of
---++
 alice |    | {pg_read_all_data*-,pg_write_all_data-}

But I think choices 2a and 2b are too complicated to understand. 
Especially because the two options have different default values. And 
even more. The default value for the INHERIT option depends on the value 
of the INHERIT attribute for the role.


So I like the first choice with \du+ better.

But perhaps there are other choices as well.

If it's interesting, I'm ready to open a new thread (the commitfest 
entry for this topic is now closed) and prepare a patch.


--
Pavel Luzanov





Re: Suppressing useless wakeups in walreceiver

2022-10-26 Thread Bharath Rupireddy
On Sun, Oct 16, 2022 at 9:29 AM Nathan Bossart  wrote:
>
> Here's a different take.  Instead of creating structs and altering function
> signatures, we can just make the wake-up times file-global, and we can skip
> the changes related to reducing the number of calls to
> GetCurrentTimestamp() for now.  This results in a far less invasive patch.
> (I still think reducing the number of calls to GetCurrentTimestamp() is
> worthwhile, but I'm beginning to agree with Bharath that it should be
> handled separately.)
>
> Thoughts?

Thanks. I'm wondering if we need to extend the similar approach for
logical replication workers in logical/worker.c LogicalRepApplyLoop()?

A comment on the patch:
Isn't it better we track the soonest wakeup time or min of wakeup[]
array (update the min whenever the array element is updated in
WalRcvComputeNextWakeup()) instead of recomputing everytime by looping
for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU
cycles, because the for loop, in which the below code is placed, runs
till the walreceiver end of life cycle. We may wrap wakeup[] and min
inside a structure for better code organization or just add min as
another static variable alongside wakeup[].

+/* Find the soonest wakeup time, to limit our nap. */
+nextWakeup = PG_INT64_MAX;
+for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+nextWakeup = Min(wakeup[i], nextWakeup);
+nap = Max(0, (nextWakeup - now + 999) / 1000);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




confused with name in the pic

2022-10-26 Thread jack...@gmail.com
typedef struct A_Expr



{



pg_node_attr(custom_read_write)



NodeTag type;



A_Expr_Kind kind;   /* see above */



List   *name;   /* possibly-qualified name of operator */



Node   *lexpr;  /* left argument, or NULL if none */



Node   *rexpr;  /* right argument, or NULL if none */



int location;   /* token location, or -1 if unknown */



} A_Expr;



I run a sql like select a,b from t3 where a > 1; and I get the parseTree for 
selectStmt:



why the name is '-' but not '>'?







jack...@gmail.com




1.png
Description: Binary data


Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant

2022-10-26 Thread David Rowley
On Fri, 14 Oct 2022 at 15:15, Richard Guo  wrote:
> The v3 patch looks good to me.

Thank you for looking at that.

One other thought I had about the duplicate "Limit" node in the final
plan was that we could make the limit clause an Expr like
LEAST(, 1).  That way we could ensure we get at
most 1 row, but perhaps less if the expression given in the LIMIT
clause evaluated to 0. This will still work correctly when the
existing limit evaluates to NULL. I'm still just not that keen on this
idea as it means still having to either edit the parse's limitCount or
store the limit details in a new field in PlannerInfo and use that
when making the final LimitPath. However, I'm still not sure doing
this is worth the extra complexity.

If nobody else has any thoughts on this or the patch in general, then
I plan to push it in the next few days.

David




Re: Some regression tests for the pg_control_*() functions

2022-10-26 Thread Bharath Rupireddy
On Wed, Oct 26, 2022 at 12:48 PM Michael Paquier  wrote:
>
> On Wed, Oct 26, 2022 at 10:13:29AM +0530, Bharath Rupireddy wrote:
> > +1 for improving the test coverage. Is there a strong reason to
> > validate individual output columns rather than select count(*) > 0
> > from pg_control_(); sort of tests? If the intention is to validate
> > the pg_controlfile contents, we have pg_controldata to look at and
> > pg_control_() functions doing crc checks.
>
> And it could be possible that the control file finishes by writing
> some incorrect data due to a bug in the backend.

We will have bigger problems when a backend corrupts the pg_control
file, no? The bigger problems could be that the server won't come up
or it behaves abnormally or some other.

> Adding a count(*) or
> similar to get the number of fields of the function is basically the
> same as checking its execution, still I'd like to think that having a
> minimum set of checks would be kind of nice on top of that. Among all
> the ones I wrote in the patch upthread, the following ones would be in
> my minimalistic list:
> - timeline_id > 0
> - timeline_id >= prev_timeline_id
> - checkpoint_lsn >= redo_lsn
> - data_page_checksum_version >= 0
> - Perhaps the various fields of pg_control_init() using their
> lower-bound values.
> - Perhaps pg_control_version and/or catalog_version_no > NN

Can't the CRC check detect any of the above corruptions? Do we have
any evidence of backend corrupting the pg_control file or any of the
above variables while running regression tests?

If the concern is backend corrupting the pg_control file and CRC check
can't detect it, then the extra checks (as proposed in the patch) must
be placed within the core (perhaps before writing/after reading the
pg_control file), not in regression tests for sure.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: GUC values - recommended way to declare the C variables?

2022-10-26 Thread Peter Smith
Thanks for the feedback. PSA the v5 patch.

On Wed, Oct 26, 2022 at 7:04 AM Justin Pryzby  wrote:
>
> +#ifdef USE_ASSERT_CHECKING
> +   sanity_check_GUC_C_var(hentry->gucvar);
> +#endif
>
> => You can conditionally define that as an empty function so #ifdefs
> aren't needed in the caller:
>
> void sanity_check_GUC_C_var()
> {
> #ifdef USE_ASSERT_CHECKING
> ...
> #endif
> }
>

Fixed as suggested.

> But actually, I don't think you should use my patch.  You needed to
> exclude update_process_title:
>
> src/backend/utils/misc/ps_status.c:bool update_process_title = true;
> ...
> src/backend/utils/misc/guc_tables.c-#ifdef WIN32
> src/backend/utils/misc/guc_tables.c-false,
> src/backend/utils/misc/guc_tables.c-#else
> src/backend/utils/misc/guc_tables.c-true,
> src/backend/utils/misc/guc_tables.c-#endif
> src/backend/utils/misc/guc_tables.c-NULL, NULL, NULL
>
> My patch would also exclude the 16 other GUCs with compile-time defaults
> from your check.  It'd be better not to exclude them; I think the right
> solution is to change the C variable initialization to a compile-time
> constant:
>
> #ifdef WIN32
> bool update_process_title = false;
> #else
> bool update_process_title = true;
> #endif
>
> Or something more indirect like:
>
> #ifdef WIN32
> #define DEFAULT_PROCESS_TITLE false
> #else
> #define DEFAULT_PROCESS_TITLE true
> #endif
>
> bool update_process_title = DEFAULT_PROCESS_TITLE;
>
> I suspect there's not many GUCs that would need to change - this might
> be the only one.  If this GUC were defined in the inverse (bool
> skip_process_title), it wouldn't need special help, either.
>

I re-checked all the GUC C vars which your patch flags as
GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I
made the C var assignment use the same preprocessor rules as used by
guc_tables. For others (mostly the string ones) I left the GUC C var
untouched because the sanity checker function already has a rule not
to complain about int GUC C vars which are 0 or string GUC vars which
are NULL.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v5-0001-GUC-C-variable-sanity-check.patch
Description: Binary data


Re: Some regression tests for the pg_control_*() functions

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 10:13:29AM +0530, Bharath Rupireddy wrote:
> +1 for improving the test coverage. Is there a strong reason to
> validate individual output columns rather than select count(*) > 0
> from pg_control_(); sort of tests? If the intention is to validate
> the pg_controlfile contents, we have pg_controldata to look at and
> pg_control_() functions doing crc checks.

And it could be possible that the control file finishes by writing
some incorrect data due to a bug in the backend.  Adding a count(*) or
similar to get the number of fields of the function is basically the
same as checking its execution, still I'd like to think that having a
minimum set of checks would be kind of nice on top of that.  Among all
the ones I wrote in the patch upthread, the following ones would be in
my minimalistic list:
- timeline_id > 0
- timeline_id >= prev_timeline_id
- checkpoint_lsn >= redo_lsn
- data_page_checksum_version >= 0
- Perhaps the various fields of pg_control_init() using their
lower-bound values.
- Perhaps pg_control_version and/or catalog_version_no > NN

> If this isn't enough, we
> can have the pg_control_validate() function to do all the necessary
> checks and simplify the tests, no?

There is no function like that.  Perhaps that you mean to introduce
something like that at the C level, but that does not seem necessary
to me as long as a SQL is able to do the job for the most meaningful
parts.
--
Michael


signature.asc
Description: PGP signature


Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-26 Thread Michael Paquier
On Wed, Oct 26, 2022 at 11:19:48AM +0800, Julien Rouhaud wrote:
> That wouldn't be overdoing anymore if we remove the line number / filename 
> from
> the fill_*_line prototypes right?

So, I have spent a good portion of today looking at what you have
here, applying 0001 and 0003 while fixing, rebasing and testing the
whole, discarding 0002 (we could do more for the line number and
source file in terms of the LOGs reported for a regexec failure). 

Now remains 0004, which is the core of the proposal, and while it
needs a rebase, I have not been able to spend much time looking at its
internals.  In order to help with the follow-up steps, I have spotted
a few areas that could be done first:
- The diffs in guc.h/c for the introduction of GetDirConfFiles() to
get a set of files in a directory (got to think a bit more about
ParseConfigFp() when it comes to the keywords, but that comes to play
with the parsing logic which is different).
- The TAP test, which is half the size of the patch in line number.
Could it be possible to make it more edible, introducing a basic
infrastructure to check a set of rules in pg_hba.conf and
pg_ident.conf without the inclusion logic?  Checks for error patterns
(that I agree we strongly lack tests for) look like something we'd
want to tackle independently of the inclusion logic, and it should be
built on top of a basic test infra, at least.
--
Michael


signature.asc
Description: PGP signature


Re: Move backup-related code to xlogbackup.c/.h

2022-10-26 Thread Bharath Rupireddy
On Mon, Oct 24, 2022 at 1:00 PM Michael Paquier  wrote:
>
> On Wed, Oct 19, 2022 at 09:07:04PM +0530, Bharath Rupireddy wrote:
> > XLogBackupResetRunning() seemed better. +1 for above function names.
>
> I see what you are doing here.  XLogCtl would still live in xlog.c,
> but we want to have functions that are able to manipulate some of its
> fields.

Right.

> I am not sure to like that much because it introduces a
> circling dependency between xlog.c and xlogbackup.c.  As of HEAD,
> xlog.c calls build_backup_content() from xlogbackup.c, which is fine
> as xlog.c is kind of a central piece that feeds on the insert and
> recovery pieces.  However your patch makes some code paths of
> xlogbackup.c call routines from xlog.c, and I don't think that we
> should do that.

If you're talking about header file dependency, there's already header
file dependency between them - xlog.c includes xlogbackup.h for
build_backup_content() and xlogbackup.c includes xlog.h for
wal_segment_size. And, I think the same kind of dependency exists
between xlog.c and xlogrecovery.c.

Please note that we're trying to reduce xlog.c file size apart from
centralizing backup related code.

> > I'm okay either way.
> >
> > Please see the attached v8 patch set.
>
> Among all that, CleanupBackupHistory() is different, still it has a
> dependency with some of the archiving pieces..

Is there a problem with that? This function is used solely by backup
functions and it happens to use one of the archiving utility
functions. Please see the other archiving utility functions being used
elsewhere in the code, not only in xlog.c  -
for instance, KeepFileRestoredFromArchive() and XLogArchiveNotify().

I'm attaching the v9 patch set herewith after rebasing. Please review
it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 385f42ebf6b85fd4a0ad6211428f4dd92e53b269 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 26 Oct 2022 05:15:37 +
Subject: [PATCH v9] Add functions for xlogbackup.c to call back into xlog.c

---
 src/backend/access/transam/xlog.c  | 175 -
 src/include/access/xlog.h  |   1 +
 src/include/access/xlog_internal.h |   8 ++
 3 files changed, 131 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..34260a2434 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8311,9 +8311,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 	 * runningBackups, to ensure adequate interlocking against
 	 * XLogInsertRecord().
 	 */
-	WALInsertLockAcquireExclusive();
-	XLogCtl->Insert.runningBackups++;
-	WALInsertLockRelease();
+	XLogBackupSetRunning();
 
 	/*
 	 * Ensure we decrement runningBackups if we fail below. NB -- for this to
@@ -8383,12 +8381,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 			 * to restore starting from the checkpoint is precisely the REDO
 			 * pointer.
 			 */
-			LWLockAcquire(ControlFileLock, LW_SHARED);
-			state->checkpointloc = ControlFile->checkPoint;
-			state->startpoint = ControlFile->checkPointCopy.redo;
-			state->starttli = ControlFile->checkPointCopy.ThisTimeLineID;
-			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
-			LWLockRelease(ControlFileLock);
+			GetCheckpointLocation(>checkpointloc, >startpoint,
+  >starttli, );
 
 			if (backup_started_in_recovery)
 			{
@@ -8399,9 +8393,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
  * (i.e., since last restartpoint used as backup starting
  * checkpoint) contain full-page writes.
  */
-SpinLockAcquire(>info_lck);
-recptr = XLogCtl->lastFpwDisableRecPtr;
-SpinLockRelease(>info_lck);
+recptr = XLogGetLastFPWDisableRecptr();
 
 if (!checkpointfpw || state->startpoint <= recptr)
 	ereport(ERROR,
@@ -8434,13 +8426,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertLockAcquireExclusive();
-			if (XLogCtl->Insert.lastBackupStart < state->startpoint)
-			{
-XLogCtl->Insert.lastBackupStart = state->startpoint;
-gotUniqueStartpoint = true;
-			}
-			WALInsertLockRelease();
+			gotUniqueStartpoint = XLogBackupSetLastStart(state->startpoint);
 		} while (!gotUniqueStartpoint);
 
 		/*
@@ -8549,6 +8535,15 @@ get_backup_status(void)
 	return sessionBackupState;
 }
 
+/*
+ * Utility routine to reset the session-level status of a backup running.
+ */
+void
+reset_backup_status(void)
+{
+	sessionBackupState = SESSION_BACKUP_NONE;
+}
+
 /*
  * do_pg_backup_stop
  *
@@ -8590,33 +8585,16 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
  errhint("wal_level must be set to \"replica\" or