Re: Logical replication timeout problem

2022-04-05 Thread Amit Kapila
On Wed, Apr 6, 2022 at 11:09 AM wangw.f...@fujitsu.com
 wrote:
>
> On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila  wrote:
> > On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira  wrote:
> > >
> > > It seems I didn't make myself clear. The callbacks I'm referring to the
> > > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > > *_cb_wrapper() function, we have something like:
> > >
> > > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > > NewUpdateProgress(ctx, false);
> > >
> > > The NewUpdateProgress function would contain a logic similar to the
> > > update_progress() from the proposed patch. (A different function name here
> > just
> > > to avoid confusion.)
> > >
> > > The output plugin is responsible to set ctx->progress with the callback
> > > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb())
> > that we would
> > > like to run NewUpdateProgress.
> > >
> >
> > This sounds like a conflicting approach to what we currently do.
> > Currently, OutputPluginUpdateProgress() is called from the xact
> > related pgoutput functions like pgoutput_commit_txn(),
> > pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
> > follow what you are saying then for some of the APIs like
> > pgoutput_change/_message/_truncate, we need to set the parameter to
> > invoke NewUpdateProgress() which will internally call
> > OutputPluginUpdateProgress(), and for the remaining APIs, we will call
> > in the corresponding pgoutput_* function. I feel if we want to make it
> > more generic than the current patch, it is better to directly call
> > what you are referring to here as NewUpdateProgress() in all remaining
> > APIs like pgoutput_change/_truncate, etc.
> Thanks for your comments.
>
> According to your suggestion, improve the patch to make it more generic.
> Attach the new patch.
>

 typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
LogicalDecodingContext *lr,
  XLogRecPtr Ptr,
  TransactionId xid,
- bool skipped_xact
+ bool skipped_xact,
+ bool last_write

In this approach, I don't think we need an additional parameter
last_write. Let's do the work related to keepalive without a
parameter, do you see any problem with that?

Also, let's try to evaluate how it impacts lag functionality for large
transactions?

-- 
With Regards,
Amit Kapila.




RE: Logical replication timeout problem

2022-04-05 Thread wangw.f...@fujitsu.com
On Fri, Apr 1, 2022 at 12:09 AM Amit Kapila  wrote:
> On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira  wrote:
> >
> > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
> >
> > This is exactly our initial analysis and we have tried a patch on
> > these lines and it has a noticeable overhead. See [1]. Calling this
> > for each change or each skipped change can bring noticeable overhead
> > that is why we decided to call it after a certain threshold (100) of
> > skipped changes. Now, surely as mentioned in my previous reply we can
> > make it generic such that instead of calling this (update_progress
> > function as in the patch) for skipped cases, we call it always. Will
> > that make it better?
> >
> > That's what I have in mind but using a different approach.
> >
> > > The functions CreateInitDecodingContext and CreateDecodingContext
> receives the
> > > update_progress function as a parameter. These functions are called in 2
> > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT) and
> (b)
> > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > > WalSndUpdateProgress as a progress function. Case (b) does not have one
> because
> > > it is not required -- local decoding/communication. There is no custom
> update
> > > progress routine for each output plugin which leads me to the question:
> > > couldn't we encapsulate the update progress call into the callback 
> > > functions?
> > >
> >
> > Sorry, I don't get your point. What exactly do you mean by this?
> > AFAIS, currently we call this output plugin API in pgoutput functions
> > only, do you intend to get it invoked from a different place?
> >
> > It seems I didn't make myself clear. The callbacks I'm referring to the
> > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > *_cb_wrapper() function, we have something like:
> >
> > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > NewUpdateProgress(ctx, false);
> >
> > The NewUpdateProgress function would contain a logic similar to the
> > update_progress() from the proposed patch. (A different function name here
> just
> > to avoid confusion.)
> >
> > The output plugin is responsible to set ctx->progress with the callback
> > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb())
> that we would
> > like to run NewUpdateProgress.
> >
> 
> This sounds like a conflicting approach to what we currently do.
> Currently, OutputPluginUpdateProgress() is called from the xact
> related pgoutput functions like pgoutput_commit_txn(),
> pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
> follow what you are saying then for some of the APIs like
> pgoutput_change/_message/_truncate, we need to set the parameter to
> invoke NewUpdateProgress() which will internally call
> OutputPluginUpdateProgress(), and for the remaining APIs, we will call
> in the corresponding pgoutput_* function. I feel if we want to make it
> more generic than the current patch, it is better to directly call
> what you are referring to here as NewUpdateProgress() in all remaining
> APIs like pgoutput_change/_truncate, etc.
Thanks for your comments.

According to your suggestion, improve the patch to make it more generic.
Attach the new patch.

Regards,
Wang wei


v11-0001-Fix-the-logical-replication-timeout-during-large.patch
Description:  v11-0001-Fix-the-logical-replication-timeout-during-large.patch


Re: API stability

2022-04-05 Thread Kyotaro Horiguchi
me> I'd like to do that. Let me see.

At Tue, 5 Apr 2022 10:29:03 -0400, Robert Haas  wrote in 
> On Tue, Apr 5, 2022 at 10:17 AM Tom Lane  wrote:
> > What I think you need to do is:
> >
> > 1. In the back branches, revert delayChkpt to its previous type and
> > semantics.  Squeeze a separate delayChkptEnd bool in somewhere
> > (you can't change the struct size either ...).
> >
> > 2. In HEAD, rename the field to something like delayChkptFlags,
> > to ensure that any code touching it has to be inspected and updated.
> 
> Well, we can do it that way, I suppose.

The change is easy on head, but is it better use uint8 instead of int
for delayChkptFlags?

In the back branches, we have, on gcc/Linux/x86-64,
14's PGPROC is 880 bytes and has gaps:

- 6 bytes after statusFlag
- 4 bytes after syncRepState
- 2 bytes after subxidStatus
- 3 bytes after procArrayGroupMember
- 3 bytes after clogGroupMember
- 3 bytes after fpVXIDLock

It seems that we can place the new variable in the first place above,
since the two are not bonded together, or at least in less tightly
bonded than other candidates.

13's PGPROC is 856 bytes and has a 7 bytes gap after delayChkpt.

Versions Ealier than 13 have delayChkpt in PGXACT (12 bytes).  It is
tightly packed and dones't have a room for a new member.  Can we add
the new flag to PGPROC instead of PGXACT?
  
12 and 11's PGPROC is 848 bytes and has gaps:
   - 4 bytes after syncRepState
   - 3 bytes after procArrayGroupMember
   - 3 bytes after clogGroupMember
   - 4 bytes after clogGroupMemberPage
   - 3 bytes after fpVXIDLock


10's PGPROC is 816 bytes and has gaps:
   - 4 bytes after cvWaitLink
   - 4 bytes after syncRepState
   - 3 bytes after procArrayGroupMember
   - 3 bytes after fpVXIDLock

So if we don't want to move any member in PGPROC, we do:

14: after statusFlags.
13: after delayChkpt.
12-10: after syncRepState (and before syncRepLinks).

If we allow to shift some members, the new flag can be placed more
saner place.

14: after delayChkpt ((uint8)statusFlags moves forward by 1 byte)
13: after delayChkpt (no member moves)
12-10: after subxids ((bool)procArrayGroupMember moves forward by 1 byte)

I continue working on the last direction above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: zero char is returned as space

2022-04-05 Thread Konstantin Izmailov
Tom,
thank you very much! It makes sense now.

K

On Tue, Apr 5, 2022 at 10:08 PM Tom Lane  wrote:

> Konstantin Izmailov  writes:
> > could you help me understand if this is an intended behaviour, or I'm
> > incorrectly querying a "char" field?
>
> We do not support '\0' as an element of a string value.  You didn't
> show how you're trying to insert this value, but I suspect that
> Postgres saw it as an empty string which it then space-padded to
> length 1 because that's what char(1) does.
>
> Don't use a string field to store an integer.  What with the need
> for a length header, you wouldn't be saving any space compared to
> "smallint" even if there weren't any semantic issues.
>
> regards, tom lane
>


Perform streaming logical transactions by background workers and parallel apply

2022-04-05 Thread Amit Kapila
In this email, I would like to discuss allowing streaming logical
transactions (large in-progress transactions) by background workers
and parallel apply in general. The goal of this work is to improve the
performance of the apply work in logical replication.

Currently, for large transactions, the publisher sends the data in
multiple streams (changes divided into chunks depending upon
logical_decoding_work_mem), and then on the subscriber-side, the apply
worker writes the changes into temporary files and once it receives
the commit, it read from the file and apply the entire transaction. To
improve the performance of such transactions, we can instead allow
them to be applied via background workers. There could be multiple
ways to achieve this:

Approach-1: Assign a new bgworker (if available) as soon as the xact's
first stream came and the main apply worker will send changes to this
new worker via shared memory. We keep this worker assigned till the
transaction commit came and also wait for the worker to finish at
commit. This preserves commit ordering and avoid writing to and
reading from file in most cases. We still need to spill if there is no
worker available. We also need to allow stream_stop to complete by the
background worker to finish it to avoid deadlocks because T-1's
current stream of changes can update rows in conflicting order with
T-2's next stream of changes.

Approach-2: Assign another worker to spill the changes and only allow
to apply at the commit time by the same or another worker. Now, to
preserve, the commit order, we need to wait at commit so that the
assigned respective workers can finish. This won't avoid spilling to
disk and reading back at commit time but can help in receiving and
processing more data than we are doing currently but not sure if this
can win over Approach-1 because we still need to write and read from
the file and we need to probably use share memory queue to send the
data to other background workers to process it.

We need to change error handling to allow the above parallelization.
The current model for apply is such that if any error occurs while
applying we will simply report the error in server logs and the apply
worker will exit. On the restart, it will again get the transaction
data which previously failed and it will try to apply it again. Now,
in the new approach (say Approach-1), we need to ensure that all the
active workers that are applying in-progress transactions should also
exit before the main apply worker exit to allow rollback of currently
applied transactions and re-apply them as we get the data again. This
is required to avoid losing transactions if any later transaction got
committed and updated the replication origin as in such cases the
earlier transactions won't be resent. This won't be much different
than what we do now, where say two transactions, t-1, and t-2 have
multiple streams overlapped. Now, if the error happened before one of
those is completed via commit or rollback, all the data needs to be
resent by the server and processed again by the apply worker.

The next step in this area is to parallelize apply of all possible
transactions. I think the main things we need to care about to allow
this are:
1. Transaction dependency: We can't simply allow dependent
transactions to perform in parallel as that can lead to inconsistency.
Say, if we insert a row in the first transaction and update it in the
second transaction and allow both transactions to apply in parallel,
the insert-one may occur later and the update will fail.
2. Deadlocks: It can happen because now the transactions will be
applied in parallel. Say transaction T-1 updates row-2 and row-3 and
transaction T-2 updates row-3 and row-2, if we allow in parallel then
there is a chance of deadlock whereas there is no such risk in serial
execution where the commit order is preserved.

We can solve both problems if we allow only independent xacts to be
parallelized. The transactions would be considered dependent if they
operate on the same set of rows from the same table. Now apart from
this, there could be other cases where determining transaction
dependency won't be straightforward, so we can disallow those
transactions to participate in parallel apply. Those are the cases
where we can use functions in the table definition expressions. We can
think of identifying safe functions like all built-in functions, and
any immutable functions (and probably stable functions).  We need to
check safety for cases such as (a) trigger functions, (b) column
default value expressions (as those can call functions), (c)
constraint expressions, (d) foreign keys, (e) operations on
partitioned tables (especially those performed via
publish_via_partition_root option) as we need to check for expressions
on all partitions.

The transactions that operate on the same set of tables and are
performing truncate can lead to deadlock, so we need to consider such
transactions as a dependent.

The basic idea is 

Re: zero char is returned as space

2022-04-05 Thread Tom Lane
Konstantin Izmailov  writes:
> could you help me understand if this is an intended behaviour, or I'm
> incorrectly querying a "char" field?

We do not support '\0' as an element of a string value.  You didn't
show how you're trying to insert this value, but I suspect that
Postgres saw it as an empty string which it then space-padded to
length 1 because that's what char(1) does.

Don't use a string field to store an integer.  What with the need
for a length header, you wouldn't be saving any space compared to
"smallint" even if there weren't any semantic issues.

regards, tom lane




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-04-05 Thread Jeff Davis
On Mon, 2022-04-04 at 09:15 +0530, Bharath Rupireddy wrote:
> My intention is to return the overall undecoded WAL record [5] i.e.
> the data starting from XLogReadRecord's output [6] till length
> XLogRecGetTotalLen(xlogreader);. Please see [7], where Andres agreed
> to have this function, I also mentioned a possible use-case there.

The current patch does not actually do this: it's returning a pointer
into the DecodedXLogRecord struct, which doesn't have the raw bytes of
the WAL record.

To return the raw bytes of the record is not entirely trivial: it seems
we have to look in the decoded record and either find a pointer into
readBuf, or readRecordBuf, depending on whether the record crosses a
boundary or not. If we find a good way to do this I'm fine keeping the
function, but if not, we can leave it for v16.

> pg_get_wal_record_info returns the main data of the WAL record
> (xl_heap_delete, xl_heap_insert, xl_heap_multi_insert, xl_heap_update
> and so on).

We also discussed just removing the main data from the output here.
It's not terribly useful, and could be arbitrarily large. Similar to
how we leave out the backup block data and images.

> As identified in [1], SQL-version of stats function is way slower in
> normal cases, hence it was agreed (by Andres, Kyotaro-san and myself)
> to have a C-function for stats.a pointer into 

Now I agree. We should also have an equivalent of "pg_waldump --
stats=record" though, too.

> Yes, that's already part of the description column (much like
> pg_waldump does) and the users can still do it with GROUP BY and
> HAVING clauses, see [4].

I still think an extra column for the results of rm_identify() would
make sense. Not critical, but seems useful.

> As mentioned in [1], read_local_xlog_page_no_wait required because
> the
> functions can still wait in read_local_xlog_page for WAL while
> finding
> the first valid record after the given input LSN (the use case is
> simple - just provide input LSN closer to server's current flush LSN,
> may be off by 3 or 4 bytes).

Did you look into using XLogReadAhead() rather than XLogReadRecord()?

> It's pretty much clear to the users with till_end_of_wal functions
> instead of cooking many things into the same functions with default
> values for input LSNs as NULL which also requires the functions to be
> "CALLED ON NULL INPUT" types which isn't good. This was also
> suggested
> by Andres, see [2], and I agree with it.

OK, it's a matter of taste I suppose. I don't have a strong opinion.

> > 2. Rename pg_get_wal_record_info -> pg_get_wal_record
> > 
> > 3. Rename pg_get_wal_records_info -> pg_get_wal_records
> 
> As these functions aren't returning the WAL record data, but info
> about it (decoded data), I would like to retain the function names
> as-is.

The name pg_get_wal_records_info bothers me slightly, but I don't have
a better suggestion.


One other thought: functions like pg_logical_emit_message() return an
LSN, but if you feed that into pg_walinspect you get the *next* record.
That makes sense because pg_logical_emit_message() returns the result
of XLogInsertRecord(), which is the end of the last inserted record.
But it can be slightly annoying/confusing. I don't have any particular
suggestion, but maybe it's worth a mention in the docs or something?

Regards,
Jeff Davis






zero char is returned as space

2022-04-05 Thread Konstantin Izmailov
Hi,
could you help me understand if this is an intended behaviour, or I'm
incorrectly querying a "char" field? I have simple table with column
declared as:
  c_tinyint char NOT NULL

The column contains tiny integers in range 0-10. When I query the column
from my app using libpq values 1-10 are returned correctly as 0x1-0x10. But
value of zero is returned as 0x20 (expected 0x0).
The pgAdmin displays result of the query
SELECT c_tinyint, ascii(c_tinyint) FROM tbl
as shown below:
| c_tinyint   | ascii |
| character(1) | integer |
| |  0 |
| |  1 |
| |  2 |
| |  3 |
| |  4 |

Thank you
K


Re: Logical replication row filtering and TOAST

2022-04-05 Thread Amit Kapila
On Wed, Apr 6, 2022 at 7:21 AM Ajin Cherian  wrote:
>
> On Tue, Apr 5, 2022 at 8:32 PM Amit Kapila  wrote:
>
> >
> > How about something like the attached?
> >
>
> LGTM.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-04-05 Thread Amit Kapila
On Wed, Apr 6, 2022 at 9:25 AM Masahiko Sawada  wrote:
>
> On Wed, Apr 6, 2022 at 12:21 PM Noah Misch  wrote:
>
> Right. I've attached an updated patch.
>

Thanks, this looks good to me as well. Noah, would you like to commit it?

-- 
With Regards,
Amit Kapila.




Re: Skip partition tuple routing with constant partition key

2022-04-05 Thread Tom Lane
Amit Langote  writes:
> On Sun, Apr 3, 2022 at 10:31 PM Greg Stark  wrote:
>> Is this a problem with the patch or its tests?
>> [18:14:20.798] Test Summary Report
>> [18:14:20.798] ---
>> [18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

> Hmm, make check-world passes for me after rebasing the patch (v10) to
> the latest HEAD (clean), nor do I see a failure on cfbot:
> http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican=2022-03-31%2000%3A49%3A45

I don't think there's room to blame any uncommitted patches
for that.  Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse=2022-03-17%2016%3A08%3A19

regards, tom lane




Re: Handle infinite recursion in logical replication setup

2022-04-05 Thread Amit Kapila
On Tue, Apr 5, 2022 at 7:06 PM Ashutosh Bapat
 wrote:
>
> On Tue, Apr 5, 2022 at 6:21 AM Peter Smith  wrote:
> >
> > Below are some other name ideas. Maybe they are not improvements, but
> > it might help other people to come up with something better.
> >
> > subscribe_publocal_only = true/false
> > origin = pub_only/any
> > adjacent_data_only = true/false
> > direct_subscriptions_only = true/false
> > ...
>
> FWIW, The subscriber wants "changes originated on publisher". From
> that angle origin = publisher/any looks attractive. It also leaves
> open the possibility that the subscriber may ask changes from a set of
> origins or even non-publisher origins.
>

So, how are you imagining extending it for multiple origins? I think
we can have multiple origins data on a node when there are multiple
subscriptions pointing to the different or same node. The origin names
are internally generated names but are present in
pg_replication_origin. So, are we going to ask the user to check that
table and specify it as an option? But, note, that the user may not be
able to immediately recognize which origin data is useful for her.
Surely, with some help or advice in docs, she may be able to identify
it but that doesn't seem so straightforward. The other point is that
such an option won't be applicable for initial sync as we can't
differentiate between data from a local or remote node.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-04-05 Thread Masahiko Sawada
On Wed, Apr 6, 2022 at 12:21 PM Noah Misch  wrote:
>
> On Tue, Apr 05, 2022 at 04:41:28PM +0900, Masahiko Sawada wrote:
> > On Tue, Apr 5, 2022 at 4:08 PM Noah Misch  wrote:
> > > On Tue, Apr 05, 2022 at 03:05:10PM +0900, Masahiko Sawada wrote:
> > > > I've attached an updated patch. The patch includes a regression test
> > > > to detect the new violation as we discussed. I've confirmed that
> > > > Cirrus CI tests pass. Please confirm on AIX and review the patch.
> > >
> > > When the context of a "git grep skiplsn" match involves several struct 
> > > fields
> > > in struct order, please change to the new order.  In other words, do for 
> > > all
> > > "git grep skiplsn" matches what the v2 patch does in GetSubscription().  
> > > The
> > > v2 patch does not do this for catalogs.sgml, but it ought to.  I didn't 
> > > check
> > > all the other "git grep" matches; please do so.
> >
> > Oops, I missed many places. I checked all "git grep" matches and fixed them.
>
> > --- a/src/backend/catalog/system_views.sql
> > +++ b/src/backend/catalog/system_views.sql
> > @@ -1285,8 +1285,8 @@ REVOKE ALL ON pg_replication_origin_status FROM 
> > public;
> >
> >  -- All columns of pg_subscription except subconninfo are publicly readable.
> >  REVOKE ALL ON pg_subscription FROM public;
> > -GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
> > -  substream, subtwophasestate, subdisableonerr, subskiplsn, 
> > subslotname,
> > +GRANT SELECT (oid, subdbid, subname, subskiplsn, subowner, subenabled,
> > +  subbinary, substream, subtwophasestate, subdisableonerr, 
> > subslotname,
> >subsynccommit, subpublications)
>
> subskiplsn comes before subname.

Right. I've attached an updated patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


make_subskiplsn_aligned_v4.patch
Description: Binary data


Re: Skip partition tuple routing with constant partition key

2022-04-05 Thread Amit Langote
On Sun, Apr 3, 2022 at 10:31 PM Greg Stark  wrote:
> Is this a problem with the patch or its tests?
>
> [18:14:20.798] # poll_query_until timed out executing this query:
> [18:14:20.798] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
> srsubstate NOT IN ('r', 's');
> [18:14:20.798] # expecting this output:
> [18:14:20.798] # t
> [18:14:20.798] # last actual query output:
> [18:14:20.798] # f
> [18:14:20.798] # with stderr:
> [18:14:20.798] # Tests were run but no plan was declared and
> done_testing() was not seen.
> [18:14:20.798] # Looks like your test exited with 60 just after 31.
> [18:14:20.798] [18:12:21] t/013_partition.pl .
> [18:14:20.798] Dubious, test returned 60 (wstat 15360, 0x3c00)
> ...
> [18:14:20.798] Test Summary Report
> [18:14:20.798] ---
> [18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)
> [18:14:20.798] Non-zero exit status: 60
> [18:14:20.798] Parse errors: No plan found in TAP output
> [18:14:20.798] Files=32, Tests=328, 527 wallclock secs ( 0.16 usr 0.09
> sys + 99.81 cusr 87.08 csys = 187.14 CPU)
> [18:14:20.798] Result: FAIL

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:

http://cfbot.cputube.org/amit-langote.html

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston
On Tue, Apr 5, 2022 at 8:14 PM Andres Freund  wrote:

>
> On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:
>
>  * Statistics are loaded from the filesystem during startup (by the startup
>  * process), unless preceded by a crash, in which case all stats are
>  * discarded. They are written out by the checkpointer process just before
>  * shutting down, except when shutting down in immediate mode.
>
>
Cool.  I was on the fence about the level of detail here, but mostly
excluded mentioning the checkpointer 'cause I didn't want to research the
correct answer tonight.

>
> >  * Each cumulative statistic producing system must construct a
> PgStat_Kind
> >  * datum in this file. The details are described elsewhere, but of
> >  * particular importance is that each kind is classified as having
> either a
> >  * fixed number of objects that it tracks, or a variable number.
> >  *
> >  * During normal operations, the different consumers of the API will have
> > their
> >  * accessed managed by the API, the protocol used is determined based
> upon
> > whether
> >  * the statistical kind is fixed-numbered or variable-numbered.
> >  * Readers of variable-numbered statistics will have the option to
> locally
> >  * cache the data, while writers may have their updates locally queued
> >  * and applied in a batch. Thus favoring speed over freshness.
> >  * The fixed-numbered statistics are faster to process and thus forgo
> >  * these mechanisms in favor of a light-weight lock.
>
> This feels a bit jumbled.


I had that inkling as well.  First draft and I needed to stop at some
point.  It didn't seem bad or wrong at least.

Of course something using an API will be managed by
> the API. I don't know what protocol reallly means?
>
>
Procedure, process, algorithm are synonyms.  Procedure probably makes more
sense here since it is a procedural language we are using.  I thought of
algorithm while writing this but it carried too much technical baggage for
me (compression, encryption, etc..) that this didn't seem to fit in with.

>
> > Additionally, both due to unclean shutdown or user
> > request,
> >  * statistics can be reset - meaning that their stored numeric values are
> > returned
> >  * to zero, and any non-numeric data that may be tracked (say a
> timestamp)
> > is cleared.
>
> I think this is basically covered in the above?
>
>
Yes and no.  The first paragraph says they are forced to reset due to
system error.  This paragraph basically says that resetting this kind of
statistic is an acceptable, and even expected, thing to do.  And in fact
can also be done intentionally and not only due to system error.  I am
pondering whether to mention this dynamic first and/or better blend it in -
but the minor repetition in the different contexts seems ok.

David J.


Re: SQL/JSON: JSON_TABLE

2022-04-05 Thread Tom Lane
Andres Freund  writes:
> The new jsonb_sqljson test is, on my machine, the slowest test in the main
> regression tests:
> ...
> Any chance the slowness isn't required slowness?

In general, there's been a serious bump in the buildfarm cycle
time in the last month --- for example, on my admittedly slow
animal prairiedog, the cycle time excluding the "make" phase
(which is really variable because ccache) has gone from about
4:26 a month ago to 5:25 in its last run.

I don't want to worry about this before feature freeze, but after that
we should take a hard look at cutting out unnecessary test cycles.

regards, tom lane




Re: shared-memory based stats collector - v70

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-05 23:11:07 -0400, Greg Stark wrote:
> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?

It was more at some point :). And believe me, I find this whole project
intimidating and exhausting. The stats collector is entangled in a lot of
places, and there was a lot of preparatory work to get this point.

Most of the commits aren't really interesting, I broke them out to make the
"main commit" a bit smaller, because it's exhausting to look at a *huge*
single commit. I wish I could have broken it down more, but I didn't find a
good way.

The interesting commit is
v70-0010-pgstat-store-statistics-in-shared-memory.patch
which actually replaces the stats collector by storing stats in shared
memory. It contains a, now hopefully decent, overview of how things work at
the top of pgstat.c.

Greetings,

Andres Freund




Re: Skipping logical replication transactions on subscriber side

2022-04-05 Thread Noah Misch
On Tue, Apr 05, 2022 at 04:41:28PM +0900, Masahiko Sawada wrote:
> On Tue, Apr 5, 2022 at 4:08 PM Noah Misch  wrote:
> > On Tue, Apr 05, 2022 at 03:05:10PM +0900, Masahiko Sawada wrote:
> > > I've attached an updated patch. The patch includes a regression test
> > > to detect the new violation as we discussed. I've confirmed that
> > > Cirrus CI tests pass. Please confirm on AIX and review the patch.
> >
> > When the context of a "git grep skiplsn" match involves several struct 
> > fields
> > in struct order, please change to the new order.  In other words, do for all
> > "git grep skiplsn" matches what the v2 patch does in GetSubscription().  The
> > v2 patch does not do this for catalogs.sgml, but it ought to.  I didn't 
> > check
> > all the other "git grep" matches; please do so.
> 
> Oops, I missed many places. I checked all "git grep" matches and fixed them.

> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1285,8 +1285,8 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
>  
>  -- All columns of pg_subscription except subconninfo are publicly readable.
>  REVOKE ALL ON pg_subscription FROM public;
> -GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
> -  substream, subtwophasestate, subdisableonerr, subskiplsn, 
> subslotname,
> +GRANT SELECT (oid, subdbid, subname, subskiplsn, subowner, subenabled,
> +  subbinary, substream, subtwophasestate, subdisableonerr, 
> subslotname,
>subsynccommit, subpublications)

subskiplsn comes before subname.  Other than that, this looks done.  I
recommend committing it with that change.




Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston
On Tue, Apr 5, 2022 at 8:11 PM Greg Stark  wrote:

> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?
>

It isn't as bad as the number makes it sound - I just used "git am" to
apply the patches to a branch and skimmed each commit separately.  Most of
them are tests or other minor pieces.  The remaining few cover different
aspects of the major commit and you can choose them based upon your
experience and time.

David J.


Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 4:16 PM Andres Freund  wrote:
> > On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> > > On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
> > > > I guess I should add a paragraph about snapshots / fetch consistency.
> > > >
> > >
> > > I apparently confused/combined the two concepts just now so that would
> > help.
> >
> > Will add.

I at least tried...


> On a slightly different track, I took the time to write-up a "Purpose"
> section for pgstat.c :
> 
> It may possibly be duplicating some things written elsewhere as I didn't go
> looking for similar prior art yet, I just wanted to get thoughts down.

There's very very little prior documentation in this area.


> This is the kind of preliminary framing I've been constructing in my own
> mind as I try to absorb this patch.  I haven't formed an opinion whether
> the actual user-facing documentation should cover some or all of this
> instead of the preamble to pgstat.c (which could just point to the docs for
> prerequisite reading).

>  * The PgStat namespace defines an API that facilitates concurrent access
>  * to a shared memory region where cumulative statistical data is saved.
>  * At shutdown, one of the running system workers will initiate the writing
>  * of the data to file. Then, during startup (following a clean shutdown)
> the
>  * Postmaster process will early on ensure that the file is loaded into
> memory.

I added something roughly along those lines in the version I just sent, based
on a suggestion by Melanie over IM:

 * Statistics are loaded from the filesystem during startup (by the startup
 * process), unless preceded by a crash, in which case all stats are
 * discarded. They are written out by the checkpointer process just before
 * shutting down, except when shutting down in immediate mode.



>  * Each cumulative statistic producing system must construct a PgStat_Kind
>  * datum in this file. The details are described elsewhere, but of
>  * particular importance is that each kind is classified as having either a
>  * fixed number of objects that it tracks, or a variable number.
>  *
>  * During normal operations, the different consumers of the API will have
> their
>  * accessed managed by the API, the protocol used is determined based upon
> whether
>  * the statistical kind is fixed-numbered or variable-numbered.
>  * Readers of variable-numbered statistics will have the option to locally
>  * cache the data, while writers may have their updates locally queued
>  * and applied in a batch. Thus favoring speed over freshness.
>  * The fixed-numbered statistics are faster to process and thus forgo
>  * these mechanisms in favor of a light-weight lock.

This feels a bit jumbled. Of course something using an API will be managed by
the API. I don't know what protocol reallly means?


> Additionally, both due to unclean shutdown or user
> request,
>  * statistics can be reset - meaning that their stored numeric values are
> returned
>  * to zero, and any non-numeric data that may be tracked (say a timestamp)
> is cleared.

I think this is basically covered in the above?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v70

2022-04-05 Thread Greg Stark
I've never tried to review a 24-patch series before. It's kind of
intimidating Is there a good place to start to get a good idea of
the most important changes?




Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

>
> Here comes v70:
>
> I think this is basically ready, minus a a few comment adjustments here and
> there. Unless somebody protests I'm planning to start pushing things
> tomorrow
> morning.
>
>
Nothing I've come across, given my area of experience, gives me pause.  I'm
mostly going to focus on docs and comments at this point - to try and help
the next person in my position (and end-users) have an easier go at
on-boarding.  Toward that end, I did just add a "Purpose" section writeup
to the v69 thread.

David J.


Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston
On Tue, Apr 5, 2022 at 4:16 PM Andres Freund  wrote:

> On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> > On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
>
> >
> > > I guess I should add a paragraph about snapshots / fetch consistency.
> > >
> >
> > I apparently confused/combined the two concepts just now so that would
> help.
>
> Will add.
>
>
Thank you.

On a slightly different track, I took the time to write-up a "Purpose"
section for pgstat.c :

It may possibly be duplicating some things written elsewhere as I didn't go
looking for similar prior art yet, I just wanted to get thoughts down.
This is the kind of preliminary framing I've been constructing in my own
mind as I try to absorb this patch.  I haven't formed an opinion whether
the actual user-facing documentation should cover some or all of this
instead of the preamble to pgstat.c (which could just point to the docs for
prerequisite reading).

David J.

 * Purpose:

 * The PgStat namespace defines an API that facilitates concurrent access
 * to a shared memory region where cumulative statistical data is saved.
 * At shutdown, one of the running system workers will initiate the writing
 * of the data to file. Then, during startup (following a clean shutdown)
the
 * Postmaster process will early on ensure that the file is loaded into
memory.
 *
 * Each cumulative statistic producing system must construct a PgStat_Kind
 * datum in this file. The details are described elsewhere, but of
 * particular importance is that each kind is classified as having either a
 * fixed number of objects that it tracks, or a variable number.
 *
 * During normal operations, the different consumers of the API will have
their
 * accessed managed by the API, the protocol used is determined based upon
whether
 * the statistical kind is fixed-numbered or variable-numbered.
 * Readers of variable-numbered statistics will have the option to locally
 * cache the data, while writers may have their updates locally queued
 * and applied in a batch. Thus favoring speed over freshness.
 * The fixed-numbered statistics are faster to process and thus forgo
 * these mechanisms in favor of a light-weight lock.
 *
 * Cumulative in this context means that processes must, for numeric data,
send
 * a delta (or change) value via the API which will then be added to the
 * stored value in memory. The system does not track individual changes,
only
 * their net effect. Additionally, both due to unclean shutdown or user
request,
 * statistics can be reset - meaning that their stored numeric values are
returned
 * to zero, and any non-numeric data that may be tracked (say a timestamp)
is cleared.


Re: [PATCH] pg_stat_toast

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-06 00:08:13 +0200, Gunnar "Nick" Bluth wrote:
> AFAICT, Andres' work is more about the structure (e.g.
> 13619598f1080d7923454634a2570ca1bc0f2fec). Or I've missed something...?

That was just the prep work... I'm about to send slightly further polished
version, but here's the patchset from yesterday:
https://www.postgresql.org/message-id/20220405030506.lfdhbu5zf4tzdpux%40alap3.anarazel.de

> The attached v11 incorporates the latest changes in the area, btw.
> 
> 
> Anyway, my (undisputed up to now!) understanding still is that only
> backends _looking_ at these stats (so, e.g., accessing the pg_stat_toast
> view) actually read the data. So, the 10-15% more space used for pg_stat
> only affect the stats collector and _some few_ backends.

It's not so simple. That stats collector constantly writes these stats out to
disk. And disk bandwidth / space is of course a shared resource.


Greetings,

Andres Freund




Re: SQL/JSON: JSON_TABLE

2022-04-05 Thread Andres Freund
Hi,

On 2022-03-27 16:53:57 -0400, Andrew Dunstan wrote:
> I'm therefore going to commit this series

The new jsonb_sqljson test is, on my machine, the slowest test in the main
regression tests:

4639 ms jsonb_sqljson
2401 ms btree_index
2166 ms stats_ext
2027 ms alter_table
1616 ms triggers
1498 ms brin
1489 ms join_hash
1367 ms foreign_key
1345 ms tuplesort
1202 ms plpgsql

Any chance the slowness isn't required slowness?

Greetings,

Andres Freund




Re: psql - add SHOW_ALL_RESULTS option - pg_regress output

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-04 23:32:50 +0200, Peter Eisentraut wrote:
> This has been committed.

It's somewhat annoying that made pg_regress even more verbose than before:

== removing existing temp instance==
== creating temporary instance==
== initializing database system   ==
== starting postmaster==
running on port 51696 with PID 2203449
== creating database "regression" ==
CREATE DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
ALTER DATABASE
== running regression test queries==

Unfortunately it appears that neither can CREATE DATABASE set GUCs, nor can
ALTER DATABASE set multiple GUCs in one statement.

Perhaps we can just set SHOW_ALL_RESULTS off for that psql command?

- Andres




Re: Logical replication row filtering and TOAST

2022-04-05 Thread Ajin Cherian
On Tue, Apr 5, 2022 at 8:32 PM Amit Kapila  wrote:

>
> How about something like the attached?
>

LGTM.

regards,
Ajin Cherian
Fujitsu Australia




Re: API stability

2022-04-05 Thread Kyotaro Horiguchi
At Tue, 5 Apr 2022 10:01:56 -0400, Robert Haas  wrote in 
> I think as the person who committed that patch I'm on the hook to fix
> this if nobody else would like to do it, but let me ask whether
> Kyotaro Horiguchi would like to propose a patch, since the original
> patch did, and/or whether you would like to propose a patch, as the
> person reporting the issue.

I'd like to do that. Let me see.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Andy Fan
On Wed, Apr 6, 2022 at 12:41 AM Robert Haas  wrote:

> On Tue, Apr 5, 2022 at 10:10 AM Andy Fan  wrote:
> >> > I wanted to have a WAL record spanning multiple WAL files of size, say
> >> > 16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
> >> > would help here. Please let me know if there's any way to generate
> >> > such large WAL records.
> >>
> >> It's easier to use pg_logical_emit_message().
> >
> > Not sure I understand the question correctly here.  What if I use the
> below code
> > where the len might be very large?  like 64MB.
> >
> >  XLogBeginInsert();
> > XLogRegisterData((char *)_append, sizeof(xl_cstore_append));
> > XLogRegisterData((char *)data, len);
> >
> > XLogInsert(..);
>
> Well, that's how to do it from C. And pg_logical_emit_message() is how
> to do it from SQL.
>
>
OK, Thanks for your confirmation!


-- 
Best Regards
Andy Fan


Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-05 Thread Thom Brown
On Wed, 6 Apr 2022 at 01:42, Thom Brown  wrote:
>
> On Tue, 5 Apr 2022 at 16:02, Robert Haas  wrote:
> >
> > On Tue, Apr 5, 2022 at 10:58 AM Magnus Hagander  wrote:
> > >> Makes sense. I will do this soon if nobody objects.
> > >>
> > >> I'm mildly uncomfortable with the phrase "WAL records generated over
> > >> the delay period" because it seems a bit imprecise, but I'm not sure
> > >> what would be better and I think the meaning is clear.
> > >
> > > Maybe "during" instead of "over"? But I'm not sure that's the part you're 
> > > referring to?
> >
> > Yeah, something like that, maybe.
>
> I share your discomfort with the wording.  How about:
>
> WAL records must be kept on standby until they are ready to be applied.
> Therefore, longer delays will result in a greater accumulation of WAL files,
> increasing disk space requirements for the standby's pg_wal
> directory.

*must be kept on the standby

-- 
Thom




Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2022-04-05 Thread Tatsuo Ishii
>> I would suggest to reorder the last chunk to:
>> 
>>... retried retries failures serfail dlfail
>> 
>> because I intend to add connection failures handling at some point,
>> and it would make more sense to add the corresponding count at the end
>> with other fails.
> 
> Ok, I have adjusted the patch. V2 patch attached.

Patch pushed. Thanks.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Race condition in server-crash testing

2022-04-05 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-04 00:50:27 -0400, Tom Lane wrote:
>> It's hard to be totally sure, but I think what happened is that
>> gaur hit the in-hindsight-obvious race condition in this code:
>> we managed to execute a successful iteration of poll_query_until
>> before the postmaster had noticed its dead child and commenced
>> the restart.  The test lines after these are not prepared to see
>> failure-to-connect.
>> It's not obvious to me how to remove this race condition.
>> Thoughts?

> Maybe we can use pump_until() with the psql that's not getting killed? With a
> non-matching regex? That'd only return once the backend was killed by
> postmaster, afaics?

Good idea.  What I actually did was to borrow the recently-fixed code
in 013_crash_restart.pl that checks for psql's "connection lost"
report.

regards, tom lane




Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-05 Thread Thom Brown
On Tue, 5 Apr 2022 at 16:02, Robert Haas  wrote:
>
> On Tue, Apr 5, 2022 at 10:58 AM Magnus Hagander  wrote:
> >> Makes sense. I will do this soon if nobody objects.
> >>
> >> I'm mildly uncomfortable with the phrase "WAL records generated over
> >> the delay period" because it seems a bit imprecise, but I'm not sure
> >> what would be better and I think the meaning is clear.
> >
> > Maybe "during" instead of "over"? But I'm not sure that's the part you're 
> > referring to?
>
> Yeah, something like that, maybe.

I share your discomfort with the wording.  How about:

WAL records must be kept on standby until they are ready to be applied.
Therefore, longer delays will result in a greater accumulation of WAL files,
increasing disk space requirements for the standby's pg_wal
directory.
-- 
Thom




Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund
On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:
> 
> >
> > On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> >
> > >, but rather add to the shared queue
> >
> > Queue? Maybe you mean the hashtable?
> >
> 
> Queue implemented by a list...?  Anyway, I think I mean this:

> /*
>  * List of PgStat_EntryRefs with unflushed pending stats.
>  *
>  * Newly pending entries should only ever be added to the end of the list,
>  * otherwise pgstat_flush_pending_entries() might not see them immediately.
>  */
> static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);

That's not in shared memory, but backend local...


> > >, and so the cache is effectively read-only.  It is also
> > transaction-scoped
> > >based upon the GUC and the nature of stats vis-a-vis transactions.
> >
> > No, that's not right. I think you might be thinking of
> > pgStatLocal.snapshot.stats?
> >
> >
> Probably...
> 
> 
> > I guess I should add a paragraph about snapshots / fetch consistency.
> >
> 
> I apparently confused/combined the two concepts just now so that would help.

Will add.


> >
> > > Even before I added the read-only and transaction-scoped I got a bit hung
> > > up on reading:
> > > "The shared hashtable only needs to be accessed when no prior reference
> > to
> > > the shared hashtable exists."
> >
> > > Thinking in terms of key seems to make more sense than value in this
> > > sentence - even if there is a one-to-one correspondence.
> >
> > Maybe "prior reference to the shared hashtable exists for the key"?
> >
> 
> I specifically dislike having two mentions of the "shared hashtable" in the
> same sentence, so I tried to phrase the second half in terms of the local
> hashtable.

You left two mentions of "shared hashtable" in the sentence prior though
:). I'll try to rephrase. But it's not the end if this isn't the most elegant
prose...


> > Was that comma after e.g. intentional?
> >
> 
> It is.  That is the style I was taught, and that we seem to adhere to in
> user-facing documentation.  Source code is a mixed bag with no enforcement,
> but while we are here...

Looks a bit odd to me. But I guess I'll add it then...


> > >   *
> > > @@ -19,19 +19,21 @@
> > >   *
> > >   * All variable-numbered stats are addressed by PgStat_HashKey while
> > running.
> > >   * It is not possible to have statistics for an object that cannot be
> > > - * addressed that way at runtime. A wider identifier can be used when
> > > + * addressed that way at runtime. A alternate identifier can be used
> > when
> > >   * serializing to disk (used for replication slot stats).
> >
> > Not sure this improves things.
> >
> >
> It just seems odd that width is being mentioned when the actual struct is a
> combination of three subcomponents.  I do feel I'd need to understand
> exactly what replication slot stats are doing uniquely here, though, to
> make any point beyond that.

There's no real numeric identifier for replication slot stats. So I'm using
the "index" used in slot.c while running. But that can change during
start/stop.

Greetings,

Andres Freund




Re: Atomic rename feature for Windows.

2022-04-05 Thread Victor Spirin

Hi

Updated patch: we use the posix semantic features in Windows build 17763 
and up.
We found an issue with this feature on Windows Server 2016 without 
updates (Windows 1607 Build 14393)


Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

05.07.2021 16:53, Victor Spirin пишет:

Hi

I used the SetFileInformationByHandle function with the 
FILE_RENAME_FLAG_POSIX_SEMANTICS flag for the file rename function..


1) The _WIN32_WINNT variable needs to be increased to 0x0A00 (Windows 
10).  Fixed conflict with #undef CHECKSUM_TYPE_NONE


2) The SetFileInformationByHandle function works correctly only on 
Windows 10 and higher.


The app must have a manifest to check the Windows version using the 
IsWindows10OrGreater() function. I added a manifest to all postgres 
projects and disabled the GenerateManifest option on windows projects.


This patch related to this post: 
https://www.postgresql.org/message-id/CAEepm%3D0FV-k%2B%3Dd9z08cW%3DZXoR1%3Dkw9wdpkP6WAuOrKJdz-8ujg%40mail.gmail.com


diff --git a/src/include/common/checksum_helper.h 
b/src/include/common/checksum_helper.h
index 23816cac21..59e819f9ca 100644
--- a/src/include/common/checksum_helper.h
+++ b/src/include/common/checksum_helper.h
@@ -26,6 +26,13 @@
  * MD5 here because any new that does need a cryptographically strong checksum
  * should use something better.
  */
+
+ /*
+ * CHECKSUM_TYPE_NONE defined in the winioctl.h when _WIN32_WINNT >= 
_WIN32_WINNT_WIN10
+ */
+#ifdef CHECKSUM_TYPE_NONE
+#undef CHECKSUM_TYPE_NONE
+#endif
 typedef enum pg_checksum_type
 {
CHECKSUM_TYPE_NONE,
diff --git a/src/include/port/win32.h b/src/include/port/win32.h
index c6213c77c3..c48a6ce3a1 100644
--- a/src/include/port/win32.h
+++ b/src/include/port/win32.h
@@ -12,12 +12,13 @@
 /*
  * Make sure _WIN32_WINNT has the minimum required value.
  * Leave a higher value in place. When building with at least Visual
- * Studio 2015 the minimum requirement is Windows Vista (0x0600) to
- * get support for GetLocaleInfoEx() with locales. For everything else
+ * Studio 2015 the minimum requirement is Windows 10 (0x0A00) to get support 
for SetFileInformationByHandle.
+ * The minimum requirement is Windows Vista (0x0600) get support for 
GetLocaleInfoEx() with locales.
+ * For everything else
  * the minimum version is Windows XP (0x0501).
  */
 #if defined(_MSC_VER) && _MSC_VER >= 1900
-#define MIN_WINNT 0x0600
+#define MIN_WINNT 0x0A00
 #else
 #define MIN_WINNT 0x0501
 #endif
diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 7ce042e75d..199b285a8f 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -39,6 +39,200 @@
 #endif
 #endif
 
+
+#if defined(WIN32) && !defined(__CYGWIN__) && defined(_WIN32_WINNT_WIN10) && 
_WIN32_WINNT >= _WIN32_WINNT_WIN10
+
+#include 
+
+/*
+ * Checks Windows version using RtlGetVersion
+ * Version 1809 (Build 17763) is required for SetFileInformationByHandle
+ * function with FILE_RENAME_FLAG_POSIX_SEMANTICS flag
+*/
+typedef NTSYSAPI(NTAPI* PFN_RTLGETVERSION)
+(OUT PRTL_OSVERSIONINFOEXW lpVersionInformation);
+
+static int windowsPosixSemanticsUsable = -1;
+
+static bool
+is_windows_posix_semantics_usable()
+{
+   HMODULE ntdll;
+
+   PFN_RTLGETVERSION _RtlGetVersion = NULL;
+   OSVERSIONINFOEXW info;
+
+   if (windowsPosixSemanticsUsable >= 0)
+   return (windowsPosixSemanticsUsable > 0);
+
+   ntdll = LoadLibraryEx("ntdll.dll", NULL, 0);
+   if (ntdll == NULL)
+   {
+   DWORD   err = GetLastError();
+
+   _dosmaperr(err);
+   return false;
+   }
+
+   _RtlGetVersion = (PFN_RTLGETVERSION)(pg_funcptr_t)
+   GetProcAddress(ntdll, "RtlGetVersion");
+   if (_RtlGetVersion == NULL)
+   {
+   DWORD   err = GetLastError();
+
+   FreeLibrary(ntdll);
+   _dosmaperr(err);
+   return false;
+   }
+   info.dwOSVersionInfoSize = sizeof(OSVERSIONINFOEXW);
+   if (!NT_SUCCESS(_RtlGetVersion()))
+   {
+   DWORD   err = GetLastError();
+
+   FreeLibrary(ntdll);
+   _dosmaperr(err);
+   return false;
+   }
+
+   if (info.dwMajorVersion >= 10 && info.dwBuildNumber >= 17763)
+   windowsPosixSemanticsUsable = 1;
+   else
+   windowsPosixSemanticsUsable = 0;
+   FreeLibrary(ntdll);
+
+   return (windowsPosixSemanticsUsable > 0);
+}
+
+typedef struct _FILE_RENAME_INFO_EXT {
+   FILE_RENAME_INFO fri;
+   WCHAR extra_space[MAX_PATH];
+} FILE_RENAME_INFO_EXT;
+
+/*
+ * pgrename_windows_posix_semantics  - uses SetFileInformationByHandle function
+ * with FILE_RENAME_FLAG_POSIX_SEMANTICS flag for atomic rename file
+ * working only on Windows 10 (1809) or later and  _WIN32_WINNT must be >= 
_WIN32_WINNT_WIN10
+ */
+static int
+pgrename_windows_posix_semantics(const char *from, const char *to)
+{
+   int err;
+

Re: MERGE bug report

2022-04-05 Thread Zhihong Yu
On Tue, Apr 5, 2022 at 3:35 PM Zhihong Yu  wrote:

>
>
> On Tue, Apr 5, 2022 at 3:18 PM Joe Wildish  wrote:
>
>> Hello Hackers,
>>
>> Reporting a bug with the new MERGE statement. Tested against
>> 75edb919613ee835e7680e40137e494c7856bcf9.
>>
>> psql output as follows:
>>
>> ...
>> psql:merge.sql:33: ERROR:  variable not found in subplan target lists
>> ROLLBACK
>> [local] joe@joe=# \errverbose
>> ERROR:  XX000: variable not found in subplan target lists
>> LOCATION:  fix_join_expr_mutator, setrefs.c:2800
>>
>> Stack trace:
>>
>> fix_join_expr_mutator setrefs.c:2800
>> expression_tree_mutator nodeFuncs.c:3348
>> fix_join_expr_mutator setrefs.c:2853
>> expression_tree_mutator nodeFuncs.c:2992
>> fix_join_expr_mutator setrefs.c:2853
>> expression_tree_mutator nodeFuncs.c:3348
>> fix_join_expr_mutator setrefs.c:2853
>> fix_join_expr setrefs.c:2753
>> set_plan_refs setrefs.c:1085
>> set_plan_references setrefs.c:315
>> standard_planner planner.c:498
>> planner planner.c:277
>> pg_plan_query postgres.c:883
>> pg_plan_queries postgres.c:975
>> exec_simple_query postgres.c:1169
>> PostgresMain postgres.c:4520
>> BackendRun postmaster.c:4593
>> BackendStartup postmaster.c:4321
>> ServerLoop postmaster.c:1801
>> PostmasterMain postmaster.c:1473
>> main main.c:202
>> __libc_start_main 0x7fc4ccc0b1e2
>> _start 0x0048804e
>>
>> Reproducer script:
>>
>> BEGIN;
>> DROP TABLE IF EXISTS item, incoming, source CASCADE;
>>
>> CREATE TABLE item
>>   (order_idINTEGER NOT NULL,
>>item_id INTEGER NOT NULL,
>>quantityINTEGER NOT NULL,
>>price   NUMERIC NOT NULL,
>>CONSTRAINT pk_item PRIMARY KEY (order_id, item_id));
>>
>> INSERT INTO item VALUES (100, 1, 4, 100.00), (100, 2, 9, 199.00);
>>
>> CREATE TABLE incoming (order_id, item_id, quantity, price)
>>   AS (VALUES (100, 1, 4, 100.00), (100, 3, 1, 200.00));
>>
>> CREATE TABLE source (order_id, item_id, quantity, price) AS
>>   (SELECT order_id, item_id, incoming.quantity, incoming.price
>>  FROM item LEFT JOIN incoming USING (order_id, item_id));
>>
>> MERGE INTO item a
>> USING source b
>>ON (a.order_id, a.item_id) =
>>   (b.order_id, b.item_id)
>>  WHEN NOT MATCHED
>>  THEN INSERT (order_id, item_id, quantity, price)
>>   VALUES (order_id, item_id, quantity, price)
>>  WHEN MATCHED
>>   AND a.* IS DISTINCT FROM b.*
>>  THEN UPDATE SET (quantity, price) = (b.quantity, b.price)
>>  WHEN MATCHED
>>   AND (b.quantity IS NULL AND b.price IS NULL)
>>  THEN DELETE;
>> COMMIT;
>>
>> It seems related to the use of a.* and b.*
>>
>> Sorry I can't be more specific. Error manifests when planning occurs and
>> that is well outside of my code base knowledge.
>>
>> Hope this helps.
>>
>> Cheers,
>> -Joe
>>
> Hi,
> It seems all the calls to fix_join_expr_mutator() are within setrefs.c
>
> I haven't found where in nodeFuncs.c fix_join_expr_mutator is called.
>
> I am on commit 75edb919613ee835e7680e40137e494c7856bcf9 .
>

Pardon - I typed too fast:

The call to fix_join_expr_mutator() is on this line (3348):

resultlist = lappend(resultlist,
 mutator((Node *) lfirst(temp),
 context));


Re: MERGE bug report

2022-04-05 Thread Zhihong Yu
On Tue, Apr 5, 2022 at 3:18 PM Joe Wildish  wrote:

> Hello Hackers,
>
> Reporting a bug with the new MERGE statement. Tested against
> 75edb919613ee835e7680e40137e494c7856bcf9.
>
> psql output as follows:
>
> ...
> psql:merge.sql:33: ERROR:  variable not found in subplan target lists
> ROLLBACK
> [local] joe@joe=# \errverbose
> ERROR:  XX000: variable not found in subplan target lists
> LOCATION:  fix_join_expr_mutator, setrefs.c:2800
>
> Stack trace:
>
> fix_join_expr_mutator setrefs.c:2800
> expression_tree_mutator nodeFuncs.c:3348
> fix_join_expr_mutator setrefs.c:2853
> expression_tree_mutator nodeFuncs.c:2992
> fix_join_expr_mutator setrefs.c:2853
> expression_tree_mutator nodeFuncs.c:3348
> fix_join_expr_mutator setrefs.c:2853
> fix_join_expr setrefs.c:2753
> set_plan_refs setrefs.c:1085
> set_plan_references setrefs.c:315
> standard_planner planner.c:498
> planner planner.c:277
> pg_plan_query postgres.c:883
> pg_plan_queries postgres.c:975
> exec_simple_query postgres.c:1169
> PostgresMain postgres.c:4520
> BackendRun postmaster.c:4593
> BackendStartup postmaster.c:4321
> ServerLoop postmaster.c:1801
> PostmasterMain postmaster.c:1473
> main main.c:202
> __libc_start_main 0x7fc4ccc0b1e2
> _start 0x0048804e
>
> Reproducer script:
>
> BEGIN;
> DROP TABLE IF EXISTS item, incoming, source CASCADE;
>
> CREATE TABLE item
>   (order_idINTEGER NOT NULL,
>item_id INTEGER NOT NULL,
>quantityINTEGER NOT NULL,
>price   NUMERIC NOT NULL,
>CONSTRAINT pk_item PRIMARY KEY (order_id, item_id));
>
> INSERT INTO item VALUES (100, 1, 4, 100.00), (100, 2, 9, 199.00);
>
> CREATE TABLE incoming (order_id, item_id, quantity, price)
>   AS (VALUES (100, 1, 4, 100.00), (100, 3, 1, 200.00));
>
> CREATE TABLE source (order_id, item_id, quantity, price) AS
>   (SELECT order_id, item_id, incoming.quantity, incoming.price
>  FROM item LEFT JOIN incoming USING (order_id, item_id));
>
> MERGE INTO item a
> USING source b
>ON (a.order_id, a.item_id) =
>   (b.order_id, b.item_id)
>  WHEN NOT MATCHED
>  THEN INSERT (order_id, item_id, quantity, price)
>   VALUES (order_id, item_id, quantity, price)
>  WHEN MATCHED
>   AND a.* IS DISTINCT FROM b.*
>  THEN UPDATE SET (quantity, price) = (b.quantity, b.price)
>  WHEN MATCHED
>   AND (b.quantity IS NULL AND b.price IS NULL)
>  THEN DELETE;
> COMMIT;
>
> It seems related to the use of a.* and b.*
>
> Sorry I can't be more specific. Error manifests when planning occurs and
> that is well outside of my code base knowledge.
>
> Hope this helps.
>
> Cheers,
> -Joe
>
Hi,
It seems all the calls to fix_join_expr_mutator() are within setrefs.c

I haven't found where in nodeFuncs.c fix_join_expr_mutator is called.

I am on commit 75edb919613ee835e7680e40137e494c7856bcf9 .


Re: should vacuum's first heap pass be read-only?

2022-04-05 Thread Peter Geoghegan
On Tue, Apr 5, 2022 at 2:53 PM Robert Haas  wrote:
> On Tue, Apr 5, 2022 at 4:30 PM Peter Geoghegan  wrote:
> > On Tue, Apr 5, 2022 at 1:10 PM Robert Haas  wrote:
> > > I had assumed that this would not be the case, because if the page is
> > > being accessed by the workload, it can be pruned - and probably frozen
> > > too, if we wanted to write code for that and spend the cycles on it -
> > > and if it isn't, pruning and freezing probably aren't needed.
> >
> > [ a lot of things ]
>
> I don't understand what any of this has to do with the point I was raising 
> here.

Why do you assume that we'll ever have an accurate idea of how many
LP_DEAD items there are, before we've looked? And if we're wrong about
that, persistently, why should anything else we think about it really
matter? This is an inherently dynamic and cyclic process. Statistics
don't really work here. That was how my remarks were related to yours.
That should be in scope -- getting better information about what work
we need to do by blurring the boundaries between deciding what to do,
and executing that plan.

On a long enough timeline the LP_DEAD items in heap pages are bound to
become the dominant concern in almost any interesting case for the
conveyor belt, for the obvious reason: you can't do anything about
LP_DEAD items without also doing every other piece of processing
involving those same heap pages. So in that sense, yes, they will be
the dominant problem at times, for sure.

On the other hand it seems very hard to imagine an interesting
scenario in which LP_DEAD items are the dominant problem from the
earliest stage of processing by VACUUM. But even if it was somehow
possible, would it matter? That would mean that there'd be occasional
instances of the conveyor belt being ineffective -- hardly the end of
the world. What has it cost us to keep it as an option that wasn't
used? I don't think we'd have to do any extra work, other than
in-memory bookkeeping.

-- 
Peter Geoghegan




MERGE bug report

2022-04-05 Thread Joe Wildish
Hello Hackers,

Reporting a bug with the new MERGE statement. Tested against 
75edb919613ee835e7680e40137e494c7856bcf9.

psql output as follows:

...
psql:merge.sql:33: ERROR:  variable not found in subplan target lists
ROLLBACK
[local] joe@joe=# \errverbose
ERROR:  XX000: variable not found in subplan target lists
LOCATION:  fix_join_expr_mutator, setrefs.c:2800

Stack trace:

fix_join_expr_mutator setrefs.c:2800
expression_tree_mutator nodeFuncs.c:3348
fix_join_expr_mutator setrefs.c:2853
expression_tree_mutator nodeFuncs.c:2992
fix_join_expr_mutator setrefs.c:2853
expression_tree_mutator nodeFuncs.c:3348
fix_join_expr_mutator setrefs.c:2853
fix_join_expr setrefs.c:2753
set_plan_refs setrefs.c:1085
set_plan_references setrefs.c:315
standard_planner planner.c:498
planner planner.c:277
pg_plan_query postgres.c:883
pg_plan_queries postgres.c:975
exec_simple_query postgres.c:1169
PostgresMain postgres.c:4520
BackendRun postmaster.c:4593
BackendStartup postmaster.c:4321
ServerLoop postmaster.c:1801
PostmasterMain postmaster.c:1473
main main.c:202
__libc_start_main 0x7fc4ccc0b1e2
_start 0x0048804e

Reproducer script:

BEGIN;
DROP TABLE IF EXISTS item, incoming, source CASCADE;

CREATE TABLE item
  (order_idINTEGER NOT NULL,
   item_id INTEGER NOT NULL,
   quantityINTEGER NOT NULL,
   price   NUMERIC NOT NULL,
   CONSTRAINT pk_item PRIMARY KEY (order_id, item_id));

INSERT INTO item VALUES (100, 1, 4, 100.00), (100, 2, 9, 199.00);

CREATE TABLE incoming (order_id, item_id, quantity, price)
  AS (VALUES (100, 1, 4, 100.00), (100, 3, 1, 200.00));

CREATE TABLE source (order_id, item_id, quantity, price) AS
  (SELECT order_id, item_id, incoming.quantity, incoming.price
 FROM item LEFT JOIN incoming USING (order_id, item_id));

MERGE INTO item a
USING source b
   ON (a.order_id, a.item_id) =
  (b.order_id, b.item_id)
 WHEN NOT MATCHED
 THEN INSERT (order_id, item_id, quantity, price)
  VALUES (order_id, item_id, quantity, price)
 WHEN MATCHED
  AND a.* IS DISTINCT FROM b.*
 THEN UPDATE SET (quantity, price) = (b.quantity, b.price)
 WHEN MATCHED
  AND (b.quantity IS NULL AND b.price IS NULL)
 THEN DELETE;
COMMIT;

It seems related to the use of a.* and b.*

Sorry I can't be more specific. Error manifests when planning occurs and that 
is well outside of my code base knowledge.

Hope this helps.

Cheers,
-Joe




Re: [PATCH] pg_stat_toast

2022-04-05 Thread Gunnar "Nick" Bluth
Am 05.04.22 um 18:17 schrieb Robert Haas:
> On Thu, Mar 31, 2022 at 9:16 AM Gunnar "Nick" Bluth
>  wrote:
>> That was meant to say "v10", sorry!
> 
> Hi,

Hi Robert,

and thx for looking at this.

> From my point of view, at least, it would be preferable if you'd stop
> changing the subject line every time you post a new version.

Terribly sorry, I believed to do the right thing! I removed the "suffix"
now for good.

> Based on the test results in
> http://postgr.es/m/42bfa680-7998-e7dc-b50e-480cdd986...@pro-open.de
> and the comments from Andres in
> https://www.postgresql.org/message-id/20211212234113.6rhmqxi5uzgipwx2%40alap3.anarazel.de
> my judgement would be that, as things stand today, this patch has no
> chance of being accepted, due to overhead. Now, Andres is currently
> working on an overhaul of the statistics collector and perhaps that
> would reduce the overhead of something like this to an acceptable
> level. If it does, that would be great news; I just don't know whether
> that's the case.

AFAICT, Andres' work is more about the structure (e.g.
13619598f1080d7923454634a2570ca1bc0f2fec). Or I've missed something...?

The attached v11 incorporates the latest changes in the area, btw.


Anyway, my (undisputed up to now!) understanding still is that only
backends _looking_ at these stats (so, e.g., accessing the pg_stat_toast
view) actually read the data. So, the 10-15% more space used for pg_stat
only affect the stats collector and _some few_ backends.

And those 10-15% were gathered with 10.000 tables containing *only*
TOASTable attributes. So the actual percentage would probably go down
quite a bit once you add some INTs or such.
Back then, I was curious myself on the impact and just ran a few
syntetic tests quickly hacked together. I'll happily go ahead and run
some tests on real world schemas if that helps clarifying matters!

> As far as the statistics themselves are concerned, I am somewhat
> skeptical about whether it's really worth adding code for this.
> According to the documentation, the purpose of the patch is to allow
> you to assess choice of storage and compression method settings for a
> column and is not intended to be enabled permanently. However, it

TBTH, the wording there is probably a bit over-cautious. I very much
respect Andres and thus his reservations, and I know how careful the
project is about regressions of any kind (see below on some elobarations
on the latter).
I alleviated the  part a bit for v11.

> seems to me that you could assess that pretty easily without this
> patch: just create a couple of different tables with different
> settings, load up the same data via COPY into each one, and see what
> happens. Now you might answer that with the patch you would get more
> detailed and accurate statistics, and I think that's true, but it
> doesn't really look like the additional level of detail would be
> critical to have in order to make a proper assessment. You might also
> say that creating multiple copies of the table and loading the data
> multiple times would be expensive, and that's also true, but you don't
> really need to load it all. A representative sample of 1GB or so would
> probably suffice in most cases, and that doesn't seem likely to be a
> huge load on the system.

At the end of the day, one could argue like you did there for almost all
(non-attribute) stats. "Why track function execution times? Just set up
a benchmark and call the function 1 mio times and you'll know how long
it takes on average!". "Why track IO timings? Run a benchmark on your
system and ..." etc. pp.

I maintain a couple of DBs that house TBs of TOASTable data (mainly XML
containing encrypted payloads). In just a couple of columns per cluster.
I'm completely clueless if TOAST compression makes a difference there.
Or externalization.
And I'm not allowed to copy that data anywhere outside production
without unrolling a metric ton of red tape.
Guess why I started writing this patch ;-)
*I* would certainly leave the option on, just to get an idea of what's
happening...

> Also, as we add more compression options, it's going to be hard to
> assess this sort of thing without trying stuff anyway. For example if
> you can set the lz4 compression level, you're not going to know which
> level is actually going to work best without trying out a bunch of
> them and seeing what happens. If we allow access to other sorts of
> compression parameters like zstd's "long" option, similarly, if you
> really care, you're going to have to try it.

Funny that you mention it. When writing the first version, I was
thinking about the LZ4 patch authors and was wondering how they
tested/benchmarked all of it and why they didn't implement something
like this patch for their tests ;-)

Yes, you're gonna try it. And you're gonna measure it. Somehow.
Externally, as things are now.

With pg_stat_toast, you'd get the byte-by-byte and - maybe even more
important - ms-by-ms comparison of the different compression 

Re: should vacuum's first heap pass be read-only?

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 4:30 PM Peter Geoghegan  wrote:
> On Tue, Apr 5, 2022 at 1:10 PM Robert Haas  wrote:
> > I had assumed that this would not be the case, because if the page is
> > being accessed by the workload, it can be pruned - and probably frozen
> > too, if we wanted to write code for that and spend the cycles on it -
> > and if it isn't, pruning and freezing probably aren't needed.
>
> [ a lot of things ]

I don't understand what any of this has to do with the point I was raising here.

> > > But, these same LP_DEAD-heavy tables *also* have a very decent
> > > chance of benefiting from a better index vacuuming strategy, something
> > > *also* enabled by the conveyor belt design. So overall, in either 
> > > scenario,
> > > VACUUM concentrates on problems that are particular to a given table
> > > and workload, without being hindered by implementation-level
> > > restrictions.
> >
> > Well this is what I'm not sure about. We need to demonstrate that
> > there are at least some workloads where retiring the LP_DEAD line
> > pointers doesn't become the dominant concern.
>
> It will eventually become the dominant concern. But that could take a
> while, compared to the growth in indexes.
>
> An LP_DEAD line pointer stub in a heap page is 4 bytes. The smallest
> possible B-Tree index tuple is 20 bytes on mainstream platforms (16
> bytes + 4 byte line pointer). Granted deduplication makes this less
> true, but that's far from guaranteed to help. Also, many tables have
> way more than one index.
>
> Of course it isn't nearly as simple as comparing the bytes of bloat in
> each case. More generally, I don't claim that it's easy to
> characterize which factor is more important, even in the abstract,
> even under ideal conditions -- it's very hard. But I'm sure that there
> are routinely very large differences among indexes and the heap
> structure.

Yeah, I think we need to better understand how this works out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston
On Tue, Apr 5, 2022 at 2:23 PM Andres Freund  wrote:

>
> On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
>
> >, but rather add to the shared queue
>
> Queue? Maybe you mean the hashtable?
>

Queue implemented by a list...?  Anyway, I think I mean this:

/*
 * List of PgStat_EntryRefs with unflushed pending stats.
 *
 * Newly pending entries should only ever be added to the end of the list,
 * otherwise pgstat_flush_pending_entries() might not see them immediately.
 */
static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);


>
>
> >, and so the cache is effectively read-only.  It is also
> transaction-scoped
> >based upon the GUC and the nature of stats vis-a-vis transactions.
>
> No, that's not right. I think you might be thinking of
> pgStatLocal.snapshot.stats?
>
>
Probably...


> I guess I should add a paragraph about snapshots / fetch consistency.
>

I apparently confused/combined the two concepts just now so that would help.

>
> > Even before I added the read-only and transaction-scoped I got a bit hung
> > up on reading:
> > "The shared hashtable only needs to be accessed when no prior reference
> to
> > the shared hashtable exists."
>
> > Thinking in terms of key seems to make more sense than value in this
> > sentence - even if there is a one-to-one correspondence.
>
> Maybe "prior reference to the shared hashtable exists for the key"?
>

I specifically dislike having two mentions of the "shared hashtable" in the
same sentence, so I tried to phrase the second half in terms of the local
hashtable.


> > I am wondering why there are no mentions to the header files in this
> > architecture, only the .c files.
>
> Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
> give a starting point (and it can't be worse than explanation of the
> current
> system).
>

No need to try to come up with something.  More curious if there was a
general reason to avoid it before I looked to see if I felt anything in
them seemed worth including from my perspective.

>
>
> > diff --git a/src/backend/utils/activity/pgstat.c
> b/src/backend/utils/activity/pgstat.c
> > index bfbfe53deb..504f952c0e 100644
> > --- a/src/backend/utils/activity/pgstat.c
> > +++ b/src/backend/utils/activity/pgstat.c
> > @@ -4,9 +4,9 @@
> >   *
> >   *
> >   * PgStat_KindInfo describes the different types of statistics handled.
> Some
> > - * kinds of statistics are collected for fixed number of objects
> > - * (e.g. checkpointer statistics). Other kinds are statistics are
> collected
> > - * for variable-numbered objects (e.g. relations).
> > + * kinds of statistics are collected for a fixed number of objects
> > + * (e.g., checkpointer statistics). Other kinds of statistics are
> collected
>
> Was that comma after e.g. intentional?
>

It is.  That is the style I was taught, and that we seem to adhere to in
user-facing documentation.  Source code is a mixed bag with no enforcement,
but while we are here...


> > + * for a varying number of objects (e.g., relations).
> >   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
>

status-quo works for me too, and matches up with the desired labelling we
are using here.



> >   *
> > @@ -19,19 +19,21 @@
> >   *
> >   * All variable-numbered stats are addressed by PgStat_HashKey while
> running.
> >   * It is not possible to have statistics for an object that cannot be
> > - * addressed that way at runtime. A wider identifier can be used when
> > + * addressed that way at runtime. A alternate identifier can be used
> when
> >   * serializing to disk (used for replication slot stats).
>
> Not sure this improves things.
>
>
It just seems odd that width is being mentioned when the actual struct is a
combination of three subcomponents.  I do feel I'd need to understand
exactly what replication slot stats are doing uniquely here, though, to
make any point beyond that.


>
> > - * Each statistics kind is handled in a dedicated file:
> > + * Each statistics kind is handled in a dedicated file, though their
> structs
> > + * are defined here for lack of better ideas.
>
> -0.5
>
>
Status-quo works for me.  Food for thought for other reviewers though.

David J.


Re: shared-memory based stats collector - v69

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 8:05 PM Andres Freund  wrote:
> 
> > - added an architecture overview comment to the top of pgstat.c - not sure
> > if
> >   it makes sense to anybody but me (and perhaps Horiguchi-san)?
> >
> >
> I took a look at this, diff attached.

Thanks!


> Some typos and minor style stuff,
> plus trying to bring a bit more detail to the caching mechanism.  I may
> have gotten it wrong in adding more detail though.
> 
> + * read-only, backend-local, transaction-scoped, hashtable
> (pgStatEntryRefHash)
> + * in front of the shared hashtable, containing references
> (PgStat_EntryRef)
> + * to shared hashtable entries. The shared hashtable thus only needs to be
> + * accessed when the PgStat_HashKey is not present in the backend-local
> hashtable,
> + * or if stats_fetch_consistency = 'none'.
> 
> I'm under the impression, but didn't try to confirm, that the pending
> updates don't use the caching mechanism

They do.


>, but rather add to the shared queue

Queue? Maybe you mean the hashtable?


>, and so the cache is effectively read-only.  It is also transaction-scoped
>based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

I guess I should add a paragraph about snapshots / fetch consistency.


> Even before I added the read-only and transaction-scoped I got a bit hung
> up on reading:
> "The shared hashtable only needs to be accessed when no prior reference to
> the shared hashtable exists."

> Thinking in terms of key seems to make more sense than value in this
> sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?


> I am wondering why there are no mentions to the header files in this
> architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the current
system).


> diff --git a/src/backend/utils/activity/pgstat.c 
> b/src/backend/utils/activity/pgstat.c
> index bfbfe53deb..504f952c0e 100644
> --- a/src/backend/utils/activity/pgstat.c
> +++ b/src/backend/utils/activity/pgstat.c
> @@ -4,9 +4,9 @@
>   *
>   *
>   * PgStat_KindInfo describes the different types of statistics handled. Some
> - * kinds of statistics are collected for fixed number of objects
> - * (e.g. checkpointer statistics). Other kinds are statistics are collected
> - * for variable-numbered objects (e.g. relations).
> + * kinds of statistics are collected for a fixed number of objects
> + * (e.g., checkpointer statistics). Other kinds of statistics are collected

Was that comma after e.g. intentional?

Applied the rest.


> + * for a varying number of objects (e.g., relations).
>   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
>   *
> @@ -19,19 +19,21 @@
>   *
>   * All variable-numbered stats are addressed by PgStat_HashKey while running.
>   * It is not possible to have statistics for an object that cannot be
> - * addressed that way at runtime. A wider identifier can be used when
> + * addressed that way at runtime. A alternate identifier can be used when
>   * serializing to disk (used for replication slot stats).

Not sure this improves things.



>   * The names for structs stored in shared memory are prefixed with
>   * PgStatShared instead of PgStat.
> @@ -53,15 +55,16 @@
>   * entry in pgstat_kind_infos, see PgStat_KindInfo for details.
>   *
>   *
> - * To keep things manageable stats handling is split across several
> + * To keep things manageable, stats handling is split across several

Done.


>   * files. Infrastructure pieces are in:
> - * - pgstat.c - this file, to tie it all together
> + * - pgstat.c - this file, which ties everything together

I liked that :)


> - * Each statistics kind is handled in a dedicated file:
> + * Each statistics kind is handled in a dedicated file, though their structs
> + * are defined here for lack of better ideas.

-0.5

Greetings,

Andres Freund




Re: [PATCH] Add extra statistics to explain for Nested Loop

2022-04-05 Thread Greg Stark
This is not passing regression tests due to some details of the plan
output - marking Waiting on Author:

diff -w -U3 c:/cirrus/src/test/regress/expected/partition_prune.out
c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
--- c:/cirrus/src/test/regress/expected/partition_prune.out 2022-04-05
17:00:25.433576100 +
+++ c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
2022-04-05 17:18:30.092203500 +
@@ -2251,10 +2251,7 @@
  Workers Planned: 2
  Workers Launched: N
  ->  Parallel Seq Scan on public.lprt_b (actual rows=N loops=N)
-   Loop Min Rows: N  Max Rows: N  Total Rows: N
Output: lprt_b.b
-   Worker 0:  actual rows=N loops=N
-   Worker 1:  actual rows=N loops=N
->  Materialize (actual rows=N loops=N)
  Loop Min Rows: N  Max Rows: N  Total Rows: N
  Output: lprt_a.a
@@ -2263,10 +2260,8 @@
Workers Planned: 1
Workers Launched: N
->  Parallel Seq Scan on public.lprt_a (actual rows=N loops=N)
- Loop Min Rows: N  Max Rows: N  Total Rows: N
  Output: lprt_a.a
- Worker 0:  actual rows=N loops=N
-(24 rows)
+(19 rows)

 drop table lprt_b;
 delete from lprt_a where a = 1;




Re: shared-memory based stats collector - v69

2022-04-05 Thread David G. Johnston
On Mon, Apr 4, 2022 at 8:05 PM Andres Freund  wrote:

> - added an architecture overview comment to the top of pgstat.c - not sure
> if
>   it makes sense to anybody but me (and perhaps Horiguchi-san)?
>
>
I took a look at this, diff attached.  Some typos and minor style stuff,
plus trying to bring a bit more detail to the caching mechanism.  I may
have gotten it wrong in adding more detail though.

+ * read-only, backend-local, transaction-scoped, hashtable
(pgStatEntryRefHash)
+ * in front of the shared hashtable, containing references
(PgStat_EntryRef)
+ * to shared hashtable entries. The shared hashtable thus only needs to be
+ * accessed when the PgStat_HashKey is not present in the backend-local
hashtable,
+ * or if stats_fetch_consistency = 'none'.

I'm under the impression, but didn't try to confirm, that the pending
updates don't use the caching mechanism, but rather add to the shared
queue, and so the cache is effectively read-only.  It is also
transaction-scoped based upon the GUC and the nature of stats vis-a-vis
transactions.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference to
the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

The code comment about having per-kind definitions in pgstat.c being
annoying is probably sufficient but it does seem like a valid comment to
leave in the architecture as well.  Having them in both places seems OK.

I am wondering why there are no mentions to the header files in this
architecture, only the .c files.

David J.


pgstat-architecture.diff
Description: Binary data


Re: shared-memory based stats collector - v66

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-02 01:16:48 -0700, Andres Freund wrote:
> I just noticed that the code doesn't appear to actually work like that right
> now. Whenever the timeout is reached, pgstat_report_stat() is called with
> force = true.
> 
> And even if the backend is busy running queries, once there's contention, the
> next invocation of pgstat_report_stat() will return the timeout relative to
> pendin_since, which then will trigger a force report via a very short timeout
> soon.
> 
> It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
> (with a slightly different name) from pgstat_report_stat() when blocked
> (limiting the max reporting delay for an idle connection) and to continue
> calling pgstat_report_stat(force = true).  But to only trigger force
> "internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.
> 
> I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
> connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
> to PGSTAT_MAX_INTERVAL when blocked) on busy connections.
> 
> Makes sense?

I tried to come up with a workload producing a *lot* of stats (multiple
function calls within a transaction, multiple transactions pipelined) and ran
it with 1000 clients (on a machine with 2 x (10 cores / 20 threads)). To
reduce overhead I set
  default_transaction_isolation=repeatable read
  track_activities=false
MVCC Snapshot acquisition is the clear bottleneck otherwise, followed by
pgstat_report_activity() (which, as confusing as it may sound, is independent
of this patch).

I do see a *small* amount of contention if I lower PGSTAT_MIN_INTERVAL to
1ms. Too small to ever be captured in pg_stat_activity.wait_event, but just
about visible in a profiler.


Which leads me to conclude we can simplify the logic significantly. Here's my
current comment explaining the logic:

 * Unless called with 'force', pending stats updates are flushed happen once
 * per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
 * block on lock acquisition, except if stats updates have been pending for
 * longer than PGSTAT_MAX_INTERVAL (6ms).
 *
 * Whenever pending stats updates remain at the end of pgstat_report_stat() a
 * suggested idle timeout is returned. Currently this is always
 * PGSTAT_IDLE_INTERVAL (1ms). Callers can use the returned time to set up
 * a timeout after which to call pgstat_report_stat(true), but are not
 * required to to do so.

Comments?

Greetings,

Andres Freund




Re: should vacuum's first heap pass be read-only?

2022-04-05 Thread Peter Geoghegan
On Tue, Apr 5, 2022 at 1:10 PM Robert Haas  wrote:
> I had assumed that this would not be the case, because if the page is
> being accessed by the workload, it can be pruned - and probably frozen
> too, if we wanted to write code for that and spend the cycles on it -
> and if it isn't, pruning and freezing probably aren't needed.

VACUUM has a top-down structure, and so seems to me to be the natural
place to think about the high level needs of the table as a whole,
especially over time.

I don't think we actually need to scan the pages that we left some
LP_DEAD items in previous VACUUM operations. It seems possible to
freeze newly appended pages quite often, without needlessly revisiting
the pages from previous batches (even those with LP_DEAD items left
behind). Maybe we need to rethink the definition of "VACUUM operation"
a little to do that, but it seems relatively tractable.

As I said upthread recently, I am excited about the potential of
"locking in" a set of scanned_pages using a local/private version of
the visibility map (a copy from just after OldestXmin is initially
established), that VACUUM can completely work off of. Especially if
combined with the conveyor belt, which could make VACUUM operations
suspendable and resumable.

I don't see any reason why it wouldn't be possible to "lock in" an
initial scanned_pages, and then use that data structure (which could
be persisted) to avoid revisiting the pages that we know we already
visited (and left LP_DEAD items in). We could "resume the VACUUM
operation that was suspended earlier" a bit later (not have several
technically unrelated VACUUM operations together in close succession).
The later rounds of processing could even use new cutoffs for both
freezing and freezing, despite being from "the same VACUUM operation".
They could have an "expanded rel_pages" that covers the newly appended
pages that we want to quickly freeze tuples on.

AFAICT the only thing that we need to do to make this safe is to carry
forward our original vacrel->NewRelfrozenXid (which can never be later
than our original vacrel->OldestXmin). Under this architecture, we
don't really "skip index vacuuming" at all. Rather, we redefine VACUUM
operations in a way that makes the final rel_pages provisional, at
least when run in autovacuum.

VACUUM itself can notice that it might be a good idea to "expand
rel_pages" and expand the scope of the work it ultimately does, based
on the observed characteristics of the table. No heap pages get repeat
processing per "VACUUM operation" (relative to the current definition
of the term). Some indexes will get "extra, earlier index vacuuming",
which we've already said is the right way to think about all this (we
should think of it as extra index vacuuming, not less index
vacuuming).

> > But, these same LP_DEAD-heavy tables *also* have a very decent
> > chance of benefiting from a better index vacuuming strategy, something
> > *also* enabled by the conveyor belt design. So overall, in either scenario,
> > VACUUM concentrates on problems that are particular to a given table
> > and workload, without being hindered by implementation-level
> > restrictions.
>
> Well this is what I'm not sure about. We need to demonstrate that
> there are at least some workloads where retiring the LP_DEAD line
> pointers doesn't become the dominant concern.

It will eventually become the dominant concern. But that could take a
while, compared to the growth in indexes.

An LP_DEAD line pointer stub in a heap page is 4 bytes. The smallest
possible B-Tree index tuple is 20 bytes on mainstream platforms (16
bytes + 4 byte line pointer). Granted deduplication makes this less
true, but that's far from guaranteed to help. Also, many tables have
way more than one index.

Of course it isn't nearly as simple as comparing the bytes of bloat in
each case. More generally, I don't claim that it's easy to
characterize which factor is more important, even in the abstract,
even under ideal conditions -- it's very hard. But I'm sure that there
are routinely very large differences among indexes and the heap
structure.

-- 
Peter Geoghegan




Re: ExecRTCheckPerms() and many prunable partitions

2022-04-05 Thread David Rowley
On Wed, 6 Apr 2022 at 02:27, Greg Stark  wrote:
>
> This is failing regression tests. I don't understand how this patch
> could be affecting this test though. Perhaps it's a problem with the
> json patches that were committed recently -- but they don't seem to be
> causing other patches to fail.

I think this will just be related to the useprefix =
list_length(es->rtable) > 1; in show_plan_tlist().  There's likely not
much point in keeping the RTE for the view anymore. IIRC it was just
there to check permissions. Amit has now added another way of doing
those.

David




Re: should vacuum's first heap pass be read-only?

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 3:22 PM Peter Geoghegan  wrote:
> It's not just an enabler of more frequent index vacuuming (for those
> indexes that need it the most), though. It's also an enabler of more
> frequent lazy_scan_prune processing (in particular setting hint bits
> and freezing), which is probably even more likely to benefit from the
> decoupling you'd be enabling. I can imagine this having great value in
> a world where autovacuum scheduling eagerly keeps up with inserts into
> an append-mostly table, largely avoiding repeating dirtying within
> lazy_scan_prune, with dynamic feedback. You just need to put off the
> work of index/heap vacuuming to be able to do that kind of thing.

I had assumed that this would not be the case, because if the page is
being accessed by the workload, it can be pruned - and probably frozen
too, if we wanted to write code for that and spend the cycles on it -
and if it isn't, pruning and freezing probably aren't needed.

> But, these same LP_DEAD-heavy tables *also* have a very decent
> chance of benefiting from a better index vacuuming strategy, something
> *also* enabled by the conveyor belt design. So overall, in either scenario,
> VACUUM concentrates on problems that are particular to a given table
> and workload, without being hindered by implementation-level
> restrictions.

Well this is what I'm not sure about. We need to demonstrate that
there are at least some workloads where retiring the LP_DEAD line
pointers doesn't become the dominant concern.

Also, just to be clear, I'm not arguing against my own idea. I'm
trying to figure out what the first, smallest unit of work that
produces a committable patch with a meaningful benefit is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-05 Thread Nathan Bossart
On Tue, Apr 05, 2022 at 11:25:36AM -0400, Stephen Frost wrote:
> Please find attached an updated patch + commit message.  Mostly, I just
> went through and did a bit more in terms of updating the documentation
> and improving the comments (there were some places that were still
> worrying about the chance of a 'stray' backup_label file existing, which
> isn't possible any longer), along with some additional testing and
> review.  This is looking pretty good to me, but other thoughts are
> certainly welcome.  Otherwise, I'm hoping to commit this tomorrow.

LGTM!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-05 Thread Robert Haas
On Fri, Apr 1, 2022 at 11:12 AM Matthias van de Meent
 wrote:
> Here's a new 0001 to keep CFBot happy.

This seems like it would conflict with the proposal from
http://postgr.es/m/ca+tgmoad8wmn6i1mmuo+4znege3hd57ys8uv8uzm7cneqy3...@mail.gmail.com
which I still hope to advance in some form at an appropriate time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Temporary file access API

2022-04-05 Thread Robert Haas
On Tue, Mar 8, 2022 at 6:12 AM Antonin Houska  wrote:
> Thanks for your comments, the initial version is attached here.

I've been meaning to look at this thread for some time but have not
found enough time to do that until just now. And now I have
questions...

1. Supposing we accepted this, how widely do you think that we could
adopt it? I see that this patch set adapts a few places to use it and
that's nice, but I have a feeling there's a lot more places that are
making use of system calls directly, or through some other
abstraction, than just this. I'm not sure that we actually want to use
something like this everywhere, but what would be the rule for
deciding where to use it and where not to use
it? If the plan for this facility is just to adapt these two
particular places to use it, that doesn't feel like enough to be
worthwhile.

2. What about frontend code? Most frontend code won't examine data
files directly, but at least pg_controldata, pg_checksums, and
pg_resetwal are exceptions.

3. How would we extend this to support encryption? Add an extra
argument to BufFileStreamInit(V)FD passing down the encryption
details?

There are some smaller things about the patch with which I'm not 100%
comfortable, but I'd like to start by understanding the big picture.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: should vacuum's first heap pass be read-only?

2022-04-05 Thread Peter Geoghegan
On Tue, Apr 5, 2022 at 5:44 AM Robert Haas  wrote:
> The whole idea of decoupling table and index vacuum
> supposes that there are situations in which it's worth performing the
> first heap pass where we gather the dead line pointers but where it's
> not necessary to follow that up as quickly as possible with a second
> heap pass to mark dead line pointers unused. I think Peter and I are
> in agreement that there are situations in which some indexes need to
> be vacuumed much more often than others -- but that doesn't matter if
> the heap needs to be vacuumed more frequently than anything else,
> because you can't do that without first vacuuming all the indexes.

It's not just an enabler of more frequent index vacuuming (for those
indexes that need it the most), though. It's also an enabler of more
frequent lazy_scan_prune processing (in particular setting hint bits
and freezing), which is probably even more likely to benefit from the
decoupling you'd be enabling. I can imagine this having great value in
a world where autovacuum scheduling eagerly keeps up with inserts into
an append-mostly table, largely avoiding repeating dirtying within
lazy_scan_prune, with dynamic feedback. You just need to put off the
work of index/heap vacuuming to be able to do that kind of thing.

Postgres 14 split the WAL record previously shared by both pruning and
vacuuming (called XLOG_HEAP2_CLEAN) into two separate WAL records
(called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). That made it easier
to spot the fact that we usually have far fewer of the latter WAL
records during VACUUM by using pg_waldump. Might be worth doing your
own experiments on this.

Other instrumentation changes in 14 also helped here. In particular
the "%u pages from table (%.2f%% of total) had %lld dead item
identifiers removed" line that was added to autovacuum's log output
made it easy to spot how little heap vacuuming might really be needed.
With some tables it is roughly the opposite way around (as much or
even more heap vacuuming than pruning/freezing) -- you'll tend to see
that in tables where opportunistic pruning leaves behind a lot of
LP_DEAD stubs that only VACUUM can make LP_UNUSED.

But, these same LP_DEAD-heavy tables *also* have a very decent
chance of benefiting from a better index vacuuming strategy, something
*also* enabled by the conveyor belt design. So overall, in either scenario,
VACUUM concentrates on problems that are particular to a given table
and workload, without being hindered by implementation-level
restrictions.

--
Peter Geoghegan




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread Tom Lane
Robert Haas  writes:
> Do you care whether your commit
> or mine goes in first?

I do not.  If they're not independent, at least one of us has messed up.

I have family commitments on Saturday, so if I don't get mine in
on Friday it'll likely not happen before Sunday.

regards, tom lane




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Tom Lane
Robert Haas  writes:
> On Tue, Apr 5, 2022 at 10:32 AM Tom Lane  wrote:
>> My point is that we want that to happen in HEAD, but it's not okay
>> for it to happen in a minor release of a stable branch.

> I understand, but I am not sure that I agree. I think that if an
> extension stops compiling against a back-branch, someone will notice
> the next time they try to compile it and will fix it. Maybe that's not
> amazing, but I don't think it's a huge deal either.

Well, perhaps it's not the end of the world, but it's still a large
PITA for the maintainer of such an extension.  They can't "just fix it"
because some percentage of their userbase will still need to compile
against older minor releases.  Nor have you provided any way to handle
that requirement via conditional compilation.

regards, tom lane




Re: Add index scan progress to pg_stat_progress_vacuum

2022-04-05 Thread Imseih (AWS), Sami
>Can't the progress data trivially be inferred by the fact that the worker
>completed?

Yes, at some point, this idea was experimented with in
0004-Expose-progress-for-the-vacuuming-indexes-cleanup-ph.patch.
This patch did the calculation in system_views.sql

However, the view is complex and there could be some edge
cases with inferring the values that lead to wrong values being reported.

Regards,

Sami Imseih
Amazon Web Services



Re: shared-memory based stats collector - v68

2022-04-05 Thread David G. Johnston
On Tuesday, April 5, 2022, Andres Freund  wrote:

> Hi,
>
> On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> > On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:
> >
> > >
> > > I think all this is going to achieve is to making code more
> complicated.
> > > There
> > > is a *single* non-assert use of accessed_across_databases and now a
> single
> > > assertion involving it.
> > >
> > > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> > >
> >
> > So, I decided to see what this would look like; the results are attached,
> > portions of it also inlined below.
>
> > I'll admit this does introduce a terminology problem - but IMO these
> words
> > are much more meaningful to the reader and code than the existing
> > booleans.  I'm hopeful we can bikeshed something agreeable as I'm
> strongly
> > in favor of making this change.
>
> Sorry, I just don't agree. I'm happy to try to make it look better, but
> this
> isn't it.
>
> Do you think it should be your way strongly enough that you'd not want to
> get
> it in the current way?
>
>
Not that strongly; I’m good with the code as-is.  Its not pervasive enough
to be hard to understand (I may ponder some code comments though) and the
system it is modeling has some legacy aspects that are more the root
problem and I don’t want to touch those here for sure.


>
> Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
> kind of thing the different values need to have power-of-two values, and
> then
> the tests need to be done with &.


Thanks.


>
> Nicely demonstrated by the fact that with the patch applied initdb doesn't
> pass...



Yeah, I compiled but tried to run the tests and learned I still need to
figure out my setup for make check; then I forgot to make install…

It served its purpose at least.



>
> > @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef
> *entry_ref)
> >   void   *pending_data = entry_ref->pending;
> >
> >   Assert(pending_data != NULL);
> > - /* !fixed_amount stats should be handled explicitly */
> > - Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> > + /* global stats should be handled explicitly : why?*/
> > + Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);
>
> The pending data infrastructure doesn't provide a way of dealing with fixed
> amount stats, and there's no PgStat_EntryRef for them (since they're not in
> the hashtable).
>
>
Thanks.

David J.


Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-04-05 Thread Jacob Champion
On Tue, 2022-04-05 at 15:13 +0900, Michael Paquier wrote:
> On Wed, Mar 30, 2022 at 04:02:09PM +, Jacob Champion wrote:
> > Whether that's a problem in the future entirely depends on whether
> > there's some authentication method that considers the empty string a
> > sane and meaningful identity. We might reasonably decide that the
> > answer is "no", but I like being able to make that decision as opposed
> > to delegating it to an existing generic framework.
> 
> My guess on the matter is that an empty authn holds the same meaning
> as NULL because it has no data,

Whether it holds meaning or not depends entirely on the auth method, I
think. Hypothetical example -- a system could accept client
certificates with an empty Subject. What identity that Subject
represents would depend on the organization, but it's distinct from
NULL/unauthenticated because the certificate is still signed by a CA.

(Postgres rejects empty Subjects when using clientname=DN and I'm not
proposing that we change that; I'm haven't actually checked that
they're RFC-legal. But it's possible that a future auth method could
have a reasonable standard definition for an empty identifier.)

> but I can see your point as well to
> make this distinction.  In order to do that, couldn't you just use
> shm_toc_lookup(noError=true)?  PARALLEL_KEY_SHAREDPORT could be an
> optional entry in the TOC data.

The current patch already handles NULL with a byte of overhead; is
there any advantage to using noError? (It might make things messier
once a second member gets added to the struct.) My concern was directed
at the GUC proposal.

> The name choice is still an issue, as per Andres' point that
> MyProcShared is confusing as it can refer to shared memory.  What we
> want is a structure name for something that's related to MyProc and
> shared across all parallel workers including the leader.  I would
> give up on the "Shared" part, using "Parallel" and "Info" instead.
> Here are some ideas:
> - ProcParallelInfo
> - ProcInfoParallel
> - ParallelProcInfo

I like ParallelProcInfo; it reads nicely. I've used that in v9.

Thanks!
--Jacob
commit b3fb176a4e4f09f7436f2df8c3411b4b51c71906
Author: Jacob Champion 
Date:   Tue Apr 5 10:18:16 2022 -0700

squash! Allow parallel workers to use pg_session_authn_id()

Update name to MyParallelProcInfo.

diff --git a/src/backend/access/transam/parallel.c 
b/src/backend/access/transam/parallel.c
index c88eab0933..27eda766b1 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,7 +76,7 @@
 #define PARALLEL_KEY_REINDEX_STATE 
UINT64CONST(0x000C)
 #define PARALLEL_KEY_RELMAPPER_STATE   UINT64CONST(0x000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS  UINT64CONST(0x000E)
-#define PARALLEL_KEY_SHAREDPORT
UINT64CONST(0x000F)
+#define PARALLEL_KEY_PROCINFO  
UINT64CONST(0x000F)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -213,7 +213,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
Sizereindexlen = 0;
Sizerelmapperlen = 0;
Sizeuncommittedenumslen = 0;
-   Sizesharedportlen = 0;
+   Sizeprocinfolen = 0;
Sizesegsize = 0;
int i;
FixedParallelState *fps;
@@ -274,8 +274,8 @@ InitializeParallelDSM(ParallelContext *pcxt)
shm_toc_estimate_chunk(>estimator, relmapperlen);
uncommittedenumslen = EstimateUncommittedEnumsSpace();
shm_toc_estimate_chunk(>estimator, uncommittedenumslen);
-   sharedportlen = EstimateSharedPortSpace();
-   shm_toc_estimate_chunk(>estimator, sharedportlen);
+   procinfolen = EstimateParallelProcInfoSpace();
+   shm_toc_estimate_chunk(>estimator, procinfolen);
/* If you add more chunks here, you probably need to add keys. 
*/
shm_toc_estimate_keys(>estimator, 12);
 
@@ -356,7 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
char   *session_dsm_handle_space;
char   *entrypointstate;
char   *uncommittedenumsspace;
-   char   *sharedportspace;
+   char   *procinfospace;
Sizelnamelen;
 
/* Serialize shared libraries we have loaded. */
@@ -427,11 +427,11 @@ InitializeParallelDSM(ParallelContext *pcxt)
shm_toc_insert(pcxt->toc, PARALLEL_KEY_UNCOMMITTEDENUMS,
   uncommittedenumsspace);
 
-   /* Serialize our SharedPort. */
-   sharedportspace = shm_toc_allocate(pcxt->toc, sharedportlen);
-   SerializeSharedPort(sharedportlen, sharedportspace);
-   

Re: LogwrtResult contended spinlock

2022-04-05 Thread Alvaro Herrera
Apologies -- I selected the wrong commit to extract the commit message
from.  Here it is again.  I also removed an obsolete /* XXX */ comment.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From 0b26f90f1b95f8e9b932eb34bbf9c2a50729cf60 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH v9] Make XLogCtl->LogwrtResult accessible with atomics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, access to LogwrtResult is protected by a spinlock.  This
becomes severely contended in some scenarios, such as with a largish
replication flock: walsenders all calling GetFlushRecPtr repeatedly
cause the processor heat up to the point where eggs can be fried on top.

This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult
with a struct containing a pair of atomically accessed variables. Do so.
In a few places, we can adjust the exact location where the locals are
updated to account for the fact that we no longer need the spinlock.

Author: Álvaro Herrera 
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
---
 src/backend/access/transam/xlog.c | 106 ++
 src/include/port/atomics.h|  29 
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17a56152f1..50fdfcbc90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
 static XLogwrtResult LogwrtResult = {0, 0};
@@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, 

Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread John Naylor
On Tue, Apr 5, 2022 at 10:06 PM Robert Haas  wrote:
>
> On Tue, Apr 5, 2022 at 10:07 AM Tom Lane  wrote:
> > Yeah, the frontend error message rework in [1].  That has exactly
> > the same constraint that it's likely to break other open patches,
> > so it'd be better to do it after the CF cutoff.  I think that doing
> > that concurrently with Robert's thing shouldn't be too risky, because
> > it only affects frontend code while his patch should touch only backend.
>
> So when *exactly* do we want to land these patches? None of the
> calendar programs I use support "anywhere on earth" as a time zone,
> but I think that feature freeze is 8am on Friday on the East coast of
> the United States.

I understand it to be noon UTC on Friday.

> If I commit the PGDLLIMPORT change within a few
> hours after that time, is that good? Should I try to do it earlier,
> before we technically hit 8am? Should I do it the night before, last
> thing before I go to bed on Thursday? Do you care whether your commit
> or mine goes in first?

For these two patches, I'd say a day or two after feature freeze is a
reasonable goal.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-04-05 Thread Andres Freund
On 2022-04-05 16:42:28 +, Imseih (AWS), Sami wrote:
> >Why isn't the obvious thing to do here to provide a way to associate 
> > workers
> >with their leaders in shared memory, but to use the existing progress 
> > fields
> >to report progress? Then, when querying progress, the leader and workers
> >progress fields can be combined to show the overall progress?
>
> The original intent was this, however the workers
> can exit before the command completes and the
> worker progress data will be lost.

Can't the progress data trivially be inferred by the fact that the worker
completed?




Re: shared-memory based stats collector - v68

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:
> 
> >
> > I think all this is going to achieve is to making code more complicated.
> > There
> > is a *single* non-assert use of accessed_across_databases and now a single
> > assertion involving it.
> >
> > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> >
> 
> So, I decided to see what this would look like; the results are attached,
> portions of it also inlined below.

> I'll admit this does introduce a terminology problem - but IMO these words
> are much more meaningful to the reader and code than the existing
> booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
> in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but this
isn't it.

Do you think it should be your way strongly enough that you'd not want to get
it in the current way?



> The ability to create defines for subsets nicely resolves the problem that
> CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
> are generally related together - they are now grouped under the DYNAMIC
> label (variable, if you want) while all of the fixed entries get associated
> with GLOBAL.  Thus the majority of usages, since accessed_across_databases
> is rare, end up being either DYNAMIC or GLOBAL.

FWIW, as-is DYNAMIC isn't correct:

> +typedef enum PgStat_KindGroup
> +{
> + PGSTAT_GLOBAL = 1,
> + PGSTAT_CLUSTER,
> + PGSTAT_OBJECT
> +} PgStat_KindGroup;
> +
> +#define PGSTAT_DYNAMIC (PGSTAT_CLUSTER | PGSTAT_OBJECT)

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and then
the tests need to be done with &.

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...


> @@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
>*/
>   if (p->key.dboid != MyDatabaseId &&
>   p->key.dboid != InvalidOid &&
> - !kind_info->accessed_across_databases)
> + kind_info->kind_group == PGSTAT_OBJECT)
>   continue;
>  
>   if (p->dropped)

Imo this is far harder to interpret - !kind_info->accessed_across_databases
tells you why we're skipping in clear code. Your alternative doesn't.


> @@ -938,7 +933,7 @@ pgstat_build_snapshot(void)
>   {
>   const PgStat_KindInfo *kind_info = pgstat_kind_info_for(kind);
>  
> - if (!kind_info->fixed_amount)
> + if (kind_info->kind_group == PGSTAT_DYNAMIC)

These all would have to be kind_info->kind_group & PGSTAT_DYNAMIC, or even
(kind_group & PGSTAT_DYNAMIC) != 0, depending on the case.


> @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
>   void   *pending_data = entry_ref->pending;
>  
>   Assert(pending_data != NULL);
> - /* !fixed_amount stats should be handled explicitly */
> - Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> + /* global stats should be handled explicitly : why?*/
> + Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);

The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).


Greetings,

Andres Freund




Re: Add index scan progress to pg_stat_progress_vacuum

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 12:42 PM Imseih (AWS), Sami  wrote:
> >Why isn't the obvious thing to do here to provide a way to associate 
> > workers
> >with their leaders in shared memory, but to use the existing progress 
> > fields
> >to report progress? Then, when querying progress, the leader and workers
> >progress fields can be combined to show the overall progress?
>
> The original intent was this, however the workers
> can exit before the command completes and the
> worker progress data will be lost.
> This is why the shared memory was introduced.
> This allows the worker progress to persist for the duration
> of the command.

At the beginning of a parallel operation, we allocate a chunk of
dynamic shared memory which persists even after some or all workers
have exited. It's only torn down at the end of the parallel operation.
That seems like the appropriate place to be storing any kind of data
that needs to be propagated between parallel workers. The current
patch uses the main shared memory segment, which seems unacceptable to
me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-05 Thread Robert Haas
On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro  wrote:
> > The checkpointer never takes heavyweight locks, so the opportunity
> > you're describing can't arise.
>
>   Hmm, oh, you probably meant the buffer interlocking
> in SyncOneBuffer().  It's true that my most recent patch throws away
> more requests than it could, by doing the level check at the end of
> the loop over all buffers instead of adding some kind of
> DropPendingWritebacks() in the barrier handler.  I guess I could find
> a way to improve that, basically checking the level more often instead
> of at the end, but I don't know if it's worth it; we're still throwing
> out an arbitrary percentage of writeback requests.

Doesn't every backend have its own set of pending writebacks?
BufferAlloc() calls
ScheduleBufferTagForWriteback(, ...)?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PostgreSQL 15 Release Management Team (RMT) + Feature Freeze

2022-04-05 Thread Jonathan S. Katz

On 3/26/22 11:10 AM, Jonathan S. Katz wrote:

Additionally, the RMT has set the feature freeze date to be April 7, 
2022. This is the last day to commit features for PostgreSQL 15. In 
other words, no new PostgreSQL 15 feature can be committed after April 8 
0:00, 2022 AoE[1].


[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth


A reminder that feature freeze takes effect in two days.

Best wishes on your outstanding patches!

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-05 Thread Peter Geoghegan
On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan  wrote:
> Right. The reason I used WARNINGs was because it matches vaguely
> related WARNINGs in vac_update_relstats()'s sibling function,
> vacuum_set_xid_limits().

Okay, pushed the relfrozenxid warning patch.

Thanks
-- 
Peter Geoghegan




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-05 Thread Magnus Hagander
On Tue, Apr 5, 2022 at 5:25 PM Stephen Frost  wrote:

> Greetings,
>
> * David Steele (da...@pgmasters.net) wrote:
> > On 4/4/22 11:42 AM, Nathan Bossart wrote:
> > >I noticed a couple of other things that can be removed.  Since we no
> longer
> > >wait on exclusive backup mode during smart shutdown, we can change
> > >connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER.
> We
> > >can also remove a couple of related notes in the documentation.  I've
> done
> > >all this in the attached patch.
> >
> > These changes look good to me. IMV it is a real bonus how much the state
> > machine has been simplified.
>
> Yeah, agreed.
>

Definitely.



> > I've also run this patch through the pgbackrest regression tests without
> any
> > problems.
>
> Fantastic.
>
> Please find attached an updated patch + commit message.  Mostly, I just
> went through and did a bit more in terms of updating the documentation
> and improving the comments (there were some places that were still
> worrying about the chance of a 'stray' backup_label file existing, which
> isn't possible any longer), along with some additional testing and
> review.  This is looking pretty good to me, but other thoughts are
> certainly welcome.  Otherwise, I'm hoping to commit this tomorrow.
>

+1. LGTM.

I'm not sure I love the renaming of the functions, but I have also yet to
come up with a better idea for how to avoid silent breakage, so go with it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Add index scan progress to pg_stat_progress_vacuum

2022-04-05 Thread Imseih (AWS), Sami
>I nevertheless think that's not acceptable. The whole premise of the 
> progress
>reporting infrastructure is to be low overhead. It's OK to require locking 
> to
>initialize parallel progress reporting, it's definitely not ok to require
>locking to report progress.

Fair point.

>Why isn't the obvious thing to do here to provide a way to associate 
> workers
>with their leaders in shared memory, but to use the existing progress 
> fields
>to report progress? Then, when querying progress, the leader and workers
>progress fields can be combined to show the overall progress?

The original intent was this, however the workers 
can exit before the command completes and the 
worker progress data will be lost.
This is why the shared memory was introduced. 
This allows the worker progress to persist for the duration 
of the command.

Regards, 

Sami Imseih
Amazon Web Services.






Re: How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 10:10 AM Andy Fan  wrote:
>> > I wanted to have a WAL record spanning multiple WAL files of size, say
>> > 16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
>> > would help here. Please let me know if there's any way to generate
>> > such large WAL records.
>>
>> It's easier to use pg_logical_emit_message().
>
> Not sure I understand the question correctly here.  What if I use the below 
> code
> where the len might be very large?  like 64MB.
>
>  XLogBeginInsert();
> XLogRegisterData((char *)_append, sizeof(xl_cstore_append));
> XLogRegisterData((char *)data, len);
>
> XLogInsert(..);

Well, that's how to do it from C. And pg_logical_emit_message() is how
to do it from SQL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] pg_stat_toast v10

2022-04-05 Thread Robert Haas
On Thu, Mar 31, 2022 at 9:16 AM Gunnar "Nick" Bluth
 wrote:
> That was meant to say "v10", sorry!

Hi,

>From my point of view, at least, it would be preferable if you'd stop
changing the subject line every time you post a new version.

Based on the test results in
http://postgr.es/m/42bfa680-7998-e7dc-b50e-480cdd986...@pro-open.de
and the comments from Andres in
https://www.postgresql.org/message-id/20211212234113.6rhmqxi5uzgipwx2%40alap3.anarazel.de
my judgement would be that, as things stand today, this patch has no
chance of being accepted, due to overhead. Now, Andres is currently
working on an overhaul of the statistics collector and perhaps that
would reduce the overhead of something like this to an acceptable
level. If it does, that would be great news; I just don't know whether
that's the case.

As far as the statistics themselves are concerned, I am somewhat
skeptical about whether it's really worth adding code for this.
According to the documentation, the purpose of the patch is to allow
you to assess choice of storage and compression method settings for a
column and is not intended to be enabled permanently. However, it
seems to me that you could assess that pretty easily without this
patch: just create a couple of different tables with different
settings, load up the same data via COPY into each one, and see what
happens. Now you might answer that with the patch you would get more
detailed and accurate statistics, and I think that's true, but it
doesn't really look like the additional level of detail would be
critical to have in order to make a proper assessment. You might also
say that creating multiple copies of the table and loading the data
multiple times would be expensive, and that's also true, but you don't
really need to load it all. A representative sample of 1GB or so would
probably suffice in most cases, and that doesn't seem likely to be a
huge load on the system.

Also, as we add more compression options, it's going to be hard to
assess this sort of thing without trying stuff anyway. For example if
you can set the lz4 compression level, you're not going to know which
level is actually going to work best without trying out a bunch of
them and seeing what happens. If we allow access to other sorts of
compression parameters like zstd's "long" option, similarly, if you
really care, you're going to have to try it.

So my feeling is that this feels like a lot of machinery and a lot of
worst-case overhead to solve a problem that's really pretty easy to
solve without any new code at all, and therefore I'd be inclined to
reject it. However, it's a well-known fact that sometimes my feelings
about things are pretty stupid, and this might be one of those times.
If so, I hope someone will enlighten me by telling me what I'm
missing.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-05 Thread David Steele

On 4/5/22 11:25 AM, Stephen Frost wrote:


Please find attached an updated patch + commit message.  Mostly, I just
went through and did a bit more in terms of updating the documentation
and improving the comments (there were some places that were still
worrying about the chance of a 'stray' backup_label file existing, which
isn't possible any longer), along with some additional testing and
review.  This is looking pretty good to me, but other thoughts are
certainly welcome.  Otherwise, I'm hoping to commit this tomorrow.


I have reviewed the changes and they look good. I also ran the new patch 
through pgbackrest regression with no issues.


Regards,
--
-David
da...@pgmasters.net




How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-05 Thread Bharath Rupireddy
Hi,

I'm thinking if there's a way in core postgres to achieve $subject. In
reality, the sync/async standbys can either be closer/farther (which
means sync/async standbys can receive WAL at different times) to
primary, especially in cloud HA environments with primary in one
Availability Zone(AZ)/Region and standbys in different AZs/Regions.
$subject may not be possible on dev systems (say, for testing some HA
features) unless we can inject a delay in WAL senders before sending
WAL.

How about having two developer-only GUCs {async,
sync}_wal_sender_delay? When set, the async and sync WAL senders will
delay sending WAL by {async, sync}_wal_sender_delay
milliseconds/seconds? Although, I can't think of any immediate use, it
will be useful someday IMO, say for features like [1], if it gets in.
With this set of GUCs, one can even add core regression tests for HA
features.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA%40mail.gmail.com

Regards,
Bharath Rupireddy.




Re: Separate the result of \watch for each query execution (psql)

2022-04-05 Thread Robert Haas
On Mon, Mar 21, 2022 at 9:15 PM Andres Freund  wrote:
> On 2022-02-25 13:23:31 +0900, Noboru Saito wrote:
> > I have created a patch that allows you to turn it on and off in \pset.
>
> The patch unfortunately causes tests to fail:

It doesn't seem like the originally proposed design here will be
accepted. Perhaps a patch implementing what Tom proposed or some
variant of it will get accepted, but the only patch we have is now
more than 1 month old and has not been updated. I'm accordingly
marking this CommitFest entry "Returned with Feedback." If Noboru
Saito decides to update the patch with some new design, he can change
the status and move the patch forward to a future CommitFest.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector - v68

2022-04-05 Thread David G. Johnston
On Mon, Apr 4, 2022 at 7:36 PM Andres Freund  wrote:

>
> I think all this is going to achieve is to making code more complicated.
> There
> is a *single* non-assert use of accessed_across_databases and now a single
> assertion involving it.
>
> What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
>

So, I decided to see what this would look like; the results are attached,
portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these words
are much more meaningful to the reader and code than the existing
booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
in favor of making this change.

The ability to create defines for subsets nicely resolves the problem that
CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
are generally related together - they are now grouped under the DYNAMIC
label (variable, if you want) while all of the fixed entries get associated
with GLOBAL.  Thus the majority of usages, since accessed_across_databases
is rare, end up being either DYNAMIC or GLOBAL.  The presence of any other
category should give one pause.  We could add an ALL define if we ever
decide to consolidate the API - but for now it's largely used to ensure
that stats of one type don't get processed by the other.  The boolean fixed
does that well enough but this just seems much cleaner and more
understandable to me.  Though having made up the terms and model myself,
that isn't too surprising.

The only existing usage of accessed_across_databases is in the negative
form, which translates to excluding objects, but only those from other
actual databases.

@@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
  */
  if (p->key.dboid != MyDatabaseId &&
  p->key.dboid != InvalidOid &&
- !kind_info->accessed_across_databases)
+ kind_info->kind_group == PGSTAT_OBJECT)
  continue;

The only other usage of something other than GLOBAL or DYNAMIC is the
restriction on the behavior of reset_single_counter, which also has to be
an object in the current database (the later condition being enforced by
the presence of a valid object oid I presume).  The replacement for this
below is not behavior-preserving, the proposed behavior I believe we agree
is correct though.

@@ -652,7 +647,7 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

- Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_OBJECT);

Everything else is a straight conversion of fixed_amount to CLUSTER+OBJECT

@@ -728,7 +723,7 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid
objoid)

- AssertArg(!kind_info->fixed_amount);
+ AssertArg(kind_info->kind_group == PGSTAT_DYNAMIC);

and !fixed_amount to GLOBAL

@@ -825,7 +820,7 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 bool
 pgstat_exists_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 {
- if (pgstat_kind_info_for(kind)->fixed_amount)
+ if (pgstat_kind_info_for(kind)->kind_group == PGSTAT_GLOBAL)
  return true;

  return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;

David J.


rework-using-enums.diff
Description: Binary data


Re: Printing backtrace of postgres processes

2022-04-05 Thread Robert Haas
On Wed, Mar 30, 2022 at 12:03 PM Justin Pryzby  wrote:
> On Wed, Mar 30, 2022 at 11:53:52AM -0400, Greg Stark wrote:
> > Sadly the cfbot is showing a patch conflict again. It's just a trivial
> > conflict in the regression test schedule so I'm not going to update
> > the status but it would be good to rebase it so we continue to get
> > cfbot testing.
>
> Done.  No changes.

+ if (chk_auxiliary_proc)
+ ereport(WARNING,
+ errmsg("PID %d is not a PostgreSQL server process", pid));
+ else
+ ereport(WARNING,
+ errmsg("PID %d is not a PostgreSQL backend process", pid));

This doesn't look right to me. I don't think that PostgreSQL server
processes are one kind of thing and PostgreSQL backend processes are
another kind of thing. I think they're two terms that are pretty
nearly interchangeable, or maybe at best you want to argue that
backend processes are some subset of server processes. I don't see
this sort of thing adding any clarity.

-static void
+void
 set_backtrace(ErrorData *edata, int num_skip)
 {
  StringInfoData errtrace;
@@ -978,7 +978,18 @@ set_backtrace(ErrorData *edata, int num_skip)
 "backtrace generation is not supported by this installation");
 #endif

- edata->backtrace = errtrace.data;
+ if (edata)
+ edata->backtrace = errtrace.data;
+ else
+ {
+ /*
+ * LOG_SERVER_ONLY is used to make sure the backtrace is never
+ * sent to client. We want to avoid messing up the other client
+ * session.
+ */
+ elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace.data);
+ pfree(errtrace.data);
+ }
 }

This looks like a grotty hack.

- PGPROC*proc = BackendPidGetProc(pid);
-
- /*
- * BackendPidGetProc returns NULL if the pid isn't valid; but by the time
- * we reach kill(), a process for which we get a valid proc here might
- * have terminated on its own.  There's no way to acquire a lock on an
- * arbitrary process to prevent that. But since so far all the callers of
- * this mechanism involve some request for ending the process anyway, that
- * it might end on its own first is not a problem.
- *
- * Note that proc will also be NULL if the pid refers to an auxiliary
- * process or the postmaster (neither of which can be signaled via
- * pg_signal_backend()).
- */
- if (proc == NULL)
- {
- /*
- * This is just a warning so a loop-through-resultset will not abort
- * if one backend terminated on its own during the run.
- */
- ereport(WARNING,
- (errmsg("PID %d is not a PostgreSQL backend process", pid)));
+ PGPROC  *proc;

+ /* Users can only signal valid backend or an auxiliary process. */
+ proc = CheckPostgresProcessId(pid, false, NULL);
+ if (!proc)
  return SIGNAL_BACKEND_ERROR;
- }

Incidentally changing the behavior of pg_signal_backend() doesn't seem
like a great idea. We can do that as a separate commit, after
considering whether documentation changes are needed. But it's not
something that should get folded into a commit on another topic.

+ /*
+ * BackendPidGetProc() and AuxiliaryPidGetProc() return NULL if the pid
+ * isn't valid; but by the time we reach kill(), a process for which we
+ * get a valid proc here might have terminated on its own.  There's no way
+ * to acquire a lock on an arbitrary process to prevent that. But since
+ * this mechanism is usually used to debug a backend or an auxiliary
+ * process running and consuming lots of memory, that it might end on its
+ * own first without logging the requested info is not a problem.
+ */

This comment made a lot more sense where it used to be than it does
where it is now. I think more work is required here than just cutting
and pasting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: LogwrtResult contended spinlock

2022-04-05 Thread Alvaro Herrera
Here's a v8, where per my previous comment I removed some code that I
believe is no longer necessary.

I've omitted the patch that renames LogwrtResult subvariables into
LogWriteResult/LogWriteFlush; I still think the end result is better
after that one, but it's a pretty trivial change that can be dealt with
separately afterwards.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)
>From fdca5f32573342d950feefbdd4d64da6dd1cfb34 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 22 Mar 2022 16:49:40 +0100
Subject: [PATCH v8] Split LogwrtResult into separate variables

Since the schedule of updating each portion is independent, it's not
very useful to have them both as a single struct.  Before we used
atomics for the corresponding XLogCtl struct this made sense because we
could use struct assignment, but that's no longer the case.

Suggested by Andres Freund.
---
 src/backend/access/transam/xlog.c | 99 ++-
 src/include/port/atomics.h| 29 +
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17a56152f1..f1cb725bed 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -285,16 +285,13 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * LogWrtResult indicates the byte positions we have already written/fsynced.
+ * These structs are similar but are declared separately to indicate their
+ * slightly different functions; in addition, the latter is read and written
+ * using atomic operations.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * In addition to the shared variable, each backend has a private copy of
+ * LogwrtResult, each member of which is separately updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -317,6 +314,12 @@ static bool doPageWrites;
  *--
  */
 
+typedef struct XLogwrtAtomic
+{
+	pg_atomic_uint64 Write;		/* last byte + 1 written out */
+	pg_atomic_uint64 Flush;		/* last byte + 1 flushed */
+} XLogwrtAtomic;
+
 typedef struct XLogwrtRqst
 {
 	XLogRecPtr	Write;			/* last byte + 1 to write out */
@@ -480,6 +483,7 @@ typedef struct XLogCtlData
 {
 	XLogCtlInsert Insert;
 
+	XLogwrtAtomic LogwrtResult; /* uses atomics */
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
 	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
@@ -497,12 +501,6 @@ typedef struct XLogCtlData
 	pg_time_t	lastSegSwitchTime;
 	XLogRecPtr	lastSegSwitchLSN;
 
-	/*
-	 * Protected by info_lck and WALWriteLock (you must hold either lock to
-	 * read it, but both to update)
-	 */
-	XLogwrtResult LogwrtResult;
-
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
 	 *
@@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL;
 static int	UsableBytesInSegment;
 
 /*
- * Private, possibly out-of-date copy of shared LogwrtResult.
+ * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult.
  * See discussion above.
  */
 static XLogwrtResult LogwrtResult = {0, 0};
@@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
-		/* update local result copy while I have the chance */
-		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
 	}
 
@@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	 * Now that we have the lock, check if someone initialized the page
 	 * already.
 	 */
+	LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);
 	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
 		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
@@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			if (opportunistic)
 break;
 
-			/* Before waiting, get info_lck and 

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-04-05 Thread Yura Sokolov
Good day, David.

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

regards

-

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com






Re: Proposal for internal Numeric to Uint64 conversion function.

2022-04-05 Thread Robert Haas
On Thu, Mar 17, 2022 at 4:02 PM Tom Lane  wrote:
> Pretty much, yeah.  I'm way more interested in cleaning up the code
> we have than in making things prettier for hypothetical future
> call sites.  In particular, the problem with writing an API in a
> vacuum is that you have little evidence that it's actually useful
> as given (e.g., did you handle error cases in a useful way).  If we
> create a numeric_to_int64 that is actually used right away by some
> existing callers, then we've got some evidence that we did it right;
> and then introducing a parallel numeric_to_uint64 is less of a leap
> of faith.

Based on this review and the fact that there's been no new patch since
the original version, I've marked this Returned with Feedback in the
CommitFest.

If Amul decides to update the patch as Tom is describing, he can
reactivate the CommitFest entry at that time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-05 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 4/4/22 11:42 AM, Nathan Bossart wrote:
> >I noticed a couple of other things that can be removed.  Since we no longer
> >wait on exclusive backup mode during smart shutdown, we can change
> >connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER.  We
> >can also remove a couple of related notes in the documentation.  I've done
> >all this in the attached patch.
> 
> These changes look good to me. IMV it is a real bonus how much the state
> machine has been simplified.

Yeah, agreed.

> I've also run this patch through the pgbackrest regression tests without any
> problems.

Fantastic.

Please find attached an updated patch + commit message.  Mostly, I just
went through and did a bit more in terms of updating the documentation
and improving the comments (there were some places that were still
worrying about the chance of a 'stray' backup_label file existing, which
isn't possible any longer), along with some additional testing and
review.  This is looking pretty good to me, but other thoughts are
certainly welcome.  Otherwise, I'm hoping to commit this tomorrow.

Thanks!

Stephen
From 4e53c8252fb1ee32bf9688b553cf1f5876ab234b Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 4 Apr 2022 13:05:13 -0400
Subject: [PATCH] Remove exclusive backup mode

Exclusive-mode backups have been deprecated since 9.6 (when
non-exclusive backups were introduced) due to the issues
they can cause should the system crash while one is running and
generally because non-exclusive provides a much better interface.
Further, exclusive backup mode wasn't really being tested (nor was most
of the related code- like being able to log in just to stop an exclusive
backup and the bits of the state machine related to that) and having to
possibly deal with an exclusive backup and the backup_label file
existing during pg_basebackup, pg_rewind, etc, added other complexities
that we are better off without.

This patch removes the exclusive backup mode, the various special cases
for dealing with it, and greatly simplifies the online backup code and
documentation.

Authors: David Steele, Nathan Bossart
Reviewed-by: Chapman Flack
Discussion: https://postgr.es/m/ac7339ca-3718-3c93-929f-99e725d11...@pgmasters.net
https://postgr.es/m/CAHg+QDfiM+WU61tF6=npzocmzvhdzck47kneyb0zrulyzv5...@mail.gmail.com
---
 doc/src/sgml/backup.sgml  | 241 +---
 doc/src/sgml/func.sgml| 111 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pg_ctl-ref.sgml  |   6 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 doc/src/sgml/runtime.sgml |   8 +-
 src/backend/access/transam/xlog.c | 519 +++---
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |  20 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  75 +--
 src/backend/replication/basebackup.c  |  20 +-
 src/backend/utils/init/postinit.c |  18 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |  30 -
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   4 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/libpq/libpq-be.h  |   3 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 24 files changed, 247 insertions(+), 1200 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..fc2ec68758 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,18 +857,9 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
- concurrent backups to be running (both those started using
- the same backup API and those started using
+ Multiple backups are able to be run concurrently (both those
+ started using this backup API and those started using
  ).
 
 
@@ -881,34 +872,30 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a 

Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 10:07 AM Tom Lane  wrote:
> Yeah, the frontend error message rework in [1].  That has exactly
> the same constraint that it's likely to break other open patches,
> so it'd be better to do it after the CF cutoff.  I think that doing
> that concurrently with Robert's thing shouldn't be too risky, because
> it only affects frontend code while his patch should touch only backend.

So when *exactly* do we want to land these patches? None of the
calendar programs I use support "anywhere on earth" as a time zone,
but I think that feature freeze is 8am on Friday on the East coast of
the United States. If I commit the PGDLLIMPORT change within a few
hours after that time, is that good? Should I try to do it earlier,
before we technically hit 8am? Should I do it the night before, last
thing before I go to bed on Thursday? Do you care whether your commit
or mine goes in first?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 10:58 AM Magnus Hagander  wrote:
>> Makes sense. I will do this soon if nobody objects.
>>
>> I'm mildly uncomfortable with the phrase "WAL records generated over
>> the delay period" because it seems a bit imprecise, but I'm not sure
>> what would be better and I think the meaning is clear.
>
> Maybe "during" instead of "over"? But I'm not sure that's the part you're 
> referring to?

Yeah, something like that, maybe.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-05 Thread Magnus Hagander
On Tue, Apr 5, 2022 at 4:54 PM Robert Haas  wrote:

> On Tue, Apr 5, 2022 at 8:41 AM Thom Brown  wrote:
> > I know it's been 8 years, but I still think it would be a useful note
> > to add to the docs.
>

Many points for bringing that one back :)


Makes sense. I will do this soon if nobody objects.
>
> I'm mildly uncomfortable with the phrase "WAL records generated over
> the delay period" because it seems a bit imprecise, but I'm not sure
> what would be better and I think the meaning is clear.
>

Maybe "during" instead of "over"? But I'm not sure that's the part you're
referring to?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 8:41 AM Thom Brown  wrote:
> I know it's been 8 years, but I still think it would be a useful note
> to add to the docs.

Makes sense. I will do this soon if nobody objects.

I'm mildly uncomfortable with the phrase "WAL records generated over
the delay period" because it seems a bit imprecise, but I'm not sure
what would be better and I think the meaning is clear.

Also one of us will need to do s/pg_xlog/pg_wal/.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 10:32 AM Tom Lane  wrote:
> Robert Haas  writes:
> > On Tue, Apr 5, 2022 at 10:17 AM Tom Lane  wrote:
> >> Renaming it would constitute an API break, which is if anything worse
> >> than an ABI break.
>
> > I don't think so, because an API break will cause a compilation
> > failure, which an extension author can easily fix.
>
> My point is that we want that to happen in HEAD, but it's not okay
> for it to happen in a minor release of a stable branch.

I understand, but I am not sure that I agree. I think that if an
extension stops compiling against a back-branch, someone will notice
the next time they try to compile it and will fix it. Maybe that's not
amazing, but I don't think it's a huge deal either. On the other hand,
if existing builds that someone has already shipped stop working with
a new server release, that's a much larger issue. The extension
packager can't go back and retroactively add a dependency on the
server version to the already-shipped package. A new package can be
shipped and specify a minimum minor version with which it will work,
but an old package is what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Tom Lane
Robert Haas  writes:
> On Tue, Apr 5, 2022 at 10:17 AM Tom Lane  wrote:
>> Renaming it would constitute an API break, which is if anything worse
>> than an ABI break.

> I don't think so, because an API break will cause a compilation
> failure, which an extension author can easily fix.

My point is that we want that to happen in HEAD, but it's not okay
for it to happen in a minor release of a stable branch.

regards, tom lane




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 10:17 AM Tom Lane  wrote:
> Renaming it would constitute an API break, which is if anything worse
> than an ABI break.

I don't think so, because an API break will cause a compilation
failure, which an extension author can easily fix.

> While we're complaining at you, let me point out that changing a field's
> content and semantics while not changing its name is a time bomb waiting
> to break any third-party code that looks at (or modifies...) the field.
>
> What I think you need to do is:
>
> 1. In the back branches, revert delayChkpt to its previous type and
> semantics.  Squeeze a separate delayChkptEnd bool in somewhere
> (you can't change the struct size either ...).
>
> 2. In HEAD, rename the field to something like delayChkptFlags,
> to ensure that any code touching it has to be inspected and updated.

Well, we can do it that way, I suppose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: ExecRTCheckPerms() and many prunable partitions

2022-04-05 Thread Greg Stark
This is failing regression tests. I don't understand how this patch
could be affecting this test though. Perhaps it's a problem with the
json patches that were committed recently -- but they don't seem to be
causing other patches to fail.


diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/jsonb_sqljson.out
/tmp/cirrus-ci-build/src/test/regress/results/jsonb_sqljson.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/jsonb_sqljson.out
2022-04-05 12:15:40.590168291 +
+++ /tmp/cirrus-ci-build/src/test/regress/results/jsonb_sqljson.out
2022-04-05 12:20:17.338045137 +
@@ -1159,37 +1159,37 @@
  );
 \sv jsonb_table_view
 CREATE OR REPLACE VIEW public.jsonb_table_view AS
- SELECT "json_table".id,
-"json_table".id2,
-"json_table"."int",
-"json_table".text,
-"json_table"."char(4)",
-"json_table".bool,
-"json_table"."numeric",
-"json_table".domain,
-"json_table".js,
-"json_table".jb,
-"json_table".jst,
-"json_table".jsc,
-"json_table".jsv,
-"json_table".jsb,
-"json_table".jsbq,
-"json_table".aaa,
-"json_table".aaa1,
-"json_table".exists1,
-"json_table".exists2,
-"json_table".exists3,
-"json_table".js2,
-"json_table".jsb2w,
-"json_table".jsb2q,
-"json_table".ia,
-"json_table".ta,
-"json_table".jba,
-"json_table".a1,
-"json_table".b1,
-"json_table".a11,
-"json_table".a21,
-"json_table".a22
+ SELECT id,
+id2,
+"int",
+text,
+"char(4)",
+bool,
+"numeric",
+domain,
+js,
+jb,
+jst,
+jsc,
+jsv,
+jsb,
+jsbq,
+aaa,
+aaa1,
+exists1,
+exists2,
+exists3,
+js2,
+jsb2w,
+jsb2q,
+ia,
+ta,
+jba,
+a1,
+b1,
+a11,
+a21,
+a22
FROM JSON_TABLE(
 'null'::jsonb, '$[*]'
 PASSING


On Wed, 30 Mar 2022 at 23:16, Amit Langote  wrote:
>
> On Fri, Mar 25, 2022 at 4:46 AM David Rowley  wrote:
> > I had a look at the v10-0001 patch. I agree that it seems to be a good
> > idea to separate out the required permission checks rather than having
> > the Bitmapset to index the interesting range table entries.
>
> Thanks David for taking a look at this.
>
> > One thing that I could just not determine from looking at the patch
> > was if there's meant to be just 1 RelPermissionInfo per RTE or per rel
> > Oid.
>
> It's the latter.
>
> > None of the comments helped me understand this
>
> I agree that the comment above the RelPermissionInfo struct definition
> missed mentioning this really important bit.  I've tried fixing that
> as:
>
> @@ -1142,7 +1142,9 @@ typedef struct RangeTblEntry
>   * Per-relation information for permission checking. Added to the 
> query
>   * by the parser when populating the query range table and 
> subsequently
>   * editorialized on by the rewriter and the planner.  There is an 
> entry
> - * each for all RTE_RELATION entries present in the range table.
> + * each for all RTE_RELATION entries present in the range table, 
> though
> + * different RTEs for the same relation share the
> RelPermissionInfo, that
> + * is, there is only one RelPermissionInfo containing a given relid.
>
> > and
> > MergeRelPermissionInfos() seems to exist that appears to try and
> > uniquify this to some extent.  I just see no code that does this
> > process for a single query level. I've provided more detail on this in
> > #5 below.
> >
> > Here's my complete review of v10-0001:
> >
> > 1. ExecCheckPermisssions -> ExecCheckPermissions
>
> Fixed.
>
> > 2. I think you'll want to move the userid field away from below a
> > comment that claims the following fields are for foreign tables only.
> >
> >   /* Information about foreign tables and foreign joins */
> >   Oid serverid; /* identifies server for the table or join */
> > - Oid userid; /* identifies user to check access as */
> > + Oid userid; /* identifies user to check access as; set
> > + * in non-foreign table relations too! */
>
> Actually, I realized that the comment should not have been touched to
> begin with.  Reverted.
>
> > 3. This should use READ_OID_FIELD()
> >
> > READ_INT_FIELD(checkAsUser);
> >
> > and this one:
> >
> > READ_INT_FIELD(relid);
>
> Fixed.
>
> > 4.  This looks wrong:
> >
> > - rel->userid = rte->checkAsUser;
> > + if (rte->rtekind == RTE_RELATION)
> > + {
> > + /* otherrels use the root parent's value. */
> > + rel->userid = parent ? parent->userid :
> > + GetRelPermissionInfo(root->parse->relpermlist,
> > + rte, false)->checkAsUser;
> > + }
> >
> > If 'parent' is false then you'll assign the result of
> > GetRelPermissionInfo (a RelPermissionInfo *) to an Oid.
>
> Hmm, I don't see a problem, because what's being assigned is
> `GetRelPermissionInfo(...)->checkAsUser`.
>
> Anyway, I rewrote the block more verbosely as:
>
> if (rte->rtekind == RTE_RELATION)
> {
> -   /* otherrels use the root parent's value. */
> -   

Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Tom Lane
Robert Haas  writes:
> On Tue, Apr 5, 2022 at 9:02 AM Markus Wanner
>  wrote:
>> And for this specific case: Is it worth reverting this change and
>> applying a fully backwards compatible fix, instead?

> I think it's normally our policy to avoid changing definitions of
> accessible structs in back branches, except that we allow ourselves
> the indulgence of adding new members at the end or in padding space.
> So what would probably be best is if, in the back-branches, we changed
> "delayChkpt" back to a boolean, renamed it to delayChkptStart, and
> added a separate Boolean called delayChkptEnd. Maybe that could be
> added just after statusFlags, where I think it would fall into padding
> space.

Renaming it would constitute an API break, which is if anything worse
than an ABI break.

While we're complaining at you, let me point out that changing a field's
content and semantics while not changing its name is a time bomb waiting
to break any third-party code that looks at (or modifies...) the field.

What I think you need to do is:

1. In the back branches, revert delayChkpt to its previous type and
semantics.  Squeeze a separate delayChkptEnd bool in somewhere
(you can't change the struct size either ...).

2. In HEAD, rename the field to something like delayChkptFlags,
to ensure that any code touching it has to be inspected and updated.

In other words, this is already an API break in HEAD, and that's
fine, but it didn't break it hard enough to draw anyone's attention,
which is not fine.

regards, tom lane




Re: Mingw task for Cirrus CI

2022-04-05 Thread Andrew Dunstan


On 4/4/22 16:41, Andres Freund wrote:
> Hi,
>
> On 2022-03-30 17:26:18 -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2022-03-22 19:00:42 +0300, Melih Mutlu wrote:
>>> Rebased it.
>>> I also removed the temp installation task and
>>> used NoDefaultCurrentDirectoryInExePath env variable instead.
>> Hm. But you're still using a separate build directory, from what I can see?
>> The NoDefaultCurrentDirectoryInExePath thing should only have an effect when
>> not using a separate build directory, no?
> Melih chatted with me about making this work. Turns out it doesn't readily -
> pg_ctl still fails.
>
>
> The reason that NoDefaultCurrentDirectoryInExePath doesn't suffice to get a
> working in-tree build, is that we break it ourselves:
>
> int
> find_my_exec(const char *argv0, char *retpath)
> ...
> #ifdef WIN32
>   /* Win32 checks the current directory first for names without slashes */
>   join_path_components(retpath, cwd, argv0);
>   if (validate_exec(retpath) == 0)
>   return resolve_symlinks(retpath);
> #endif
>
> So even if windows doesn't actually use the current path, and the current
> pg_ctl process isn't the one from the current directory, we *still* return
> that.
>
> Gah.




I notice a few things about the latest file in this thread.

You should set MSYSTEM=UCRT64 in the environment section. Given that,
there should be no need to specify a --host= setting for configure.

If it's not done already, the shebang line in
/ucrt64/bin/core_perl/prove needs to be modified to use /ucrt64/bin/perl.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Andy Fan
Hi,

On Tue, Apr 5, 2022 at 9:46 PM Alvaro Herrera 
wrote:

> On 2022-Apr-05, Bharath Rupireddy wrote:
>
> > Hi,
> >
> > I wanted to have a WAL record spanning multiple WAL files of size, say
> > 16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
> > would help here. Please let me know if there's any way to generate
> > such large WAL records.
>
> It's easier to use pg_logical_emit_message().
>
>
Not sure I understand the question correctly here.  What if I use the below
code
where the len might be very large?  like 64MB.

 XLogBeginInsert();
XLogRegisterData((char *)_append, sizeof(xl_cstore_append));
XLogRegisterData((char *)data, len);

XLogInsert(..);

-- 
Best Regards
Andy Fan


Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread Tom Lane
Andrew Dunstan  writes:
> On 3/30/22 14:37, Robert Haas wrote:
>> @RMT: Andres proposed upthread that we should plan to do this just
>> after feature freeze. Accordingly I propose to commit at least 0002
>> and perhaps 0001 if people want it just after feature freeze. I
>> therefore ask that the RMT either (a) regard this change as not being
>> a feature (and thus not subject to the freeze) or (b) give it a 1-day
>> extension. The reason for committing it just after freeze is to
>> minimize the number of conflicts that it creates for other patches.
>> The reason why that's probably an OK thing to do is that applying
>> PGDLLIMPORT markings is low-risk.

> WFM. I think Tom also has an item he wants to do right at the end of
> feature freeze.

Yeah, the frontend error message rework in [1].  That has exactly
the same constraint that it's likely to break other open patches,
so it'd be better to do it after the CF cutoff.  I think that doing
that concurrently with Robert's thing shouldn't be too risky, because
it only affects frontend code while his patch should touch only backend.

regards, tom lane

[1] https://commitfest.postgresql.org/37/3574/




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Robert Haas
On Tue, Apr 5, 2022 at 9:02 AM Markus Wanner
 wrote:
> And for this specific case: Is it worth reverting this change and
> applying a fully backwards compatible fix, instead?

I think it's normally our policy to avoid changing definitions of
accessible structs in back branches, except that we allow ourselves
the indulgence of adding new members at the end or in padding space.
So what would probably be best is if, in the back-branches, we changed
"delayChkpt" back to a boolean, renamed it to delayChkptStart, and
added a separate Boolean called delayChkptEnd. Maybe that could be
added just after statusFlags, where I think it would fall into padding
space.

I think as the person who committed that patch I'm on the hook to fix
this if nobody else would like to do it, but let me ask whether
Kyotaro Horiguchi would like to propose a patch, since the original
patch did, and/or whether you would like to propose a patch, as the
person reporting the issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Alvaro Herrera
On 2022-Apr-05, Bharath Rupireddy wrote:

> Hi,
> 
> I wanted to have a WAL record spanning multiple WAL files of size, say
> 16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
> would help here. Please let me know if there's any way to generate
> such large WAL records.

It's easier to use pg_logical_emit_message().

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru




Re: How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Matthias van de Meent
On Tue, 5 Apr 2022 at 15:13, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I wanted to have a WAL record spanning multiple WAL files of size, say
> 16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
> would help here. Please let me know if there's any way to generate
> such large WAL records.

The function pg_logical_emit_message (callable with REPLICATION
permissions from SQL) allows you to emit records of arbitrary length <
2GB - 2B (for now), which should be enough.

Other than that, you could try to generate 16MB of subtransaction IDs;
the commit record would contain all subxids and thus be at least 16MB
in size.

-Matthias




Re: Handle infinite recursion in logical replication setup

2022-04-05 Thread Ashutosh Bapat
On Tue, Apr 5, 2022 at 6:21 AM Peter Smith  wrote:
>
> Here are my comments for the latest patch v6-0001.
>
> (I will post my v6-0002 review comments separately)
>
> PATCH v6-0001 comments
> ==
>
> 1.1 General - Option name
>
> I still feel like the option name is not ideal. Unfortunately, this is
> important because any name change would impact lots of these patch
> files and docs, struct members etc.
>
> It was originally called "local_only", but I thought that as a
> SUBSCRIPTION option that was confusing because "local" means local to
> what? Really it is local to the publisher, not local to the
> subscriber, so that name seemed misleading.
>
> Then I suggested "publish_local_only". Although that resolved the
> ambiguity problem, other people thought it seemed odd to have the
> "publish" prefix for a subscription-side option.
>
> So now it is changed again to "subscribe_local_only" -- It's getting
> better but still, it is implied that the "local" means local to the
> publisher except there is nothing in the option name really to convey
> that meaning. IMO we here all understand the meaning of this option
> mostly by familiarity with this discussion thread, but I think a user
> coming to this for the first time will still be confused by the name.
>
> Below are some other name ideas. Maybe they are not improvements, but
> it might help other people to come up with something better.
>
> subscribe_publocal_only = true/false
> origin = pub_only/any
> adjacent_data_only = true/false
> direct_subscriptions_only = true/false
> ...

FWIW, The subscriber wants "changes originated on publisher". From
that angle origin = publisher/any looks attractive. It also leaves
open the possibility that the subscriber may ask changes from a set of
origins or even non-publisher origins.

-- 
Best Wishes,
Ashutosh Bapat




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread Andrew Dunstan


On 3/30/22 14:37, Robert Haas wrote:
> On Fri, Feb 18, 2022 at 7:02 PM Andres Freund  wrote:
>> On 2022-02-15 08:06:58 -0800, Andres Freund wrote:
>>> The more I think about it the more I'm convinced that if we want to do this,
>>> we should do it for variables and functions.
>> Btw, if we were to do this, we should just use -fvisibility=hidden everywhere
>> and would see the same set of failures on unixoid systems as on windows. Of
>> course only in in-core extensions, but it'd still be better than nothing.
> Let's be less ambitious for this release, and just get the variables
> marked with PGDLLIMPORT. We seem to have consensus to create parity
> between Windows and non-Windows builds, which means precisely applying
> PGDLLIMPORT to variables marked in header files, and nothing more. The
> merits of -fvisibility=hidden or PGDLLIMPORT on functions are a
> separate question that can be debated on its own merits, but I don't
> want that larger discussion to bog down this effort. Here are updated
> patches for that.
>
> @RMT: Andres proposed upthread that we should plan to do this just
> after feature freeze. Accordingly I propose to commit at least 0002
> and perhaps 0001 if people want it just after feature freeze. I
> therefore ask that the RMT either (a) regard this change as not being
> a feature (and thus not subject to the freeze) or (b) give it a 1-day
> extension. The reason for committing it just after freeze is to
> minimize the number of conflicts that it creates for other patches.
> The reason why that's probably an OK thing to do is that applying
> PGDLLIMPORT markings is low-risk.
>
> Thanks,
>


WFM. I think Tom also has an item he wants to do right at the end of
feature freeze.


The script looks fine, needs a copyright notice and a comment at the top
describing what it does. It seems like something we might need to do
from time to time, as it will be easy to forget to mark variables and we
should periodically run this as a check.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





How to generate a WAL record spanning multiple WAL files?

2022-04-05 Thread Bharath Rupireddy
Hi,

I wanted to have a WAL record spanning multiple WAL files of size, say
16MB. I'm wondering if the Full Page Images (FPIs) of a TOAST table
would help here. Please let me know if there's any way to generate
such large WAL records.

Thoughts?

Regards,
Bharath Rupireddy.




A fastpath for TransactionIdIsCurrentTransactionId

2022-04-05 Thread Andy Fan
Hi:

In my recent work,  I want to check if the xmin for all the tuples is
CurrentTransactioniId,
then I found we can improve the  fastpath for
TransactionIdIsCurrentTransactionId
like below,  would it be safe?  This would be helpful if we have lots of
sub transactionId.

diff --git a/src/backend/access/transam/xact.c
b/src/backend/access/transam/xact.c
index 3596a7d7345..e4721a6cb39 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -935,8 +935,12 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 * Likewise, InvalidTransactionId and FrozenTransactionId are
certainly
 * not my transaction ID, so we can just return "false" immediately
for
 * any non-normal XID.
+*
+* And any Transaction IDs precede TransactionXmin are certainly not
+* my transaction ID as well.
 */
-   if (!TransactionIdIsNormal(xid))
+
+   if (TransactionIdPrecedes(xid, TransactionXmin))
return false;

if (TransactionIdEquals(xid, GetTopTransactionIdIfAny()))

-- 
Best Regards
Andy Fan


API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-05 Thread Markus Wanner



On 24.03.22 20:32, Robert Haas wrote:

Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.


This patch changed the delayChkpt field of struct PGPROC from bool to 
int.  Back-porting this change could be considered an API breaking 
change for extensions using this field.


I'm not certain about padding behavior of compilers in general (or 
standards requirements around that), but at least on my machine, it 
seems sizeof(PGPROC) did not change, so padding led to subsequent fields 
still having the same offset.


Nonetheless, the meaning of the field itself changed.  And the 
additional assert now also triggers for the following pseudo-code of the 
extension I'm concerned about:


/*
 * Prevent checkpoints being emitted in between additional
 * information in the logical message and the following
 * prepare record.
 */
MyProc->delayChkpt = true;

LogLogicalMessage(...);

/* Note that this will also reset the delayChkpt flag. */
PrepareTransaction(...);


Now, I'm well aware this is not an official API, it just happens to be 
accessible for extensions.  So I guess the underlying question is:  What 
can extension developers expect?  Which parts are okay to change even in 
stable branches and which can be relied upon to remain stable?


And for this specific case: Is it worth reverting this change and 
applying a fully backwards compatible fix, instead?


Regards

Markus Wanner




Re: Window Function "Run Conditions"

2022-04-05 Thread Andy Fan
On Tue, Apr 5, 2022 at 7:49 PM David Rowley  wrote:

> On Tue, 5 Apr 2022 at 19:38, Andy Fan  wrote:
> > 1. We can do more on PASSTHROUGH, we just bypass the window function
> > currently,  but IIUC we can ignore all of the following tuples in
> current partition
> > once we go into this mode.  patch 0001 shows what I mean.
>
> Yeah, there is more performance to be had than even what you've done
> there.  There's no reason really for spool_tuples() to do
> tuplestore_puttupleslot() when we're not in run mode.
>

Yeah, this is a great idea.

The attached should give slightly more performance.  I'm unsure if
> there's more that can be done for window aggregates, i.e.
> eval_windowaggregates()
>
> I'll consider the idea about doing all the filtering in
> nodeWindowAgg.c. For now I made find_window_run_conditions() keep the
> qual so that it's still filtered in the subquery level when there is a
> PARTITION BY clause. Probably the best way would be to make

nodeWindowAgg.c just loop with a for(;;) loop. I'll need to give it
> more thought. I'll do that in the morning.
>
>
I just finished the planner part review and thought about the
multi activeWindows
cases,  I think passthrough mode should be still needed but just for multi
activeWindow cases, In the passthrough mode,  we can not discard the tuples
in the same partition.  Just that PARTITION BY clause should not be the
requirement
for passthrough mode and we can do such optimization.  We can discuss
more after your final decision.

And I would suggest the below fastpath for this feature.

@@ -2535,7 +2535,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo
*rel,
 * if it happens to reference a window
function.  If so then
 * it might be useful to use for the
WindowAgg's runCondition.
 */
-   if (check_and_push_window_quals(subquery,
rte, rti, clause))
+   if (!subquery->hasWindowFuncs ||
check_and_push_window_quals(subquery, rte, rti, clause))
{
/*
 * It's not a suitable window run
condition qual or it is,

-- 
Best Regards
Andy Fan


  1   2   >