Re: refactoring relation extension and BufferAlloc(), faster COPY
Hi, On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote: > On 2023-Feb-21, Heikki Linnakangas wrote: > > > > +static BlockNumber > > > +BulkExtendSharedRelationBuffered(Relation rel, > > > + SMgrRelation > > > smgr, > > > + bool > > > skip_extension_lock, > > > + char > > > relpersistence, > > > + ForkNumber > > > fork, ReadBufferMode mode, > > > + > > > BufferAccessStrategy strategy, > > > + uint32 > > > *num_pages, > > > + uint32 > > > num_locked_pages, > > > + Buffer > > > *buffers) > > > > Ugh, that's a lot of arguments, some are inputs and some are outputs. I > > don't have any concrete suggestions, but could we simplify this somehow? > > Needs a comment at least. > > Yeah, I noticed this too. I think it would be easy enough to add a new > struct that can be passed as a pointer, which can be stack-allocated > by the caller, and which holds the input arguments that are common to > both functions, as is sensible. I played a fair bit with various options. I ended up not using a struct to pass most options, but instead go for a flags argument. However, I did use a struct for passing either relation or smgr. typedef enum ExtendBufferedFlags { EB_SKIP_EXTENSION_LOCK = (1 << 0), EB_IN_RECOVERY = (1 << 1), EB_CREATE_FORK_IF_NEEDED = (1 << 2), EB_LOCK_FIRST = (1 << 3), /* internal flags follow */ EB_RELEASE_PINS = (1 << 4), } ExtendBufferedFlags; /* * To identify the relation - either relation or smgr + relpersistence has to * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us * to use the same function for both crash recovery and normal operation. */ typedef struct ExtendBufferedWhat { Relation rel; struct SMgrRelationData *smgr; char relpersistence; } ExtendBufferedWhat; #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel}) /* requires use of EB_SKIP_EXTENSION_LOCK */ #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence}) extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb, ForkNumber forkNum, BufferAccessStrategy strategy, uint32 flags); extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb, ForkNumber fork, BufferAccessStrategy strategy, uint32 flags, uint32 extend_by, Buffer *buffers, uint32 *extended_by); extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb, ForkNumber fork, BufferAccessStrategy strategy, uint32 flags, BlockNumber extend_to, ReadBufferMode mode); As you can see I removed ReadBufferMode from most of the functions (as suggested by Heikki earlier). When extending by 1/multiple pages, we only need to know whether to lock or not. The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to fall back to reading page normally if there was a concurrent relation extension. The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated, gnarly, code to do so from vm_extend(), fsm_extend(). I'm not sure about the function naming pattern. I do like 'By' a lot more than the Bulk prefix I used before. What do you think? Greetings, Andres Freund
Combine pg_walinspect till_end_of_wal functions with others
Hi, In a recent discussion [1], Michael Paquier asked if we can combine pg_walinspect till_end_of_wal functions with other functions pg_get_wal_records_info and pg_get_wal_stats. The code currently looks much duplicated and the number of functions that pg_walinspect exposes to the users is bloated. The point was that the till_end_of_wal functions determine the end LSN and everything else that they do is the same as their counterpart functions. Well, the idea then was to keep things simple, not clutter the APIs, have better and consistent user-inputted end_lsn validations at the cost of usability and code redundancy. However, now I tend to agree with the feedback received. I'm attaching a patch doing the $subject with the following behavior: 1. If start_lsn is NULL, error out/return NULL. 2. If end_lsn isn't specified, default to NULL, then determine the end_lsn. 3. If end_lsn is specified as NULL, then determine the end_lsn. 4. If end_lsn is specified as non-NULL, then determine if it is greater than start_lsn if yes, go ahead do the job, otherwise error out. Another idea is to convert till_end_of_wal flavors to SQL-only functions and remove the c code from pg_walinspect.c. However, I prefer $subject and completely remove till_end_of_wal flavors for better usability in the long term. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACV-WBN%3DEUgUPyYOGitp%2Brn163vMnQd%3DHcWrnKt-uqFYFA%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 156e1d9148af79b905095b627e2db7919e8489ca Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 26 Feb 2023 13:34:33 + Subject: [PATCH v1] Reduce pg_walinspect's functions without losing functionality --- contrib/pg_walinspect/Makefile| 6 +- .../pg_walinspect/expected/oldextversions.out | 64 +++ .../pg_walinspect/expected/pg_walinspect.out | 32 +++- contrib/pg_walinspect/meson.build | 4 +- .../pg_walinspect/pg_walinspect--1.1--1.2.sql | 78 contrib/pg_walinspect/pg_walinspect.c | 174 +- contrib/pg_walinspect/pg_walinspect.control | 2 +- contrib/pg_walinspect/sql/oldextversions.sql | 27 +++ contrib/pg_walinspect/sql/pg_walinspect.sql | 28 ++- doc/src/sgml/pgwalinspect.sgml| 54 ++ 10 files changed, 336 insertions(+), 133 deletions(-) create mode 100644 contrib/pg_walinspect/expected/oldextversions.out create mode 100644 contrib/pg_walinspect/pg_walinspect--1.1--1.2.sql create mode 100644 contrib/pg_walinspect/sql/oldextversions.sql diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile index 7033878a79..9527a0b268 100644 --- a/contrib/pg_walinspect/Makefile +++ b/contrib/pg_walinspect/Makefile @@ -7,9 +7,11 @@ OBJS = \ PGFILEDESC = "pg_walinspect - functions to inspect contents of PostgreSQL Write-Ahead Log" EXTENSION = pg_walinspect -DATA = pg_walinspect--1.0.sql pg_walinspect--1.0--1.1.sql +DATA = pg_walinspect--1.0.sql \ + pg_walinspect--1.0--1.1.sql \ + pg_walinspect--1.1--1.2.sql -REGRESS = pg_walinspect +REGRESS = pg_walinspect oldextversions REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf diff --git a/contrib/pg_walinspect/expected/oldextversions.out b/contrib/pg_walinspect/expected/oldextversions.out new file mode 100644 index 00..65963c77f4 --- /dev/null +++ b/contrib/pg_walinspect/expected/oldextversions.out @@ -0,0 +1,64 @@ +-- test old extension version entry points +DROP EXTENSION pg_walinspect; +CREATE EXTENSION pg_walinspect VERSION '1.1'; +-- New function introduced in 1.1 +SELECT pg_get_functiondef('pg_get_wal_fpi_info'::regproc); + pg_get_functiondef +--- + CREATE OR REPLACE FUNCTION public.pg_get_wal_fpi_info(start_lsn pg_lsn, end_lsn pg_lsn, OUT lsn pg_lsn, OUT reltablespace oid, OUT reldatabase oid, OUT relfilenode oid, OUT relblocknumber bigint, OUT forkname text, OUT fpi bytea)+ + RETURNS SETOF record+ + LANGUAGE c + + PARALLEL SAFE STRICT
Re: Add LZ4 compression in pg_dump
On Tue, Feb 28, 2023 at 05:58:34PM -0600, Justin Pryzby wrote: > I found that e9960732a broke writing of empty gzip-compressed data, > specifically LOs. pg_dump succeeds, but then the restore fails: The number of issues you have been reporting here begins to worries me.. How many of them have you found? Is it right to assume that all of them have basically 03d02f5 as oldest origin point? -- Michael signature.asc Description: PGP signature
Re: meson: Add equivalent of configure --disable-rpath option
On 22.02.23 13:56, Peter Eisentraut wrote: The configure option --disable-rpath currently has no equivalent in meson. This option is used by packagers, so I think it would be good to have it in meson as well. I came up with the attached patch. committed
Re: Track Oldest Initialized WAL Buffer Page
On Wed, Mar 1, 2023 at 9:49 AM Nathan Bossart wrote: > > On Tue, Feb 28, 2023 at 11:12:29AM +0530, Bharath Rupireddy wrote: > > On Tue, Feb 28, 2023 at 5:52 AM Nathan Bossart > > wrote: > >> It's confusing to me that OldestInitializedPage is set to NewPageBeginPtr. > >> Doesn't that set it to the beginning of the newest initialized page? > > > > Yes, that's the intention, see below. OldestInitializedPage points to > > the start address of the oldest initialized page whereas the > > InitializedUpTo points to the end address of the latest initialized > > page. With this, one can easily track all the WAL between > > OldestInitializedPage and InitializedUpTo. > > This is where I'm confused. Why would we set the variable for the start > address of the _oldest_ initialized page to the start address of the > _newest_ initialized page? I must be missing something obvious, so sorry > if this is a silly question. That's the crux of the patch. Let me clarify it a bit. Firstly, we try to set OldestInitializedPage at the end of the recovery but that's conditional, that is, only when the last replayed WAL record spans partially to the end block. Secondly, we set OldestInitializedPage while initializing the page for the first time, so the missed-conditional case above gets coverd too. And, OldestInitializedPage isn't updated for every new initialized page, only when the previous OldestInitializedPage is being reused i.e. the wal_buffers are full and it wraps around. Please see the comment and the condition XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx which holds true if we're crossing-over/wrapping around previous OldestInitializedPage. +/* + * Try updating oldest initialized XLog buffer page. + * + * Update it if we are initializing an XLog buffer page for the first + * time or if XLog buffers are full and we are wrapping around. + */ +if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) || +XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx) +{ +Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr); + +XLogCtl->OldestInitializedPage = NewPageBeginPtr; +} -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade and logical replication
On Wed, Mar 01, 2023 at 11:51:49AM +0530, Amit Kapila wrote: > On Tue, Feb 28, 2023 at 10:18 AM Julien Rouhaud wrote: > > > > Well, as I mentioned I'm *not* interested in a logical-replication-only > > scenario. Logical replication is nice but it will always be less efficient > > than physical replication, and some workloads also don't really play well > > with > > it. So while it can be a huge asset in some cases I'm for now looking at > > leveraging logical replication for the purpose of major upgrade only for a > > physical replication cluster, so the publications and subscriptions are only > > temporary and trashed after use. > > > > That being said I was only saying that if I had to do a major upgrade of a > > logical replication cluster this is probably how I would try to do it, to > > minimize downtime, even if there are probably *a lot* difficulties to > > overcome. > > > > Okay, but it would be better if you list out your detailed steps. It > would be useful to support the new mechanism in this area if others > also find your steps to upgrade useful. Sure. Here are the overly detailed steps: 1) setup a normal physical replication cluster (pg_basebackup, restoring PITR, whatever), let's call the primary node "A" and replica node "B" 2) ensure WAL level is "logical" on the primary node A 3) create a logical replication slot on every (connectable) database (or just the one you're interested in if you don't want to preserve everything) on A 4) create a FOR ALL TABLE publication (again for every databases or just the one you're interested in) 5) wait for replication to be reasonably if not entirely up to date 6) promote the standby node B 7) retrieve the promotion LSN (from the .history file, pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()...) 8) call pg_replication_slot_advance() with that LSN for all previously created logical replication slots on A 9) create a normal subscription on all wanted databases on the promoted node 10) wait for it to catchup if needed on B 12) stop the node B 13) run pg_upgrade on B, creating the new node C 14) start C, run the global ANALYZE and any sanity check needed (hopefully you would have validated that your application is compatible with that new version before this point) 15) re-enable the subscription on C. This is currently not possible without losing data, the patch fixes that 16) wait for it to catchup if needed 17) create any missing relation and do the ALTER SUBSCRIPTION ... REFRESH if needed 18) trash B 19) create new nodes D, E... as physical replica from C if needed, possibly using cheaper approach like pg_start_backup() / rsync / pg_stop_backup if needed 20) switchover to C and trash A (or convert it to another replica if you want) 21) trash the publications on C on all databases As noted the step 15 is currently problematic, and is also problematic in any variation of that scenario that doesn't require you to entirely recreate the node C from scratch using logical replication, which is what I want to avoid. This isn't terribly complicated but requires to be really careful if you don't want to end up with an incorrect node C. This approach is also currently not entirely ideal, but hopefully logical replication of sequences and DDL will remove the main sources of downtime when upgrading using logical replication. My ultimate goal is to provide some tooling to do that in a much simpler way. Maybe a new "promote to logical" action that would take care of steps 2 to 9. Users would therefore only have to do this "promotion to logical", and then run pg_upgrade and create a new physical replication cluster if they want.
Re: Commitfest 2023-03 starting tomorrow!
On Tue, Feb 28, 2023 at 01:45:27PM -0500, Greg Stark wrote: > So I'm not sure if I'll be CFM this month but I'm assuming I will be > at this point Okay, that's OK for me! Thanks for helping out. > The next pass would be to grab any patches not marked Ready for > Committer and if they look like they'll need more than a one round of > feedback and a couple weeks to polish they'll probably get bounced to > the next commitfest too. It sucks not getting feedback on your patches > for so long but there are really just sooo many patches and so few > eyeballs... It would be great if people could do initial reviews of > these patches before we bounce them because it really is discouraging > for developers to send patches and not get feedback. But realistically > it's going to happen to a lot of patches. I don't have many patches registered this time for the sole reason of being able to spend more cycles on reviews and see what could make the cut. So we'll see how it goes, I guess.. The CF would begin in more or less 5 hours as of the moment of this message: https://www.timeanddate.com/time/zones/aoe -- Michael signature.asc Description: PGP signature
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada wrote: > > On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada wrote: > > > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor > > wrote: > > > > > > > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada < sawada.m...@gmail.com> wrote: > > > > > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > > > wrote: > > > > > > > > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed. > > > > > > > My idea is to make the bug investigation easier but on > > > > reflection, it seems not the best idea given this purpose. > > > > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread. > > > > > > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself. > > > > > > > What do you think > > > > about the attached patch? Please note that it also includes the > > > > changes for minimum memory requirement. > > > > > > Most of the asserts look logical, or at least harmless. > > > > > > - int max_off; /* the maximum offset number */ > > > + OffsetNumber max_off; /* the maximum offset number */ > > > > > > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose. > > > > Right. I'll separate this change as a separate patch. > > > > > > > > This change, however, defies common sense: > > > > > > +/* > > > + * The minimum amount of memory required by TidStore is 2MB, the current minimum > > > + * valid value for the maintenance_work_mem GUC. This is required to allocate the > > > + * DSA initial segment, 1MB, and some meta data. This number is applied also to > > > + * the local TidStore cases for simplicity. > > > + */ > > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > > > > > + /* Sanity check for the max_bytes */ > > > + if (max_bytes < TIDSTORE_MIN_MEMORY) > > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > > > + TIDSTORE_MIN_MEMORY, max_bytes); > > > > > > Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place. > > > > > > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. > > > > > > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem. > > > > Right. > > > > > It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?). > > > > IIUC both don't care about the allocated DSA segment size. Parallel > > hash accounts actual tuple (+ header) size as used memory but doesn't > > consider how much DSA segment is allocated behind. Both parallel hash > > and parallel bitmap scan can work even with work_mem = 64kB, but when > > checking the total DSA segment size allocated during these operations, > > it was 1MB. > > > > I realized that there is a similar memory limit design issue also on > > the non-shared tidstore cases. We deduct 70kB from max_bytes but it > > won't work fine with work_mem = 64kB. Probably we need to reconsider > > it. FYI 70kB comes from the maximum slab block size for node256. > > Currently, we calculate the slab block size enough to allocate 32 > chunks from there. For node256, the leaf node is 2,088 bytes and the > slab block size is 66,816 bytes. One idea to fix this issue to > decrease it. I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault. If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last
Re: LOG: invalid record length at : wanted 24, got 0
On Wed, Mar 1, 2023 at 10:51 AM Harinath Kanchu wrote: > > Hello, > > We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 > days. > > The standby suddenly disconnects from the primary, and it throws the error > “LOG: invalid record length at : wanted 24, got0”. Firstly, this isn't an error per se, especially for a standby as it can get/retry the same WAL record from other sources. It's a bit hard to say anything further just by looking at this LOG message, one needs to look at what's happening around the same time. You mentioned that the connection to primary was lost, so you need to dive deep as to why it got lost. If the connection was lost half-way through fetching the WAL record, the standby may emit such a LOG message. Secondly, you definitely need to understand why the connection to primary keeps getting lost - network disruption, parameter changes or primary going down, standby going down etc.? > And then it tries to restore the WAL file from the archive. Due to low write > activity on primary, the WAL file will be switched and archived only after 1 > hr. > > So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE > without replicating the primary. > > Due to this there will be write outage as the standby is synchronous standby. I understand this problem and there's a proposed patch to help with this - https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3njkvzt8q...@mail.gmail.com. It basically allows one to set a timeout as to how much duration the standby can restore from archive before switching to stream. Therefore, in your case, the standby doesn't have to wait for 1hr to connect to primary, but it can connect before that. > We are using “wal_sync_method” as “fsync” assuming WAL file not getting > flushed correctly. > > But this is happening even after making it as “fsync” instead of “fdatasync”. I don't think that's a problem, unless wal_sync_method isn't changed to something else in between. > Restarting the STANDBY sometimes fixes this problem, but detecting this > automatically is a big problem as the postgres standby process will be still > running fine, but WAL RECEIVER process is up and down continuously due to > switching of WAL sources. Yes, the standby after failure to connect to primary, it switches to archive and stays there until it exhausts all the WAL from the archive and then switches to stream. You can monitor the replication slot of the standby on the primary, if it's inactive, then one needs to jump in. As mentioned above, there's an in-progress feature that helps in these cases. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: add PROCESS_MAIN to VACUUM
On Thu, Jan 19, 2023 at 11:08:07AM -0800, Nathan Bossart wrote: > rebased PROCESS_TOAST has that: /* sanity check for PROCESS_TOAST */ if ((params->options & VACOPT_FULL) != 0 && (params->options & VACOPT_PROCESS_TOAST) == 0) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("PROCESS_TOAST required with VACUUM FULL"))); [...] - if (params->options & VACOPT_FULL) + if (params->options & VACOPT_FULL && + params->options & VACOPT_PROCESS_MAIN) { Shouldn't we apply the same rule for PROCESS_MAIN? One of the regression tests added means that FULL takes priority over PROCESS_MAIN=FALSE, which is a bit confusing IMO. @@ -190,6 +190,7 @@ typedef struct VacAttrStats #define VACOPT_DISABLE_PAGE_SKIPPING 0x80 /* don't skip any pages */ #define VACOPT_SKIP_DATABASE_STATS 0x100 /* skip vac_update_datfrozenxid() */ #define VACOPT_ONLY_DATABASE_STATS 0x200 /* only vac_update_datfrozenxid() */ +#define VACOPT_PROCESS_MAIN 0x400 /* process main relation */ Perhaps the options had better be reorganized so as PROCESS_MAIN is just before PROCESS_TOAST? +-- PROCESS_MAIN option +VACUUM (PROCESS_MAIN FALSE) vactst; +VACUUM (PROCESS_MAIN FALSE, PROCESS_TOAST FALSE) vactst; +VACUUM (PROCESS_MAIN FALSE, FULL) vactst; Thinking a bit here. This set of tests does not make sure that the main relation and/or the toast relation have been actually processed. pg_stat_user_tables does not track what's happening on the toast relations. So... What about adding some tests in 100_vacuumdb.pl that rely on vacuumdb --verbose and check the logs produced? We should make sure that the toast or the main relation are processed, by tracking, for example, logs like vacuuming "schema.table". When FULL is involved, we may want to track the changes on relfilenodes depending on what's wanted. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade and logical replication
On Tue, Feb 28, 2023 at 10:18 AM Julien Rouhaud wrote: > > On Tue, Feb 28, 2023 at 08:56:37AM +0530, Amit Kapila wrote: > > On Tue, Feb 28, 2023 at 7:55 AM Julien Rouhaud wrote: > > > > > > > > > The scenario I'm interested in is to rely on logical replication only for > > > the > > > upgrade, so the end state (and start state) is to go back to physical > > > replication. In that case, I would just create new physical replica from > > > the > > > pg_upgrade'd server and failover to that node, or rsync the previous > > > publisher > > > node to make it a physical replica. > > > > > > But even if you want to only rely on logical replication, I'm not sure > > > why you > > > would want to keep the publisher node as a publisher node? I think that > > > doing > > > it this way will lead to a longer downtime compared to doing a failover > > > on the > > > pg_upgrade'd node, make it a publisher and then move the former publisher > > > node > > > to a subscriber. > > > > > > > I am not sure if this is usually everyone follows because it sounds > > like a lot of work to me. IIUC, to achieve this, one needs to recreate > > all the publications and subscriptions after changing the roles of > > publisher and subscriber. Can you please write steps to show exactly > > what you have in mind to avoid any misunderstanding? > > Well, as I mentioned I'm *not* interested in a logical-replication-only > scenario. Logical replication is nice but it will always be less efficient > than physical replication, and some workloads also don't really play well with > it. So while it can be a huge asset in some cases I'm for now looking at > leveraging logical replication for the purpose of major upgrade only for a > physical replication cluster, so the publications and subscriptions are only > temporary and trashed after use. > > That being said I was only saying that if I had to do a major upgrade of a > logical replication cluster this is probably how I would try to do it, to > minimize downtime, even if there are probably *a lot* difficulties to > overcome. > Okay, but it would be better if you list out your detailed steps. It would be useful to support the new mechanism in this area if others also find your steps to upgrade useful. -- With Regards, Amit Kapila.
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 1, 2023 at 10:57 AM Masahiko Sawada wrote: > > On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila wrote: > > > > > Won't a malicious user can block the > > replication in other ways as well and let the publisher stall (or > > crash the publisher) even without setting min_send_delay? Basically, > > one needs to either disable the subscription or create a > > constraint-violating row in the table to make that happen. If the > > system is exposed for arbitrarily allowing the creation of a > > subscription then a malicious user can create a subscription similar > > to one existing subscription and block the replication due to > > constraint violations. I don't think it would be so easy to bypass the > > current system that a malicious user will be allowed to create/alter > > subscriptions arbitrarily. > > Right. But a difference is that with min_send_delay, it's just to > create a subscription. > But, currently, only superusers would be allowed to create subscriptions. Even, if we change it and allow it based on some pre-defined role, it won't be allowed to create a subscription arbitrarily. So, not sure, if any malicious user can easily bypass it as you are envisioning it. -- With Regards, Amit Kapila.
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila wrote: > > On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada wrote: > > > > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) > > wrote: > > > > Thinking of side effects of this feature (no matter where we delay > > applying the changes), on the publisher, vacuum cannot collect garbage > > and WAL cannot be recycled. Is that okay in the first place? The point > > is that the subscription setting affects the publisher. That is, > > min_send_delay is specified on the subscriber but the symptoms that > > could ultimately lead to a server crash appear on the publisher, which > > sounds dangerous to me. > > > > Imagine a service or system like where there is a publication server > > and it's somewhat exposed so that a user (or a subsystem) arbitrarily > > can create a subscriber to replicate a subset of the data. A malicious > > user can have the publisher crash by creating a subscription with, > > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation > > but it's -1 by default. > > > > By publisher crash, do you mean due to the disk full situation, it can > lead the publisher to stop/panic? Exactly. > Won't a malicious user can block the > replication in other ways as well and let the publisher stall (or > crash the publisher) even without setting min_send_delay? Basically, > one needs to either disable the subscription or create a > constraint-violating row in the table to make that happen. If the > system is exposed for arbitrarily allowing the creation of a > subscription then a malicious user can create a subscription similar > to one existing subscription and block the replication due to > constraint violations. I don't think it would be so easy to bypass the > current system that a malicious user will be allowed to create/alter > subscriptions arbitrarily. Right. But a difference is that with min_send_delay, it's just to create a subscription. > Similarly, if there is a network issue > (unreachable or slow), one will see similar symptoms. I think > retention of data and WAL on publisher do rely on acknowledgment from > subscribers and delay in that due to any reason can lead to the > symptoms you describe above. I think that piling up WAL files due to a slow network is a different story since it's a problem not only on the subscriber side. > We have documented at least one such case > already where during Drop Subscription, if the network is not > reachable then also, a similar problem can happen and users need to be > careful about it [1]. Apart from a bad-use case example I mentioned, in general, piling up WAL files due to the replication slot has many bad effects on the system. I'm concerned that the side effect of this feature (at least of the current design) is too huge compared to the benefit, and afraid that users might end up using this feature without understanding the side effect well. It might be okay if we thoroughly document it but I'm not sure. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
LOG: invalid record length at : wanted 24, got 0
Hello, We are seeing an interesting STANDBY behavior, that’s happening once in 3-4 days. The standby suddenly disconnects from the primary, and it throws the error “LOG: invalid record length at : wanted 24, got0”. And then it tries to restore the WAL file from the archive. Due to low write activity on primary, the WAL file will be switched and archived only after 1 hr. So, it stuck in a loop of switching the WAL sources from STREAM and ARCHIVE without replicating the primary. Due to this there will be write outage as the standby is synchronous standby. We are using “wal_sync_method” as “fsync” assuming WAL file not getting flushed correctly. But this is happening even after making it as “fsync” instead of “fdatasync”. Restarting the STANDBY sometimes fixes this problem, but detecting this automatically is a big problem as the postgres standby process will be still running fine, but WAL RECEIVER process is up and down continuously due to switching of WAL sources. How can we fix this ? Any suggestions regarding this will be appreciated. Postgres Version: 13.6 OS: RHEL Linux Thank you, Best, Harinath.
Re: Support logical replication of global object commands
On Fri, Feb 17, 2023 at 4:48 AM Amit Kapila wrote: > > On Fri, Feb 17, 2023 at 10:58 AM Zheng Li wrote: > > > > > > > Actually, I intend something for global objects. But the main thing > > > > > that is worrying me about this is that we don't have a clean way to > > > > > untie global object replication from database-specific object > > > > > replication. > > > > > > > > I think ultimately we need a clean and efficient way to publish (and > > > > subscribe to) any changes in all databases, preferably in one logical > > > > replication slot. > > > > > > > > > > Agreed. I was thinking currently for logical replication both > > > walsender and slot are database-specific. So we need a way to > > > distinguish the WAL for global objects and then avoid filtering based > > > on the slot's database during decoding. > > > > But which WALSender should handle the WAL for global objects if we > > don't filter by database? Is there any specific problem you see for > > decoding global objects commands in a database specific WALSender? > > > > I haven't verified but I was concerned about the below check: > logicalddl_decode > { > ... > + > + if (message->dbId != ctx->slot->data.database || OK, let's suppose we don't filter by database for global commands when decoding ddl records, roughly what the following code does: logicalddl_decode { ... if (message->dbId != ctx->slot->data.database || + message->cmdtype != DCT_GlobalObjectCmd But this is not enough, we also need the subsequent commit record of the txn to be decoded in order to replicate the global command. So I think we also need to make DecodeCommit bypass the filter by database if global object replication is turned on and we have decoded a global command in the txn. Regards, Zane
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Tue, Feb 28, 2023 at 9:21 PM Hayato Kuroda (Fujitsu) wrote: > > > 1. > > + /* > > + * If we've requested to shut down, exit the process. > > + * > > + * Note that WalSndDone() cannot be used here because the delaying > > + * changes will be sent in the function. > > + */ > > + if (got_STOPPING) > > + { > > + QueryCompletion qc; > > + > > + /* Inform the standby that XLOG streaming is done */ > > + SetQueryCompletion(, CMDTAG_COPY, 0); > > + EndCommand(, DestRemote, false); > > + pq_flush(); > > > > Do we really need to do anything except for breaking the loop and let > > the exit handling happen in the main loop when 'got_STOPPING' is set? > > AFAICS, this is what we are doing in some other palces (See > > WalSndWaitForWal). Won't that work? It seems that will help us sending > > all the pending WAL. > > If we exit the loop after got_STOPPING is set, as you said, the walsender will > send delaying changes and then exit. The behavior is same as the case that > WalSndDone() > is called. But I think it is not suitable for the motivation of the feature. > If users notice the miss operation like TRUNCATE, they must shut down the > publisher > once and then recovery from back up or old subscriber. If the walsender sends > all > pending changes, miss operations will be also propagated to subscriber and > data > cannot be protected. So currently I want to keep the style. > FYI - In case of physical replication, received WALs are not applied when the > secondary is shutted down. > Fair point but I think the current comment should explain why we are doing something different here. How about extending the existing comments to something like: "If we've requested to shut down, exit the process. This is unlike handling at other places where we allow complete WAL to be sent before shutdown because we don't want the delayed transactions to be applied downstream. This will allow one to use the data from downstream in case of some unwanted operations on the current node." -- With Regards, Amit Kapila.
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada wrote: > > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) > wrote: > > Thinking of side effects of this feature (no matter where we delay > applying the changes), on the publisher, vacuum cannot collect garbage > and WAL cannot be recycled. Is that okay in the first place? The point > is that the subscription setting affects the publisher. That is, > min_send_delay is specified on the subscriber but the symptoms that > could ultimately lead to a server crash appear on the publisher, which > sounds dangerous to me. > > Imagine a service or system like where there is a publication server > and it's somewhat exposed so that a user (or a subsystem) arbitrarily > can create a subscriber to replicate a subset of the data. A malicious > user can have the publisher crash by creating a subscription with, > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation > but it's -1 by default. > By publisher crash, do you mean due to the disk full situation, it can lead the publisher to stop/panic? Won't a malicious user can block the replication in other ways as well and let the publisher stall (or crash the publisher) even without setting min_send_delay? Basically, one needs to either disable the subscription or create a constraint-violating row in the table to make that happen. If the system is exposed for arbitrarily allowing the creation of a subscription then a malicious user can create a subscription similar to one existing subscription and block the replication due to constraint violations. I don't think it would be so easy to bypass the current system that a malicious user will be allowed to create/alter subscriptions arbitrarily. Similarly, if there is a network issue (unreachable or slow), one will see similar symptoms. I think retention of data and WAL on publisher do rely on acknowledgment from subscribers and delay in that due to any reason can lead to the symptoms you describe above. We have documented at least one such case already where during Drop Subscription, if the network is not reachable then also, a similar problem can happen and users need to be careful about it [1]. [1] - https://www.postgresql.org/docs/devel/logical-replication-subscription.html -- With Regards, Amit Kapila.
Re: Normalization of utility queries in pg_stat_statements
On Mon, Feb 20, 2023 at 11:34:59AM +0900, Michael Paquier wrote: > With the patches.. Attached is an updated patch set, where I have done more refactoring work for the regression tests of pg_stat_statements, splitting pg_stat_statments.sql into the following files: - user_activity.sql for the role-level resets. - wal.sql for the WAL generation tracking. - dml.sql for insert/update/delete/merge and row counts. - The main file is renamed to select.sql, as it now only covers SELECT patterns. There is no change in the code coverage or the patterns tested. And with that, I am rather comfortable with the shape of the regression tests moving forward. 0002 and 0003 are equivalent to the previous 0003 and 0004 in v4, that switch pg_stat_statements to apply the normalization to utilities that use Const nodes. -- Michael From 23476ee28a8e5bb82859369a8f54e9cd3c45fc32 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 1 Mar 2023 13:34:11 +0900 Subject: [PATCH v5 1/3] Split more regression tests of pg_stat_statements This commit expands more the refactoring of the regression tests of pg_stat_statements, with tests moved out of pg_stat_statements.sql to separate files, as of: - select.sql is mainly the former pg_stat_statements.sql, renamed. - dml.sql for INSERT/UPDATE/DELETE and MERGE - user_activity, to test role-level checks and stat resets. - wal, to check the WAL generation. --- contrib/pg_stat_statements/Makefile | 4 +- contrib/pg_stat_statements/expected/dml.out | 148 .../expected/pg_stat_statements.out | 768 -- .../pg_stat_statements/expected/select.out| 411 ++ .../expected/user_activity.out| 199 + contrib/pg_stat_statements/expected/wal.out | 35 + contrib/pg_stat_statements/meson.build| 5 +- contrib/pg_stat_statements/sql/dml.sql| 78 ++ .../sql/pg_stat_statements.sql| 300 --- contrib/pg_stat_statements/sql/select.sql | 146 .../pg_stat_statements/sql/user_activity.sql | 65 ++ contrib/pg_stat_statements/sql/wal.sql| 22 + 12 files changed, 1110 insertions(+), 1071 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/dml.out delete mode 100644 contrib/pg_stat_statements/expected/pg_stat_statements.out create mode 100644 contrib/pg_stat_statements/expected/select.out create mode 100644 contrib/pg_stat_statements/expected/user_activity.out create mode 100644 contrib/pg_stat_statements/expected/wal.out create mode 100644 contrib/pg_stat_statements/sql/dml.sql delete mode 100644 contrib/pg_stat_statements/sql/pg_stat_statements.sql create mode 100644 contrib/pg_stat_statements/sql/select.sql create mode 100644 contrib/pg_stat_statements/sql/user_activity.sql create mode 100644 contrib/pg_stat_statements/sql/wal.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 69fbc6a858..5578a9dd4e 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -17,8 +17,8 @@ PGFILEDESC = "pg_stat_statements - execution statistics of SQL statements" LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf -REGRESS = pg_stat_statements cursors utility level_tracking planning \ - cleanup oldextversions +REGRESS = select dml cursors utility level_tracking planning \ + user_activity wal cleanup oldextversions # Disabled because these tests require "shared_preload_libraries=pg_stat_statements", # which typical installcheck users do not have (e.g. buildfarm clients). NO_INSTALLCHECK = 1 diff --git a/contrib/pg_stat_statements/expected/dml.out b/contrib/pg_stat_statements/expected/dml.out new file mode 100644 index 00..803d993e85 --- /dev/null +++ b/contrib/pg_stat_statements/expected/dml.out @@ -0,0 +1,148 @@ +-- +-- DMLs on test table +-- +SET pg_stat_statements.track_utility = FALSE; +-- utility "create table" should not be shown +CREATE TEMP TABLE pgss_dml_tab (a int, b char(20)); +INSERT INTO pgss_dml_tab VALUES(generate_series(1, 10), 'aaa'); +UPDATE pgss_dml_tab SET b = 'bbb' WHERE a > 7; +DELETE FROM pgss_dml_tab WHERE a > 9; +-- explicit transaction +BEGIN; +UPDATE pgss_dml_tab SET b = '111' WHERE a = 1 ; +COMMIT; +BEGIN \; +UPDATE pgss_dml_tab SET b = '222' WHERE a = 2 \; +COMMIT ; +UPDATE pgss_dml_tab SET b = '333' WHERE a = 3 \; +UPDATE pgss_dml_tab SET b = '444' WHERE a = 4 ; +BEGIN \; +UPDATE pgss_dml_tab SET b = '555' WHERE a = 5 \; +UPDATE pgss_dml_tab SET b = '666' WHERE a = 6 \; +COMMIT ; +-- many INSERT values +INSERT INTO pgss_dml_tab (a, b) VALUES (1, 'a'), (2, 'b'), (3, 'c'); +-- SELECT with constants +SELECT * FROM pgss_dml_tab WHERE a > 5 ORDER BY a ; + a | b +---+-- + 6 | 666 + 7 | aaa + 8 | bbb + 9 | bbb +(4 rows) + +SELECT * + FROM pgss_dml_tab + WHERE a >
Re: stopgap fix for signal handling during restore_command
On Sun, Feb 26, 2023 at 12:12:27PM -0800, Andres Freund wrote: > On 2023-02-26 11:39:00 -0800, Nathan Bossart wrote: >> What precisely did you have in mind? AFAICT you are asking for a wrapper >> around write(). > > Partially I just want something that can easily be searched for, that can have > comments attached to it documenting why what it is doing is safe. > > It'd not be a huge amount of work to have a slow and restricted string > interpolation support, to make it easier to write messages. Converting floats > is probably too hard to do safely, and I'm not sure %m can safely be > supported. But basic things like %d would be pretty simple. > > Basically a loop around the format string that directly writes to stderr using > write(), and only supports a signal safe subset of normal format strings. Got it, thanks. I will try to put something together along these lines, although I don't know if I'll pick up the interpolation support in this thread. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Weird failure with latches in curculio on v15
On Sat, Feb 25, 2023 at 11:00:31AM -0800, Andres Freund wrote: > TBH, I think the current archive and restore module APIs aren't useful. I > think it was a mistake to add archive modules without having demonstrated that > one can do something useful with them that the restore_command didn't already > do. If anything, archive modules have made it harder to improve archiving > performance via concurrency. I must respectfully disagree that this work is useless. Besides the performance and security benefits of not shelling out for every WAL file, I've found it very useful to be able to use the standard module framework to develop archive modules. It's relatively easy to make use of GUCs, background workers, compression, etc. Of course, there is room for improvement in areas like concurrency support as you rightly point out, but I don't think that makes the current state worthless. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Track Oldest Initialized WAL Buffer Page
On Tue, Feb 28, 2023 at 11:12:29AM +0530, Bharath Rupireddy wrote: > On Tue, Feb 28, 2023 at 5:52 AM Nathan Bossart > wrote: >> It's confusing to me that OldestInitializedPage is set to NewPageBeginPtr. >> Doesn't that set it to the beginning of the newest initialized page? > > Yes, that's the intention, see below. OldestInitializedPage points to > the start address of the oldest initialized page whereas the > InitializedUpTo points to the end address of the latest initialized > page. With this, one can easily track all the WAL between > OldestInitializedPage and InitializedUpTo. This is where I'm confused. Why would we set the variable for the start address of the _oldest_ initialized page to the start address of the _newest_ initialized page? I must be missing something obvious, so sorry if this is a silly question. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote: > On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart > wrote: >> Why do we only read a page at a time in XLogReadFromBuffersGuts()? What is >> preventing us from copying all the data we need in one go? > > Note that most of the WALRead() callers request a single page of > XLOG_BLCKSZ bytes even if the server has less or more available WAL > pages. It's the streaming replication wal sender that can request less > than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And, > if we read, say, MAX_SEND_SIZE at once while holding > WALBufMappingLock, that might impact concurrent inserters (at least, I > can say it in theory) - one of the main intentions of this patch is > not to impact inserters much. Perhaps we should test both approaches to see if there is a noticeable difference. It might not be great for concurrent inserts to repeatedly take the lock, either. If there's no real difference, we might be able to simplify the code a bit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 1, 2023 at 8:06 AM Kyotaro Horiguchi wrote: > > At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila > wrote in > > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi > > wrote: > > > > > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila > > > wrote in > > > > The one difference w.r.t recovery_min_apply_delay is that here we will > > > > hold locks for the duration of the delay which didn't seem to be a > > > > good idea. This will also probably lead to more bloat as we will keep > > > > transactions open for a long time. Doing it before DecodePrepare won't > > > > > > I don't have a concrete picture but could we tell reorder buffer to > > > retain a PREPAREd transaction until a COMMIT PREPARED comes? > > > > > > > Yeah, we could do that and that is what is the behavior unless the > > user enables 2PC via 'two_phase' subscription option. But, I don't see > > the need to unnecessarily delay the prepare till the commit if a user > > has specified 'two_phase' option. It is quite possible that COMMIT > > PREPARED happens at a much later time frame than the amount of delay > > the user is expecting. > > It looks like the user should decide between potential long locks or > extra delays, and this choice ought to be documented. > Sure, we can do that. However, I think the way this feature works is that we keep standby/subscriber behind the primary/publisher by a certain time period and if there is any unwanted transaction (say Delete * .. without where clause), we can recover it from the receiver side. So, it may not matter much even if we wait at PREPARE to avoid long locks instead of documenting it. > > > If > > > delaying non-prepared transactions until COMMIT is adequate, then the > > > same thing seems to work for prepared transactions. > > > > > > > have such problems. This is the reason that we decide to perform a > > > > delay at the start of the transaction instead at commit/prepare in the > > > > subscriber-side approach. > > > > > > It seems that there are no technical obstacles to do that on the > > > publisher side. The only observable difference would be that > > > relatively large prepared transactions may experience noticeable > > > additional delays. IMHO I don't think it's a good practice > > > protocol-wise to intentionally choke a stream at the receiving end > > > when it has not been flow-controlled on the transmitting end. > > > > > > > But in this proposal, we are not choking/delaying anything on the receiving > > end. > > I didn't say that to the latest patch. I interpreted the quote of > your description as saying that the subscriber-side solution is > effective in solving the long-lock problems, so I replied that that > can be solved with the publisher-side solution and the subscriber-side > solution could cause some unwanted behavior. > > Do you think we have decided to go with the publisher-side solution? > I'm fine if so. > I am fine too unless we discover any major challenges with publisher-side implementation. -- With Regards, Amit Kapila.
Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]
Greetings, * Jacob Champion (jchamp...@timescale.com) wrote: > On Mon, Feb 27, 2023 at 12:43 PM Stephen Frost wrote: > > * Jacob Champion (jchamp...@timescale.com) wrote: > > > This patchset should ideally have required zero client side changes, but > > > because our SCRAM implementation is slightly nonstandard too -- it > > > doesn't embed the username into the SCRAM data -- libpq can't talk to > > > the OpenLDAP/Cyrus SASL implementation. I included a quick-and-bad patch > > > to fix it in libpq; that needs its own conversation. > > > > Indeed it does... as there were specific reasons that what we > > implemented for PG's SCRAM doesn't embed the username into the SCRAM > > data and my recollection is that we don't because SCRAM (or maybe SASL? > > either way though...) requires it to be utf-8 encoded and we support a > > lot of other encoding that aren't that, so it wouldn't work for a lot > > of our users. > > Yes. Interoperable authentication is going to have to solve those > sorts of problems somehow, though. And there are a bunch of levers to > pull: clients aren't required to run their sent usernames through > SASLprep; our existing servers don't mind having it sent, so we could > potentially backport a client change; and then after that it's a game > of balancing compatibility and compliance. A deeper conversation for > sure. We did solve those problems with SCRAM, by not exactly following the unfortunately short-sighted standard. What we have gets us the benefits of SCRAM and generally follows the standard without giving up non-utf8 encodings of roles. If there's a compelling reason to support another authentication method then I'd hope we would look for similar ways to support it without giving up too much. I certainly don't see it making sense to backport some change in this area. > > > I think this is exactly the sort of thing that's too niche to be in core > > > but might be extremely useful for the few people it applies to, so I'm > > > proposing it as an argument in favor of general authn/z extensibility. > > > > If it was able to actually work for our users (and maybe it is if we > > make it optional somehow) and we have users who want it (which certainly > > seems plausible) then I'd say that it's something we would benefit from > > having in core. While it wouldn't solve all the issues with 'ldap' > > auth, if it works with AD's LDAP servers and doesn't require the > > password to be passed from the client to the server (in any of this, be > > the client psql, pgadmin, or the PG server when it goes to talk to the > > LDAP server..) then that would be a fantastic thing and we could > > replace the existing 'ldap' auth with that and be *much* better off for > > it, and so would our users. > > I think the argument you're making here boils down to "if it's not > niche, it should be in core", and I agree with that general sentiment. Good. That certainly seems like a start. > But not all of the prerequisites you stated are met. I see no evidence > that Active Directory supports SCRAM [1], for instance, unless the MS > documentation is just omitting it. That's certainly unfortunate if that's the case. The question then becomes- are a lot of people using SCRAM to connect to OpenLDAP, to the extent that it's a sensible thing for us to support and gives us a more secure option for 'ldap' auth than what exists today where we're passing around cleartext passwords? That seems like it's at least plausible. > Even if it were applicable to every single LDAP deployment, I'd still > like users to be able to choose the best authentication available in > their setups without first having to convince the community. They can > maintain the things that make sense for them, like they do with > extensions. And the authors can iterate on better authentication out > of cycle, like extension authors do. Considering how few outside-of-core well maintained extensions for PG exist, I have an extremely hard time buying off on the argument that the authentication system is a smart area for us to encourage approach in. Broadly speaking, I feel confident saying that authentication systems, rather like encryption libraries, should be standardized, well documented, heavily reviewed, and carefully maintained. If the argument is that there are these teams out there who are itching to write amazing new authentication methods for PG who are going to spend time documenting them, reviewing them, and maintaining them, then we should try to get those folks to work with the community to add support for these new methods so that everyone can benefit from them- not encouraging them to be independent projects which have to work through hooks that limit how those authentication methods are able to be integrated. Consider things like pg_stat_ssl and pg_stat_gssapi. What about pg_ident maps? Will each of these end up with their own version of that? And those are questions beyond just the issue around making sure
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu) wrote: > > Dear Amit, > > > Few comments: > > Thank you for reviewing! PSA new version. > Note that the starting point of delay for 2PC was not changed, > I think it has been under discussion. > > > 1. > > + /* > > + * If we've requested to shut down, exit the process. > > + * > > + * Note that WalSndDone() cannot be used here because the delaying > > + * changes will be sent in the function. > > + */ > > + if (got_STOPPING) > > + { > > + QueryCompletion qc; > > + > > + /* Inform the standby that XLOG streaming is done */ > > + SetQueryCompletion(, CMDTAG_COPY, 0); > > + EndCommand(, DestRemote, false); > > + pq_flush(); > > > > Do we really need to do anything except for breaking the loop and let > > the exit handling happen in the main loop when 'got_STOPPING' is set? > > AFAICS, this is what we are doing in some other palces (See > > WalSndWaitForWal). Won't that work? It seems that will help us sending > > all the pending WAL. > > If we exit the loop after got_STOPPING is set, as you said, the walsender will > send delaying changes and then exit. The behavior is same as the case that > WalSndDone() > is called. But I think it is not suitable for the motivation of the feature. > If users notice the miss operation like TRUNCATE, they must shut down the > publisher > once and then recovery from back up or old subscriber. If the walsender sends > all > pending changes, miss operations will be also propagated to subscriber and > data > cannot be protected. So currently I want to keep the style. > FYI - In case of physical replication, received WALs are not applied when the > secondary is shutted down. > > > 2. > > + /* Try to flush pending output to the client */ > > + if (pq_flush_if_writable() != 0) > > + WalSndShutdown(); > > > > Is there a reason to try flushing here? > > IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is > a > trouble and walsender fails to send messages to subscriber. > > In Linux, the stuck trace from pq_flush_if_writable() will finally reach the > send() system call. > And according to man page[1], it will be triggered by some unexpected state > or the connection is closed. > > Based on above, I think the returned value should be confirmed. > > > Apart from the above, I have made a few changes in the comments in the > > attached diff patch. If you agree with those then please include them > > in the next version. > > Thanks! I checked and I thought all of them should be included. > > Moreover, I used grammar checker and slightly reworded the commit message. Thinking of side effects of this feature (no matter where we delay applying the changes), on the publisher, vacuum cannot collect garbage and WAL cannot be recycled. Is that okay in the first place? The point is that the subscription setting affects the publisher. That is, min_send_delay is specified on the subscriber but the symptoms that could ultimately lead to a server crash appear on the publisher, which sounds dangerous to me. Imagine a service or system like where there is a publication server and it's somewhat exposed so that a user (or a subsystem) arbitrarily can create a subscriber to replicate a subset of the data. A malicious user can have the publisher crash by creating a subscription with, say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation but it's -1 by default. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila wrote in > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi > wrote: > > > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila > > wrote in > > > The one difference w.r.t recovery_min_apply_delay is that here we will > > > hold locks for the duration of the delay which didn't seem to be a > > > good idea. This will also probably lead to more bloat as we will keep > > > transactions open for a long time. Doing it before DecodePrepare won't > > > > I don't have a concrete picture but could we tell reorder buffer to > > retain a PREPAREd transaction until a COMMIT PREPARED comes? > > > > Yeah, we could do that and that is what is the behavior unless the > user enables 2PC via 'two_phase' subscription option. But, I don't see > the need to unnecessarily delay the prepare till the commit if a user > has specified 'two_phase' option. It is quite possible that COMMIT > PREPARED happens at a much later time frame than the amount of delay > the user is expecting. It looks like the user should decide between potential long locks or extra delays, and this choice ought to be documented. > > If > > delaying non-prepared transactions until COMMIT is adequate, then the > > same thing seems to work for prepared transactions. > > > > > have such problems. This is the reason that we decide to perform a > > > delay at the start of the transaction instead at commit/prepare in the > > > subscriber-side approach. > > > > It seems that there are no technical obstacles to do that on the > > publisher side. The only observable difference would be that > > relatively large prepared transactions may experience noticeable > > additional delays. IMHO I don't think it's a good practice > > protocol-wise to intentionally choke a stream at the receiving end > > when it has not been flow-controlled on the transmitting end. > > > > But in this proposal, we are not choking/delaying anything on the receiving > end. I didn't say that to the latest patch. I interpreted the quote of your description as saying that the subscriber-side solution is effective in solving the long-lock problems, so I replied that that can be solved with the publisher-side solution and the subscriber-side solution could cause some unwanted behavior. Do you think we have decided to go with the publisher-side solution? I'm fine if so. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Proposal] Add foreign-server health checks infrastructure
Hi Kuroda-san, Thank you for updating the patch! I rethought the pqSocketPoll part. Current interpretation of arguments seems a little bit confusing because a specific pattern of arguments has a different meaning. What do you think about introducing a new argument like `int forConnCheck`? This seems straightforward and readable. I think it may be better, so fixed. But now we must consider another thing - will we support combination of conncheck and {read|write} or timeout? Especially about timeout, if someone calls pqSocketPoll() with forConnCheck = 1 and end_time = -1, the process may be stuck because it waits till socket peer closed connection. Currently the forConnCheck can be specified with other request, but timeout must be zero. Yes, we need to consider these. I'm wondering whether we need a special care for the combination of event and timeout. Surely, if forConnCheck is set and end_time = -1, pqSocketPoll blocks until the connection close. However, the behavior matches the meaning of the arguments and does not seem confusing (also not an error state). Do we need to restrict this kind of usage in the pqSocketPoll side? I think like the following might be fine. ``` if (!forRead && !forWrite) { if (!(forConnCheck && PQconnCheckable())) return 0; } ``` While making the patch, I come up with idea that postgres_fdw_verify_connection_states* returns NULL if PQconnCheck() cannot work on this platform. This can be done by adding PQconnCheckable(). It may be reasonable because the checking is not really done on this environment. How do you think? I agree with you. Followings are comments for v33. Please check. 0001: 1. the comment of PQconnCheck +/* + * Check whether the socket peer closed connection or not. + * Check whether the socket peer closed 'the' connection or not? 2. the comment of pqSocketCheck - * or both. Returns >0 if one or more conditions are met, 0 if it timed - * out, -1 if an error occurred. + * or both. Moreover, this function can check the health of socket on some + * limited platforms if end_time is 0. the health of socket -> the health of the connection? if end_time is 0 -> if forConnCehck is specified? 3. the comment of pqSocketPoll - * If neither forRead nor forWrite are set, immediately return a timeout - * condition (without waiting). Return >0 if condition is met, 0 - * if a timeout occurred, -1 if an error or interrupt occurred. + * Moreover, this function can check the health of socket on some limited + * platforms if end_time is 0. the health of socket -> the health of the connection? if end_time is 0 -> if forConnCehck is specified? 4. the code of pqSocketPoll +#if defined(POLLRDHUP) + if (forConnCheck) + input_fd.events |= POLLRDHUP; +#endif I think it is better to use PQconnCheckable() to remove the macro. 5. the code of pqSocketCheck and the definition of pqSocketPoll - result = pqSocketPoll(conn->sock, forRead, forWrite, end_time); + result = pqSocketPoll(conn->sock, forRead, forWrite, forConnCheck, end_time); -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time) +pqSocketPoll(int sock, int forRead, int forWrite, int forConnCheck, time_t end_time) Should these be divided into two lines? 6. the comment of verify_cached_connections + * This function emits warnings if a disconnection is found, and this returns + * false only when the lastly verified server seems to be disconnected. It seems better to write the case where this function returns true. 7. the comment of postgres_fdw_verify_connection_states + * This function emits a warning if a disconnection is found. This returns + * false only when the verified server seems to be disconnected, and reutrns + * NULL if the connection check had not been done. It seems better to write the case where this function returns true. 8. the code of + (errcode(ERRCODE_CONNECTION_FAILURE), +errmsg("%s", str.data), +errdetail("Socket close is detected."), +errhint("Plsease check the health of server."))); Is it better to use "Connection close is detected" rather than "Socket close is detected"? 9. the document of postgres_fdw The document of postgres_fdw_verify_connection_states_* is a little bit old. Could you update it? regards, -- Katsuragi Yuta Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Non-superuser subscription owners
Greetings, * Jeff Davis (pg...@j-davis.com) wrote: > On Tue, 2023-02-28 at 08:37 -0500, Robert Haas wrote: > > The existing SECURITY_RESTRICTED_OPERATION flag basically prevents > > you > > from tinkering with the session state. > > Currently, every time we set that flag we also run all the code as the > table owner. > > You're suggesting using the SECURITY_RESTRICTED_OPERATION flag, along > with the new security flags, but not switch to the table owner, right? I'm having trouble following this too, I have to admit. If we aren't changing who we're running the code under.. but making it so that the code isn't actually able to do anything then that doesn't strike me as likely to actually be useful? Surely things like triggers which are used to update another table or insert into another table what happened on the table with the trigger need to be allowed to modify the database- how do we make that possible while the code runs as the invoker and not the table owner when the table owner is the one who gets to write the code? > > If we also had a similar flags > > like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or > > just > > a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be > > pretty close to what we need. The idea would be that, when a user > > executes a function or procedure > > Or default expressions, I presume. If we at least agree on this point, > then I think we should try to find a way to treat these other hunks of > code in a secure way (which I think is what Andres was suggesting). Would need to apply to functions in views and functions in RLS too, along wth default expressions and everything else that could be defined by one person and run by another. > > owned by a user that they don't trust > > completely, we'd set > > SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRIT > > ES_PROHIBITED. > > It seems like you're saying to basically just keep the user ID the > same, and maybe keep USAGE privileges, but not be able to do anything > else? Might be useful. Kind of like running it as a nobody user but > without the problems you mentioned. Some details to think about, I'm > sure. While there's certainly some use-cases where a completely unprivileged user would work, there's certainly an awful lot where it wouldn't. Having that as an option might be interesting for those much more limited use-cases and maybe you could even say "only run functions which are owned by a superuser or X roles" but it's certainly not a general solution to the problem. > > And we could provide a user with a way to express the degree of trust > > they have in some other user or perhaps even some specific function, > > e.g. > > > > SET trusted_roles='alice:read'; > > > > ...could mean that I trust alice to read from the database with my > > permissions, should I happen to run code provided by her in SECURITY > > INVOKER modacke. > > I'm not very excited about inventing a new privilege language inside a > GUC, but perhaps a simpler form could be a reasonable mitigation (or at > least a starting place). I'm pretty far down the path of "wow that looks really difficult to work with", to put it nicely. > > I'm sure there's some details to sort out here, e.g. around security > > related to the trusted_roles GUC itself. But I don't really see a > > fundamental problem. We can invent arbitrary flags that prohibit > > classes of operations that are of concern, set them by default in > > cases where concern is justified, and then give users who want the > > current behavior some kind of escape hatch that causes those flags to > > not get set after all. Not only does such a solution not seem > > impossible, I can possibly even imagine back-patching it, depending > > on > > exactly what the shape of the final solution is, how important we > > think it is to get a fix out there, and how brave I'm feeling that > > day. > > Unless the trusted roles defaults to '*', then I think it will still > break some things. Defaulting to an option that is "don't break anything" while giving users flexibility to test out other, more secure, options seems like it would be a pretty reasonable way forward, generally. That said.. I don't really think this particular approach ends up being a good direction to go in... > One of my key tests for user-facing proposals is whether the > documentation will be reasonable or not. Most of these proposals to > make SECURITY INVOKER less bad fail that test. and this is certainly a very good point as to why. Thanks, Stephen signature.asc Description: PGP signature
Re: Time delayed LR (WAS Re: logical replication restrictions)
Here are some review comments for v9-0001, but these are only very trivial. == Commit Message 1. Nitpick. The new text is jagged-looking. It should wrap at ~80 chars. ~~~ 2. 2. Another reason is for that parallel streaming, the transaction will be opened immediately by the parallel apply worker. Therefore, if the walsender is delayed in sending the final record of the transaction, the parallel apply worker must wait to receive it with an open transaction. This would result in the locks acquired during the transaction not being released until the min_send_delay has elapsed. ~ The text already said there are "two reasons", and already this is numbered as reason 2. So it doesn't need to keep saying "Another reason" here. "Another reason is for that parallel streaming" --> "For parallel streaming..." == src/backend/replication/walsender.c 3. WalSndDelay + /* die if timeout was reached */ + WalSndCheckTimeOut(); Other nearby comments start uppercase, so this should too. == src/include/replication/walreceiver.h 4. WalRcvStreamOptions @@ -187,6 +187,7 @@ typedef struct * prepare time */ char*origin; /* Only publish data originating from the * specified origin */ + int32 min_send_delay; /* The minimum send delay */ } logical; } proto; } WalRcvStreamOptions; ~ Should that comment mention the units are "(ms)" -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Doc update for pg_stat_statements normalization
On Wed, Mar 01, 2023 at 12:43:40AM +, Imseih (AWS), Sami wrote: > Yes, that makes sense. Okay, thanks. Done that now on HEAD. -- Michael signature.asc Description: PGP signature
Proposal: %T Prompt parameter for psql for current time (like Oracle has)
I cannot get the last email to show up for the commitfest. This is version 2 of the original patch. [1] Thanks Jim! [1] https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com Regards Kirk. From b9db157177bbdeeeb6d35c3623ca9355141419d7 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Wed, 1 Mar 2023 00:07:55 +0100 Subject: [PATCH v2] Time option added to psql prompt This adds a useful time option to the prompt: %T. Which does not require a wasteful backquoted shell command which is also not compatible between operating systems. The format is simply HH24:MI:SS no other options available by design! Author: Kirk Wolak Reviewed-By: Andrey Borodin Reviewed-By: Nikolay Samokhvalov Thread: https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com --- doc/src/sgml/ref/psql-ref.sgml | 9 + src/bin/psql/prompt.c | 10 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index dc6528dc11..04ab9eeb8c 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES (:'content'); + +%T + + + The current time on the client in HH24:MI:SS format. + + + + %x diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c index 969cd9908e..0c0c725df5 100644 --- a/src/bin/psql/prompt.c +++ b/src/bin/psql/prompt.c @@ -41,6 +41,7 @@ * or a ! if session is not connected to a database; * in prompt2 -, *, ', or "; * in prompt3 nothing + * %T - time in HH24:MI:SS format * %x - transaction status: empty, *, !, ? (unknown or no connection) * %l - The line number inside the current statement, starting from 1. * %? - the error code of the last query (not yet implemented) @@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack) break; } break; - + /* output HH24:MI:SS */ + case 'T': + { + time_t current_time = time(NULL); + struct tm *tm_info = localtime(_time); + sprintf(buf, "%02d:%02d:%02d", tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec); + } + break; case 'x': if (!pset.db) buf[0] = '?'; -- 2.25.1
Re: Minimal logical decoding on standbys
On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote: > Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and > 8a8661828a (for 0005). [ Jumping into this thread late, so I apologize if these comments have already been covered. ] Regarding v51-0004: * Why is the CV sleep not being canceled? * Comments on WalSndWaitForWal need to be updated to explain the difference between the flush (primary) and the replay (standby) cases. Overall, it seems like what you really want for the sleep/wakeup logic in WalSndWaitForLSN is something like this: condVar = RecoveryInProgress() ? replayCV : flushCV; waitEvent = RecoveryInProgress() ? WAIT_EVENT_WAL_SENDER_WAIT_REPLAY : WAIT_EVENT_WAL_SENDER_WAIT_FLUSH; ConditionVariablePrepareToSleep(condVar); for(;;) { ... sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp()); socketEvents = WL_SOCKET_READABLE; if (pq_is_send_pending()) socketEvents = WL_SOCKET_WRITABLE; ConditionVariableTimedSleepOrEvents( condVar, sleeptime, socketEvents, waitEvent); } ConditionVariableCancelSleep(); But the problem is that ConditionVariableTimedSleepOrEvents() doesn't exist, and I think that's what Andres was suggesting here[1]. WalSndWait() only waits for a timeout or a socket event, but not a CV; ConditionVariableTimedSleep() only waits for a timeout or a CV, but not a socket event. I'm also missing how WalSndWait() works currently. It calls ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup() wake it up? Assuming I'm wrong, and WalSndWait() does use the latch, then I guess it could be extended by having two different latches in the WalSnd structure, and waking them up separately and waiting on the right one. Not sure if that's a good idea though. [1] https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5i...@awork3.anarazel.de -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)
On Thu, Feb 23, 2023 at 2:05 PM Kirk Wolak wrote: > On Thu, Feb 23, 2023 at 9:52 AM Tom Lane wrote: > >> Heikki Linnakangas writes: >> > On 23/02/2023 13:20, Peter Eisentraut wrote: >> >> If you don't have \timing turned on before the query starts, psql won't >> >> record what the time was before the query, so you can't compute the run >> >> time afterwards. This kind of feature would only work if you always >> >> take the start time, even if \timing is turned off. >> >> > Correct. That seems acceptable though? gettimeofday() can be slow on >> > some platforms, but I doubt it's *that* slow, that we couldn't call it >> > two times per query. >> >> Yeah, you'd need to capture both the start and stop times even if >> \timing isn't on, in case you get asked later. But the backend is >> going to call gettimeofday at least once per query, likely more >> depending on what features you use. And there are inherently >> multiple kernel calls involved in sending a query and receiving >> a response. I tend to agree with Heikki that this overhead would >> be unnoticeable. (Of course, some investigation proving that >> wouldn't be unwarranted.) >> >> regards, tom lane >> > > Note, for this above feature, I was thinking we have a ROW_COUNT variable > I use \set to see. > The simplest way to add this is maybe a set variable: EXEC_TIME > And it's set when ROW_COUNT gets set. > +1 > > == > Now, since this opened a lively discussion, I am officially submitting my > first patch. > This includes the small change to prompt.c and the documentation. I had > help from Andrey Borodin, > and Pavel Stehule, who have supported me in how to propose, and use > gitlab, etc. > > We are programmers... It's literally our job to sharpen our tools. And > PSQL is one of my most used. > A small frustration, felt regularly was the motive. > > Regards, Kirk > PS: If I am supposed to edit the subject to say there is a patch here, I > did not know > PPS: I appreciate ANY and ALL feedback... This is how we learn! > Patch Posted with one edit, for line editings (Thanks Jim!) From b9db157177bbdeeeb6d35c3623ca9355141419d7 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Wed, 1 Mar 2023 00:07:55 +0100 Subject: [PATCH v2] Time option added to psql prompt This adds a useful time option to the prompt: %T. Which does not require a wasteful backquoted shell command which is also not compatible between operating systems. The format is simply HH24:MI:SS no other options available by design! Author: Kirk Wolak Reviewed-By: Andrey Borodin Reviewed-By: Nikolay Samokhvalov Thread: https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com --- doc/src/sgml/ref/psql-ref.sgml | 9 + src/bin/psql/prompt.c | 10 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index dc6528dc11..04ab9eeb8c 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES (:'content'); + +%T + + + The current time on the client in HH24:MI:SS format. + + + + %x diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c index 969cd9908e..0c0c725df5 100644 --- a/src/bin/psql/prompt.c +++ b/src/bin/psql/prompt.c @@ -41,6 +41,7 @@ * or a ! if session is not connected to a database; * in prompt2 -, *, ', or "; * in prompt3 nothing + * %T - time in HH24:MI:SS format * %x - transaction status: empty, *, !, ? (unknown or no connection) * %l - The line number inside the current statement, starting from 1. * %? - the error code of the last query (not yet implemented) @@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack) break; } break; - + /* output HH24:MI:SS */ + case 'T': + { + time_t current_time = time(NULL); + struct tm *tm_info = localtime(_time); + sprintf(buf, "%02d:%02d:%02d", tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec); + } + break; case 'x': if (!pset.db) buf[0] = '?'; -- 2.25.1
Re: Doc update for pg_stat_statements normalization
> + > + Queries on which normalization can be applied may be observed with constant > + values in pg_stat_statements, especially when there > + is a high rate of entry deallocations. To reduce the likelihood of this > + happening, consider increasing pg_stat_statements.max. > + The pg_stat_statements_info view, discussed below > + in , > + provides statistics about entry deallocations. > + > Are you OK with this text? Yes, that makes sense. Regards, -- Sami Imseih Amazon Web Services
Re: pg_upgrade and logical replication
On Tue, Feb 28, 2023 at 08:02:13AM -0800, Nikolay Samokhvalov wrote: > On Fri, Feb 17, 2023 at 7:35 AM Julien Rouhaud wrote: > > > > Any table later added in the > > publication is either already fully replicated until that LSN on the > > upgraded > > node, so only the delta is needed, or has been created after that LSN. In > > the > > latter case, the entirety of the table will be replicated with the logical > > replication as a delta right? > > What if we consider a slightly adjusted procedure? > > 0. Temporarily, forbid running any DDL on the source cluster. This is (at least for me) a non starter, as I want an approach that doesn't impact the primary node, at least not too much. Also, how would you do that? If you need some new infrastructure it means that you can only upgrade nodes starting from pg16+, while my approach can upgrade any node that supports publications as long as the target version is pg16+. It also raises some concerns: why prevent any DDL while e.g. creating a temporary table shouldn't not be a problem, same for renaming some underlying object, adding indexes... You would have to curate a list of what exactly is allowed which is never great. Also, how exactly would you ensure that indeed DDL were forbidden since a long enough point in time rather than just "currently" forbidden at the time you do some check?
Re: Doc update for pg_stat_statements normalization
On Tue, Feb 28, 2023 at 11:11:30PM +, Imseih (AWS), Sami wrote: > I agree. We introduce the concept of a plannable statement in a > previous section and we can then make this distinction in the new > paragraph. > > I also added a link to pg_stat_statements_info since that is introduced > later on int the doc. I have reworded the paragraph a bit to be more general so as it would not need an update once more normalization is applied to utility queries (I am going to fix the part where we mention that we use the strings for utilities, which is not the case anymore now): + + Queries on which normalization can be applied may be observed with constant + values in pg_stat_statements, especially when there + is a high rate of entry deallocations. To reduce the likelihood of this + happening, consider increasing pg_stat_statements.max. + The pg_stat_statements_info view, discussed below + in , + provides statistics about entry deallocations. + Are you OK with this text? -- Michael From dd8938e4ba1b1f29d14b3fa2dc76301f42592cad Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 1 Mar 2023 09:05:08 +0900 Subject: [PATCH v3] doc update regarding pg_stat_statements normalization. It is quite possible that a query text willl not normalize (replace constants with $1, $2..) when there is a high rate pgss_hash deallocation. This commit calls this out in docs. --- doc/src/sgml/pgstatstatements.sgml | 10 ++ 1 file changed, 10 insertions(+) diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml index efc36da602..f1ba78c8cb 100644 --- a/doc/src/sgml/pgstatstatements.sgml +++ b/doc/src/sgml/pgstatstatements.sgml @@ -516,6 +516,16 @@ pg_stat_statements entry. + + Queries on which normalization can be applied may be observed with constant + values in pg_stat_statements, especially when there + is a high rate of entry deallocations. To reduce the likelihood of this + happening, consider increasing pg_stat_statements.max. + The pg_stat_statements_info view, discussed below + in , + provides statistics about entry deallocations. + + In some cases, queries with visibly different texts might get merged into a single pg_stat_statements entry. Normally this will happen -- 2.39.2 signature.asc Description: PGP signature
Re: Add LZ4 compression in pg_dump
On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote: > On 2/23/23 16:26, Tomas Vondra wrote: > > Thanks for v30 with the updated commit messages. I've pushed 0001 after > > fixing a comment typo and removing (I think) an unnecessary change in an > > error message. > > > > I'll give the buildfarm a bit of time before pushing 0002 and 0003. > > > > I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.), > and marked the CF entry as committed. Thanks for the patch! I found that e9960732a broke writing of empty gzip-compressed data, specifically LOs. pg_dump succeeds, but then the restore fails: postgres=# SELECT lo_create(1234); lo_create | 1234 $ time ./src/bin/pg_dump/pg_dump -h /tmp -d postgres -Fc |./src/bin/pg_dump/pg_restore -f /dev/null -v pg_restore: implied data-only restore pg_restore: executing BLOB 1234 pg_restore: processing BLOBS pg_restore: restoring large object with OID 1234 pg_restore: error: could not uncompress data: (null) The inline patch below fixes it, but you won't be able to apply it directly, as it's on top of other patches which rename the functions back to "Zlib" and rearranges the functions to their original order, to allow running: git diff --diff-algorithm=minimal -w e9960732a~:./src/bin/pg_dump/compress_io.c ./src/bin/pg_dump/compress_gzip.c The current function order avoids 3 lines of declarations, but it's obviously pretty useful to be able to run that diff command. I already argued for not calling the functions "Gzip" on the grounds that the name was inaccurate. I'd want to create an empty large object in src/test/sql/largeobject.sql to exercise this tested during pgupgrade. But unfortunately that doesn't use -Fc, so this isn't hit. Empty input is an important enough test case to justify a tap test, if there's no better way. diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c index f3f5e87c9a8..68f3111b2fe 100644 --- a/src/bin/pg_dump/compress_gzip.c +++ b/src/bin/pg_dump/compress_gzip.c @@ -55,6 +55,32 @@ InitCompressorZlib(CompressorState *cs, gzipcs = (ZlibCompressorState *) pg_malloc0(sizeof(ZlibCompressorState)); cs->private_data = gzipcs; + + if (cs->writeF) + { + z_streamp zp; + zp = gzipcs->zp = (z_streamp) pg_malloc0(sizeof(z_stream)); + zp->zalloc = Z_NULL; + zp->zfree = Z_NULL; + zp->opaque = Z_NULL; + + /* +* outsize is the buffer size we tell zlib it can output to. We +* actually allocate one extra byte because some routines want to append a +* trailing zero byte to the zlib output. +*/ + + gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1); + gzipcs->outsize = ZLIB_OUT_SIZE; + + if (deflateInit(gzipcs->zp, cs->compression_spec.level) != Z_OK) + pg_fatal("could not initialize compression library: %s", + zp->msg); + + /* Just be paranoid - maybe End is called after Start, with no Write */ + zp->next_out = gzipcs->outbuf; + zp->avail_out = gzipcs->outsize; + } } static void @@ -63,7 +89,7 @@ EndCompressorZlib(ArchiveHandle *AH, CompressorState *cs) ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data; z_streamp zp; - if (gzipcs->zp) + if (cs->writeF != NULL) { zp = gzipcs->zp; zp->next_in = NULL; @@ -131,29 +157,6 @@ WriteDataToArchiveZlib(ArchiveHandle *AH, CompressorState *cs, const void *data, size_t dLen) { ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data; - z_streamp zp; - - if (!gzipcs->zp) - { - zp = gzipcs->zp = (z_streamp) pg_malloc(sizeof(z_stream)); - zp->zalloc = Z_NULL; - zp->zfree = Z_NULL; - zp->opaque = Z_NULL; - - /* -* outsize is the buffer size we tell zlib it can output to. We -* actually allocate one extra byte because some routines want to -* append a trailing zero byte to the zlib output. -*/ - gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1); - gzipcs->outsize = ZLIB_OUT_SIZE; - - if (deflateInit(zp, cs->compression_spec.level) != Z_OK) - pg_fatal("could not initialize compression library: %s", zp->msg); - - zp->next_out = gzipcs->outbuf; - zp->avail_out = gzipcs->outsize; - } gzipcs->zp->next_in = (void *) unconstify(void *, data); gzipcs->zp->avail_in = dLen; >From 1c707279596f3cffde9c97b450dcbef0b6ddbd94 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 26 Feb 2023 17:12:03 -0600 Subject:
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On 2/16/23 10:38, Jacob Champion wrote: > Thanks for the nudge, v7 rebases over the configure conflict from 9244c11afe. I think/hope this is well-baked enough for a potential commit this CF, so I've adjusted the target version. Let me know if there are any concerns about the approach. Thanks! --Jacob
Re: Ability to reference other extensions by schema in extension scripts
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed I've applied the patch attached to message https://www.postgresql.org/message-id/000401d94bc8%2448dff700%24da9fe500%24%40pcorp.us (md5sum a7d45a32c054919d94cd4a26d7d07c20) onto current tip of the master branch being 128dd9f9eca0b633b51ffcd5b0f798fbc48ec4c0 The review written in https://www.postgresql.org/message-id/20230228224608.ak7br5shev4wic5a%40c19 all still applies. The `make installcheck-world` test fails for me but the failures seem unrelated to the patch (many occurrences of "+ERROR: function pg_input_error_info(unknown, unknown) does not exist" in the regression.diff). Documentation exists for the new feature The new status of this patch is: Ready for Committer
Re: Beautify pg_walinspect docs a bit
On Tue, Feb 28, 2023 at 11:57:40AM -0800, Nathan Bossart wrote: > It looks like 58597ed accidentally added an "end_lsn" to the docs for > pg_get_wal_stats_till_end_of_wal(). Indeed. Fixed, thanks! -- Michael signature.asc Description: PGP signature
Re: [PoC] Let libpq reject unexpected authentication requests
On 2/16/23 10:57, Jacob Champion wrote: > v14 rebases over the test and solution conflicts from 9244c11afe2. Since we're to the final CF for PG16, here's a rough summary. This patchset provides two related features: 1) the ability for a client to explicitly allow or deny particular methods of in-band authentication (that is, things like password exchange), and 2) the ability to withhold a client certificate from a server that asks for it. Feature 1 was originally proposed to mitigate abuse where a successful MITM attack can then be used to fish for client credentials [1]. It also lets users disable undesirable authentication types (like plaintext) by default, which seems to be a common interest. Both features came up again in the context of proxies such as postgres_fdw, where it's sometimes important that users authenticate using only their credentials and not piggyback on the authority of the proxy host [2]. And another use case for feature 2 just came up independently [3], to fix connections where the default client certificate isn't valid for a particular server. Since this is all client-side, it's compatible with existing servers. Also since it's client-side, it can't prevent connections from being established by an eager server; it can only drop the connection once it sees that its requirement was not met, similar to how we handle target_session_attrs. That means it can't prevent a login trigger from being processed on behalf of a confused proxy. (I think that would require server-side support.) 0001 and 0002 are the core features. 0003 is a more future-looking refactoring of the internals, to make it easier to handle more SASL mechanisms, but it's not required and contains some unexercised code. Thanks, --Jacob [1] https://www.postgresql.org/message-id/fcc3ebeb7f05775b63f3207ed52a54ea5d17fb42.camel%40vmware.com [2] https://www.postgresql.org/message-id/20230123015255.h3jro3yyitlsqykp%40awork3.anarazel.de [3] https://www.postgresql.org/message-id/CAAWbhmh_QqCnRVV8ct3gJULReQjWxLTaTBqs%2BfV7c7FpH0zbew%40mail.gmail.com
Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL
Daniel Gustafsson writes: > The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around > SysCacheGetAttr where a NULL value triggers an elog(). +1, seems like a good idea. (I didn't review the patch details.) > This will reduce granularity of error messages, and as the patch sits now it > does so a lot since the message is left to work on - I wanted to see if this > was at all seen as a net positive before spending time on that part. I chose > an elog since I as a user would prefer to hit an elog instead of a silent keep > going with an assert, this is of course debateable. I'd venture that the Assert cases are mostly from laziness, and that once we centralize this it's plenty worthwhile to generate a decent elog message. You ought to be able to look up the table and column name from the info that is at hand. Also ... at least in assert-enabled builds, maybe we could check that the column being fetched this way is actually marked attnotnull? That would help to catch misuse. regards, tom lane
Re: Marking options deprecated in help output
> On 24 Feb 2023, at 21:31, Heikki Linnakangas wrote: > On 05/12/2022 11:42, Daniel Gustafsson wrote: > How about putting the deprecated option on a separate line like this: > >> -h, --host=HOSTNAMEdatabase server host or socket directory >> -H (same as -h, deprecated) If nothing else, it helps to keep it shorter and avoids breaking the line between command and description, so I agree with you. Done in the attached v2. -- Daniel Gustafsson deprecated_options_v2.diff Description: Binary data
Re: Doc update for pg_stat_statements normalization
> I am OK with an addition to the documentation to warn that one may > have to increase the maximum number of entries that can be stored if > seeing a non-normalized entry that should have been normalized. I agree. We introduce the concept of a plannable statement in a previous section and we can then make this distinction in the new paragraph. I also added a link to pg_stat_statements_info since that is introduced later on int the doc. Regards, -- Sami Imseih Amazon Web Services 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch Description: 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch
Re: Stale references to guc.c in comments/tests
Daniel Gustafsson writes: > Specifying the GUCs in question is a good idea, done in the attached. I'm not > sure the phrasing is spot-on though, but I can't think of a better one. If > you > can think of a better one I'm all ears. I'd just change "the definition of" to "the definitions of". LGTM otherwise. regards, tom lane
RE: Ability to reference other extensions by schema in extension scripts
> On Sun, Feb 26, 2023 at 01:39:24AM -0500, Regina Obe wrote: > > > > 1) Just don't allow any extensions referenced by other > > >extensions to be relocatable. > > > > Attached is my revision 3 patch, which follows the proposed #1. > > Don't allow schema relocation of an extension if another extension > > requires it. > > I've built a version of PostgreSQL with this patch applied and I confirm it > works as expected. > > The "ext1" is relocatable and creates a function ext1log(): > > =# create extension ext1 schema n1; > CREATE EXTENSION > > The "ext2" is relocatable and creates a function ext2log() relying on the > ext1log() function from "ext1" extension, referencing it via > @extschema:ext1@: > > =# create extension ext2 schema n2; > CREATE EXTENSION > =# select n2.ext2log('hello'); -- things work here > ext1: ext2: hello > > By creating "ext2", "ext1" becomes effectively non-relocatable: > > =# alter extension ext1 set schema n2; > ERROR: cannot SET SCHEMA of extension ext1 because other extensions > require it > DETAIL: extension ext2 requires extension ext1 > > Drop "ext2" makes "ext1" relocatable again: > > =# drop extension ext2; > DROP EXTENSION > =# alter extension ext1 set schema n2; > ALTER EXTENSION > > Upon re-creating "ext2" the referenced ext1 schema will be the correct one: > > =# create extension ext2 schema n1; > CREATE EXTENSION > =# select n1.ext2log('hello'); > ext1: ext2: hello > > The code itself builds w/out warnings with: > > mkdir build > cd build > ../configure > make 2> ERR # ERR is empty > > The testsuite reports all successes: > > make check > [...] > === >All 213 tests passed. > === > > Since I didn't see the tests for extension in there, I've also explicitly run that > portion: > > make -C src/test/modules/test_extensions/ check > [...] > test test_extensions ... ok 32 ms > test test_extdepend ... ok 12 ms > [...] > = >All 2 tests passed. > = > > > As mentioned already the downside of this patch is that it would not be > possibile to change the schema of an otherwise relocatable extension once > other extension depend on it, but I can't think of any good reason to allow > that, as it would mean dependent code would need to always dynamically > determine the install location of the objects in that extension, which sounds > dangerous, security wise. > > --strk; > > Libre GIS consultant/developer > https://strk.kbt.io/services.html Oops I had forgotten to submit the updated patch strk was testing against in my fork. He had asked me to clean up the warnings off list and the description. Attached is the revised. Thanks strk for the patient help and guidance. Thanks, Regina 0004-Allow-use-of-extschema-reqextname-to-reference.patch Description: Binary data
Re: Stale references to guc.c in comments/tests
> On 27 Feb 2023, at 17:59, Tom Lane wrote: > The grammar is a bit off ("the GUC definition" would read better), > but really I think the wording was vague already and we should tighten > it up. Can we specify exactly which GUC variable(s) we're talking about? Specifying the GUCs in question is a good idea, done in the attached. I'm not sure the phrasing is spot-on though, but I can't think of a better one. If you can think of a better one I'm all ears. -- Daniel Gustafsson v3-0001-Fix-outdated-references-to-guc.c.patch Description: Binary data
Re: RFC: logical publication via inheritance root?
Hi, I'm going to register this in CF for feedback. Summary for potential reviewers: we don't use declarative partitions in the Timescale partitioning scheme, but it'd be really nice to be able to replicate between our tables and standard tables, or between two Timescale-partitioned tables with different layouts. This patch lets extensions (or savvy users) upgrade an existing inheritance relationship between two tables into a "logical partition" relationship, so that they can be handled with the publish_via_partition_root machinery. I hope this might also help pg_partman users migrate between old- and new-style partition schemes, but that's speculation. On 1/20/23 09:53, Jacob Champion wrote: >> 2) While this strategy works well for ongoing replication, it's not >> enough to get the initial synchronization correct. The subscriber >> still does a COPY of the root table directly, missing out on all the >> logical descendant data. The publisher will have to tell the >> subscriber about the relationship somehow, and older subscriber >> versions won't understand how to use that (similar to how old >> subscribers can't correctly handle row filters). > > I partially solved this by having the subscriber pull the logical > hierarchy from the publisher to figure out which tables to COPY. This > works when publish_via_partition_root=true, but it doesn't correctly > return to the previous behavior when the setting is false. I need to > check the publication setting from the subscriber, too, but that opens > up the question of what to do if two different publications conflict. Second draft attached, which fixes that bug. I kept thinking to myself that this would be much easier if the publisher told the subscriber what data to copy rather than having the subscriber hardcode the initial sync process... and then I realized that I could, sort of, move in that direction. This version adds a SQL function to determine the list of source tables to COPY into a subscriber's target table. Now the publisher can make use of whatever catalogs it needs to make that list and the subscriber doesn't need to couple to them. (This could also provide a way for publishers to provide more generic "table indirection" in the future, but I'm wary of selling genericism as a feature here.) I haven't solved the problem where two publications of the same table have different settings for publish_via_partition_root. I was curious to see how the existing partition code prevented problems, but I'm not really sure that it does... Here are some situations where the existing implementation duplicates data on the initial sync: 1) A single subscription to two publications, one with publish_via_partition_root on and the other off, which publish the same partitioned table 2) A single subscription to two publications with publish_via_partition_root on, one of which publishes a root partition and the other of which publishes a descendant/leaf 3) A single subscription to two publications with publish_via_partition_root on, one of which publishes FOR ALL TABLES and the other of which publishes a descendant/leaf Is it expected that DBAs should avoid these cases, or are they worth pursuing with a bug fix? Thanks, --JacobFrom 6d31d13f9d7c400cf717426c4f336f7ad19a4391 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Mon, 26 Sep 2022 13:23:51 -0700 Subject: [PATCH] WIP: introduce pg_set_logical_root for use with pubviaroot Allows regular inherited tables to be published via their root table, just like partitions. This works by hijacking pg_inherit's inhseqno column, and replacing a (single) existing entry for the child with the value zero, indicating that it should be treated as a logical partition by the publication machinery. Initial sync works by asking the publisher for a list of logical descendants of the published table, then COPYing them one-by-one into the root. The publisher reuses the existing pubviaroot logic, adding the new logical roots to code that previously looked only for partition roots. Known bugs/TODOs: - The pg_inherits machinery doesn't prohibit changes to inheritance after an entry has been marked as a logical root. - I haven't given any thought to interactions with row filters, or to column lists, or to multiple publications with conflicting pubviaroot settings. - pg_set_logical_root() doesn't check for table ownership yet. Anyone can muck with pg_inherits through it. - I'm not sure that I'm taking all the necessary locks yet, and those I do take may be taken in the wrong order. --- src/backend/catalog/pg_inherits.c | 202 - src/backend/catalog/pg_publication.c| 120 +- src/backend/commands/publicationcmds.c | 10 + src/backend/partitioning/partdesc.c | 3 +- src/backend/replication/logical/tablesync.c | 235 +-- src/backend/replication/pgoutput/pgoutput.c | 54 ++--- src/include/catalog/pg_inherits.h | 7 +-
Re: Ability to reference other extensions by schema in extension scripts
On Sun, Feb 26, 2023 at 01:39:24AM -0500, Regina Obe wrote: > > 1) Just don't allow any extensions referenced by other > >extensions to be relocatable. > > Attached is my revision 3 patch, which follows the proposed #1. > Don't allow schema relocation of an extension if another extension > requires it. I've built a version of PostgreSQL with this patch applied and I confirm it works as expected. The "ext1" is relocatable and creates a function ext1log(): =# create extension ext1 schema n1; CREATE EXTENSION The "ext2" is relocatable and creates a function ext2log() relying on the ext1log() function from "ext1" extension, referencing it via @extschema:ext1@: =# create extension ext2 schema n2; CREATE EXTENSION =# select n2.ext2log('hello'); -- things work here ext1: ext2: hello By creating "ext2", "ext1" becomes effectively non-relocatable: =# alter extension ext1 set schema n2; ERROR: cannot SET SCHEMA of extension ext1 because other extensions require it DETAIL: extension ext2 requires extension ext1 Drop "ext2" makes "ext1" relocatable again: =# drop extension ext2; DROP EXTENSION =# alter extension ext1 set schema n2; ALTER EXTENSION Upon re-creating "ext2" the referenced ext1 schema will be the correct one: =# create extension ext2 schema n1; CREATE EXTENSION =# select n1.ext2log('hello'); ext1: ext2: hello The code itself builds w/out warnings with: mkdir build cd build ../configure make 2> ERR # ERR is empty The testsuite reports all successes: make check [...] === All 213 tests passed. === Since I didn't see the tests for extension in there, I've also explicitly run that portion: make -C src/test/modules/test_extensions/ check [...] test test_extensions ... ok 32 ms test test_extdepend ... ok 12 ms [...] = All 2 tests passed. = As mentioned already the downside of this patch is that it would not be possibile to change the schema of an otherwise relocatable extension once other extension depend on it, but I can't think of any good reason to allow that, as it would mean dependent code would need to always dynamically determine the install location of the objects in that extension, which sounds dangerous, security wise. --strk; Libre GIS consultant/developer https://strk.kbt.io/services.html
Re: Ability to reference other extensions by schema in extension scripts
On Sat, Feb 25, 2023 at 03:40:24PM -0500, Regina Obe wrote: > > On Mon, Feb 06, 2023 at 05:19:39AM -0500, Regina Obe wrote: > > > > I was thinking: how about using the "refobjsubid" to encode the "level" of > > dependency on an extension ? Right now "refobjsubid" is always 0 when the > > referenced object is an extension. > > Could we consider subid=1 to mean the dependency is not only on the > > extension but ALSO on it's schema location ? > > I like that idea. It's only been ever used for tables I think, but I don't > see why it wouldn't apply in this case as the concept is kinda the same. > Only concern if other parts rely on this being 0. This has to be verified, yes. But it feels to me like "must be 0" was mostly to _allow_ for future extensions like the proposed one. > The other question, should this just update the existing DEPENDENCY_NORMAL > extension or add a new DEPENDENCY_NORMAL between the extensions with > subid=1? I'd use the existing record. --strk; Libre GIS consultant/developer https://strk.kbt.io/services.html
Making empty Bitmapsets always be NULL
When I designed the Bitmapset module, I set things up so that an empty Bitmapset could be represented either by a NULL pointer, or by an allocated object all of whose bits are zero. I've recently come to the conclusion that that was a bad idea and we should instead have a convention like the longstanding invariant for Lists: an empty list is represented by NIL and nothing else. To do this, we need to fix bms_intersect, bms_difference, and a couple of other functions to check for having produced an empty result; but then we can replace bms_is_empty() by a simple NULL test. I originally guessed that that would be a bad tradeoff, but now I think it likely is a win performance-wise, because we call bms_is_empty() many more times than those other functions put together. However, any performance gain would likely be marginal; the real reason why I'm pushing this is that we have various places that have hand-implemented a rule about "this Bitmapset variable must be exactly NULL if empty", so that they can use checks-for-null in place of bms_is_empty() calls in particularly hot code paths. That is a really fragile, mistake-prone way to do things, and I'm surprised that we've seldom been bitten by it. It's not well documented at all which variables have this property, so you can't readily tell which code might be violating those conventions. So basically I'd like to establish that convention everywhere and get rid of these ad-hoc reduce-to-NULL checks. I put together the attached draft patch to do so. I've not done any hard performance testing on it --- I did do one benchmark that showed maybe 0.8% speedup, but I'd regard that as below the noise. I found just a few places that have issues with this idea. One thing that is problematic is bms_first_member(): assuming you allow it to loop to completion, it ends with the passed Bitmapset being empty, which is now an invariant violation. I made it pfree the argument at that point, and fixed a couple of callers that would be broken thereby; but I wonder if it wouldn't be better to get rid of that function entirely and convert all its callers to use bms_next_member. There are only about half a dozen. I also discovered that nodeAppend.c is relying on bms_del_members not reducing a non-empty set to NULL, because it uses the nullness of appendstate->as_valid_subplans as a state boolean. That was probably acceptable when it was written, but whoever added classify_matching_subplans made a hash of the state invariants here, because that can set as_valid_subplans to empty. I added a separate boolean as an easy way out, but maybe that code could do with a more thorough revisit. I'll add this to the about-to-start CF. regards, tom lane diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7eb79cee58..a5d74cc462 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3046,6 +3046,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, modified_attrs = HeapDetermineColumnsInfo(relation, interesting_attrs, id_attrs, , newtup, _has_external); + interesting_attrs = NULL; /* don't try to free it again */ /* * If we're not updating any "key" column, we can grab a weaker lock type. @@ -3860,8 +3861,9 @@ heap_attr_equals(TupleDesc tupdesc, int attrnum, Datum value1, Datum value2, * listed as interesting) of the old tuple is a member of external_cols and is * stored externally. * - * The input interesting_cols bitmapset is destructively modified; that is OK - * since this is invoked at most once in heap_update. + * The input interesting_cols bitmapset is destructively modified, and freed + * before we return; that is OK since this is invoked at most once in + * heap_update. */ static Bitmapset * HeapDetermineColumnsInfo(Relation relation, diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index c33a3c0bec..cfa95a07e4 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -877,15 +877,7 @@ UpdateChangedParamSet(PlanState *node, Bitmapset *newchg) * include anything else into its chgParam set. */ parmset = bms_intersect(node->plan->allParam, newchg); - - /* - * Keep node->chgParam == NULL if there's not actually any members; this - * allows the simplest possible tests in executor node files. - */ - if (!bms_is_empty(parmset)) - node->chgParam = bms_join(node->chgParam, parmset); - else - bms_free(parmset); + node->chgParam = bms_join(node->chgParam, parmset); } /* diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 20d23696a5..51a30ddf65 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -1642,7 +1642,7 @@ find_hash_columns(AggState *aggstate) perhash->hashGrpColIdxHash[i] = i + 1; perhash->numhashGrpCols++; /* delete already mapped columns */ - bms_del_member(colnos,
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
On Wed, Mar 1, 2023 at 12:03 AM Gregory Stark wrote: > It looks like this patch received some feedback from Andres and hasn't > had any further work posted. I'm going to move it to "Waiting on > Author". I'll post the updated version in the next couple of days. > It doesn't sound like this is likely to get committed this release > cycle unless responding to Andres's points simpler than I expect. I wouldn't think ahead that much. -- Regards, Alexander Korotkov
Re: Non-superuser subscription owners
On Tue, 2023-02-28 at 08:37 -0500, Robert Haas wrote: > The existing SECURITY_RESTRICTED_OPERATION flag basically prevents > you > from tinkering with the session state. Currently, every time we set that flag we also run all the code as the table owner. You're suggesting using the SECURITY_RESTRICTED_OPERATION flag, along with the new security flags, but not switch to the table owner, right? > If we also had a similar flags > like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or > just > a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be > pretty close to what we need. The idea would be that, when a user > executes a function or procedure Or default expressions, I presume. If we at least agree on this point, then I think we should try to find a way to treat these other hunks of code in a secure way (which I think is what Andres was suggesting). > owned by a user that they don't trust > completely, we'd set > SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRIT > ES_PROHIBITED. It seems like you're saying to basically just keep the user ID the same, and maybe keep USAGE privileges, but not be able to do anything else? Might be useful. Kind of like running it as a nobody user but without the problems you mentioned. Some details to think about, I'm sure. > And we could provide a user with a way to express the degree of trust > they have in some other user or perhaps even some specific function, > e.g. > > SET trusted_roles='alice:read'; > > ...could mean that I trust alice to read from the database with my > permissions, should I happen to run code provided by her in SECURITY > INVOKER modacke. I'm not very excited about inventing a new privilege language inside a GUC, but perhaps a simpler form could be a reasonable mitigation (or at least a starting place). > I'm sure there's some details to sort out here, e.g. around security > related to the trusted_roles GUC itself. But I don't really see a > fundamental problem. We can invent arbitrary flags that prohibit > classes of operations that are of concern, set them by default in > cases where concern is justified, and then give users who want the > current behavior some kind of escape hatch that causes those flags to > not get set after all. Not only does such a solution not seem > impossible, I can possibly even imagine back-patching it, depending > on > exactly what the shape of the final solution is, how important we > think it is to get a fix out there, and how brave I'm feeling that > day. Unless the trusted roles defaults to '*', then I think it will still break some things. One of my key tests for user-facing proposals is whether the documentation will be reasonable or not. Most of these proposals to make SECURITY INVOKER less bad fail that test. Each of these ideas and sub-ideas affect the semantics, and should be documented. But how do we document that some code runs as you, some as the person who wrote it, sometimes we obey SECURITY INVOKER and sometimes we ignore it and use DEFINER semantics, some code is outside a function and always executes as the invoker, some code has some security flags, and some code has more security flags, code can change between the time you look at it and the time it runs, and it's all filtered through GUCs with their own privilege sub-language? OK, let's assume that we have all of that documented, then how do we guide users on what reasonable best practices are for the GUC settings, etc.? Or do we just say "this is mechanically how all these parts work, good luck assembling it into a secure system!". [ Note: I feel like this is the state we are in now. Even if technically we don't have live security bugs that I'm aware of, we are setting users up for security problems. ] On the other hand, if we focus on executing code as the user who wrote it in most places, then the documentation will be something like: "you defined the table, you wrote the code, it runs as you, here are some best practices for writing secure code". And we have some different documentation for writing a cool SECURITY INVOKER function and how to get other users to trust you enough to run it. That sounds a LOT more understandable for users. Regards, Jeff Davis
Re: Non-superuser subscription owners
On Tue, 2023-02-28 at 11:28 -0800, Andres Freund wrote: > I can only repeat myself in stating that SECURITY DEFINER solves none > of the > relevant issues. I included several examples of why it doesn't in the > recent > thread about "blocking SECURITY INVOKER". E.g. that default arguments > of > SECDEF functions are evaluated with the current user's privileges, > not the > function owner's privs: > > https://postgr.es/m/20230113032943.iyxdu7bnxe4cmbld%40awork3.anarazel.de I was speaking a bit loosely, using "SECURITY DEFINER" to mean the semantics of executing code as the one who wrote it. I didn't specifically mean the function marker, because as you pointed out in the other thread, that's not enough. >From your email it looks like there is still a path forward: "The proposal to not trust any expressions controlled by untrusted users at least allows to prevent execution of code, even if it doesn't provide a way to execute the code in a safe manner. Given that we don't have the former, it seems foolish to shoot for the latter." And later: "I think the combination of a) a setting that restricts evaluation of any non-trusted expressions, independent of the origin b) an easy way to execute arbitrary statements within SECURITY_RESTRICTED_OPERATION" My takeaway from that thread was that we need a mechanism to deal with non-function code (e.g. default expressions) first; but once we have that, it opens up the design space to better solutions or at least mitigations. Is that right? Regards, Jeff Davis
Re: WIN32 pg_import_system_collations
On 2023-02-28 Tu 11:40, Juan José Santamaría Flecha wrote: On Tue, Feb 28, 2023 at 12:55 PM Andrew Dunstan wrote: On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote: Hmm, yeah. I'm not sure I understand the point of this test anyway: SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE"); Uhm, they probably don't make much sense except for "tr_TR", so I'm fine with removing them. PFA a patch for so. I think you missed my point, which was that the COLLATE clause above seemed particularly pointless. But I agree that all these are not much use, so I'll remove them as you suggest. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Refactoring SysCacheGetAttr to know when attr cannot be NULL
Today we have two fairly common patterns around extracting an attr from a cached tuple: a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, ); Assert(!isnull); a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, ); if (isnull) elog(ERROR, ".."); The error message in the elog() cases also vary quite a lot. I've been unable to find much in terms of guidelines for when to use en elog or an Assert, with the likelyhood of a NULL value seemingly being the guiding principle (but not in all cases IIUC). The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around SysCacheGetAttr where a NULL value triggers an elog(). This removes a lot of boilerplate error handling which IMO leads to increased readability as the error handling *in these cases* don't add much (there are other cases where checking isnull does a lot of valuable work of course). Personally I much prefer the error-out automatically style of APIs like how palloc saves a ton of checking the returned allocation for null, this aims at providing a similar abstraction. This will reduce granularity of error messages, and as the patch sits now it does so a lot since the message is left to work on - I wanted to see if this was at all seen as a net positive before spending time on that part. I chose an elog since I as a user would prefer to hit an elog instead of a silent keep going with an assert, this is of course debateable. Thoughts? -- Daniel Gustafsson 0001-Add-SysCacheGetAttrNotNull-for-guarnteed-not-null-at.patch Description: Binary data
Re: Beautify pg_walinspect docs a bit
It looks like 58597ed accidentally added an "end_lsn" to the docs for pg_get_wal_stats_till_end_of_wal(). diff --git a/doc/src/sgml/pgwalinspect.sgml b/doc/src/sgml/pgwalinspect.sgml index 22677e54f2..3d7cdb95cc 100644 --- a/doc/src/sgml/pgwalinspect.sgml +++ b/doc/src/sgml/pgwalinspect.sgml @@ -174,7 +174,7 @@ combined_size_percentage | 2.8634072910530795 - pg_get_wal_stats_till_end_of_wal(start_lsn pg_lsn, end_lsn pg_lsn, per_record boolean DEFAULT false) + pg_get_wal_stats_till_end_of_wal(start_lsn pg_lsn, per_record boolean DEFAULT false) returns setof record -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Memory leak from ExecutorState context?
On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote: > Hello all, > > A customer is facing out of memory query which looks similar to this > situation: > > > https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b > > This PostgreSQL version is 11.18. Some settings: > > * shared_buffers: 8GB > * work_mem: 64MB > * effective_cache_size: 24GB > * random/seq_page_cost are by default > * physical memory: 32GB > > The query is really large and actually update kind of a materialized view. > > The customer records the plans of this query on a regular basis. The explain > analyze of this query before running out of memory was: > > https://explain.depesz.com/s/sGOH > > The customer is aware he should rewrite this query to optimize it, but it's a > long time process he can not start immediately. To make it run in the > meantime, > he actually removed the top CTE to a dedicated table. According to their > experience, it's not the first time they had to split a query this way to make > it work. > > I've been able to run this query on a standby myself. I've "call > MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed > (best view with "less -S") and the graph associated with it in attachment. It > shows: > > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end > (400s as the last record) That's interesting. We're using HashBatchContext for very few things, so what could it consume so much memory? But e.g. the number of buckets should be limited by work_mem, so how could it get to 1.4GB? Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets and print how many batches/butches are there? > * ALL other context are stable before 240s, but ExecutorState > * ExecutorState keeps rising up to 13GB with no interruption until the memory > exhaustion > > I did another run with interactive gdb session (see the messy log session in > attachment, for what it worth). Looking at some backtraces during the memory > inflation close to the end of the query, all of them were having these frames > in > common: > > [...] > #6 0x00621ffc in ExecHashJoinImpl (parallel=false, > pstate=0x31a3378) > at nodeHashjoin.c:398 [...] > > ...which is not really helpful but at least, it seems to come from a hash join > node or some other hash related code. See the gdb session log for more > details. > After the out of mem, pmap of this process shows: > > 430: postgres: postgres [local] EXPLAIN > Address Kbytes RSS Dirty Mode Mapping > [...] > 02c5e000 13719620 8062376 8062376 rw--- [ anon ] > [...] > > Is it usual a backend is requesting such large memory size (13 GB) and > actually use less of 60% of it (7.7GB of RSS)? > No idea. Interpreting this info is pretty tricky, in my experience. It might mean the memory is no longer used but sbrk couldn't return it to the OS yet, or something like that. > Sadly, the database is 1.5TB large and I can not test on a newer major > version. > I did not try to check how large would be the required data set to reproduce > this, but it moves 10s of million of rows from multiple tables anyway... > > Any idea? How could I help to have a better idea if a leak is actually > occurring and where exactly? > Investigating memory leaks is tough, especially for generic memory contexts like ExecutorState :-( Even more so when you can't reproduce it on a machine with custom builds. What I'd try is this: 1) attach breakpoints to all returns in AllocSetAlloc(), printing the pointer and size for ExecutorState context, so something like break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0 commands print MemoryChunkGetPointer(chunk) size cont end 2) do the same for AllocSetFree() 3) Match the palloc/pfree calls (using the pointer address), to determine which ones are not freed and do some stats on the size. Usually there's only a couple distinct sizes that account for most of the leaked memory. 4) Break AllocSetAlloc on those leaked sizes, to determine where the calls come from. This usually gives enough info about the leak or at least allows focusing the investigation to a particular area of code. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]
On Mon, Feb 27, 2023 at 12:43 PM Stephen Frost wrote: > * Jacob Champion (jchamp...@timescale.com) wrote: > > This patchset should ideally have required zero client side changes, but > > because our SCRAM implementation is slightly nonstandard too -- it > > doesn't embed the username into the SCRAM data -- libpq can't talk to > > the OpenLDAP/Cyrus SASL implementation. I included a quick-and-bad patch > > to fix it in libpq; that needs its own conversation. > > Indeed it does... as there were specific reasons that what we > implemented for PG's SCRAM doesn't embed the username into the SCRAM > data and my recollection is that we don't because SCRAM (or maybe SASL? > either way though...) requires it to be utf-8 encoded and we support a > lot of other encoding that aren't that, so it wouldn't work for a lot > of our users. Yes. Interoperable authentication is going to have to solve those sorts of problems somehow, though. And there are a bunch of levers to pull: clients aren't required to run their sent usernames through SASLprep; our existing servers don't mind having it sent, so we could potentially backport a client change; and then after that it's a game of balancing compatibility and compliance. A deeper conversation for sure. > > I think this is exactly the sort of thing that's too niche to be in core > > but might be extremely useful for the few people it applies to, so I'm > > proposing it as an argument in favor of general authn/z extensibility. > > If it was able to actually work for our users (and maybe it is if we > make it optional somehow) and we have users who want it (which certainly > seems plausible) then I'd say that it's something we would benefit from > having in core. While it wouldn't solve all the issues with 'ldap' > auth, if it works with AD's LDAP servers and doesn't require the > password to be passed from the client to the server (in any of this, be > the client psql, pgadmin, or the PG server when it goes to talk to the > LDAP server..) then that would be a fantastic thing and we could > replace the existing 'ldap' auth with that and be *much* better off for > it, and so would our users. I think the argument you're making here boils down to "if it's not niche, it should be in core", and I agree with that general sentiment. But not all of the prerequisites you stated are met. I see no evidence that Active Directory supports SCRAM [1], for instance, unless the MS documentation is just omitting it. Even if it were applicable to every single LDAP deployment, I'd still like users to be able to choose the best authentication available in their setups without first having to convince the community. They can maintain the things that make sense for them, like they do with extensions. And the authors can iterate on better authentication out of cycle, like extension authors do. --Jacob [1] https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/e7d814a5-4cb5-4b0d-b408-09d79988b550
Re: Non-superuser subscription owners
Hi, On 2023-02-22 09:18:34 -0800, Jeff Davis wrote: > I can't resist mentioning that these are all SECURITY INVOKER problems. > SECURITY INVOKER is insecure unless the invoker absolutely trusts the > definer, and that only really makes sense if the definer is a superuser > (or something very close). That's why we keep adding exceptions with > SECURITY_RESTRICTED_OPERATION, which is really just a way to silently > ignore the SECURITY INVOKER label and use SECURITY DEFINER instead. > > At some point we need to ask: "when is SECURITY INVOKER both safe and > useful?" and contain it to those cases, rather than silently ignoring > it in an expanding list of cases. I can only repeat myself in stating that SECURITY DEFINER solves none of the relevant issues. I included several examples of why it doesn't in the recent thread about "blocking SECURITY INVOKER". E.g. that default arguments of SECDEF functions are evaluated with the current user's privileges, not the function owner's privs: https://postgr.es/m/20230113032943.iyxdu7bnxe4cmbld%40awork3.anarazel.de Greetings, Andres Freund
Re: Experiments with Postgres and SSL
On Tue, Feb 28, 2023 at 10:33 AM Greg Stark wrote: > On Wed, 22 Feb 2023 at 07:27, Heikki Linnakangas wrote: > > Good idea. Do we want to just require the protocol to be "postgres", or > > perhaps "postgres/3.0"? Need to register that with IANA, I guess. > > I had never heard of this before, it does seem useful. But if I > understand it right it's entirely independent of this patch. It can be. If you want to use it in the strongest possible way, though, you'd have to require its use by clients. Introducing that requirement later would break existing ones, so I think it makes sense to do it at the same time as the initial implementation, if there's interest. > We can > add it to all our Client/Server exchanges whether they're the initial > direct SSL connection or the STARTTLS negotiation? I'm not sure it would buy you anything during the STARTTLS-style opening. You already know what protocol you're speaking in that case. (So with the ALPACA example, the damage is already done.) Thanks, --Jacob
Re: Memory leak from ExecutorState context?
On 2/28/23 19:25, Justin Pryzby wrote: > On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote: >> Hello all, >> >> A customer is facing out of memory query which looks similar to this >> situation: >> >> >> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b >> >> This PostgreSQL version is 11.18. Some settings: > > hash joins could exceed work_mem until v13: > > |Allow hash aggregation to use disk storage for large aggregation result > |sets (Jeff Davis) > | That's hash aggregate, not hash join. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Commitfest 2023-03 starting tomorrow!
So I'm not sure if I'll be CFM this month but I'm assuming I will be at this point Regardless, Commitfest 2023-03 starts tomorrow! So this is a good time to check your submitted patches and ensure they're actually in building and don't need a rebase. Take a look at http://cfbot.cputube.org/ for example. I'll do an initial pass marking anything red here as Waiting on Author The next pass would be to grab any patches not marked Ready for Committer and if they look like they'll need more than a one round of feedback and a couple weeks to polish they'll probably get bounced to the next commitfest too. It sucks not getting feedback on your patches for so long but there are really just sooo many patches and so few eyeballs... It would be great if people could do initial reviews of these patches before we bounce them because it really is discouraging for developers to send patches and not get feedback. But realistically it's going to happen to a lot of patches. -- greg
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi, On 2023-02-25 16:00:05 +0530, Amit Kapila wrote: > On Tue, Feb 21, 2023 at 7:55 PM Önder Kalacı wrote: > >> I think this overhead seems to be mostly due to the need to perform > >> tuples_equal multiple times for duplicate values. I think more work needs to be done to determine the source of the overhead. It's not clear to me why there'd be an increase in tuples_equal() calls in the tests upthread. > Wouldn't a table-level option like 'apply_index_scan' be better than a > subscription-level option with a default value as false? Anyway, the > bigger point is that we don't see a better way to proceed here than to > introduce some option to control this behavior. I don't think this should default to false. The quadratic apply performance the sequential scans cause, are a much bigger hazard for users than some apply performance reqression. > I see this as a way to provide this feature for users but I would > prefer to proceed with this if we can get some more buy-in from senior > community members (at least one more committer) and some user(s) if > possible. So, I once again request others to chime in and share their > opinion. I'd prefer not having an option, because we figure out the cause of the performance regression (reducing it to be small enough to not care). After that an option defaulting to using indexes. I don't think an option defaulting to false makes sense. I don't care whether it's subscription or relation level option. Greetings, Andres Freund
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy wrote: > +/* + * Guts of XLogReadFromBuffers(). + * + * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL + * fetched WAL buffers on timeline 'tli' and return the read bytes. + */ s/fetched WAL buffers/fetched from WAL buffers + else if (nread < nbytes) + { + /* + * We read some of the requested bytes. Continue to read remaining + * bytes. + */ + ptr += nread; + nbytes -= nread; + dst += nread; + *read_bytes += nread; + } The 'if' condition should always be true. You can replace the same with an assertion instead. s/Continue to read remaining/Continue to read the remaining The good thing about this patch is that it reduces read IO calls without impacting the write performance (at least not that noticeable). It also takes us one step forward towards the enhancements mentioned in the thread. If performance is a concern, we can introduce a GUC to enable/disable this feature. -- Thanks & Regards, Kuntal Ghosh
Re: Experiments with Postgres and SSL
On Wed, 22 Feb 2023 at 07:27, Heikki Linnakangas wrote: > > On 20/01/2023 03:28, Jacob Champion wrote: > > On Wed, Jan 18, 2023 at 7:16 PM Greg Stark wrote: > >> * "Service Mesh" type tools that hide multiple services behind a > >> single host/port ("Service Mesh" is just a new buzzword for "proxy"). > > > > If you want to multiplex protocols on a port, now is an excellent time > > to require clients to use ALPN on implicit-TLS connections. (There are > > no clients that can currently connect via implicit TLS, so you'll > > never have another chance to force the issue without breaking > > backwards compatibility.) That should hopefully make it harder to > > ALPACA yourself or others [2]. > > Good idea. Do we want to just require the protocol to be "postgres", or > perhaps "postgres/3.0"? Need to register that with IANA, I guess. I had never heard of this before, it does seem useful. But if I understand it right it's entirely independent of this patch. We can add it to all our Client/Server exchanges whether they're the initial direct SSL connection or the STARTTLS negotiation? > We implemented a protocol version negotiation mechanism in the libpq > protocol itself, how would this interact with it? If it's just > "postgres", then I guess we'd still negotiate the protocol version and > list of extensions after the TLS handshake. > > >> Incidentally I find the logic in ProcessStartupPacket incredibly > >> confusing. It took me a while before I realized it's using tail > >> recursion to implement the startup logic. I think it would be way more > >> straightforward and extensible if it used a much more common iterative > >> style. I think it would make it possible to keep more state than just > >> ssl_done and gss_done without changing the function signature every > >> time for example. > > > > +1. The complexity of the startup logic, both client- and server-side, > > is a big reason why I want implicit TLS in the first place. That way, > > bugs in that code can't be exploited before the TLS handshake > > completes. > > +1. We need to support explicit TLS for a long time, so we can't > simplify by just removing it. But let's refactor the code somehow, to > make it more clear. > > Looking at the patch, I think it accepts an SSLRequest packet even if > implicit TLS has already been established. That's surely wrong, and > shows how confusing the code is. (Or I'm reading it incorrectly, which > also shows how confusing it is :-) ) I'll double check it but I think I tested that that wasn't the case. I think it accepts the SSL request packet and sends back an N which the client libpq just interprets as the server not supporting SSL and does an unencrypted connection (which is tunneled over stunnel unbeknownst to libpq). I agree I would want to flatten this logic to an iterative approach but having wrapped my head around it now I'm not necessarily rushing to do it now. The main advantage of flattening it would be to make it easy to support other protocol types which I think could be really interesting. It would be much clearer to document the state machine if all the state is in one place and the code just loops through processing startup packets and going to a new state until the connection is established. That's true now but you have to understand how the state is passed in the function parameters and notice that all the recursion is tail recursive (I think). And extending that state would require extending the function signature which would get awkward quickly. > Regarding Vladimir's comments on how clients can migrate to this, I > don't have any great suggestions. To summarize, there are several options: > > - Add an "fast_tls" option that the user can enable if they know the > server supports it > > - First connect in old-fashioned way, and remember the server version. > Later, if you reconnect to the same server, use implicit TLS if the > server version was high enough. This would be most useful for connection > pools. Vladimir pointed out that this doesn't necessarily work. The server may be new enough to support it but it could be behind a proxy like pgbouncer or something. The same would be true if the server reported a "connection option" instead of depending on version. > - Connect both ways at the same time, and continue with the fastest, > i.e. "happy eyeballs" That seems way too complex for us to bother with imho. > - Try implicit TLS first, and fall back to explicit TLS if it fails. > For libpq, we don't necessarily need to do anything right now. We can > add the implicit TLS support in a later version. Not having libpq > support makes it hard to test the server codepath, though. Maybe just > test it with 'stunnel' or 'openssl s_client'. I think we should have an option to explicitly enable it in psql, if only for testing. And then wait five years and switch the default on it then. In the meantime users can just set it based on their setup. That's not the way to the quickest adoption but
Re: Memory leak from ExecutorState context?
On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote: > Hello all, > > A customer is facing out of memory query which looks similar to this > situation: > > > https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b > > This PostgreSQL version is 11.18. Some settings: hash joins could exceed work_mem until v13: |Allow hash aggregation to use disk storage for large aggregation result |sets (Jeff Davis) | |Previously, hash aggregation was avoided if it was expected to use more |than work_mem memory. Now, a hash aggregation plan can be chosen despite |that. The hash table will be spilled to disk if it exceeds work_mem |times hash_mem_multiplier. | |This behavior is normally preferable to the old behavior, in which once |hash aggregation had been chosen, the hash table would be kept in memory |no matter how large it got — which could be very large if the planner |had misestimated. If necessary, behavior similar to that can be obtained |by increasing hash_mem_multiplier. > https://explain.depesz.com/s/sGOH This shows multiple plan nodes underestimating the row counts by factors of ~50,000, which could lead to the issue fixed in v13. I think you should try to improve the estimates, which might improve other queries in addition to this one, in addition to maybe avoiding the issue with joins. > The customer is aware he should rewrite this query to optimize it, but it's a > long time process he can not start immediately. To make it run in the > meantime, > he actually removed the top CTE to a dedicated table. Is the table analyzed ? > Is it usual a backend is requesting such large memory size (13 GB) and > actually use less of 60% of it (7.7GB of RSS)? It's possible it's "using less" simply because it's not available. Is the process swapping ? -- Justin
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On 28/02/2023 16:17, Maxim Orlov wrote: Either I do not understand something, or the files from pg_commit_ts directory are not copied. Huh, yeah you're right, pg_upgrade doesn't copy pg_commit_ts. - Heikki
Re: WIN32 pg_import_system_collations
On Tue, Feb 28, 2023 at 12:55 PM Andrew Dunstan wrote: > On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote: > > > Hmm, yeah. I'm not sure I understand the point of this test anyway: > > > SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE"); > Uhm, they probably don't make much sense except for "tr_TR", so I'm fine with removing them. PFA a patch for so. Regards, Juan José Santamaría Flecha 0002-remove-useless-test-from-collate.windows.win1252.patch Description: Binary data 0001-change-locale-name-for-test-collate.windows.win1252.patch Description: Binary data
Re: pg_upgrade and logical replication
On Fri, Feb 17, 2023 at 7:35 AM Julien Rouhaud wrote: > > Any table later added in the > publication is either already fully replicated until that LSN on the upgraded > node, so only the delta is needed, or has been created after that LSN. In the > latter case, the entirety of the table will be replicated with the logical > replication as a delta right? What if we consider a slightly adjusted procedure? 0. Temporarily, forbid running any DDL on the source cluster. 1. On the source, create publication, replication slot and remember the LSN for it 2. Restore the target cluster to that LSN using restore_target_lsn (PITR) 3. Run pg_upgrade on the target cluster 4. Only now, create subscription to target 5. Wait until logical replication catches up 6. Perform a switchover to the new cluster taking care of lags in sequences, etc 7. Resume DDL when needed Do you see any data loss happening in this approach?
RE: Time delayed LR (WAS Re: logical replication restrictions)
Dear Amit, > Few comments: Thank you for reviewing! PSA new version. Note that the starting point of delay for 2PC was not changed, I think it has been under discussion. > 1. > + /* > + * If we've requested to shut down, exit the process. > + * > + * Note that WalSndDone() cannot be used here because the delaying > + * changes will be sent in the function. > + */ > + if (got_STOPPING) > + { > + QueryCompletion qc; > + > + /* Inform the standby that XLOG streaming is done */ > + SetQueryCompletion(, CMDTAG_COPY, 0); > + EndCommand(, DestRemote, false); > + pq_flush(); > > Do we really need to do anything except for breaking the loop and let > the exit handling happen in the main loop when 'got_STOPPING' is set? > AFAICS, this is what we are doing in some other palces (See > WalSndWaitForWal). Won't that work? It seems that will help us sending > all the pending WAL. If we exit the loop after got_STOPPING is set, as you said, the walsender will send delaying changes and then exit. The behavior is same as the case that WalSndDone() is called. But I think it is not suitable for the motivation of the feature. If users notice the miss operation like TRUNCATE, they must shut down the publisher once and then recovery from back up or old subscriber. If the walsender sends all pending changes, miss operations will be also propagated to subscriber and data cannot be protected. So currently I want to keep the style. FYI - In case of physical replication, received WALs are not applied when the secondary is shutted down. > 2. > + /* Try to flush pending output to the client */ > + if (pq_flush_if_writable() != 0) > + WalSndShutdown(); > > Is there a reason to try flushing here? IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is a trouble and walsender fails to send messages to subscriber. In Linux, the stuck trace from pq_flush_if_writable() will finally reach the send() system call. And according to man page[1], it will be triggered by some unexpected state or the connection is closed. Based on above, I think the returned value should be confirmed. > Apart from the above, I have made a few changes in the comments in the > attached diff patch. If you agree with those then please include them > in the next version. Thanks! I checked and I thought all of them should be included. Moreover, I used grammar checker and slightly reworded the commit message. [1]: https://man7.org/linux/man-pages/man3/send.3p.html Best Regards, Hayato Kuroda FUJITSU LIMITED v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch Description: v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Re: Maybe we can remove the type cast in typecache.c
qinghao huang writes: > When I was reading postgres code, I found there is a wierd type cast. I'm > pondering if it is necessary. > ``` > /* Allocate a new typmod number. This will be wasted if we error out. */ > typmod = (int) > > pg_atomic_fetch_add_u32(>shared_typmod_registry->next_typmod, > 1); > ``` > typmod has u32 type, but we cast it to int first. typmods really ought to be int32, not uint32, so IMO none of this is exactly right. But it's also true that it makes no real difference. Postgres pretty much assumes that "int" is 32 bits. regards, tom lane
Avoid multiple SetLatch() calls in procsignal_sigusr1_handler()
Hi, Most of the multiplexed SIGUSR1 handlers are setting latch explicitly when the procsignal_sigusr1_handler() can do that for them at the end. These multiplexed handlers are currently being used as SIGUSR1 handlers, not as independent handlers, so no problem if SetLatch() is removed from them. A few others do it right by saying /* latch will be set by procsignal_sigusr1_handler */. Although, calling SetLatch() in quick succession does no harm (it just returns if the latch was previously set), it seems unnecessary. I'm attaching a patch that avoids multiple SetLatch() calls. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v1-0001-Avoid-multiple-SetLatch-calls-in-procsignal_sigus.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada wrote: > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor > wrote: > > > > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada > > wrote: > > > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > > wrote: > > > > > > > > That doesn't seem useful to me. If we've done enough testing to > > > > reassure us the new way always gives the same answer, the old way is > > > > not needed at commit time. If there is any doubt it will always give > > > > the same answer, then the whole patchset won't be committed. > > > > > My idea is to make the bug investigation easier but on > > > reflection, it seems not the best idea given this purpose. > > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old > > tid array. As I've said, that doesn't seem like a good thing to carry > > forward forevermore, in any form. Plus, comparing new code with new code is > > not the same thing as comparing existing code with new code. That was my > > idea upthread. > > > > Maybe the effort my idea requires is too much vs. the likelihood of finding > > a problem. In any case, it's clear that if I want that level of paranoia, > > I'm going to have to do it myself. > > > > > What do you think > > > about the attached patch? Please note that it also includes the > > > changes for minimum memory requirement. > > > > Most of the asserts look logical, or at least harmless. > > > > - int max_off; /* the maximum offset number */ > > + OffsetNumber max_off; /* the maximum offset number */ > > > > I agree with using the specific type for offsets here, but I'm not sure why > > this change belongs in this patch. If we decided against the new asserts, > > this would be easy to lose. > > Right. I'll separate this change as a separate patch. > > > > > This change, however, defies common sense: > > > > +/* > > + * The minimum amount of memory required by TidStore is 2MB, the current > > minimum > > + * valid value for the maintenance_work_mem GUC. This is required to > > allocate the > > + * DSA initial segment, 1MB, and some meta data. This number is applied > > also to > > + * the local TidStore cases for simplicity. > > + */ > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > > > + /* Sanity check for the max_bytes */ > > + if (max_bytes < TIDSTORE_MIN_MEMORY) > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > > + TIDSTORE_MIN_MEMORY, max_bytes); > > > > Aside from the fact that this elog's something that would never get past > > development, the #define just adds a hard-coded copy of something that is > > already hard-coded somewhere else, whose size depends on an implementation > > detail in a third place. > > > > This also assumes that all users of tid store are limited by > > maintenance_work_mem. Andres thought of an example of some day unifying > > with tidbitmap.c, and maybe other applications will be limited by work_mem. > > > > But now that I'm looking at the guc tables, I am reminded that work_mem's > > minimum is 64kB, so this highlights a design problem: There is obviously no > > requirement that the minimum work_mem has to be >= a single DSA segment, > > even though operations like parallel hash and parallel bitmap heap scan are > > limited by work_mem. > > Right. > > > It would be nice to find out what happens with these parallel features > > when work_mem is tiny (maybe parallelism is not even considered?). > > IIUC both don't care about the allocated DSA segment size. Parallel > hash accounts actual tuple (+ header) size as used memory but doesn't > consider how much DSA segment is allocated behind. Both parallel hash > and parallel bitmap scan can work even with work_mem = 64kB, but when > checking the total DSA segment size allocated during these operations, > it was 1MB. > > I realized that there is a similar memory limit design issue also on > the non-shared tidstore cases. We deduct 70kB from max_bytes but it > won't work fine with work_mem = 64kB. Probably we need to reconsider > it. FYI 70kB comes from the maximum slab block size for node256. Currently, we calculate the slab block size enough to allocate 32 chunks from there. For node256, the leaf node is 2,088 bytes and the slab block size is 66,816 bytes. One idea to fix this issue to decrease it. For example, with 16 chunks the slab block size is 33,408 bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark test with 70kB block size and 16kB block size: * 70kB slab blocks: select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFF'); height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793 mem_allocated | load_ms | search_ms ---+-+--- 143085184 |1216 | 750 (1 row) * 16kB slab blocks: select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFF'); height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793 mem_allocated | load_ms | search_ms
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Hello Tested patch on all cases: CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITION. As expected it works. Also added a description to copy.sgml and made a review on patch. I added 'ignored_errors' integer parameter that should be output after the option is finished. All errors were added to the system logfile with full detailed context. Maybe it's better to log only error message. file:///home/abc13/Documents/todo_copy/postgres/v2-0001-Add-COPY-option-IGNORE_DATATYPE_ERRORS.patch Regards, Damir Belyalov Postgres Professional diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index c25b52d0cb..706b929947 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -34,6 +34,7 @@ COPY { table_name [ ( format_name FREEZE [ boolean ] +IGNORE_DATATYPE_ERRORS [ boolean ] DELIMITER 'delimiter_character' NULL 'null_string' HEADER [ boolean | MATCH ] @@ -233,6 +234,17 @@ COPY { table_name [ ( + +IGNORE_DATATYPE_ERRORS + + + Drops rows that contain malformed data while copying. These are rows + with columns where the data type's input-function raises an error. + Outputs warnings about rows with incorrect data to system logfile. + + + + DELIMITER diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e34f583ea7..0334894014 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -410,6 +410,7 @@ ProcessCopyOptions(ParseState *pstate, bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool ignore_datatype_errors_specified= false; ListCell *option; /* Support external use for option sanity checking */ @@ -449,6 +450,13 @@ ProcessCopyOptions(ParseState *pstate, freeze_specified = true; opts_out->freeze = defGetBoolean(defel); } + else if (strcmp(defel->defname, "ignore_datatype_errors") == 0) + { + if (ignore_datatype_errors_specified) +errorConflictingDefElem(defel, pstate); + ignore_datatype_errors_specified = true; + opts_out->ignore_datatype_errors = defGetBoolean(defel); + } else if (strcmp(defel->defname, "delimiter") == 0) { if (opts_out->delim) diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index af52faca6d..ecaa750568 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -955,10 +955,14 @@ CopyFrom(CopyFromState cstate) errcallback.previous = error_context_stack; error_context_stack = + if (cstate->opts.ignore_datatype_errors) + cstate->ignored_errors = 0; + for (;;) { TupleTableSlot *myslot; bool skip_tuple; + ErrorSaveContext escontext = {T_ErrorSaveContext}; CHECK_FOR_INTERRUPTS(); @@ -991,9 +995,26 @@ CopyFrom(CopyFromState cstate) ExecClearTuple(myslot); + if (cstate->opts.ignore_datatype_errors) + { + escontext.details_wanted = true; + cstate->escontext = escontext; + } + /* Directly store the values/nulls array in the slot */ if (!NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull)) + { + if (cstate->opts.ignore_datatype_errors && cstate->ignored_errors > 0) +ereport(WARNING, errmsg("Errors: %ld", cstate->ignored_errors)); break; + } + + /* Soft error occured, skip this tuple */ + if (cstate->escontext.error_occurred) + { + ExecClearTuple(myslot); + continue; + } ExecStoreVirtualTuple(myslot); diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 91b564c2bc..9c36b0dc8b 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -70,6 +70,7 @@ #include "libpq/pqformat.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "nodes/miscnodes.h" #include "pgstat.h" #include "port/pg_bswap.h" #include "utils/builtins.h" @@ -938,10 +939,23 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, cstate->cur_attname = NameStr(att->attname); cstate->cur_attval = string; - values[m] = InputFunctionCall(_functions[m], - string, - typioparams[m], - att->atttypmod); + + /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */ + if (!InputFunctionCallSafe(_functions[m], + string, + typioparams[m], + att->atttypmod, + (Node *) >escontext, + [m])) + { +cstate->ignored_errors++; + +ereport(LOG, + errmsg("%s", cstate->escontext.error_data->message)); + +return true; + } + if (string != NULL) nulls[m] = false; cstate->cur_attname = NULL; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index a0138382a1..d79d293c0d 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -701,7 +701,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); HANDLER HAVING HEADER_P HOLD HOUR_P - IDENTITY_P IF_P ILIKE
Re: Allow logical replication to copy tables in binary format
Hi, Attached v11. vignesh C , 28 Şub 2023 Sal, 13:59 tarihinde şunu yazdı: > Thanks for the patch, Few comments: > 1) Are primary key required for the tables, if not required we could > remove the primary key which will speed up the test by not creating > the indexes and inserting in the indexes. Even if required just create > for one of the tables: > I think that having a primary key in tables for logical replication tests is good for checking if log. rep. duplicates any row. Other tests also have primary keys in almost all tables. Bharath Rupireddy , 28 Şub 2023 Sal, 15:27 tarihinde şunu yazdı: > 1. > + > + If true, initial data synchronization will be performed in binary > format > + > It's not just the initial table sync right? The table sync can happen > at any other point of time when ALTER SUBSCRIPTION ... REFRESH > PUBLICATION WITH (copy = true) is run. > How about - "If true, the subscriber requests publication for > pre-existing data in binary format"? > I changed it as you suggested. I sometimes feel like the phrase "initial sync" is used for initial sync of a table, not a subscription. Or table syncs triggered by ALTER SUBSCRIPTION are ignored in some places where "initial sync" is used. 2. > Perhaps, this should cover the recommended cases for enabling this new > option - something like below (may not need to have exact wording, but > the recommended cases?): > "It is recommended to enable this option only when 1) the column data > types have appropriate binary send/receive functions, 2) not > replicating between different major versions or different platforms, > 3) both publisher and subscriber tables have the exact same column > types (not when replicating from smallint to int or numeric to int8 > and so on), 4) both publisher and subscriber supports COPY with binary > option, otherwise the table copy can fail." > I added a line stating that binary format is less portable across machine architectures and versions as stated in COPY [1]. I don't think we should add line saying "recommended", but state the restrictions clearly instead. It's also similar in COPY docs as well. I think the explanation now covers all your points, right? Jelte Fennema , 28 Şub 2023 Sal, 16:25 tarihinde şunu yazdı: > > 3. I think the newly added tests must verify if the binary COPY is > > picked up when enabled. Perhaps, looking at the publisher's server log > > for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise, > > we have no way of testing that the option took effect. > > Another way to test that BINARY is enabled could be to trigger one > of the failure cases. Yes, there is already a failure case for binary copy which resolves with swithcing binary_copy to false. But I also added checks for publisher logs now too. [1] https://www.postgresql.org/docs/devel/sql-copy.html Thanks, -- Melih Mutlu Microsoft v11-0001-Allow-logical-replication-to-copy-table-in-binary.patch Description: Binary data
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On Mon, 27 Feb 2023 at 12:10, Heikki Linnakangas wrote: > (CC list trimmed, gmail wouldn't let me send otherwise) > > That sounds right for pg_serial, pg_notify, and pg_subtrans. But not for > pg_commit_ts and the pg_multixacts. > > This needs tests for pg_upgrading those SLRUs, after 0, 1 and N > wraparounds. > > Yep, that's my fault. I've forgotten about pg_multixacts. But for the pg_commit_ts it's a bit complicated story. The thing is, if we do upgrade, old files from pg_commit_ts not copied into a new server. For example, I've checked one more time on the current master branch: 1). initdb 2). add "track_commit_timestamp = on" into postgresql.conf 3). pgbench 4). $ ls pg_commit_ts/ 0005 000A 000F 0014 0019 001E 0023... ...009A 009F 00A4 00A9 00AE 00B3 00B8 00BD 00C2 5). do pg_upgrade 6). $ ls pg_commit_ts/ 00C2 Either I do not understand something, or the files from pg_commit_ts directory are not copied. -- Best regards, Maxim Orlov.
Re: Adding CommandID to heap xlog records
I took another stab at this from a different angle, and tried to use this to simplify logical decoding. The theory was that if we included the command ID in the WAL records, we wouldn't need the separate HEAP2_NEW_CID record anymore, and could remove much of the code in reorderbuffer.c that's concerned with tracking ctid->(cmin,cmax) mapping. Unfortunately, it didn't work out. Here's one problem: Insert with cmin 1 Commit Delete the same tuple with cmax 2. Abort Even if we store the cmin in the INSERT record, and set it on the tuple on replay, the DELETE overwrites it. That's OK for the original transactions, because they only look at the cmin/cmax of their own transaction, but it's a problem for logical decoding. If we see the inserted tuple during logical decoding, we need the cmin of the tuple. We could still just replace the HEAP2_NEW_CID records with the CIDs in the heap INSERT/UPDATE/DELETE records, and use that information to maintain the ctid->(cmin,cmax) mapping in reorderbuffer.c like we do today. But that doesn't really simplify reorderbuffer.c much. Attached is a patch for that, for the archives sake. Another problem with that is that logical decoding needs slightly different information than what we store on the tuples on disk. My original motivation for this was for Neon, which needs the WAL replay to restore the same CID as what's stored on disk, whether it's cmin, cmax or combocid. But for logical decoding, we need the cmin or cmax, *not* the combocid. To cater for both uses, we'd need to include both the original cmin/cmax and the possible combocid, which again makes it more complicated. So unfortunately I don't see much opportunity to simplify logical decoding with this. However, please take a look at the first two patches attached. They're tiny cleanups that make sense on their own. - Heikki From d666309a4cffc48e5091c0a97b8e2e7ec668f058 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 28 Feb 2023 13:22:40 +0200 Subject: [PATCH 1/4] Improve comment on why we need ctid->(cmin,cmax) mapping --- src/backend/replication/logical/snapbuild.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 62542827e4b..93755646564 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -41,10 +41,15 @@ * transactions we need Snapshots that see intermediate versions of the * catalog in a transaction. During normal operation this is achieved by using * CommandIds/cmin/cmax. The problem with that however is that for space - * efficiency reasons only one value of that is stored - * (cf. combocid.c). Since combo CIDs are only available in memory we log - * additional information which allows us to get the original (cmin, cmax) - * pair during visibility checks. Check the reorderbuffer.c's comment above + * efficiency reasons, the cmin and cmax are not included in WAL records. We + * cannot read the cmin/cmax from the tuple itself, either, because it is + * reset on crash recovery. Even if we could, we could not decode combocids + * which are only tracked in the original backend's memory. To work around + * that, heapam writes an extra WAL record (XLOG_HEAP2_NEW_CID) every time a + * catalog row is modified, which includes the cmin and cmax of the + * tuple. During decoding, we insert the ctid->(cmin,cmax) mappings into the + * reorder buffer, and use them at visibility checks instead of the cmin/cmax + * on the tuple itself. Check the reorderbuffer.c's comment above * ResolveCminCmaxDuringDecoding() for details. * * To facilitate all this we need our own visibility routine, as the normal -- 2.30.2 From 44435eb4121425f4babd9ed8847e6ddc137ba436 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 28 Feb 2023 12:55:49 +0200 Subject: [PATCH 2/4] Remove redundant check for fast_forward --- src/backend/replication/logical/decode.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 8fe7bb65f1f..d8962345da4 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -394,8 +394,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) switch (info) { case XLOG_HEAP2_MULTI_INSERT: - if (!ctx->fast_forward && -SnapBuildProcessChange(builder, xid, buf->origptr)) + if (SnapBuildProcessChange(builder, xid, buf->origptr)) DecodeMultiInsert(ctx, buf); break; case XLOG_HEAP2_NEW_CID: -- 2.30.2 From bc5103128fb3b6583a0b2b248fd205b59a347fe5 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 28 Feb 2023 13:02:56 +0200 Subject: [PATCH 3/4] Remove combocid field in logical decode that was just for debugging. (I don't think this is worth doing without the next patch) ---
Re: Non-superuser subscription owners
On Mon, Feb 27, 2023 at 7:37 PM Jeff Davis wrote: > > Yeah. That's the idea I was floating, at least. > > Isn't that a hard problem; maybe impossible? It doesn't seem that hard to me; maybe I'm missing something. The existing SECURITY_RESTRICTED_OPERATION flag basically prevents you from tinkering with the session state. If we also had a similar flags like DATABASE_READS_PROHIBITED and DATABASE_WRITES_PROHIBITED (or just a combined DATABASE_ACCESS_PROHIBITED flag) I think that would be pretty close to what we need. The idea would be that, when a user executes a function or procedure owned by a user that they don't trust completely, we'd set SECURITY_RESTRICTED_OPERATION|DATABASE_READS_PROHIBITED|DATABASE_WRITES_PROHIBITED. And we could provide a user with a way to express the degree of trust they have in some other user or perhaps even some specific function, e.g. SET trusted_roles='alice:read'; ...could mean that I trust alice to read from the database with my permissions, should I happen to run code provided by her in SECURITY INVOKER modacke. I'm sure there's some details to sort out here, e.g. around security related to the trusted_roles GUC itself. But I don't really see a fundamental problem. We can invent arbitrary flags that prohibit classes of operations that are of concern, set them by default in cases where concern is justified, and then give users who want the current behavior some kind of escape hatch that causes those flags to not get set after all. Not only does such a solution not seem impossible, I can possibly even imagine back-patching it, depending on exactly what the shape of the final solution is, how important we think it is to get a fix out there, and how brave I'm feeling that day. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow logical replication to copy tables in binary format
> 3. I think the newly added tests must verify if the binary COPY is > picked up when enabled. Perhaps, looking at the publisher's server log > for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise, > we have no way of testing that the option took effect. Another way to test that BINARY is enabled could be to trigger one of the failure cases.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Feb 28, 2023 at 3:42 PM John Naylor wrote: > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada > wrote: > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > wrote: > > > > > > That doesn't seem useful to me. If we've done enough testing to reassure > > > us the new way always gives the same answer, the old way is not needed at > > > commit time. If there is any doubt it will always give the same answer, > > > then the whole patchset won't be committed. > > > My idea is to make the bug investigation easier but on > > reflection, it seems not the best idea given this purpose. > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old > tid array. As I've said, that doesn't seem like a good thing to carry forward > forevermore, in any form. Plus, comparing new code with new code is not the > same thing as comparing existing code with new code. That was my idea > upthread. > > Maybe the effort my idea requires is too much vs. the likelihood of finding a > problem. In any case, it's clear that if I want that level of paranoia, I'm > going to have to do it myself. > > > What do you think > > about the attached patch? Please note that it also includes the > > changes for minimum memory requirement. > > Most of the asserts look logical, or at least harmless. > > - int max_off; /* the maximum offset number */ > + OffsetNumber max_off; /* the maximum offset number */ > > I agree with using the specific type for offsets here, but I'm not sure why > this change belongs in this patch. If we decided against the new asserts, > this would be easy to lose. Right. I'll separate this change as a separate patch. > > This change, however, defies common sense: > > +/* > + * The minimum amount of memory required by TidStore is 2MB, the current > minimum > + * valid value for the maintenance_work_mem GUC. This is required to > allocate the > + * DSA initial segment, 1MB, and some meta data. This number is applied also > to > + * the local TidStore cases for simplicity. > + */ > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > + /* Sanity check for the max_bytes */ > + if (max_bytes < TIDSTORE_MIN_MEMORY) > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided", > + TIDSTORE_MIN_MEMORY, max_bytes); > > Aside from the fact that this elog's something that would never get past > development, the #define just adds a hard-coded copy of something that is > already hard-coded somewhere else, whose size depends on an implementation > detail in a third place. > > This also assumes that all users of tid store are limited by > maintenance_work_mem. Andres thought of an example of some day unifying with > tidbitmap.c, and maybe other applications will be limited by work_mem. > > But now that I'm looking at the guc tables, I am reminded that work_mem's > minimum is 64kB, so this highlights a design problem: There is obviously no > requirement that the minimum work_mem has to be >= a single DSA segment, even > though operations like parallel hash and parallel bitmap heap scan are > limited by work_mem. Right. > It would be nice to find out what happens with these parallel features when > work_mem is tiny (maybe parallelism is not even considered?). IIUC both don't care about the allocated DSA segment size. Parallel hash accounts actual tuple (+ header) size as used memory but doesn't consider how much DSA segment is allocated behind. Both parallel hash and parallel bitmap scan can work even with work_mem = 64kB, but when checking the total DSA segment size allocated during these operations, it was 1MB. I realized that there is a similar memory limit design issue also on the non-shared tidstore cases. We deduct 70kB from max_bytes but it won't work fine with work_mem = 64kB. Probably we need to reconsider it. FYI 70kB comes from the maximum slab block size for node256. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas wrote: > > On 16/11/2022 07:17, kuroda.keis...@nttcom.co.jp wrote: > >> The issue here is pg_rewind looks into control file to determine the > >> soruce timeline, because the control file is not updated until the > >> first checkpoint ends after promotion finishes, even though file > >> blocks are already diverged. > >> > >> Even in that case history file for the new timeline is already > >> created, so searching for the latest history file works. > > > > I think this change is a good one because if I want > > pg_rewind to run automatically after a promotion, > > I don't have to wait for the checkpoint to complete. > > > > The attached patch is Horiguchi-san's patch with > > additional tests. The tests are based on James's tests, > > "010_no_checkpoint_after_promotion.pl" tests that > > pg_rewind is successfully executed without running > > checkpoint after promote. > > I fixed this last week in commit 009746, see thread [1]. I'm sorry I > didn't notice this thread earlier. > > I didn't realize that we had a notice about this in the docs. I'll go > and remove that. Thanks! > > - Heikki > Thanks; I think the missing [1] (for reference) is: https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi James
Re: Add shared buffer hits to pg_stat_io
Hi, On 2/25/23 9:16 PM, Melanie Plageman wrote: Hi, As suggested in [1], the attached patch adds shared buffer hits to pg_stat_io. Thanks for the patch! BufferDesc * LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, -bool *foundPtr, IOContext *io_context) +bool *foundPtr, IOContext io_context) { BufferTag newTag; /* identity of requested block */ LocalBufferLookupEnt *hresult; @@ -128,14 +128,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, hresult = (LocalBufferLookupEnt *) hash_search(LocalBufHash, , HASH_FIND, NULL); - /* -* IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set -* io_context here (instead of after a buffer hit would have returned) for -* convenience since we don't have to worry about the overhead of calling -* IOContextForStrategy(). -*/ - *io_context = IOCONTEXT_NORMAL; It looks like that io_context is not used in LocalBufferAlloc() anymore and then can be removed as an argument. I am looking for input as to the order of this column in the view. I think it should go after op_bytes since it is not relevant for non-block-oriented IO. Agree. However, I'm not sure what the order of hits, evictions, and reuses should be (all have to do with buffers). I'm not sure there is a strong "correct" ordering but the proposed one looks natural to me. While adding this, I noticed that I had made all of the IOOP columns int8 in the view, and I was wondering if this is sufficient for hits (I imagine you could end up with quite a lot of those). I think that's ok and bigint is what is already used for pg_statio_user_tables.heap_blks_hit for example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Allow logical replication to copy tables in binary format
On Tue, Feb 28, 2023 at 1:22 AM Melih Mutlu wrote: > > Hi, > > Thanks for all of your reviews! > > So, I made some changes in the v10 according to your comments. Thanks. Some quick comments on v10: 1. + + If true, initial data synchronization will be performed in binary format + It's not just the initial table sync right? The table sync can happen at any other point of time when ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy = true) is run. How about - "If true, the subscriber requests publication for pre-existing data in binary format"? 2. + Specifies whether pre-existing data on the publisher will be copied + to the subscriber in binary format. The default is false. + Binary format is very data type specific, it will not allow copying + between different column types as opposed to text format. Note that + if this option is enabled, all data types which will be copied during + the initial synchronization should have binary send and receive functions. + If this option is disabled, data format for the initial synchronization + will be text. Perhaps, this should cover the recommended cases for enabling this new option - something like below (may not need to have exact wording, but the recommended cases?): "It is recommended to enable this option only when 1) the column data types have appropriate binary send/receive functions, 2) not replicating between different major versions or different platforms, 3) both publisher and subscriber tables have the exact same column types (not when replicating from smallint to int or numeric to int8 and so on), 4) both publisher and subscriber supports COPY with binary option, otherwise the table copy can fail." 3. I think the newly added tests must verify if the binary COPY is picked up when enabled. Perhaps, looking at the publisher's server log for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise, we have no way of testing that the option took effect. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: WIN32 pg_import_system_collations
On 2023-02-27 Mo 17:20, Juan José Santamaría Flecha wrote: El lun, 27 feb 2023, 23:05, Juan José Santamaría Flecha escribió: The command that's failing is "SET lc_time TO 'de_DE';", and that area of code is untouched by this patch. As mentioned in [1], the problem seems to come from a Windows bug that the CI images and my development machines have patched out. What I wanted to post as [1]: https://www.postgresql.org/message-id/CAC%2BAXB1agvrgpyHEfqbDr2MOpcON3d%2BWYte_SLzn1E4TamLs9g%40mail.gmail.com Hmm, yeah. I'm not sure I understand the point of this test anyway: SELECT to_char(date '2010-03-01', 'DD TMMON ' COLLATE "de_DE"); cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Allow logical replication to copy tables in binary format
On Tue, 28 Feb 2023 at 01:22, Melih Mutlu wrote: > > Hi, > > Thanks for all of your reviews! > > So, I made some changes in the v10 according to your comments. > > 1- copy_format is changed to a boolean parameter copy_binary. > It sounds easier to use a boolean to enable/disable binary copy. Default > value is false, so nothing changes in the current behaviour if copy_binary is > not specified. > It's still persisted into the catalog. This is needed since its value will be > needed by tablesync workers later. And tablesync workers fetch subscription > configurations from the catalog. > In the copy_data case, it is not directly stored anywhere but it affects the > state of the table which is stored in the catalog. So, I guess this is the > convenient way if we decide to go with a new parameter. > > 2- copy_binary is not tied to another parameter > The patch does not disallow any combination of copy_binary with binary and > copy_data options. I tried to explain what binary copy can and cannot do in > the docs. Rest is up to the user now. > > 3- Removed version check for copy_binary > HEAD already does not have any check for binary option. Making binary copy > work only if publisher and subscriber are the same major version can be too > restricted. > > 4- Added separate test file > Although I believe 002_types.pl and 014_binary.pl can be improved more for > binary enabled subscription cases, this patch might not be the best place to > make those changes. > 032_binary_copy.pl now has the tests for binary copy option. There are also > some regression tests in subscription.sql. > > Finally, some other small fixes are done according to the reviews. > > Also, thanks Bharath for performance testing [1]. It can be seen that the > patch has some benefits. > > I appreciate further thoughts/reviews. Thanks for the patch, Few comments: 1) Are primary key required for the tables, if not required we could remove the primary key which will speed up the test by not creating the indexes and inserting in the indexes. Even if required just create for one of the tables: +# Create tables on both sides of the replication +my $ddl = qq( + CREATE TABLE public.test_numerical ( + a INTEGER PRIMARY KEY, + b NUMERIC, + c FLOAT, + d BIGINT + ); + CREATE TABLE public.test_arrays ( + a INTEGER[] PRIMARY KEY, + b NUMERIC[], + c TEXT[] + ); +CREATE TABLE public.test_range_array ( + a INTEGER PRIMARY KEY, + b TSTZRANGE, + c int8range[] + ); +CREATE TYPE public.test_comp_basic_t AS (a FLOAT, b TEXT, c INTEGER); +CREATE TABLE public.test_one_comp ( + a INTEGER PRIMARY KEY, + b public.test_comp_basic_t + );); + 2) We can have the Insert/Select of tables in the same order so that it is easy to verify. test_range_array/test_one_comp insertion/selection order was different. +# Insert some content and before creating a subscription +$node_publisher->safe_psql( + 'postgres', qq( +INSERT INTO public.test_numerical (a, b, c, d) VALUES + (1, 1.2, 1.3, 10), +(2, 2.2, 2.3, 20); + INSERT INTO public.test_arrays (a, b, c) VALUES + ('{1,2,3}', '{1.1, 1.2, 1.3}', '{"one", "two", "three"}'), +('{3,1,2}', '{1.3, 1.1, 1.2}', '{"three", "one", "two"}'); +INSERT INTO test_range_array (a, b, c) VALUES + (1, tstzrange('Mon Aug 04 00:00:00 2014 CEST'::timestamptz, 'infinity'), '{"[1,2]", "[10,20]"}'), + (2, tstzrange('Sat Aug 02 00:00:00 2014 CEST'::timestamptz, 'Mon Aug 04 00:00:00 2014 CEST'::timestamptz), '{"[2,3]", "[20,30]"}'); + INSERT INTO test_one_comp (a, b) VALUES + (1, ROW(1.0, 'a', 1)), + (2, ROW(2.0, 'b', 2)); + )); + +# Create the subscription with copy_binary = true +my $publisher_connstring = $node_publisher->connstr . ' dbname=postgres'; +$node_subscriber->safe_psql('postgres', + "CREATE SUBSCRIPTION tsub CONNECTION '$publisher_connstring' " + . "PUBLICATION tpub WITH (slot_name = tpub_slot, copy_binary = true)"); + +# Ensure nodes are in sync with each other +$node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub'); + +my $sync_check = qq( + SET timezone = '+2'; + SELECT a, b, c, d FROM test_numerical ORDER BY a; + SELECT a, b, c FROM test_arrays ORDER BY a; + SELECT a, b FROM test_one_comp ORDER BY a; + SELECT a, b, c FROM test_range_array ORDER BY a; +); 3) Should we check the results for test_myvarchar table only, since there is no change in other tables, we need not check other tables again: +# Now tablesync should succeed +$node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub'); + +$sync_check = qq( + SET timezone = '+2'; + SELECT a, b, c, d FROM test_numerical ORDER BY a; + SELECT a, b, c
Re: Make some xlogreader messages more accurate
+1 for the changes. >1. Why is "wanted >=%u" any better than "wanted at least %u"? IMO, the >wording as opposed to >= symbol in the user-facing messages works >better. I think I agree with Bharath on this: "wanted at least %u" sounds better for user error than "wanted >=%u". Regards, Jeevan Ladhe On Tue, 28 Feb 2023 at 11:46, Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > On Thu, Feb 23, 2023 at 1:06 PM Peter Eisentraut > wrote: > > > > Here is a small patch to make some invalid-record error messages in > > xlogreader a bit more accurate (IMO). > > +1 for these changes. > > > My starting point was that when you have some invalid WAL, you often get > > a message like "wanted 24, got 0". This is a bit incorrect, since it > > really wanted *at least* 24, not exactly 24. So I have updated the > > messages to that effect, and > > Yes, it's not exactly "wanted", but "wanted at least" because > xl_tot_len is the total length of the entire record including header > and payload. > > > also added that detail to one message where > > it was available but not printed. > > Looks okay. > > > Going through the remaining report_invalid_record() calls I then > > adjusted the use of "invalid" vs. "incorrect" in one case. The message > > "record with invalid length" makes it sound like the length was > > something like -5, but really we know what the length should be and what > > we got wasn't it, so "incorrect" sounded better and is also used in > > other error messages in that file. > > I have no strong opinion about this change. We seem to be using > "invalid length" and "incorrect length" interchangeably [1] without > distinguishing between "invalid" if length is < 0 and "incorrect" if > length >= 0 and not something we're expecting. > > Another comment on the patch: > 1. Why is "wanted >=%u" any better than "wanted at least %u"? IMO, the > wording as opposed to >= symbol in the user-facing messages works > better. > +report_invalid_record(state, "invalid record offset at %X/%X: > wanted >=%u, got %u", > + "invalid record length at %X/%X: > wanted >=%u, got %u", > + "invalid record length at %X/%X: wanted > >=%u, got %u", > > [1] > elog(ERROR, "incorrect length %d in streaming transaction's changes > file \"%s\"", > "record with invalid length at %X/%X", > (errmsg("invalid length of checkpoint record"))); > errmsg("invalid length of startup packet"))); > errmsg("invalid length of startup packet"))); > elog(ERROR, "invalid zero-length dimension array in MCVList"); > elog(ERROR, "invalid length (%d) dimension array in MCVList", > errmsg("invalid length in external \"%s\" value", > errmsg("invalid length in external bit string"))); > libpq_append_conn_error(conn, "certificate contains IP address with > invalid length %zu > > -- > Bharath Rupireddy > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com > > >
Re: [HACKERS] make async slave to wait for lsn to be replayed
Intro== The main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. We cannot store this lsn inside database, since reads are distributed across all replicas and primary. https://www.postgresql.org/message-id/195e2d07ead315b1620f1a053313f490%40postgrespro.ru Suggestions == Lots of proposals were made how this feature may look like. I aggregate them into the following four types. 1) Classic (wait_classic_v1.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru == advantages: multiple events, standalone WAIT disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] where event is one of: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v1.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru == advantages: no new words in grammar, standalone AFTER disadvantages: a little harder to understand AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch) https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com == advantages: no new words in grammar,like it made in pg_last_wal_replay_lsn, no snapshots need disadvantages: a little harder to remember names SELECT pg_waitlsn(‘LSN’, timeout); SELECT pg_waitlsn_infinite(‘LSN’); SELECT pg_waitlsn_no_wait(‘LSN’); 4) Brackets style: Kondratov https://www.postgresql.org/message-id/a8bff0350a27e0a87a6eaf0905d6737f%40postgrespro.ru == advantages: only one new word in grammar,like it made in VACUUM and REINDEX, ability to extend parameters without grammar fixes disadvantages: WAIT (LSN '16/B374D848', TIMEOUT 100); BEGIN WAIT (LSN '16/B374D848' [, etc_options]); ... COMMIT; Consequence == Below I provide the implementation of patches for the first three types. I propose to discuss this feature again/ Regards -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companydiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml index 54b5f22d6e..18695e013e 100644 --- a/doc/src/sgml/ref/allfiles.sgml +++ b/doc/src/sgml/ref/allfiles.sgml @@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory. + diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml index 016b021487..a2794763b1 100644 --- a/doc/src/sgml/ref/begin.sgml +++ b/doc/src/sgml/ref/begin.sgml @@ -21,13 +21,16 @@ PostgreSQL documentation -BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] +BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +where wait_event is: +AFTER lsn_value [ WITHIN number_of_milliseconds ] diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml index 74ccd7e345..46a3bcf1a8 100644 --- a/doc/src/sgml/ref/start_transaction.sgml +++ b/doc/src/sgml/ref/start_transaction.sgml @@ -21,13 +21,16 @@ PostgreSQL documentation -START TRANSACTION [ transaction_mode [, ...] ] +START TRANSACTION [ transaction_mode [, ...] ] wait_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +where wait_event is: +AFTER lsn_value [ WITHIN number_of_milliseconds ] diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml index e11b4b6130..a83ff4551e 100644 --- a/doc/src/sgml/reference.sgml +++ b/doc/src/sgml/reference.sgml @@ -216,6 +216,7 @@ + diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index dbe9394762..e4d2d8d0a1 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -43,6 +43,7 @@ #include "backup/basebackup.h" #include "catalog/pg_control.h" #include "commands/tablespace.h" +#include "commands/wait.h" #include "common/file_utils.h" #include "miscadmin.h" #include "pgstat.h" @@ -1752,6 +1753,15 @@ PerformWalRecovery(void) break; } + /* + * If we replayed an LSN that someone was waiting for, + *
Maybe we can remove the type cast in typecache.c
Hi hackers, When I was reading postgres code, I found there is a wierd type cast. I'm pondering if it is necessary. ``` /* Allocate a new typmod number. This will be wasted if we error out. */ typmod = (int) pg_atomic_fetch_add_u32(>shared_typmod_registry->next_typmod, 1); ``` typmod has u32 type, but we cast it to int first. And I also have some confusion about why `NextRecordTypmod` and `TupleDescData.tdtypmod` has type of int32, but `SharedTypmodTableEntry.typmod` has type of uint32. Best regard, Qinghao Huang
Re: Track IO times in pg_stat_io
Hi, On 2/26/23 5:03 PM, Melanie Plageman wrote: Hi, As suggested in [1], the attached patch adds IO times to pg_stat_io; Thanks for the patch! I started to have a look at it and figured out that a tiny rebase was needed (due to 728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached. The timings will only be non-zero when track_io_timing is on That could lead to incorrect interpretation if one wants to divide the timing per operations, say: - track_io_timing is set to on while there is already operations - or set to off while it was on (and the number of operations keeps growing) Might be worth to warn/highlight in the "track_io_timing" doc? + if (track_io_timing) + { + INSTR_TIME_SET_CURRENT(io_time); + INSTR_TIME_SUBTRACT(io_time, io_start); + pgstat_count_io_time(io_object, io_context, IOOP_EXTEND, io_time); + } + + pgstat_count_io_op(io_object, io_context, IOOP_EXTEND); vs @@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, INSTR_TIME_SUBTRACT(io_time, io_start); pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time)); INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); + pgstat_count_io_time(io_object, io_context, IOOP_READ, io_time); } That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and after pgstat_count_io_op() (for the IOOP_READ case). What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()? If so, the ordering would also need to be changed in: - FlushRelationBuffers() - register_dirty_segment() There is one minor question (in the code as a TODO) which is whether or not it is worth cross-checking that IO counts and times are either both zero or neither zero in the validation function pgstat_bktype_io_stats_valid(). As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea to also check that if counts are not Zero then times are not Zero. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 6249bb50d0..2c62b0a437 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + +read_time double precision + + +Time spent in read operations in milliseconds (if is enabled, otherwise zero) + + + + @@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + +write_time double precision + + +Time spent in write operations in milliseconds (if is enabled, otherwise zero) + + + + @@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + +extend_time double precision + + +Time spent in extend operations in milliseconds (if is enabled, otherwise zero) + + + + @@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + +fsync_time double precision + + +Time spent in fsync operations in milliseconds (if is enabled, otherwise zero) + + + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 34ca0e739f..39391bc2fc 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1123,12 +1123,16 @@ SELECT b.io_object, b.io_context, b.reads, + b.read_time, b.writes, + b.write_time, b.extends, + b.extend_time, b.op_bytes, b.evictions, b.reuses, b.fsyncs, + b.fsync_time, b.stats_reset FROM pg_stat_get_io() b; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0a05577b68..bbd2af9fae 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1000,11 +1000,26 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (isExtend) { + instr_time io_start, + io_time; /* new
Re: Provide PID data for "cannot wait on a latch owned by another process" in latch.c
On Tue, Feb 28, 2023 at 08:18:16AM +0100, Peter Eisentraut wrote: > I would also have asked for some kind of prefix that introduces the numbers. Okay. > I wonder what these numbers are useful for though? Is this a development > aid? Yes. > Can you do anything with these numbers? Yes. They immediately pointed out that I missed to mark a latch as owned in a process, hence the owner_pid was showing up as 0 when trying to use it. The second showed me the process that was involved, which was still useful once cross-checked with the contents of the logs prefixed with %p. -- Michael signature.asc Description: PGP signature
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Thu, Feb 23, 2023 at 3:03 AM Melanie Plageman wrote: > > Hi, > > So, I attached a rough implementation of both the autovacuum failsafe > reverts to shared buffers and the vacuum option (no tests or docs or > anything). Thanks for the patches. I have some comments. 0001: 1. I don't quite understand the need for this 0001 patch. Firstly, bstrategy allocated once per autovacuum worker in AutovacMemCxt which goes away with the process. Secondly, the worker exits after do_autovacuum() with which memory context is gone. I think this patch is unnecessary unless I'm missing something. 0002: 1. Don't we need to remove vac_strategy for analyze.c as well? It's pretty-meaningless there than vacuum.c as we're passing bstrategy to all required functions. 0004: 1. I think no multiple sentences in a single error message. How about "of %d, changing it to %d"? +elog(WARNING, "buffer_usage_limit %d is below the minimum buffer_usage_limit of %d. setting it to %d", 2. Typically, postgres error messages start with lowercase letters, hints and detail messages start with uppercase letters. +if (buffers == 0) +elog(ERROR, "Use of shared buffers unsupported for buffer access strategy: %s. nbuffers must be -1.", +strategy_name); + +if (buffers > 0) +elog(ERROR, "Specification of ring size in buffers unsupported for buffer access strategy: %s. nbuffers must be -1.", +strategy_name); 3. A function for this seems unnecessary, especially when a static array would do the needful, something like forkNames[]. +static const char * +btype_get_name(BufferAccessStrategyType btype) +{ +switch (btype) +{ 4. Why are these assumptions needed? Can't we simplify by doing validations on the new buffers parameter only when the btype is BAS_VACUUM? +if (buffers == 0) +elog(ERROR, "Use of shared buffers unsupported for buffer access strategy: %s. nbuffers must be -1.", +strategy_name); +// TODO: DEBUG logging message for dev? +if (buffers == 0) +btype = BAS_NORMAL; 5. Is this change needed for this patch? default: elog(ERROR, "unrecognized buffer access strategy: %d", - (int) btype); -return NULL;/* keep compiler quiet */ +(int) btype); + +pg_unreachable(); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Add documentation for coverage reports with meson
Hi all, I have mentioned on a different thread of -docs that we have no documentation to achieve $subject, so attached is a patch to add something. This can be done with the following steps: meson setup -Db_coverage=true .. blah ninja meson test ninja coverage-html As far as I can see, there is no option to generate anything else than a HTML report? This portion is telling the contrary, still it does not seem to work here and ninja does the job with coverage-html or coverage as only available targets: https://mesonbuild.com/howtox.html#producing-a-coverage-report Side issue: the current code generates no reports for the files that are automatically generated in src/backend/nodes/, which are actually part of src/include/ for a meson build. I have not looked into that yet. Thoughts? -- Michael diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml index a08c7a78af..dd9a004a6d 100644 --- a/doc/src/sgml/regress.sgml +++ b/doc/src/sgml/regress.sgml @@ -825,53 +825,76 @@ PG_TEST_NOCLEAN=1 make -C src/bin/pg_dump check instrumentation, so that it becomes possible to examine which parts of the code are covered by the regression tests or any other test suite that is run with the code. This is currently supported -when compiling with GCC, and it requires the gcov -and lcov programs. +when compiling with GCC, and it requires the gcov, +lcov and genhtml programs. - -A typical workflow looks like this: + +Coverage with configure + + A typical workflow looks like this: ./configure --enable-coverage ... OTHER OPTIONS ... make make check # or other test suite make coverage-html -Then point your HTML browser -to coverage/index.html. - + Then point your HTML browser + to coverage/index.html. + - -If you don't have lcov or prefer text output over an -HTML report, you can run + + If you don't have lcov or prefer text output over an + HTML report, you can run make coverage -instead of make coverage-html, which will -produce .gcov output files for each source file -relevant to the test. (make coverage and make -coverage-html will overwrite each other's files, so mixing them -might be confusing.) - + instead of make coverage-html, which will + produce .gcov output files for each source file + relevant to the test. (make coverage and make + coverage-html will overwrite each other's files, so mixing them + might be confusing.) + - -You can run several different tests before making the coverage report; -the execution counts will accumulate. If you want -to reset the execution counts between test runs, run: + + You can run several different tests before making the coverage report; + the execution counts will accumulate. If you want + to reset the execution counts between test runs, run: make coverage-clean - + - -You can run the make coverage-html or make -coverage command in a subdirectory if you want a coverage -report for only a portion of the code tree. - + + You can run the make coverage-html or make + coverage command in a subdirectory if you want a coverage + report for only a portion of the code tree. + - -Use make distclean to clean up when done. - + + Use make distclean to clean up when done. + + + + +Coverage with meson + + A typical workflow looks like this: + +meson setup -Db_coverage=true ... OTHER OPTIONS ... +ninja +meson test +ninja coverage-html + + Then point your HTML browser + to ./meson-logs/coveragereport/index.html. + + + + You can run several different tests before making the coverage report; + the execution counts will accumulate. + + signature.asc Description: PGP signature
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Mon, Feb 27, 2023 at 2:21 PM Hayato Kuroda (Fujitsu) wrote: > Few comments: 1. + /* + * If we've requested to shut down, exit the process. + * + * Note that WalSndDone() cannot be used here because the delaying + * changes will be sent in the function. + */ + if (got_STOPPING) + { + QueryCompletion qc; + + /* Inform the standby that XLOG streaming is done */ + SetQueryCompletion(, CMDTAG_COPY, 0); + EndCommand(, DestRemote, false); + pq_flush(); Do we really need to do anything except for breaking the loop and let the exit handling happen in the main loop when 'got_STOPPING' is set? AFAICS, this is what we are doing in some other palces (See WalSndWaitForWal). Won't that work? It seems that will help us sending all the pending WAL. 2. + /* Try to flush pending output to the client */ + if (pq_flush_if_writable() != 0) + WalSndShutdown(); Is there a reason to try flushing here? Apart from the above, I have made a few changes in the comments in the attached diff patch. If you agree with those then please include them in the next version. -- With Regards, Amit Kapila. changes_amit_1.patch Description: Binary data
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Thu, Jan 12, 2023 at 6:06 AM Andres Freund wrote: > > On 2023-01-11 17:26:19 -0700, David G. Johnston wrote: > > Should we just add "ring_buffers" to the existing "shared_buffers" and > > "temp_buffers" settings? > > The different types of ring buffers have different sizes, for good reasons. So > I don't see that working well. I also think it'd be more often useful to > control this on a statement basis - if you have a parallel import tool that > starts NCPU COPYs you'd want a smaller buffer than a single threaded COPY. Of > course each session can change the ring buffer settings, but still. How about having GUCs for each ring buffer (bulk_read_ring_buffers, bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)? These options can help especially when statement level controls aren't easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If needed users can also set them at the system level. For instance, one can set bulk_write_ring_buffers to other than 16MB or -1 to disable the ring buffer to use shared_buffers and run a bunch of bulk write queries. Although I'm not quite opposing the idea of statement level controls (like the VACUUM one proposed here), it is better to make these ring buffer sizes configurable across the system to help with the other similar cases e.g., a CTAS or RMV can help subsequent reads from shared buffers if ring buffer is skipped. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com