Re: parallel vacuum comments
On Fri, Dec 17, 2021 at 10:51 AM Masahiko Sawada wrote: > > I've attached updated patches. The first patch just moves common > function for index bulk-deletion and cleanup to vacuum.c. And the > second patch moves parallel vacuum code to vacuumparallel.c. The > comments I got so far are incorporated into these patches. Please > review them. > Thanks, it is helpful for the purpose of review. Few comments: = 1. + * dead_items stores TIDs whose index tuples are deleted by index vacuuming. + * Each TID points to an LP_DEAD line pointer from a heap page that has been + * processed by lazy_scan_prune. Also needed by lazy_vacuum_heap_rel, which + * marks the same LP_DEAD line pointers as LP_UNUSED during second heap pass. */ - LVDeadItems *dead_items; /* TIDs whose index tuples we'll delete */ + VacDeadItems *dead_items; /* TIDs whose index tuples we'll delete */ Isn't it better to keep these comments atop the structure VacDeadItems declaration? 2. What is the reason for not moving lazy_vacuum_one_index/lazy_cleanup_one_index to vacuum.c so that they can be called from vacuumlazy.c and vacuumparallel.c? Without this refactoring patch, I think both leader and workers set the same error context phase (VACUUM_ERRCB_PHASE_VACUUM_INDEX) during index vacuuming? Is it because you want a separate context phase for a parallel vacuum? The other thing related to this is that if we have to do the way you have it here then we don't need pg_rusage_init() in functions lazy_vacuum_one_index/lazy_cleanup_one_index. 3. @@ -3713,7 +3152,7 @@ update_index_statistics(LVRelState *vacrel) int nindexes = vacrel->nindexes; IndexBulkDeleteResult **indstats = vacrel->indstats; - Assert(!IsInParallelMode()); + Assert(!ParallelVacuumIsActive(vacrel)); I think we can retain the older Assert. If we do that then I think we don't need to define ParallelVacuumIsActive in vacuumlazy.c. -- With Regards, Amit Kapila.
Re: sequences vs. synchronous replication
On 12/18/21 05:52, Tom Lane wrote: Tomas Vondra writes: The problem is exactly the same as in [1] - the aborted transaction generated WAL, but RecordTransactionAbort() ignores that and does not update LogwrtResult.Write, with the reasoning that aborted transactions do not matter. But sequences violate that, because we only write WAL once every 32 increments, so the following nextval() gets "committed" without waiting for the replica (because it did not produce WAL). Ugh. I'm not sure this is a clear data corruption bug, but it surely walks and quacks like one. My proposal is to fix this by tracking the lsn of the last LSN for a sequence increment, and then check that LSN in RecordTransactionCommit() before calling XLogFlush(). (1) Does that work if the aborted increment was in a different session? I think it is okay but I'm tired enough to not be sure. Good point - it doesn't :-( At least not by simply storing LSN in a global variable or something like that. The second backend needs to know the LSN of the last WAL-logged sequence increment, but only the first backend knows that. So we'd need to share that between backends somehow. I doubt we want to track LSN for every individual sequence (because for clusters with many dbs / sequences that may be a lot). Perhaps we could track just a fixed number o LSN values in shared memory (say, 1024), and update/read just the element determined by hash(oid). That is, the backend WAL-logging sequence with given oid would set the current LSN to array[hash(oid) % 1024], and backend doing nextval() would simply remember the LSN in that slot. Yes, if there are conflicts that'll flush more than needed. Alternatively we could simply use the current insert LSN, but that's going to flush more stuff than needed all the time. (2) I'm starting to wonder if we should rethink the sequence logging mechanism altogether. It was cool when designed, but it seems really problematic when you start thinking about replication behaviors. Perhaps if wal_level > minimal, we don't do things the same way? Maybe, but I have no idea how should the reworked WAL logging work. Any batching seems to have this issue, and loging individual increments is likely going to be slower. Of course, reworking how sequences are WAL-logged may invalidate the "sequence decoding" patch I've been working on :-( regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: sequences vs. synchronous replication
Tomas Vondra writes: > The problem is exactly the same as in [1] - the aborted transaction > generated WAL, but RecordTransactionAbort() ignores that and does not > update LogwrtResult.Write, with the reasoning that aborted transactions > do not matter. But sequences violate that, because we only write WAL > once every 32 increments, so the following nextval() gets "committed" > without waiting for the replica (because it did not produce WAL). Ugh. > I'm not sure this is a clear data corruption bug, but it surely walks > and quacks like one. My proposal is to fix this by tracking the lsn of > the last LSN for a sequence increment, and then check that LSN in > RecordTransactionCommit() before calling XLogFlush(). (1) Does that work if the aborted increment was in a different session? I think it is okay but I'm tired enough to not be sure. (2) I'm starting to wonder if we should rethink the sequence logging mechanism altogether. It was cool when designed, but it seems really problematic when you start thinking about replication behaviors. Perhaps if wal_level > minimal, we don't do things the same way? regards, tom lane
Re: Column Filtering in Logical Replication
On Fri, Dec 17, 2021 at 3:16 PM houzj.f...@fujitsu.com wrote: > > On Friday, December 17, 2021 1:55 AM Alvaro Herrera > wrote: > > On 2021-Dec-16, houzj.f...@fujitsu.com wrote: > > > > > The patch ensures all columns of RT are in column list when > > > CREATE/ALTER publication, but it seems doesn't prevent user from > > > changing the replica identity or dropping the index used in replica > > > identity. Do we also need to check those cases ? > > > > Yes, we do. As it happens, I spent a couple of hours yesterday writing > > code for > > that, at least partially. I haven't yet checked what happens with cases > > like > > REPLICA NOTHING, or REPLICA INDEX and then dropping that index. > > > > My initial ideas were a bit wrong BTW: I thought we should check the > > combination of column lists in all publications (a bitwise-OR of column > > bitmaps, > > so to speak). But conceptually that's wrong: we need to check the column > > list > > of each publication individually instead. Otherwise, if you wanted to hide > > a > > column from some publication but that column was part of the replica > > identity, > > there'd be no way to identify the tuple in the replica. (Or, if the > > pgouput code > > disobeys the column list and sends the replica identity even if it's not in > > the > > column list, then you'd be potentially publishing data that you wanted to > > hide.) > > Thanks for the explanation. > > Apart from ALTER REPLICA IDENTITY and DROP INDEX, I think there could be > some other cases we need to handle for the replica identity check: > > 1) > When adding a partitioned table with column list to the publication, I think > we > need to check the RI of all its leaf partition. Because the RI on the > partition > is the one actually takes effect. > > 2) > ALTER TABLE ADD PRIMARY KEY; > ALTER TABLE DROP CONSTRAINT "PRIMAEY KEY"; > > If the replica identity is default, it will use the primary key. we might also > need to prevent user from adding or removing primary key in this case. > > > Based on the above cases, the RI check seems could bring considerable amount > of > code. So, how about we follow what we already did in > CheckCmdReplicaIdentity(), > we can put the check for RI in that function, so that we can cover all the > cases and reduce the code change. And if we are worried about the cost of do > the check for UPDATE and DELETE every time, we can also save the result in the > relcache. It's safe because every operation change the RI will invalidate the > relcache. We are using this approach in row filter patch to make sure all > columns in row filter expression are part of RI. > Another point related to RI is that this patch seems to restrict specifying the RI columns in the column filter list irrespective of publish action. Do we need to have such a restriction if the publication publishes 'insert' or 'truncate'? -- With Regards, Amit Kapila.
Re: Column Filtering in Logical Replication
On Sat, Dec 18, 2021 at 7:04 AM Alvaro Herrera wrote: > > On 2021-Dec-17, Tomas Vondra wrote: > > > On 12/17/21 22:07, Alvaro Herrera wrote: > > > So I've been thinking about this as a "security" item (you can see my > > > comments to that effect sprinkled all over this thread), in the sense > > > that if a publication "hides" some column, then the replica just won't > > > get access to it. But in reality that's mistaken: the filtering that > > > this patch implements is done based on the queries that *the replica* > > > executes at its own volition; if the replica decides to ignore the list > > > of columns, it'll be able to get all columns. All it takes is an > > > uncooperative replica in order for the lot of data to be exposed anyway. > > > > Interesting, I haven't really looked at this as a security feature. And in > > my experience if something is not carefully designed to be secure from the > > get go, it's really hard to add that bit later ... > > I guess the way to really harden replication is to use the GRANT system > at the publisher's side to restrict access for the replication user. > This would provide actual security. So you're right that I seem to be > barking at the wrong tree ... maybe I need to give a careful look at > the documentation for logical replication to understand what is being > offered, and to make sure that we explicitly indicate that limiting the > column list does not provide any actual security. > IIRC, the use cases as mentioned by other databases (like Oracle) are (a) this helps when the target table doesn't have the same set of columns or (b) when the columns contain some sensitive information like personal identification number, etc. I think there could be a side benefit in this which comes from the fact that the lesser data will flow across the network which could lead to faster replication especially when the user filters large column data. -- With Regards, Amit Kapila.
Re: row filtering for logical replication
On Fri, Dec 17, 2021 at 5:29 PM Greg Nancarrow wrote: > > On Fri, Dec 17, 2021 at 7:20 PM Ajin Cherian wrote: > > > > On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow wrote: > > > > > So using the v47 patch-set, I still find that the UPDATE above results in > > > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to > > > (2,1). > > > This is according to the 2nd UPDATE rule below, from patch 0003. > > > > > > + * old-row (no match)new-row (no match) -> (drop change) > > > + * old-row (no match)new row (match) -> INSERT > > > + * old-row (match) new-row (no match) -> DELETE > > > + * old-row (match) new row (match) -> UPDATE > > > > > > This is because the old row (1,1) doesn't match the UPDATE filter > > > "(a>1)", but the new row (2,1) does. > > > This functionality doesn't seem right to me. I don't think it can be > > > assumed that (1,1) was never published (and thus requires an INSERT > > > rather than UPDATE) based on these checks, because in this example, (1,1) > > > was previously published via a different operation - INSERT (and using a > > > different filter too). > > > I think the fundamental problem here is that these UPDATE rules assume > > > that the old (current) row was previously UPDATEd (and published, or not > > > published, according to the filter applicable to UPDATE), but this is not > > > necessarily the case. > > > Or am I missing something? > > > > But it need not be correct in assuming that the old-row was part of a > > previous INSERT either (and published, or not published according to > > the filter applicable to an INSERT). > > For example, change the sequence of inserts and updates prior to the > > last update: > > > > truncate tbl1 ; > > insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < > > 2); > > update tbl1 set b = 1; ==> not replicated since update and ! (a > 1) > > update tbl1 set a = 2; ==> replicated and update converted to insert > > since (a > 1) > > > > In this case, the last update "update tbl1 set a = 2; " is updating a > > row that was previously updated and not inserted and not replicated to > > the subscriber. > > How does the replication logic differentiate between these two cases, > > and decide if the update was previously published or not? > > I think it's futile for the publisher side to try and figure out the > > history of published rows. In fact, if this level of logic is required > > then it is best implemented on the subscriber side, which then defeats > > the purpose of a publication filter. > > > > I think it's a concern, for such a basic example with only one row, > getting unpredictable (and even wrong) replication results, depending > upon the order of operations. > I am not sure how we can deduce that. The results are based on current and new values of row which is what I think we are expecting here. > Doesn't this problem result from allowing different WHERE clauses for > different pubactions for the same table? > My current thoughts are that this shouldn't be allowed, and also WHERE > clauses for INSERTs should, like UPDATE and DELETE, be restricted to > using only columns covered by the replica identity or primary key. > Hmm, even if we do that one could have removed the insert row filter by the time we are evaluating the update. So, we will get the same result. I think the behavior in your example is as we expect as per the specs defined by the patch and I don't see any problem, in this case, w.r.t replication results. Let us see what others think on this? -- With Regards, Amit Kapila.
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > > advance relfrozenxid, outside of certain obvious edge cases. And so > > with all the patches applied, up to and including the opportunistic > > freezing patch, every autovacuum of every table manages to advance > > relfrozenxid during benchmarking -- usually to a fairly recent value. > > I've focussed on making aggressive VACUUMs (especially anti-wraparound > > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > > user keeps canceling autovacuums, maybe due to automated script that > > performs DDL). That has taken priority over other goals, for now. > > Great! Maybe this is a good time to revisit basic questions about VACUUM. I wonder if we can get rid of some of the GUCs for VACUUM now. Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for physical blocks, but we use logical units (XIDs). We probably shouldn't be using any units, but using XIDs "feels wrong" to me. Even with my patch, it is theoretically possible that we won't be able to advance relfrozenxid very much, because we cannot get a cleanup lock on one single heap page with one old XID. But even in this extreme case, how relevant is the "age" of this old XID, really? What really matters is whether or not we can advance relfrozenxid in time (with time to spare). And so the wraparound risk of the system is not affected all that much by the age of the single oldest XID. The risk mostly comes from how much total work we still need to do to advance relfrozenxid. If the single old XID is quite old indeed (~1.5 billion XIDs), but there is only one, then we just have to freeze one tuple to be able to safely advance relfrozenxid (maybe advance it by a huge amount!). How long can it take to freeze one tuple, with the freeze map, etc? On the other hand, the risk may be far greater if we have *many* tuples that are still unfrozen, whose XIDs are only "middle aged" right now. The idea behind vacuum_freeze_min_age seems to be to be lazy about work (tuple freezing) in the hope that we'll never need to do it, but that seems obsolete now. (It probably made a little more sense before the visibility map.) Using XIDs makes sense for things like autovacuum_freeze_max_age, because there we have to worry about wraparound and relfrozenxid (whether or not we like it). But with this patch, and with everything else (the failsafe, insert-driven autovacuums, everything we've done over the last several years) I think that it might be time to increase the autovacuum_freeze_max_age default. Maybe even to something as high as 800 million transaction IDs, but certainly to 400 million. What do you think? (Maybe don't answer just yet, something to think about.) > + vacrel->aggressive = aggressive; > vacrel->failsafe_active = false; > vacrel->consider_bypass_optimization = true; > > How about adding skipwithvm to LVRelState too? Agreed -- it's slightly better that way. Will change this. > */ > - if (skipping_blocks && !FORCE_CHECK_PAGE()) > + if (skipping_blocks && blkno < nblocks - 1) > > Why do we always need to scan the last page even if heap truncation is > disabled (or in the failsafe mode)? My goal here was to keep the behavior from commit e8429082, "Avoid useless truncation attempts during VACUUM", while simplifying things around skipping heap pages via the visibility map (including removing the FORCE_CHECK_PAGE() macro). Of course you're right that this particular change that you have highlighted does change the behavior a little -- now we will always treat the final page as a "scanned page", except perhaps when 100% of all pages in the relation are skipped using the visibility map. This was a deliberate choice (and perhaps even a good choice!). I think that avoiding accessing the last heap page like this isn't worth the complexity. Note that we may already access heap pages (making them "scanned pages") despite the fact that we know it's unnecessary: the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't even try to avoid wasting CPU cycles on these not-skipped-but-skippable pages). So I think that the performance cost for the last page isn't going to be noticeable. However, now that I think about it, I wonder...what do you think of SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today? SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly after the original visibility map implementation was committed in 2009. The idea that it helps us to advance relfrozenxid outside of aggressive VACUUMs (per commit message from bf136cf6e3) seems like it might no longer matter with the patch -- because now we won't ever set a page all-visible but not all-frozen. Plus the idea that we need to do all this work just
Re: Column Filtering in Logical Replication
On 12/18/21 02:34, Alvaro Herrera wrote: On 2021-Dec-17, Tomas Vondra wrote: On 12/17/21 22:07, Alvaro Herrera wrote: So I've been thinking about this as a "security" item (you can see my comments to that effect sprinkled all over this thread), in the sense that if a publication "hides" some column, then the replica just won't get access to it. But in reality that's mistaken: the filtering that this patch implements is done based on the queries that *the replica* executes at its own volition; if the replica decides to ignore the list of columns, it'll be able to get all columns. All it takes is an uncooperative replica in order for the lot of data to be exposed anyway. Interesting, I haven't really looked at this as a security feature. And in my experience if something is not carefully designed to be secure from the get go, it's really hard to add that bit later ... I guess the way to really harden replication is to use the GRANT system at the publisher's side to restrict access for the replication user. This would provide actual security. So you're right that I seem to be barking at the wrong tree ... maybe I need to give a careful look at the documentation for logical replication to understand what is being offered, and to make sure that we explicitly indicate that limiting the column list does not provide any actual security. You say it's the replica making the decisions, but my mental model is it's the publisher decoding the data for a given list of publications (which indeed is specified by the subscriber). But the subscriber can't tweak the definition of publications, right? Or what do you mean by queries executed by the replica? What are the gap? I am thinking in somebody modifying the code that the replica runs, so that it ignores the column list that the publication has been configured to provide; instead of querying only those columns, it would query all columns. If the server has a *separate* security mechanism to hide the columns (per-column privs), it is that feature that will protect the data, not the logical-replication-feature to filter out columns. Right. Although I haven't thought about how logical decoding interacts with column privileges. I don't think logical decoding actually checks column privileges - I certainly don't recall any ACL checks in src/backend/replication ... Well, in practice if you're confronted with a replica that's controlled by a malicious user that can tweak its behavior, then replica-side privilege checking won't do anything useful. I don't follow. Surely the decoding happens on the primary node, right? Which is where the ACL checks would happen, using the role the replication connection is opened with. This led me to realize that the replica-side code in tablesync.c is totally oblivious to what's the publication through which a table is being received from in the replica. So we're not aware of a replica being exposed only a subset of columns through some specific publication; and a lot more hacking is needed than this patch does, in order to be aware of which publications are being used. Does that mean we currently sync all the columns in the initial sync, and only start filtering columns later while decoding transactions? No, it does filter the list of columns in the initial sync. But the current implementation is bogus, because it obtains the list of *all* publications in which the table is published, not just the ones that the subscription is configured to get data from. And the sync code doesn't receive the list of publications. We need more thorough patching of the sync code to close that hole. Ah, got it. Thanks for the explanation. Yeah, that makes no sense. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences
On 12/15/21 14:51, Tomas Vondra wrote: On 12/15/21 14:20, Amit Kapila wrote: On Tue, Dec 14, 2021 at 7:02 AM Tomas Vondra wrote: Hi, here's an updated version of the patches, dealing with almost all of the issues (at least in the 0001 and 0002 parts). The main changes: 1) I've removed the 'created' flag from fill_seq_with_data, as discussed. I don't think it's needed by any of the parts (not even 0003, AFAICS). We still need it in xl_seq_rec, though. 2) GetCurrentTransactionId() added to sequence.c are called only with wal_level=logical, to minimize the overhead. There's still one remaining problem, that I already explained in [1]. The problem is that with this: BEGIN; SELECT nextval('s') FROM generate_series(1,100); ROLLBACK; The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, which is updated by XLogFlush() - but only in RecordTransactionCommit. Which makes sense, because only the committed stuff is "visible". But the non-transactional behavior of sequence decoding disagrees with this, because now some of the changes from aborted transactions may be replicated. Which means the wait_for_catchup() ends up not waiting for the sequence change to be replicated. This is an issue for tests in patch 0003, at least. My concern is this actually affects other places waiting for things getting replicated :-/ By any chance, will this impact synchronous replication as well which waits for commits to be replicated? Physical or logical replication? Physical is certainly not replicated. Actually, I take that back. It does affect physical (sync) replication just as well, and I think it might be considered a data loss issue. I started a new thread to discuss that, so that it's not buried in this thread about logical replication. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
sequences vs. synchronous replication
Hi, while working on logical decoding of sequences, I ran into an issue with nextval() in a transaction that rolls back, described in [1]. But after thinking about it a bit more (and chatting with Petr Jelinek), I think this issue affects physical sync replication too. Imagine you have a primary <-> sync_replica cluster, and you do this: CREATE SEQUENCE s; -- shutdown the sync replica BEGIN; SELECT nextval('s') FROM generate_series(1,50); ROLLBACK; BEGIN; SELECT nextval('s'); COMMIT; The natural expectation would be the COMMIT gets stuck, waiting for the sync replica (which is not running), right? But it does not. The problem is exactly the same as in [1] - the aborted transaction generated WAL, but RecordTransactionAbort() ignores that and does not update LogwrtResult.Write, with the reasoning that aborted transactions do not matter. But sequences violate that, because we only write WAL once every 32 increments, so the following nextval() gets "committed" without waiting for the replica (because it did not produce WAL). I'm not sure this is a clear data corruption bug, but it surely walks and quacks like one. My proposal is to fix this by tracking the lsn of the last LSN for a sequence increment, and then check that LSN in RecordTransactionCommit() before calling XLogFlush(). regards [1] https://www.postgresql.org/message-id/ae3cab67-c31e-b527-dd73-08f196999ad4%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
On 2021-Dec-17, Tomas Vondra wrote: > On 12/17/21 22:07, Alvaro Herrera wrote: > > So I've been thinking about this as a "security" item (you can see my > > comments to that effect sprinkled all over this thread), in the sense > > that if a publication "hides" some column, then the replica just won't > > get access to it. But in reality that's mistaken: the filtering that > > this patch implements is done based on the queries that *the replica* > > executes at its own volition; if the replica decides to ignore the list > > of columns, it'll be able to get all columns. All it takes is an > > uncooperative replica in order for the lot of data to be exposed anyway. > > Interesting, I haven't really looked at this as a security feature. And in > my experience if something is not carefully designed to be secure from the > get go, it's really hard to add that bit later ... I guess the way to really harden replication is to use the GRANT system at the publisher's side to restrict access for the replication user. This would provide actual security. So you're right that I seem to be barking at the wrong tree ... maybe I need to give a careful look at the documentation for logical replication to understand what is being offered, and to make sure that we explicitly indicate that limiting the column list does not provide any actual security. > You say it's the replica making the decisions, but my mental model is it's > the publisher decoding the data for a given list of publications (which > indeed is specified by the subscriber). But the subscriber can't tweak the > definition of publications, right? Or what do you mean by queries executed > by the replica? What are the gap? I am thinking in somebody modifying the code that the replica runs, so that it ignores the column list that the publication has been configured to provide; instead of querying only those columns, it would query all columns. > > If the server has a *separate* security mechanism to hide the columns > > (per-column privs), it is that feature that will protect the data, not > > the logical-replication-feature to filter out columns. > > Right. Although I haven't thought about how logical decoding interacts with > column privileges. I don't think logical decoding actually checks column > privileges - I certainly don't recall any ACL checks in > src/backend/replication ... Well, in practice if you're confronted with a replica that's controlled by a malicious user that can tweak its behavior, then replica-side privilege checking won't do anything useful. > > This led me to realize that the replica-side code in tablesync.c is > > totally oblivious to what's the publication through which a table is > > being received from in the replica. So we're not aware of a replica > > being exposed only a subset of columns through some specific > > publication; and a lot more hacking is needed than this patch does, in > > order to be aware of which publications are being used. > Does that mean we currently sync all the columns in the initial sync, and > only start filtering columns later while decoding transactions? No, it does filter the list of columns in the initial sync. But the current implementation is bogus, because it obtains the list of *all* publications in which the table is published, not just the ones that the subscription is configured to get data from. And the sync code doesn't receive the list of publications. We need more thorough patching of the sync code to close that hole. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > But the bigger question is. Are we really concerned about this flaky > problem? Is it worth investing time and money on? I can get money to > go buy a G4 or G5 and spend some time on it. It just seems a bit... > niche. But if it's a real bug that represents something broken on > other architectures that just happens to be easier to trigger here it > might be worthwhile. TBH, I don't know. There seem to be three plausible explanations: 1. Flaky hardware in my unit. 2. Ancient macOS bug, as Andres suggested upthread. 3. Actual PG bug. If it's #1 or #2 then we're just wasting our time here. I'm not sure how to estimate the relative probabilities, but I suspect #3 is the least likely of the lot. FWIW, I did just reproduce the problem on that machine with current HEAD: 2021-12-17 18:40:40.293 EST [21369] FATAL: inconsistent page found, rel 1663/167772/2673, forknum 0, blkno 26 2021-12-17 18:40:40.293 EST [21369] CONTEXT: WAL redo at C/3DE3F658 for Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW 2021-12-17 18:40:40.522 EST [21365] LOG: startup process (PID 21369) exited with exit code 1 That was after only five loops of the regression tests, so either I got lucky or the failure probability has increased again. In any case, it seems clear that the problem exists independently of Munro's patches, so I don't really think this question should be considered a blocker for those. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On Fri, 17 Dec 2021 at 18:40, Tom Lane wrote: > > Greg Stark writes: > > Hm. I seem to have picked a bad checkout. I took the last one before > > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). > > FWIW, I think that's the first one *after* the revert. Doh But the bigger question is. Are we really concerned about this flaky problem? Is it worth investing time and money on? I can get money to go buy a G4 or G5 and spend some time on it. It just seems a bit... niche. But if it's a real bug that represents something broken on other architectures that just happens to be easier to trigger here it might be worthwhile. -- greg
Re: Add id's to various elements in protocol.sgml
On Dec 17, 2021 at 08:13, Peter Eisentraut wrote: On 15.12.21 16:59, Brar Piening wrote: On Dec 15, 2021 at 15:49, Alvaro Herrera wrote: On 2021-Dec-15, Brar Piening wrote: Since I can't argue towards some general utility for the xreflabels and don't have any other solid argument in favor of adding more, I will remove them from my current patch but leave the existing ones intact. Yeah, I think not adding them until we have a use for them might be wisest. A new version of the patch that doesn't add xreflabels is attached. Now this patch adds a bunch of ids, but you can't use them to link to, because as soon as you do, you will get complaints about a missing xreflabel. So what is the remaining purpose? The purpose is that you can directly link to the id in the public html docs which still gets generated (e. g. https://www.postgresql.org/docs/14/protocol-replication.html#PROTOCOL-REPLICATION-BASE-BACKUP). Essentially it gives people discussing the protocol and pointing to a certain command or message format the chance to link to the very thing they are discussing instead of the top of the lengthy html page.
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > Hm. I seem to have picked a bad checkout. I took the last one before > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). FWIW, I think that's the first one *after* the revert. > 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel > worker" (PID 54073) was terminated by signal 10: Bus error I'm betting on weird emulation issue. None of my real PPC machines showed such things. regards, tom lane
Re: WIP: WAL prefetch (another approach)
On 12/17/21 23:56, Greg Stark wrote: Hm. I seem to have picked a bad checkout. I took the last one before the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some incompatibility with the emulation and the IPC stuff parallel workers use. 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel worker" (PID 54073) was terminated by signal 10: Bus error 2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was running: SELECT variance(unique1::int4), sum(unique1::int8), regr_count(unique1::float8, unique1::float8) FROM (SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1) u; 2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active server processes 2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in recovery mode 2021-12-17 17:51:51.761 EST [50955] LOG: all server processes terminated; reinitializing Interesting. In my experience SIGBUS on PPC tends to be due to incorrect alignment, but I'm not sure how that works with the emulation. Can you get a backtrace? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Hm. I seem to have picked a bad checkout. I took the last one before the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some incompatibility with the emulation and the IPC stuff parallel workers use. 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel worker" (PID 54073) was terminated by signal 10: Bus error 2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was running: SELECT variance(unique1::int4), sum(unique1::int8), regr_count(unique1::float8, unique1::float8) FROM (SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1) u; 2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active server processes 2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in recovery mode 2021-12-17 17:51:51.761 EST [50955] LOG: all server processes terminated; reinitializing
Re: Column Filtering in Logical Replication
On 12/17/21 22:07, Alvaro Herrera wrote: So I've been thinking about this as a "security" item (you can see my comments to that effect sprinkled all over this thread), in the sense that if a publication "hides" some column, then the replica just won't get access to it. But in reality that's mistaken: the filtering that this patch implements is done based on the queries that *the replica* executes at its own volition; if the replica decides to ignore the list of columns, it'll be able to get all columns. All it takes is an uncooperative replica in order for the lot of data to be exposed anyway. Interesting, I haven't really looked at this as a security feature. And in my experience if something is not carefully designed to be secure from the get go, it's really hard to add that bit later ... You say it's the replica making the decisions, but my mental model is it's the publisher decoding the data for a given list of publications (which indeed is specified by the subscriber). But the subscriber can't tweak the definition of publications, right? Or what do you mean by queries executed by the replica? What are the gap? If the server has a *separate* security mechanism to hide the columns (per-column privs), it is that feature that will protect the data, not the logical-replication-feature to filter out columns. Right. Although I haven't thought about how logical decoding interacts with column privileges. I don't think logical decoding actually checks column privileges - I certainly don't recall any ACL checks in src/backend/replication ... AFAIK we only really check privileges during initial sync (when creating the slot and copying data), but then we keep replicating data even if the privilege gets revoked for the table/column. In principle the replication role is pretty close to superuser. This led me to realize that the replica-side code in tablesync.c is totally oblivious to what's the publication through which a table is being received from in the replica. So we're not aware of a replica being exposed only a subset of columns through some specific publication; and a lot more hacking is needed than this patch does, in order to be aware of which publications are being used. I'm going to have a deeper look at this whole thing. Does that mean we currently sync all the columns in the initial sync, and only start filtering columns later while decoding transactions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Adding CI to our tree
On 2021-12-17 11:31:59 -0800, Andres Freund wrote: > > Don't we need the ulimit call for FreeBSD? > > I think the default core limits were different, I will check. Yep, freebsd has -c unlimited by default: https://cirrus-ci.com/task/6199382239346688?logs=sysinfo#L23 vs https://cirrus-ci.com/task/4792007355793408?logs=sysinfo#L32
Re: Column Filtering in Logical Replication
So I've been thinking about this as a "security" item (you can see my comments to that effect sprinkled all over this thread), in the sense that if a publication "hides" some column, then the replica just won't get access to it. But in reality that's mistaken: the filtering that this patch implements is done based on the queries that *the replica* executes at its own volition; if the replica decides to ignore the list of columns, it'll be able to get all columns. All it takes is an uncooperative replica in order for the lot of data to be exposed anyway. If the server has a *separate* security mechanism to hide the columns (per-column privs), it is that feature that will protect the data, not the logical-replication-feature to filter out columns. This led me to realize that the replica-side code in tablesync.c is totally oblivious to what's the publication through which a table is being received from in the replica. So we're not aware of a replica being exposed only a subset of columns through some specific publication; and a lot more hacking is needed than this patch does, in order to be aware of which publications are being used. I'm going to have a deeper look at this whole thing. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: speed up text_position() for utf-8
Attached is a short patch series to develop some ideas of inlining pg_utf_mblen(). 0001 puts the main implementation of pg_utf_mblen() into an inline function and uses this in pg_mblen(). This is somewhat faster in the strpos tests, so that gives some measure of the speedup expected for other callers. Text search seems to call this a lot, so this might have noticeable benefit. 0002 refactors text_position_get_match_pos() to use pg_mbstrlen_with_len(). This itself is significantly faster when combined with 0001, likely because the latter can inline the call to pg_mblen(). The intention is to speed up more than just text_position. 0003 explicitly specializes for the inline version of pg_utf_mblen() into pg_mbstrlen_with_len(), but turns out to be almost as slow as master for ascii. It doesn't help if I undo the previous change in pg_mblen(), and I haven't investigated why yet. 0002 looks good now, but the experience with 0003 makes me hesitant to propose this seriously until I can figure out what's going on there. The test is as earlier, a worst-case substring search, times in milliseconds. patch | no match | ascii | multibyte +--+---+--- PG11 | 1220 | 1220 | 1150 master | 385 | 2420 | 1980 0001 | 390 | 2180 | 1670 0002 | 389 | 1330 | 1100 0003 | 391 | 2100 | 1360 -- John Naylor EDB: http://www.enterprisedb.com v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patch Description: Binary data v2-0003-Specialize-pg_mbstrlen_with_len-for-UTF-8.patch Description: Binary data v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patch Description: Binary data
Re: Adding CI to our tree
> On 17 Dec 2021, at 20:31, Andres Freund wrote: > On 2021-12-17 12:34:36 +0100, Peter Eisentraut wrote: >> I don't like that the -s option is used. I would like to see what commands >> are executed. > > I can change it - but it makes it *much* harder to spot compiler warnings. Having used Cirrus et.al a fair bit I strongly agree with Andres, working with huge logs in the browser is painful whereas -s makes it useable even on mobile devices. -- Daniel Gustafsson https://vmware.com/
Re: make tuplestore helper function
On Mon, Dec 13, 2021 at 05:27:52PM -0500, Melanie Plageman wrote: > Done in attached v4 Thanks. I think these comments can be removed from the callers: /* it's a simple type, so don't use get_call_result_type */ You removed one call to tuplestore_donestoring(), but not the others. I guess you could remove them all (optionally as a separate patch). The rest of these are questions more than comments, and I'm not necessarily suggesting to change the patch: This isn't new in your patch, but for me, it's more clear if the callers have a separate boolean for randomAccess, rather than having the long expression in the function call: + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, + rsinfo->allowedModes & SFRM_Materialize_Random); vs randomAccess = (rsinfo->allowedModes & SFRM_Materialize_Random) != 0; - tupstore = tuplestore_begin_heap(randomAccess, false, work_mem); + tupstore = MakeFuncResultTuplestore(fcinfo, NULL, randomAccess); One could also argue for passing rsinfo->allowedModes, instead of (rsinfo->allowedModes & SFRM_Materialize_Random). There's a couples places that you're checking expectedDesc where it wasn't being checked before. Is that some kind of live bug ? pg_config() text_to_table() It'd be nice if the "expected tuple format" error didn't need to be duplicated for each SRFs. I suppose the helper function could just take a boolean determining whether or not to throw an error (passing "expectedDesc==NULL"). You'd have to call the helper before CreateTupleDescCopy() - which you're already doing in a couple places for similar reasons. I noticed that tuplestore.h isn't included most places. I suppose it's included via execnodes.h. Your patch doesn't change that, arguably it should've always been included. -- Justin
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1 > or maybe 4.0.1. What version is on your G4? $ gcc -v Using built-in specs. Target: powerpc-apple-darwin9 Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 --target=powerpc-apple-darwin9 Thread model: posix gcc version 4.0.1 (Apple Inc. build 5493) I see that gcc 4.2.1 is also present on this machine, but I've never used it. regards, tom lane
Re: WIP: WAL prefetch (another approach)
I have IBUILD:postgresql gsstark$ ls /usr/bin/*gcc* /usr/bin/gcc /usr/bin/gcc-4.0 /usr/bin/gcc-4.2 /usr/bin/i686-apple-darwin9-gcc-4.0.1 /usr/bin/i686-apple-darwin9-gcc-4.2.1 /usr/bin/powerpc-apple-darwin9-gcc-4.0.1 /usr/bin/powerpc-apple-darwin9-gcc-4.2.1 I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1 or maybe 4.0.1. What version is on your G4?
Re: Adding CI to our tree
Hi, On 2021-12-17 09:36:05 -0500, Tom Lane wrote: > Andrew Dunstan writes: > > Maye I have missed it, but why are we using ccache here? That seems a > > bit pointless in an ephemeral instance. > > I believe Munro's cfbot tooling is able to save and re-use ccache > across successive instantiations of a build instance. I've not > looked at this code, but if it can do that there'd be point to it. Yes, the ccache cache is persisted across runs (see the *_cache and upload_cache inststructions). It makes a quite substantial difference. One reason the windows runs are a lot slower than the others is just that visual studio isn't yet supported by ccache, and that there doesn't seem to be good other tools. The ccache maintainers merged more of the msvc support last weekend! So I have quite a bit of hope of being able to use ccache there as well. Greetings, Andres Freund
Re: Adding CI to our tree
Hi, On 2021-12-17 12:34:36 +0100, Peter Eisentraut wrote: > On 13.12.21 22:12, Andres Freund wrote: > > Attached is an updated version of the CI patches. An example of a test run > > with the attached version of this > > https://cirrus-ci.com/build/6501998521483264 > > + only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || > $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*freebsd.*' > > I'm not in favor of this kind of thing. I don't understand how this is > useful, other than perhaps for developing *this* patch. I don't think > people would like having these tags in the mainline, and if it's for local > use, then people can adjust the file locally. Definitely not for mainline. But it's extremely useful for development. If you iteratively try to fix windows, running all the other tests can be slower - there's a concurrency limit in how many tests you can run for free... > + CC="ccache cc" CFLAGS="-O0 -ggdb"' > > I don't think using -O0 is the right thing. It will miss some compiler > warnings, and it will not thoroughly test the compiler. We should test > using the configurations that are close to what users actually use. Hm. I personally always end up using -O0 for the actual development tree, and it seems a lot of others do as well. Building with -O2 makes backtraces etc just less useful. > +- su postgres -c 'gmake -s -j3 && gmake -s -j3 -C contrib' > > Why doesn't this use make world (or world-bin, if you prefer). I started working on this well before world-bin existed. And using 'world' as the target builds the docs, which is quite expensive... I happened to actually make the change to world-bin yesterday for the next version to send :) > Why does this use -j3 if there are two CPUs configured? (Perhaps the number > of CPUs should be put into a variable.) I tried a few and that worked best. > I don't like that the -s option is used. I would like to see what commands > are executed. I can change it - but it makes it *much* harder to spot compiler warnings. > +cpu: 4 > > Why does the Linux job use 4 CPUs and the FreeBSD job 2? I'll add a comment about it. Two reasons 1) the limits on cirrus are lower for linux than freebsd: https://cirrus-ci.org/faq/ 2) There's some issues on freebsd where test performance regressess *very* substantially with higher concurrency. Thomas and I looked a bunch into it without figuring out the details. > +- export > I don't think that is portable to all shells. Doesn't really need to be? > +- su postgres -c 'time script test.log gmake -s -j2 ${CHECK} > ${CHECKFLAGS}' > > +su postgres -c '\ > + ulimit -c unlimited; \ > + make -s ${CHECK} ${CHECKFLAGS} -j8 \ > + ' > > Not clear why these are so different. Don't we need the test.log file for > Linux? There's a comment about the use of script: # Use of script is to avoid make complaints about fcntl() # https://savannah.gnu.org/bugs/?60774 that bug is specific to platforms that don't allow locking pipes. Which linux does allow, but freebsd doesn't. > Don't we need the ulimit call for FreeBSD? I think the default core limits were different, I will check. > Why the -j8 option even though 4 CPUs have been configured? That might have been an accident. > +brew install \ > + ccache \ > + coreutils \ > + icu4c \ > + krb5 \ > + llvm \ > + lz4 \ > + make \ > + openldap \ > + openssl \ > + python@3.10 \ > + tcl-tk > > Curious why coreutils and make are installed? The system-supplied tools > ought to work. make because newer versions of make have -Otarget, which makes concurrent check-world output at least kind-of readable. > +brew cleanup -s > > Seems unnecessary? It reduces the size of the cached downloads. Not much point in keeping older versions of the package around. Populating the cache takes time. > +PKG_CONFIG_PATH="/usr/local/opt/krb5/lib/pkgconfig:$PKG_CONFIG_PATH" > > AFAICT, we don't use pkg-config for the krb5 package. I now converted this to a loop. > +- gmake -s -j12 && gmake -s -j12 -C contrib > > These numbers should also be explained or configured somewhere. Possibly > query the number of CPUs on the instance. macOS instances have a fixed number of cores - 12. Might make sense to query it, but not sure what a good portable way there is. > +PROVE_FLAGS: -j10 > > Why only on Windows? Because windows doesn't have a way to run tests in parallel in another way. prove-level concurrency is the only thing. Whereas other platforms can run tests in parallel via make. Combining both tends to not work very well in my experience. > +# Installation on windows currently only completely works from > src\tools\msvc > > If that is so, let's fix that. I did report the problem - just haven't gotten around to fixing it. Note this is also how the buildfarm invokes installation... The problem is that Install.pm includes config.pl from the c
Re: In-placre persistance change of a relation
On Fri, Dec 17, 2021 at 02:33:25PM +, Jakub Wartak wrote: > > Justin wrote: > > On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote: > > > As the thread didn't get a lot of traction, I've registered it into > > > current > > commitfest > > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf > > est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak% > > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a > > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT > > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV > > CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC > > NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for > > review' state. > > > > The patch is failing: > [..] > > I think you ran "make check", but should run something like this: > > make check-world -j8 >check-world.log 2>&1 && echo Success > > Hi Justin, > > I've repeated the check-world and it says to me all is ok locally (also with > --enable-cassert --enable-debug , at least on Amazon Linux 2) and also > installcheck on default params seems to be ok > I don't seem to understand why testfarm reports errors for tests like "path, > polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on > MacOS(?) . > Could someone point to me where to start looking/fixing? Since it says this, it looks a lot like a memory error like a use-after-free - like in fsync_parent_path(): CREATE TABLE PATH_TBL (f1 path); +ERROR: could not open file <> Pacific": No such file or directory I see at least this one is still failing, though: time make -C src/test/recovery check >From 676ecf794b2b0e98d8f31e4245f6f455da5e19cb Mon Sep 17 00:00:00 2001 From: Jakub Wartak Date: Thu, 16 Dec 2021 12:03:42 + Subject: [PATCH 1/2] In-place table persistence change with new command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED: To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 +++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 518 +++-- src/backend/commands/tablecmds.c | 374 -- src/backend/nodes/copyfuncs.c | 16 + src/backend/nodes/equalfuncs.c | 15 + src/backend/parser/gram.y | 20 + src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c| 88 + src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 318 +++ src/backend/storage/smgr/md.c | 92 - src/backend/storage/smgr/smgr.c| 32 ++ src/backend/storage/sync/sync.c| 20 +- src/backend/tcop/utility.c | 11 + src/bin/pg_rewind/parsexlog.c | 24 ++ src/common/relpath.c | 47 ++- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 42 +- src/include/commands/tablecmds.h | 2 + src/include/common/relpath.h | 9 +- src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 + src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 29 files changed, 1583 insertions(+), 179 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 773d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->a
Re: WIP: WAL prefetch (another approach)
Greg Stark writes: > What tools and tool versions are you using to build? Is it just GCC for PPC? > There aren't any special build processes to make a fat binary involved? Nope, just "configure; make" using that macOS version's regular gcc. regards, tom lane
Re: WIP: WAL prefetch (another approach)
What tools and tool versions are you using to build? Is it just GCC for PPC? There aren't any special build processes to make a fat binary involved? On Thu, 16 Dec 2021 at 23:11, Tom Lane wrote: > > Greg Stark writes: > > But if you're interested and can explain the tests to run I can try to > > get the tests running on this machine: > > I'm not sure that machine is close enough to prove much, but by all > means give it a go if you wish. My test setup was explained in [1]: > > >> To recap, the test lashup is: > >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) > >> * Standard debug build (--enable-debug --enable-cassert) > >> * Out-of-the-box configuration, except add wal_consistency_checking = all > >> and configure a wal-streaming standby on the same machine > >> * Repeatedly run "make installcheck-parallel", but skip the tablespace > >> test to avoid issues with the standby trying to use the same directory > >> * Delay long enough after each installcheck-parallel to let the > >> standby catch up (the run proper is ~24 min, plus 2 min for catchup) > > Remember also that the code in question is not in HEAD; you'd > need to apply Munro's patches, or check out some commit from > around 2021-04-22. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us -- greg
[PATCH] Add reloption for views to enable RLS
Hi all! As part of a customer project we are looking to implement an reloption for views which when set, runs the subquery as invoked by the user rather than the view owner, as is currently the case. The rewrite rule's table references are then checked as if the user were referencing the table(s) directly. This feature is similar to so-called 'SECURITY INVOKER' views in other DBMS. Although such permission checking could be implemented using views which SELECT from a table function and further using triggers, that approach has obvious performance downsides. Our initial thought on implementing this was to simply add another reloption for views, just like the already existing `security_barrier`. With this in place, we then can conditionally evaluate in RelationBuildRuleLock() if we need to call setRuleCheckAsUser() or not. The new reloption has been named `security`, which is an enum currently only supporting a single value: `relation_permissions`. The code for fetching the rules and triggers in RelationBuildDesc() had to be moved after the parsing of the reloptions, since with this change RelationBuildRuleLock()now depends upon having relation->rd_options available. The current behavior of views without that new reloption set is unaltered. This is implemented as such in patch 0001. Regression tests are included for both the new reloption of CREATE VIEW and the row level security side of this too, contained in patch 0002. All regression tests are passing without errors. Finally, patch 0003 updates the documentation for this new reloption. An simplified example on how this feature can be used could look like this: CREATE TABLE people (id int, name text, company text); ALTER TABLE people ENABLE ROW LEVEL SECURITY; INSERT INTO people VALUES (1, 'alice', 'foo'), (2, 'bob', 'bar'); CREATE VIEW customers_no_security AS SELECT * FROM people; CREATE VIEW customers WITH (security=relation_permissions) AS SELECT * FROM people; -- We want carol to only see people from company 'foo' CREATE ROLE carol; CREATE POLICY company_foo_only ON people FOR ALL TO carol USING (company = 'foo'); GRANT SELECT ON people TO carol; GRANT SELECT ON customers_no_security TO carol; GRANT SELECT ON customers TO carol; Now using these tables as carol: postgres=# SET ROLE carol; SET For the `people` table, the policy is applied as expected: postgres=> SELECT * FROM people; id | name | company +---+- 1 | alice | foo (1 row) If we now use the view with the new relopt set, the policy is applied too: postgres=> SELECT * FROM customers; id | name | company +---+- 1 | alice | foo (1 row) But without the `security=relation_permissions` relopt, carol gets to see data they should not be able to due to the policy not being applied, since the rules are checked against the view owner: postgres=> SELECT * FROM customers_no_security; id | name | company +---+- 1 | alice | foo 2 | bob | bar (2 rows) Excluding regression tests and documentation, the changes boil down to this: src/backend/access/common/reloptions.c| 20 src/backend/nodes/copyfuncs.c | 1 src/backend/nodes/equalfuncs.c| 1 src/backend/nodes/outfuncs.c | 1 src/backend/nodes/readfuncs.c | 1 src/backend/optimizer/plan/subselect.c| 1 src/backend/optimizer/prep/prepjointree.c | 1 src/backend/rewrite/rewriteHandler.c | 1 src/backend/utils/cache/relcache.c| 62 src/include/nodes/parsenodes.h| 3 src/include/utils/rel.h | 21 11 files changed, 84 insertions(+), 29 deletions(-) All patches are against current master. Thanks, Christoph HeissFrom 2ed6b63adcebfff14965b8c9913ae0fafbe904a2 Mon Sep 17 00:00:00 2001 From: Christoph Heiss Date: Fri, 17 Dec 2021 17:17:54 +0100 Subject: [PATCH 3/3] Add documentation for new 'security' reloption on views --- doc/src/sgml/ddl.sgml | 4 doc/src/sgml/ref/alter_view.sgml | 9 + doc/src/sgml/ref/create_view.sgml | 18 ++ 3 files changed, 31 insertions(+) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 64d9030652..760ea2f794 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -2292,6 +2292,10 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw; are not subject to row security. + + For views, the policies are applied as being referenced through the view owner by default, rather than the user referencing the view. To apply row security policies as defined for the invoking user, the security option can be set on views (see CREATE VIEW) to get the same behavior. + + Row security policies can be specific to commands, or to roles, or to both. A policy can be specified to apply to ALL diff --git a/doc/src/sgml/ref/alter_vie
Re: pg_upgrade should truncate/remove its logs before running
On Thu, Dec 16, 2021 at 12:23:08PM +0100, Daniel Gustafsson wrote: > > On 16 Dec 2021, at 12:11, Peter Eisentraut > > wrote: > > > Could we make it write just one log file? Is having multiple log files > > better? > > Having individual .txt files from checks with additional > information > on how to handle the error are quite convenient when writing wrappers around > pg_upgrade (speaking from experience of having written multiple pg_upgraade > frontends). Parsing a single logfile is more work, and will break existing > scripts. > > I'm in favor of a predictable by default logpath, with a parameter to > override, > as mentioned upthread. I put this together in the simplest way, prefixing all the filenames with the configured path.. Another options is to chdir() into the given path. But, pg_upgrade takes (and requires) a bunch of other paths, like -d -D -b -B, and those are traditionally interpretted relative to CWD. I could getcwd() and prefix all the -[dDbB] with that, but prefixing a handful of binary/data paths is hardly better than prefixing a handful of dump/logfile paths. I suppose that openat() isn't portable. I don't think this it's worth prohibiting relative paths, so I can't think of any less-naive way to do this. I didn't move the delete-old-cluster.sh, since that's intended to stay around even after a successful upgrade, as opposed to the other logs, which are typically removed at that point. -- Justin >From 7b608a415645682a9853bb4fbfde54caa6ccc137 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 11 Dec 2021 19:19:50 -0600 Subject: [PATCH] pg_upgrade: write logfiles and dumps in subdir.. The subdir must not exist when pg_upgrade is started. This avoids appending the logfiles which have pre-existing errors in them. And allows cleaning up after pg_upgrade more easily. --- src/bin/pg_upgrade/check.c | 10 ++--- src/bin/pg_upgrade/dump.c | 8 ++-- src/bin/pg_upgrade/exec.c | 5 ++- src/bin/pg_upgrade/function.c | 3 +- src/bin/pg_upgrade/option.c | 21 +-- src/bin/pg_upgrade/pg_upgrade.c | 65 + src/bin/pg_upgrade/pg_upgrade.h | 1 + src/bin/pg_upgrade/server.c | 4 +- 8 files changed, 77 insertions(+), 40 deletions(-) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 1b8ef79242..9bc4e106dd 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -552,7 +552,7 @@ create_script_for_old_cluster_deletion(char **deletion_script_file_name) prep_status("Creating script to delete old cluster"); - if ((script = fopen_priv(*deletion_script_file_name, "w")) == NULL) + if ((script = fopen_priv(*deletion_script_file_name, "w")) == NULL) // OK ? pg_fatal("could not open file \"%s\": %s\n", *deletion_script_file_name, strerror(errno)); @@ -784,7 +784,7 @@ check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster) return; } - snprintf(output_path, sizeof(output_path), + snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, "contrib_isn_and_int8_pass_by_value.txt"); for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++) @@ -861,7 +861,7 @@ check_for_user_defined_postfix_ops(ClusterInfo *cluster) prep_status("Checking for user-defined postfix operators"); - snprintf(output_path, sizeof(output_path), + snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, "postfix_ops.txt"); /* Find any user defined postfix operators */ @@ -960,7 +960,7 @@ check_for_tables_with_oids(ClusterInfo *cluster) prep_status("Checking for tables WITH OIDS"); - snprintf(output_path, sizeof(output_path), + snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, "tables_with_oids.txt"); /* Find any tables declared WITH OIDS */ @@ -1215,7 +1215,7 @@ check_for_user_defined_encoding_conversions(ClusterInfo *cluster) prep_status("Checking for user-defined encoding conversions"); - snprintf(output_path, sizeof(output_path), + snprintf(output_path, sizeof(output_path), "%s/%s", log_opts.basedir, "encoding_conversions.txt"); /* Find any user defined encoding conversions */ diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c index 90060d0f8e..fe979e0df4 100644 --- a/src/bin/pg_upgrade/dump.c +++ b/src/bin/pg_upgrade/dump.c @@ -22,10 +22,10 @@ generate_old_dump(void) /* run new pg_dumpall binary for globals */ exec_prog(UTILITY_LOG_FILE, NULL, true, true, "\"%s/pg_dumpall\" %s --globals-only --quote-all-identifiers " - "--binary-upgrade %s -f %s", + "--binary-upgrade %s -f %s/%s", new_cluster.bindir, cluster_conn_opts(&old_cluster), log_opts.verbose ? "--verbose" : "", - GLOBALS_DUMP_FILE); + log_opts.basedir, GLOBALS_DUMP_FILE); check_ok(); prep_status("Creating dump of database schemas\n"); @@ -52,10 +52,10 @@ generate_old_dump(void) parallel_exec_prog(log_file_name, NULL, "\"%s/pg_dump\" %s --schema-only --quote
Re: Column Filtering in Logical Replication
On 2021-Dec-17, Rahila Syed wrote: > > This means that we need to support changing the column list of a > > table in a publication. I'm looking at implementing some form of > > ALTER PUBLICATION for that. > > I think right now the patch contains support only for ALTER > PUBLICATION.. ADD TABLE with column filters. In order to achieve > changing the column lists of a published table, I think we can extend > the ALTER TABLE ..SET TABLE syntax to support specification of column > list. Yeah, that's what I was thinking too. > > So this whole thing can be reduced to just this: > > > if (att_map != NULL && !bms_is_member(att->attnum, att_map)) > >continue;/* that is, don't send this attribute */ > > I agree the condition can be shortened now. The long if condition was > included because initially the feature allowed specifying filters > without replica identity columns(sent those columns internally without > user having to specify). Ah, true, I had forgotten that. Thanks. > > 900 +* the table is partitioned. Run a recursive query to iterate > > through all > > 901 +* the parents of the partition and retreive the record for > > the parent > > 902 +* that exists in pg_publication_rel. > > 903 +*/ > > The above comment in fetch_remote_table_info() can be changed as the > recursive query is no longer used. Oh, of course. I'll finish some loose ends and submit a v10, but it's still not final. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "Right now the sectors on the hard disk run clockwise, but I heard a rumor that you can squeeze 0.2% more throughput by running them counterclockwise. It's worth the effort. Recommended." (Gerry Pourwelle)
\d with triggers: more than one row returned by a subquery used as an expression
I want to mention that the 2nd problem I mentioned here is still broken. https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com It happens if non-inheritted triggers on child and parent have the same name. On Fri, Jul 16, 2021 at 08:02:59PM -0500, Justin Pryzby wrote: > On Fri, Jul 16, 2021 at 06:01:12PM -0400, Alvaro Herrera wrote: > > On 2021-Jul-16, Justin Pryzby wrote: > > > CREATE TABLE p(i int) PARTITION BY RANGE(i); > > > CREATE TABLE p1 PARTITION OF p FOR VALUES FROM (1)TO(2); > > > CREATE FUNCTION foo() returns trigger LANGUAGE plpgsql AS $$begin end$$; > > > CREATE TRIGGER x AFTER DELETE ON p1 EXECUTE FUNCTION foo(); > > > CREATE TRIGGER x AFTER DELETE ON p EXECUTE FUNCTION foo(); > > > > Hmm, interesting -- those statement triggers are not cloned, so what is > > going on here is just that the psql query to show them is tripping on > > its shoelaces ... I'll try to find a fix. > > > > I *think* the problem is that the query matches triggers by name and > > parent/child relationship; we're missing to ignore triggers by tgtype. > > It's not great design that tgtype is a bitmask of unrelated flags ... > > I see it's the subquery Amit wrote and proposed here: > https://www.postgresql.org/message-id/CA+HiwqEiMe0tCOoPOwjQrdH5fxnZccMR7oeW=f9fmgszjqb...@mail.gmail.com > > .. and I realize that I've accidentally succeeded in breaking what I first > attempted to break 15 months ago: > > On Mon, Apr 20, 2020 at 02:57:40PM -0500, Justin Pryzby wrote: > > I'm happy to see that this doesn't require a recursive cte, at least. > > I was trying to think how to break it by returning multiple results or > > results > > out of order, but I think that can't happen. > > If you assume that pg_partition_ancestors returns its results in order, I > think > you can fix it by adding LIMIT 1. Otherwise I think you need a recursive CTE, > as I'd feared. > > Note also that I'd sent a patch to add newlines, to make psql -E look pretty. > v6-0001-fixups-c33869cc3bfc42bce822251f2fa1a2a346f86cc5.patch
Re: Addition of --no-sync to pg_upgrade for test speedup
On Fri, Dec 17, 2021 at 10:21:04AM +0100, Peter Eisentraut wrote: > On 16.12.21 07:50, Michael Paquier wrote: > > As per $subject, avoiding the flush of the new cluster's data > > directory shortens a bint the runtime of the test. In some of my slow > > VMs, aka Windows, this shaves a couple of seconds even if the bulk of > > the time is still spent on the main regression test suite. > > > > In pg_upgrade, we let the flush happen with initdb --sync-only, based > > on the binary path of the new cluster, so I think that we are not > > going to miss any test coverage by skipping that. > > I think that is reasonable. > > Maybe we could have some global option, like some environment variable, that > enables the "sync" mode in all tests, so it's easy to test that once in a > while. Not really a requirement for your patch, but an idea in case this is > a concern. Yes, I think it would be good to see all the places we might want to pass the no-sync option. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Adding CI to our tree
Andrew Dunstan writes: > Maye I have missed it, but why are we using ccache here? That seems a > bit pointless in an ephemeral instance. I believe Munro's cfbot tooling is able to save and re-use ccache across successive instantiations of a build instance. I've not looked at this code, but if it can do that there'd be point to it. regards, tom lane
RE: In-placre persistance change of a relation
> Justin wrote: > On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote: > > As the thread didn't get a lot of traction, I've registered it into current > commitfest > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf > est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak% > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV > CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC > NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for > review' state. > > The patch is failing: [..] > I think you ran "make check", but should run something like this: > make check-world -j8 >check-world.log 2>&1 && echo Success Hi Justin, I've repeated the check-world and it says to me all is ok locally (also with --enable-cassert --enable-debug , at least on Amazon Linux 2) and also installcheck on default params seems to be ok I don't seem to understand why testfarm reports errors for tests like "path, polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) . Could someone point to me where to start looking/fixing? -J.
Re: Adding CI to our tree
On 12/13/21 16:12, Andres Freund wrote: > Hi, > > Attached is an updated version of the CI patches. An example of a test run > with the attached version of this > https://cirrus-ci.com/build/6501998521483264 > Maye I have missed it, but why are we using ccache here? That seems a bit pointless in an ephemeral instance. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit : > I wasn't really suggesting to investigate those other allocators in this > patch - it seems like a task requiring a pretty significant amount of > work/time. My point was that we should make it reasonably easy to add > tweaks for those other environments, if someone is interested enough to > do the legwork. > > >> 2) In fact, I wonder if different glibc versions behave differently? > >> Hopefully it's not changing that much, though. Ditto kernel versions, > >> but the mmap/sbrk interface is likely more stable. We can test this. > > > > That could be tested, yes. As a matter of fact, a commit removing the > > upper > > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to > > glibc, > > which means we can service much bigger allocation without mmap. > > Yeah, I noticed that commit too. Most systems stick to one glibc > version, so it'll take time to reach most systems. Let's continue with > just one glibc version and then maybe test other versions. Yes, I also need to figure out how to detect we're using glibc as I'm not very familiar with configure. > > >> 3) If we bump the thresholds, won't that work against reusing the > >> memory? I mean, if we free a whole block (from any allocator we have), > >> glibc might return it to kernel, depending on mmap threshold value. It's > >> not guaranteed, but increasing the malloc thresholds will make that even > >> less likely. So we might just as well increase the minimum block size, > >> with about the same effect, no? > > > > It is my understanding that malloc will try to compact memory by moving it > > around. So the memory should be actually be released to the kernel at some > > point. In the meantime, malloc can reuse it for our next invocation (which > > can be in a different memory context on our side). > > > > If we increase the minimum block size, this is memory we will actually > > > > reserve, and it will not protect us against the ramping-up behaviour: > > - the first allocation of a big block may be over mmap_threshold, and > > serviced> > > by an expensive mmap > > > > - when it's free, the threshold is doubled > > - next invocation is serviced by an sbrk call > > - freeing it will be above the trim threshold, and it will be returned. > > > > After several "big" allocations, the thresholds will raise to their > > maximum > > values (well, it used to, I need to check what happens with that latest > > patch of glibc...) > > > > This will typically happen several times as malloc doubles the threshold > > each time. This is probably the reason quadrupling the block sizes was > > more effective. > > Hmmm, OK. Can we we benchmark the case with large initial block size, at > least for comparison? The benchmark I called "fixed" was with a fixed block size of ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same performance profile as the growing blocks + malloc tuning. But if I understand correctly, you implemented the growing blocks logic after concerns about wasting memory with a constant large block size. If we tune malloc, that memory would not be wasted if we don't alloc it, just not released as eagerly when it's allocated. Or do you want a benchmark with an even bigger initial block size ? With the growing blocks patch with a large initial size ? I can run either, I just want to understand what is interesting to you. One thing that would be interesting would be to trace the total amount of memory allocated in the different cases. This is something I will need to do anyway when I propose that patch; Best regards, -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
On 12/17/21 09:08, Ronan Dunklau wrote: > Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : >> On 12/16/21 17:03, Ronan Dunklau wrote: >>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. >>> >>> So please find attached another benchmark for that case. >>> >>> The 3 different patchsets tested are: >>> - master >>> - fixed (David's original patch) >>> - adjust (Thomas growing blocks patch) >> >> Presumably Thomas is me, right? > > I'm really sorry for this typo... Please accept my apologies. > Nah, no apology needed. I was just wondering if I missed some patch from Thomas Munro ... >> >>> So it looks like tuning malloc for this would be very benificial for any >>> kind of allocation, and by doing so we reduce the problems seen with the >>> growing blocks patch to next to nothing, while keeping the ability to not >>> allocate too much memory from the get go. >> >> Thanks for running those tests and investigating the glibc behavior! I >> find those results very interesting. My conclusions from this is that >> the interaction interaction between "our" allocator and the allocator in >> malloc (e.g. glibc) can be problematic. Which makes benchmarking and >> optimization somewhat tricky because code changes may trigger behavior >> change in glibc (or whatever allocator backs malloc). >> >> I think it's worth exploring if we can tune this in a reasonable way, >> but I have a couple concerns related to that: >> >> 1) I wonder how glibc-specific this is - I'd bet it applies to other >> allocators (either on another OS or just different allocator on Linux) >> too. Tweaking glibc parameters won't affect those systems, of course, >> but maybe we should allow tweaking those systems too ... > > I agree, finding their specific problems and see if we can workaround it > would > be interesting. I suppose glibc's malloc is the most commonly used allocator > in production, as it is the default for most Linux distributions. > I wasn't really suggesting to investigate those other allocators in this patch - it seems like a task requiring a pretty significant amount of work/time. My point was that we should make it reasonably easy to add tweaks for those other environments, if someone is interested enough to do the legwork. >> >> 2) In fact, I wonder if different glibc versions behave differently? >> Hopefully it's not changing that much, though. Ditto kernel versions, >> but the mmap/sbrk interface is likely more stable. We can test this. > > That could be tested, yes. As a matter of fact, a commit removing the upper > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, > which means we can service much bigger allocation without mmap. > Yeah, I noticed that commit too. Most systems stick to one glibc version, so it'll take time to reach most systems. Let's continue with just one glibc version and then maybe test other versions. > >> >> 3) If we bump the thresholds, won't that work against reusing the >> memory? I mean, if we free a whole block (from any allocator we have), >> glibc might return it to kernel, depending on mmap threshold value. It's >> not guaranteed, but increasing the malloc thresholds will make that even >> less likely. So we might just as well increase the minimum block size, >> with about the same effect, no? > > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can > be in a different memory context on our side). > > If we increase the minimum block size, this is memory we will actually > reserve, and it will not protect us against the ramping-up behaviour: > - the first allocation of a big block may be over mmap_threshold, and > serviced > by an expensive mmap > - when it's free, the threshold is doubled > - next invocation is serviced by an sbrk call > - freeing it will be above the trim threshold, and it will be returned. > > After several "big" allocations, the thresholds will raise to their maximum > values (well, it used to, I need to check what happens with that latest patch > of glibc...) > > This will typically happen several times as malloc doubles the threshold each > time. This is probably the reason quadrupling the block sizes was more > effective. > Hmmm, OK. Can we we benchmark the case with large initial block size, at least for comparison? > >> >>> I would like to try to implement some dynamic glibc malloc tuning, if that >>> is something we don't reject on principle from the get go. >> >> +1 to that > > Ok, I'll work on a patch for this and submit a new thread. > Cool! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: speed up verifying UTF-8
I plan to push v25 early next week, unless there are further comments. -- John Naylor EDB: http://www.enterprisedb.com
Re: In-placre persistance change of a relation
On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote: > I'm especially worried if I didn't screw up something/forgot something > related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests > passed". > As the thread didn't get a lot of traction, I've registered it into current > commitfest https://commitfest.postgresql.org/36/3461/ with You as the author > and in 'Ready for review' state. > I think it behaves as almost finished one and apparently after reading all > those discussions that go back over 10years+ time span about this feature, > and lot of failed effort towards wal_level=noWAL I think it would be nice to > finally start getting some of that of it into the core. The patch is failing: http://cfbot.cputube.org/kyotaro-horiguchi.html https://api.cirrus-ci.com/v1/artifact/task/5564333871595520/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs I think you ran "make check", but should run something like this: make check-world -j8 >check-world.log 2>&1 && echo Success -- Justin
Re: Column Filtering in Logical Replication
Hi, Thank you for updating the patch. The regression tests and tap tests pass with v9 patch. > > After working on this a little bit more, I realized that this is a bad > idea overall. It causes lots of complications and it's just not worth > it. So I'm back at my original thought that we need to throw an ERROR > at ALTER TABLE .. DROP COLUMN time if the column is part of a > replication column filter, and suggest the user to remove the column > from the filter first and reattempt the DROP COLUMN. > > This means that we need to support changing the column list of a table > in a publication. I'm looking at implementing some form of ALTER > PUBLICATION for that. > > I think right now the patch contains support only for ALTER PUBLICATION.. ADD TABLE with column filters. In order to achieve changing the column lists of a published table, I think we can extend the ALTER TABLE ..SET TABLE syntax to support specification of column list. So this whole thing can > be reduced to just this: if (att_map != NULL && !bms_is_member(att->attnum, att_map)) >continue;/* that is, don't send this attribute */ I agree the condition can be shortened now. The long if condition was included because initially the feature allowed specifying filters without replica identity columns(sent those columns internally without user having to specify). 900 +* the table is partitioned. Run a recursive query to iterate > through all > 901 +* the parents of the partition and retreive the record for > the parent > 902 +* that exists in pg_publication_rel. > 903 +*/ The above comment in fetch_remote_table_info() can be changed as the recursive query is no longer used. Thank you, Rahila Syed
Re: row filtering for logical replication
On Fri, Dec 17, 2021 at 7:20 PM Ajin Cherian wrote: > > On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow wrote: > > > So using the v47 patch-set, I still find that the UPDATE above results in > > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1). > > This is according to the 2nd UPDATE rule below, from patch 0003. > > > > + * old-row (no match)new-row (no match) -> (drop change) > > + * old-row (no match)new row (match) -> INSERT > > + * old-row (match) new-row (no match) -> DELETE > > + * old-row (match) new row (match) -> UPDATE > > > > This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", > > but the new row (2,1) does. > > This functionality doesn't seem right to me. I don't think it can be > > assumed that (1,1) was never published (and thus requires an INSERT rather > > than UPDATE) based on these checks, because in this example, (1,1) was > > previously published via a different operation - INSERT (and using a > > different filter too). > > I think the fundamental problem here is that these UPDATE rules assume that > > the old (current) row was previously UPDATEd (and published, or not > > published, according to the filter applicable to UPDATE), but this is not > > necessarily the case. > > Or am I missing something? > > But it need not be correct in assuming that the old-row was part of a > previous INSERT either (and published, or not published according to > the filter applicable to an INSERT). > For example, change the sequence of inserts and updates prior to the > last update: > > truncate tbl1 ; > insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2); > update tbl1 set b = 1; ==> not replicated since update and ! (a > 1) > update tbl1 set a = 2; ==> replicated and update converted to insert > since (a > 1) > > In this case, the last update "update tbl1 set a = 2; " is updating a > row that was previously updated and not inserted and not replicated to > the subscriber. > How does the replication logic differentiate between these two cases, > and decide if the update was previously published or not? > I think it's futile for the publisher side to try and figure out the > history of published rows. In fact, if this level of logic is required > then it is best implemented on the subscriber side, which then defeats > the purpose of a publication filter. > I think it's a concern, for such a basic example with only one row, getting unpredictable (and even wrong) replication results, depending upon the order of operations. Doesn't this problem result from allowing different WHERE clauses for different pubactions for the same table? My current thoughts are that this shouldn't be allowed, and also WHERE clauses for INSERTs should, like UPDATE and DELETE, be restricted to using only columns covered by the replica identity or primary key. Regards, Greg Nancarrow Fujitsu Australia
Re: Adding CI to our tree
On 13.12.21 22:12, Andres Freund wrote: Attached is an updated version of the CI patches. An example of a test run with the attached version of this https://cirrus-ci.com/build/6501998521483264 + only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*freebsd.*' I'm not in favor of this kind of thing. I don't understand how this is useful, other than perhaps for developing *this* patch. I don't think people would like having these tags in the mainline, and if it's for local use, then people can adjust the file locally. + CC="ccache cc" CFLAGS="-O0 -ggdb"' I don't think using -O0 is the right thing. It will miss some compiler warnings, and it will not thoroughly test the compiler. We should test using the configurations that are close to what users actually use. +- su postgres -c 'gmake -s -j3 && gmake -s -j3 -C contrib' Why doesn't this use make world (or world-bin, if you prefer). Why does this use -j3 if there are two CPUs configured? (Perhaps the number of CPUs should be put into a variable.) I don't like that the -s option is used. I would like to see what commands are executed. +cpu: 4 Why does the Linux job use 4 CPUs and the FreeBSD job 2? +- export I don't think that is portable to all shells. +- su postgres -c 'time script test.log gmake -s -j2 ${CHECK} ${CHECKFLAGS}' +su postgres -c '\ + ulimit -c unlimited; \ + make -s ${CHECK} ${CHECKFLAGS} -j8 \ + ' Not clear why these are so different. Don't we need the test.log file for Linux? Don't we need the ulimit call for FreeBSD? Why the -j8 option even though 4 CPUs have been configured? +brew install \ + ccache \ + coreutils \ + icu4c \ + krb5 \ + llvm \ + lz4 \ + make \ + openldap \ + openssl \ + python@3.10 \ + tcl-tk Curious why coreutils and make are installed? The system-supplied tools ought to work. +brew cleanup -s Seems unnecessary? +PKG_CONFIG_PATH="/usr/local/opt/krb5/lib/pkgconfig:$PKG_CONFIG_PATH" AFAICT, we don't use pkg-config for the krb5 package. +- gmake -s -j12 && gmake -s -j12 -C contrib These numbers should also be explained or configured somewhere. Possibly query the number of CPUs on the instance. +PROVE_FLAGS: -j10 Why only on Windows? +# Installation on windows currently only completely works from src\tools\msvc If that is so, let's fix that. But see that install.pl contains if (-e "src/tools/msvc/buildenv.pl") etc. it seems to want to be able to be invoked from the top level. +- cd src\tools\msvc && perl .\install.pl %CIRRUS_WORKING_DIR%\tmp_install Confusing mix of forward and backward slashes in the Windows section. I think forward slashes should work everywhere. + test_plcheck_script: +- perl src/tools/msvc/vcregress.pl plcheck etc. Couldn't we enhance vcregress.pl to take multiple arguments or take a "check-world" argument or something. Otherwise, this will be tedious to keep updated. + test_subscriptioncheck_script: +- perl src/tools/msvc/vcregress.pl taptest .\src\test\subscription\ This is even worse. I don't want to have to hand-register every new TAP test. + always: +gcc_warning_script: | + time ./configure \ +--cache gcc.cache CC="ccache gcc" \ +--enable-dtrace I don't know why we wouldn't need the full set of options here. It's not like optional code never has compiler warnings. + # cross-compile to windows + always: +mingw_cross_warning_script: | I would welcome a native mingw build with full options and test suite run. This cross-compiling stuff is of course interesting, but I'm not sure why it is being preferred over a full native run. --- /dev/null +++ b/.dockerignore @@ -0,0 +1,7 @@ +# Ignore everything, except ci/ I wonder whether this would interfere with other uses of docker. I suppose people might have their custom setups for building docker images from PostgreSQL sources. It seems weird that this file gets this prominence, saying that the canonical use of docker inside PostgreSQL sources is for Cirrus CI. It would be useful if the README explained the use of docker. As I mentioned before, it's not immediately clear why docker is used at all in this. The docker file for Windows contains a lot of hardcoded version numbers. This should at least be refactored a little bit so that it is clear which version numbers should be updated and how. Or better yet, avoid the need to constantly update version numbers. For example, the Python patch release changes every few weeks (e.g., 3.10.0 isn't current anymore). Also, the way OpenSSL is installed looks a bit fishy. Is this what people actually use in practice? How can we make it match actual practice better? +# So that tests using the "manually" started postgres on windows can use +# prepared statements +max_prepared_transactions
Re: Skipping logical replication transactions on subscriber side
On Fri, Dec 17, 2021 at 7:12 PM Amit Kapila wrote: > > On Fri, Dec 17, 2021 at 3:23 PM Peter Eisentraut > wrote: > > > > On 13.12.21 04:12, Greg Nancarrow wrote: > > > (ii) "Setting -1 means to reset the transaction ID" > > > > > > Shouldn't it be explained what resetting actually does and when it can > > > be, or is needed to be, done? Isn't it automatically reset? > > > I notice that negative values (other than -1) seem to be regarded as > > > valid - is that right? > > > Also, what happens if this option is set multiple times? Does it just > > > override and use the latest setting? (other option handling errors out > > > with errorConflictingDefElem()). > > > e.g. alter subscription sub skip (xid = 721, xid = 722); > > > > Let's not use magic numbers and instead use a syntax that is more > > explicit, like SKIP (xid = NONE) or RESET SKIP or something like that. > > > > +1 for using SKIP (xid = NONE) because otherwise first we need to > introduce RESET syntax for this command. Agreed. Thank you for the comment! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: row filtering for logical replication
On Fri, Dec 17, 2021 at 1:50 PM Ajin Cherian wrote: > > On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow wrote: > > > So using the v47 patch-set, I still find that the UPDATE above results in > > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1). > > This is according to the 2nd UPDATE rule below, from patch 0003. > > > > + * old-row (no match)new-row (no match) -> (drop change) > > + * old-row (no match)new row (match) -> INSERT > > + * old-row (match) new-row (no match) -> DELETE > > + * old-row (match) new row (match) -> UPDATE > > > > This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", > > but the new row (2,1) does. > > This functionality doesn't seem right to me. I don't think it can be > > assumed that (1,1) was never published (and thus requires an INSERT rather > > than UPDATE) based on these checks, because in this example, (1,1) was > > previously published via a different operation - INSERT (and using a > > different filter too). > > I think the fundamental problem here is that these UPDATE rules assume that > > the old (current) row was previously UPDATEd (and published, or not > > published, according to the filter applicable to UPDATE), but this is not > > necessarily the case. > > Or am I missing something? > > But it need not be correct in assuming that the old-row was part of a > previous INSERT either (and published, or not published according to > the filter applicable to an INSERT). > For example, change the sequence of inserts and updates prior to the > last update: > > truncate tbl1 ; > insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2); > update tbl1 set b = 1; ==> not replicated since update and ! (a > 1) > update tbl1 set a = 2; ==> replicated and update converted to insert > since (a > 1) > > In this case, the last update "update tbl1 set a = 2; " is updating a > row that was previously updated and not inserted and not replicated to > the subscriber. > How does the replication logic differentiate between these two cases, > and decide if the update was previously published or not? > I think it's futile for the publisher side to try and figure out the > history of published rows. > I also think so. One more thing, even if we want we might not be able to apply the insert filter as the corresponding values may not be logged. -- With Regards, Amit Kapila.
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit : > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can be in a different memory context on our side). I've been told off-list this comment wasn't clear: I meant that it compacts *free* memory, consolidating into larger blocks that will eventually reach the threshold, and be released. -- Ronan Dunklau
Re: Skipping logical replication transactions on subscriber side
On Fri, Dec 17, 2021 at 3:23 PM Peter Eisentraut wrote: > > On 13.12.21 04:12, Greg Nancarrow wrote: > > (ii) "Setting -1 means to reset the transaction ID" > > > > Shouldn't it be explained what resetting actually does and when it can > > be, or is needed to be, done? Isn't it automatically reset? > > I notice that negative values (other than -1) seem to be regarded as > > valid - is that right? > > Also, what happens if this option is set multiple times? Does it just > > override and use the latest setting? (other option handling errors out > > with errorConflictingDefElem()). > > e.g. alter subscription sub skip (xid = 721, xid = 722); > > Let's not use magic numbers and instead use a syntax that is more > explicit, like SKIP (xid = NONE) or RESET SKIP or something like that. > +1 for using SKIP (xid = NONE) because otherwise first we need to introduce RESET syntax for this command. -- With Regards, Amit Kapila.
Re: row filtering for logical replication
On Fri, Dec 17, 2021 at 4:11 AM Peter Smith wrote: > > PSA the v47* patch set. > Few comments on v47-0002: === 1. The handling to find rowfilter for ancestors in RelationGetInvalidRowFilterCol seems complex. It seems you are accumulating non-partition relations as well in toprelid_in_pub. Can we simplify such that we find the ancestor only for 'pubviaroot' publications? 2. I think the name RelationGetInvalidRowFilterCol is confusing because the same function is also used to get publication actions. Can we name it as GetRelationPublicationInfo() and pass a bool parameter to indicate whether row_filter info needs to be built. We can get the invalid_row_filter column as output from that function. 3. +GetRelationPublicationActions(Relation relation) { .. + if (!relation->rd_pubactions) + (void) RelationGetInvalidRowFilterCol(relation); + + return memcpy(pubactions, relation->rd_pubactions, + sizeof(PublicationActions)); .. .. } I think here we can reverse the check such that if actions are set just do memcpy and return otherwise get the relationpublicationactions info. 4. invalid_rowfilter_column_walker { .. /* * If pubviaroot is true, we need to convert the column number of * parent to the column number of child relation first. */ if (context->pubviaroot) { char *colname = get_attname(context->parentid, attnum, false); attnum = get_attnum(context->relid, colname); } Here, in the comments, you can tell why you need this conversion. Can we name this function as rowfilter_column_walker()? 5. +/* For invalid_rowfilter_column_walker. */ +typedef struct { + AttrNumber invalid_rfcolnum; /* invalid column number */ + Bitmapset *bms_replident; /* bitset of replica identity col indexes */ + bool pubviaroot; /* true if we are validating the parent + * relation's row filter */ + Oid relid; /* relid of the relation */ + Oid parentid; /* relid of the parent relation */ +} rf_context; Normally, we declare structs at the beginning of the file and for the formatting of struct declarations, see other nearby structs like RelIdCacheEnt. 6. Can we name IsRowFilterSimpleNode() as IsRowFilterSimpleExpr()? -- With Regards, Amit Kapila.
Re: psql format output
On 15.12.21 20:58, Florian Koch wrote: I realized that the output of "\df+ func_name" has a formatting problem when a lot of arguments are used. The field 'Arguments data types' gets very long and destroys the whole formatting in the console. The field 'Source code' is most of the time multi-line and I thought that the output for the field 'Arguments data types' could also be multiline with one line for each argument. That's a reasonable idea. I wonder if it would work in general. If someone had a C function (so no source code) with three arguments, they might be annoyed if it now displayed as three lines by default.
Re: Skipping logical replication transactions on subscriber side
On 13.12.21 04:12, Greg Nancarrow wrote: (ii) "Setting -1 means to reset the transaction ID" Shouldn't it be explained what resetting actually does and when it can be, or is needed to be, done? Isn't it automatically reset? I notice that negative values (other than -1) seem to be regarded as valid - is that right? Also, what happens if this option is set multiple times? Does it just override and use the latest setting? (other option handling errors out with errorConflictingDefElem()). e.g. alter subscription sub skip (xid = 721, xid = 722); Let's not use magic numbers and instead use a syntax that is more explicit, like SKIP (xid = NONE) or RESET SKIP or something like that.
RE: Column Filtering in Logical Replication
On Friday, December 17, 2021 1:55 AM Alvaro Herrera wrote: > On 2021-Dec-16, houzj.f...@fujitsu.com wrote: > > > The patch ensures all columns of RT are in column list when > > CREATE/ALTER publication, but it seems doesn't prevent user from > > changing the replica identity or dropping the index used in replica > > identity. Do we also need to check those cases ? > > Yes, we do. As it happens, I spent a couple of hours yesterday writing code > for > that, at least partially. I haven't yet checked what happens with cases like > REPLICA NOTHING, or REPLICA INDEX and then dropping that index. > > My initial ideas were a bit wrong BTW: I thought we should check the > combination of column lists in all publications (a bitwise-OR of column > bitmaps, > so to speak). But conceptually that's wrong: we need to check the column list > of each publication individually instead. Otherwise, if you wanted to hide a > column from some publication but that column was part of the replica identity, > there'd be no way to identify the tuple in the replica. (Or, if the pgouput > code > disobeys the column list and sends the replica identity even if it's not in > the > column list, then you'd be potentially publishing data that you wanted to > hide.) Thanks for the explanation. Apart from ALTER REPLICA IDENTITY and DROP INDEX, I think there could be some other cases we need to handle for the replica identity check: 1) When adding a partitioned table with column list to the publication, I think we need to check the RI of all its leaf partition. Because the RI on the partition is the one actually takes effect. 2) ALTER TABLE ADD PRIMARY KEY; ALTER TABLE DROP CONSTRAINT "PRIMAEY KEY"; If the replica identity is default, it will use the primary key. we might also need to prevent user from adding or removing primary key in this case. Based on the above cases, the RI check seems could bring considerable amount of code. So, how about we follow what we already did in CheckCmdReplicaIdentity(), we can put the check for RI in that function, so that we can cover all the cases and reduce the code change. And if we are worried about the cost of do the check for UPDATE and DELETE every time, we can also save the result in the relcache. It's safe because every operation change the RI will invalidate the relcache. We are using this approach in row filter patch to make sure all columns in row filter expression are part of RI. Best regards, Hou zj
Re: Addition of --no-sync to pg_upgrade for test speedup
On 16.12.21 07:50, Michael Paquier wrote: As per $subject, avoiding the flush of the new cluster's data directory shortens a bint the runtime of the test. In some of my slow VMs, aka Windows, this shaves a couple of seconds even if the bulk of the time is still spent on the main regression test suite. In pg_upgrade, we let the flush happen with initdb --sync-only, based on the binary path of the new cluster, so I think that we are not going to miss any test coverage by skipping that. I think that is reasonable. Maybe we could have some global option, like some environment variable, that enables the "sync" mode in all tests, so it's easy to test that once in a while. Not really a requirement for your patch, but an idea in case this is a concern.
Re: Transparent column encryption
On 17.12.21 01:41, Jacob Champion wrote: (And I think the client should be able to enforce encryption in the first place, before I distract you too much with other stuff.) Yes, this is a useful point that I have added to my notes.
RE: In-placre persistance change of a relation
> Kyotaro wrote: > > In this version, I got rid of the "CLEANUP FORK"s, and added a new > > system "Smgr marks". The mark files have the name of the > > corresponding fork file followed by ".u" (which means Uncommitted.). > > "Uncommited"-marked main fork means the same as the > CLEANUP2_FORKNUM > > and uncommitted-marked init fork means the same as the > CLEANUP_FORKNUM > > in the previous version.x > > > > I noticed that the previous version of the patch still leaves an > > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash > > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is > > revmoed at rollback. In this version the responsibility to remove the > > mark files is moved to SyncPostCheckpoint, where main fork files are > > actually removed. > > For the record, I noticed that basebackup could be confused by the mark files > but I haven't looked that yet. > Good morning Kyotaro, the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following: b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE 7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE 9ce346eabf3 - Report progress of startup operations that take a long time. f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr(). I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed". Basic demonstration of this patch (with wal_level=minimal): create unlogged table t6 (id bigint, t text); -- produces 110GB table, takes ~5mins insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 1); alter table t6 set logged; on baseline SET LOGGED takes: ~7min10s on patched SET LOGGED takes: 25s So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED. Some more intensive testing also looks good, assuming table prepared to put pressure on WAL: create unlogged table t_unlogged (id bigint, t text) partition by hash (id); create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0); [..] create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3); Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend . t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WAL so using: pgbench -f .sql -T 30 -P 1 -c 32 -j 3 t with synchronous_commit =ON(default): with t_logged.sql: tps = 229 (lat avg = 138ms) with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite with synchronous_commit =OFF: with t_logged.sql: tps = 413 (lat avg = 77ms) with t_unloged.sql: tps = 782 (lat avg = 40ms) Afterwards switching the unlogged ~16GB partitions takes 5s per partition. As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state. I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core. -Jakub Wartak. v7-0001-In-place-table-persistence-change-with-new-comman.patch Description: v7-0001-In-place-table-persistence-change-with-new-comman.patch
Re: [PoC] Delegating pg_ident to a third party
On 17.12.21 00:48, Jacob Champion wrote: WDYT? (My responses here will be slower than usual. Hope you all have a great end to the year!) Looks interesting. I wonder whether putting this into pg_ident.conf is sensible. I suspect people will want to eventually add more features around this, like automatically creating roles or role memberships, at which point pg_ident.conf doesn't seem appropriate anymore. Should we have a new file for this? Do you have any further ideas?
Re: row filtering for logical replication
On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow wrote: > So using the v47 patch-set, I still find that the UPDATE above results in > publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1). > This is according to the 2nd UPDATE rule below, from patch 0003. > > + * old-row (no match)new-row (no match) -> (drop change) > + * old-row (no match)new row (match) -> INSERT > + * old-row (match) new-row (no match) -> DELETE > + * old-row (match) new row (match) -> UPDATE > > This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", > but the new row (2,1) does. > This functionality doesn't seem right to me. I don't think it can be assumed > that (1,1) was never published (and thus requires an INSERT rather than > UPDATE) based on these checks, because in this example, (1,1) was previously > published via a different operation - INSERT (and using a different filter > too). > I think the fundamental problem here is that these UPDATE rules assume that > the old (current) row was previously UPDATEd (and published, or not > published, according to the filter applicable to UPDATE), but this is not > necessarily the case. > Or am I missing something? But it need not be correct in assuming that the old-row was part of a previous INSERT either (and published, or not published according to the filter applicable to an INSERT). For example, change the sequence of inserts and updates prior to the last update: truncate tbl1 ; insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2); update tbl1 set b = 1; ==> not replicated since update and ! (a > 1) update tbl1 set a = 2; ==> replicated and update converted to insert since (a > 1) In this case, the last update "update tbl1 set a = 2; " is updating a row that was previously updated and not inserted and not replicated to the subscriber. How does the replication logic differentiate between these two cases, and decide if the update was previously published or not? I think it's futile for the publisher side to try and figure out the history of published rows. In fact, if this level of logic is required then it is best implemented on the subscriber side, which then defeats the purpose of a publication filter. regards, Ajin Cherian Fujitsu Australia
Re: Use generation context to speed up tuplesorts
Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : > On 12/16/21 17:03, Ronan Dunklau wrote: > > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > >> I will follow up with a benchmark of the test sorting a table with a > >> width > >> varying from 1 to 32 columns. > > > > So please find attached another benchmark for that case. > > > > The 3 different patchsets tested are: > > - master > > - fixed (David's original patch) > > - adjust (Thomas growing blocks patch) > > Presumably Thomas is me, right? I'm really sorry for this typo... Please accept my apologies. > > > So it looks like tuning malloc for this would be very benificial for any > > kind of allocation, and by doing so we reduce the problems seen with the > > growing blocks patch to next to nothing, while keeping the ability to not > > allocate too much memory from the get go. > > Thanks for running those tests and investigating the glibc behavior! I > find those results very interesting. My conclusions from this is that > the interaction interaction between "our" allocator and the allocator in > malloc (e.g. glibc) can be problematic. Which makes benchmarking and > optimization somewhat tricky because code changes may trigger behavior > change in glibc (or whatever allocator backs malloc). > > I think it's worth exploring if we can tune this in a reasonable way, > but I have a couple concerns related to that: > > 1) I wonder how glibc-specific this is - I'd bet it applies to other > allocators (either on another OS or just different allocator on Linux) > too. Tweaking glibc parameters won't affect those systems, of course, > but maybe we should allow tweaking those systems too ... I agree, finding their specific problems and see if we can workaround it would be interesting. I suppose glibc's malloc is the most commonly used allocator in production, as it is the default for most Linux distributions. > > 2) In fact, I wonder if different glibc versions behave differently? > Hopefully it's not changing that much, though. Ditto kernel versions, > but the mmap/sbrk interface is likely more stable. We can test this. That could be tested, yes. As a matter of fact, a commit removing the upper limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, which means we can service much bigger allocation without mmap. > > 3) If we bump the thresholds, won't that work against reusing the > memory? I mean, if we free a whole block (from any allocator we have), > glibc might return it to kernel, depending on mmap threshold value. It's > not guaranteed, but increasing the malloc thresholds will make that even > less likely. So we might just as well increase the minimum block size, > with about the same effect, no? It is my understanding that malloc will try to compact memory by moving it around. So the memory should be actually be released to the kernel at some point. In the meantime, malloc can reuse it for our next invocation (which can be in a different memory context on our side). If we increase the minimum block size, this is memory we will actually reserve, and it will not protect us against the ramping-up behaviour: - the first allocation of a big block may be over mmap_threshold, and serviced by an expensive mmap - when it's free, the threshold is doubled - next invocation is serviced by an sbrk call - freeing it will be above the trim threshold, and it will be returned. After several "big" allocations, the thresholds will raise to their maximum values (well, it used to, I need to check what happens with that latest patch of glibc...) This will typically happen several times as malloc doubles the threshold each time. This is probably the reason quadrupling the block sizes was more effective. > > > I would like to try to implement some dynamic glibc malloc tuning, if that > > is something we don't reject on principle from the get go. > > +1 to that Ok, I'll work on a patch for this and submit a new thread. -- Ronan Dunklau
Re: pg_dump versus ancient server versions
On Fri, 22 Oct 2021 at 19:27, Robert Haas wrote: > > Another thing to think about in that regard: how likely is that > PostgreSQL 7.4 and PostgreSQL 15 both compile and run on the same > operating system? I suspect the answer is "not very." I seem to recall > Greg Stark trying to compile really old versions of PostgreSQL for a > conference talk some years ago, and he got back to a point where it > just became impossible to make work on modern toolchains even with a > decent amount of hackery. That was when I compared sorting performance over time. I was able to get Postgres to build back to the point where 64-bit architecture support was added. From Andrew Dunstans comment later in this thread I'm guessing that was 7.2 That was basically at the point where 64-bit architecture support was added. It looks like the earliest date on the graphs in the talk are 2002-11-27 which matches the 7.3 release date. I think building earlier versions would have been doable if I had built them in 32-bit mode. -- greg