Re: Adding skip scan (including MDAM style range skip scan) to nbtree
On Sat, May 10, 2025 at 10:59 AM Tomas Vondra wrote: > But doesn't it also highlight how fragile this memory allocation is? The > skip scan patch didn't do anything wrong - it just added a couple > fields, using a little bit more memory. I think we understand allocating > more memory may need more time, but we expect the effect to be somewhat > proportional. Which doesn't seem to be the case here. > > Many other patches add fields somewhere, it seems like bad luck the skip > scan happened to trigger this behavior. It's quite likely other patches > ran into the same issue, except that no one noticed. Maybe the skip scan > did that in much hotter code, not sure. But what did the skip scan commit (specifically commit 92fe23d9, without any of the follow-up commits) change about memory allocation, that might be at issue with your test case? You said that that commit "just added a couple fields". What specific fields are you talking about, that were added by commit 92fe23d9? I already speculated that the issue might be tied to the addition of a new support routine (skip support), but the experiment we ran to try to validate that theory disproved it. What else is there? Again, commit 92fe23d9 didn't make BTScanOpaqueData any larger (follow-up commit 8a510275 added a new BTScanOpaqueData.skipScan bool field, but that didn't even affect sizeof BTScanOpaqueData, since the field fit into what was previously just alignment padding space). AFAICT nothing that seems like it might be relevant changed, apart from the addition of the new support routine, which was ruled out already. -- Peter Geoghegan
Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)
Hi hackers, I’ve been exploring the idea of integrating an LSM tree–based storage engine into PostgreSQL — similar in spirit to MyRocks for MySQL — by replacing the underlying storage while preserving PostgreSQL’s upper layers (planner, executor, MVCC, etc.). The motivation stems from the well‑known write‑amplification issues with B‑trees under high write throughput. An LSM‑based engine could offer: - Significant improvements in write performance and space efficiency, especially under heavy ingestion workloads. - Better scalability with larger datasets, particularly when compression is applied. - Comparable read performance (with trade‑offs depending on workload), and opportunities to optimize through Bloom filters, tiered compaction strategies, etc. - Reduced reliance on manual VACUUM: obsolete versions would be purged naturally during LSM compactions, potentially eliminating routine heap vacuuming (transaction‑ID wrap‑around handling and stats collection would still need careful design). The hoped‑for outcome is a faster, more scalable PostgreSQL for >1 TB workloads, while maintaining the rich feature set and ecosystem compatibility users expect from Postgres. Unlike Neon, this approach is not targeting cloud‑native object storage or remote WAL streaming, but instead optimizing for maximum performance on local disks or high‑performance block volumes, where write throughput and compaction efficiency matter most. This would likely involve implementing a new Table Access Method (TAM), possibly backed by a forked engine such as BadgerDB or RocksDB, adapted to support PostgreSQL’s MVCC and WAL semantics. I’d love to hear your thoughts: 1. Does this direction make sense for experimentation within the Postgres ecosystem? 2. Are there known architectural blockers or prior discussions/attempts in this space worth revisiting? 3. Would such a project be best developed entirely as a fork, or is there openness to evolving TAM to better support pluggable storage with LSM‑like semantics? Looking forward to your feedback. - Manish https://github.com/manishrjain
Re: Improve hash join's handling of tuples with null join keys
On Tue, May 6, 2025 at 12:12 PM Tomas Vondra wrote: > On 5/6/25 01:11, Tom Lane wrote: > > The attached patch is a response to the discussion at [1], where > > it emerged that lots of rows with null join keys can send a hash > > join into too-many-batches hell, if they are on the outer side > > of the join so that they must be null-extended not just discarded. > > This isn't really surprising given that such rows will certainly > > end up in the same hash bucket, and no amount of splitting can > > reduce the size of that bucket. (I'm a bit surprised that the > > growEnabled heuristic didn't kick in, but it seems it didn't, > > at least not up to several million batches.) Good idea. I haven't reviewed it properly, but one observation is that trapping the null-keys tuples in per-worker tuple stores creates unfairness. That could be fixed by using a SharedTuplestore instead, but unfortunately SharedTuplestore always spills to disk at the moment, so maybe I should think about how to give it some memory for small sets like regular Tuplestore. Will look more closely after Montreal. > I don't think that's too surprising - growEnabled depends on all tuples > getting into the same batch during a split. But even if there are many > duplicate values, real-world data sets often have a couple more tuples > that just happen to fall into that bucket too. And then during the split > some get into one batch and some get into another. Yeah. > My personal experience is that the growEnabled heuristics is overly > sensitive, and probably does not trigger very often. It can also get set > to "true" to early, but that's (much) harder to hit. > > I have suggested to make growEnabled less strict in [2], i.e. to > calculate the threshold as percentage of the batch, and not disable > growth permanently. But it was orthogonal to what that thread did. +1, I also wrote a thread with a draft patch like that at some point, which I'll also try to dig up after Montreal. > But more importantly, wasn't the issue discussed in [1] about parallel > hash joins? I got quite confused while reading the thread ... I'm asking > because growEnabled is checked only in ExecHashIncreaseNumBatches, not > in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins > don't use growEnabled at all, no? There is an equivalent mechanism, but it's slightly more complicated as it requires consensus (see code near PHJ_GROW_BATCHES_DECIDE). It's possible that it's not working quite as well as it should. It's definitely less deterministic in some edge cases since tuples are packed into chunks differently so the memory used can vary slightly run-to-run, but the tuple count should be stable. I've made a note to review that logic again too. Note also that v16 is the first that could put NULLs in a shared memory hash table (11c2d6fdf5 enabled Parallel Hash Right|Full Join), while non-P HJ has had that for a long time, but it also couldn't be used in a parallel query, so I guess it's possible that this stuff is coming up now because it wasn't often picked for problems that would generate interesting numbers of NULLs likely to exceed limits given available plans in older releases. See also related bug fix in 98c7c7152, spotted soon after this plan type escaped into the field. While thinking about that, I wanted to note that we have more things to improve in PHRJ: (1) Parallelism of unmatched scan: a short but not entirely satisfying patch was already shared on the PHRJ thread but not committed with the main feature. I already had some inklings of how to do much better which I recently described in a bit more detail on the PBHS thread in vapourware form, where parallel fairness came up again. "La perfection est le mortel ennemi du bien" or whatever it is they say in the language of Montreal, but really the easy patch for unmatched scan parallelism wasn't actually bon enough, because it was non-deterministic how many processes could participate due to deadlock avoidance arcana, creating run-to-run variation that I'd expect Tomáš to find empirically and reject in one of his benchmarking expeditions :-). (2) Bogus asymmetries in estimations/planning: I wrote some analysis of why we don't use PHRJ as much as we could/should near Richard Guo's work on anti/semi joins which went in around the same time. My idea there is to try to debogify the parallel degree logic more generally, it's just that PHRJ brought key aspects of it into relief for me, ie bogosity of the rule-based "driving table" concept. I'll try to write these projects up on the wiki, instead of in random threads :-) In other words if you just use local Tuplestores as you showed it would actually be an improvement in fairness over the status quo due to (1) not being solved yet... but it will be solved, hence mentioning it in this context.
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Sun, May 11, 2025 at 8:49 PM Xuneng Zhou wrote: > > Hi, I was able to reproduce the failure by adding a 1-second sleep in the > LogicalRepApplyLoop function > However, I noticed that the tests under src/test/subscription run > significantly slower— is this normal? > Yes, because you made apply slower by adding a sleep. > Also, it looks like the patch mentioned in this thread addresses the issue: > https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com >> So you are okay with a test-only fix? With Regards, Amit Kapila.
Re: Small fixes needed by high-availability tools
On Fri, May 2, 2025 at 6:30 PM Andrey Borodin wrote: > > 3. Allow reading LSN written by walreciever, but not flushed yet > > Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might > be ahead of node1 by flush LSN, but before by written LSN. If we do a > failover we choose node2 instead of node1 and loose data recently committed > with synchronous_commit=remote_write. > In which case, can we rely on written WAL that is not yet flushed? Because say you decide based on written WAL and choose node-1 in above case for failover, what if it restarts without flushing the written WAL? > Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact > returns flushed LSN, not written. I propose to add a new function which > returns LSN actually written. Internals of this function are already > implemented (GetWalRcvWriteRecPtr()), but unused. > It seems to me that this is less controversial than your other two proposals. So, we can discuss this in a separate thread as well. -- With Regards, Amit Kapila.
Re: Valgrind - showing memory leaks
=?utf-8?Q?=C3=81lvaro?= Herrera writes: > On 2025-May-08, Tom Lane wrote: >> So it seems like valgrind is wrong here, or else we're leaking the >> whole rd_att structure later on somehow. > Well, the problem is that if num_check is zero, FreeTupleDesc() doesn't > free ->check. Ah-hah. > Done that way, thanks. Thanks! I've been putting together a list of fixes to make Valgrind less noisy, and it just got one shorter. regards, tom lane
Re: Add an option to skip loading missing publication to avoid logical replication failure
Hi, I was able to reproduce the failure by adding a 1-second sleep in the LogicalRepApplyLoop function However, I noticed that the tests under src/test/subscription run significantly slower— is this normal? Also, it looks like the patch mentioned in this thread addresses the issue: https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com > > A simpler way to consistently reproduce the issue is to add a 1-second > sleep in the LogicalRepApplyLoop function, just before the call to > WaitLatchOrSocket. This reproduces the test failure consistently for > me. The failure reason is the same as in [1]. > > [1] - > https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com > > Regards, > Vignesh >
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
On 5/11/25 18:07, Peter Geoghegan wrote: > On Sat, May 10, 2025 at 10:59 AM Tomas Vondra wrote: >> But doesn't it also highlight how fragile this memory allocation is? The >> skip scan patch didn't do anything wrong - it just added a couple >> fields, using a little bit more memory. I think we understand allocating >> more memory may need more time, but we expect the effect to be somewhat >> proportional. Which doesn't seem to be the case here. >> >> Many other patches add fields somewhere, it seems like bad luck the skip >> scan happened to trigger this behavior. It's quite likely other patches >> ran into the same issue, except that no one noticed. Maybe the skip scan >> did that in much hotter code, not sure. > > But what did the skip scan commit (specifically commit 92fe23d9, > without any of the follow-up commits) change about memory allocation, > that might be at issue with your test case? You said that that commit > "just added a couple fields". What specific fields are you talking > about, that were added by commit 92fe23d9? > > I already speculated that the issue might be tied to the addition of a > new support routine (skip support), but the experiment we ran to try > to validate that theory disproved it. What else is there? > That's a good point. However, it seems I have done something wrong when running the tests with the support routine removed :-( I just repeated the tests, and I got this: mode clients 3ba2cdaa454 master revert master revert -- prepared 110860 354811048 33% 102% 4254921129925190 44% 99% 32383991414238493 37% 100% -- simple1 2595 1844 2604 71% 100% 4 8266 6090 8126 74% 98% 3211765 719811449 61% 97% I where "revert" is master with the removal patch. Sorry about the confusion, I guess I was distracted and did some mistake. So, this seems to be in line with the hypothesis ... regards -- Tomas Vondra
Re: Why our Valgrind reports suck
I wrote: > Okay, here is a patch series that updates the > 0001-Make-memory-contexts-themselves-more-visible-to-valg.patch > patch you posted in that thread, I forgot to mention that I did try to implement the "two-level pool" scheme that the Valgrind documentation talks about, and could not make it work. There seem to be undocumented interactions between the outer and inner chunks, and it's not real clear to me that there's not outright bugs. Anyway, AFAICS that scheme would bring us no immediate advantages anyway, compared to the flat approach of just adding mempool chunks for the allocators' management areas. regards, tom lane
Re: add tab-complete for ALTER DOMAIN ADD...
On Sun, 11 May 2025, at 15:48, Álvaro Herrera wrote: > So we should also have tab-completion for ALTER TABLE ADD NOT NULL, as > in the attached. Any opinions? LGTM -- - ilmari
Re: Suggestion to add --continue-client-on-abort option to pgbench
On Sun, May 11, 2025 at 7:07 PM Matthias van de Meent < boekewurm+postg...@gmail.com> wrote: > > On Sat, May 10, 2025 at 8:45 PM ikedarintarof < > ikedarinta...@oss.nttdata.com> wrote: > >> > >> Hi hackers, > >> > >> I would like to suggest adding a new option to pgbench, which enables > >> the client to continue processing transactions even if some errors occur > >> during a transaction. > >> Currently, a client stops sending requests when its transaction is > >> aborted due to reasons other than serialization failures or deadlocks. I > >> think in some cases, especially when using custom scripts, the client > >> should be able to rollback the failed transaction and start a new one. > >> > >> For example, my custom script (insert_to_unique_column.sql) follows: > >> ``` > >> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique); > >> INSERT INTO test (col2) VALUES (random(0, 5)); > >> ``` > >> Assume we need to continuously apply load to the server using 5 clients > >> for a certain period of time. However, a client sometimes stops when its > >> transaction in my custom script is aborted due to a check constraint > >> violation. As a result, the load on the server is lower than expected, > >> which is the problem I want to address. > >> > >> The proposed new option solves this problem. When > >> --continue-client-on-abort is set to true, the client rolls back the > >> failed transaction and starts a new one. This allows all 5 clients to > >> continuously apply load to the server, even if some transactions fail. > > +1. I've had similar cases before too, where I'd wanted pgbench to > continue creating load on the server even if a transaction failed > server-side for any reason. Sometimes, I'd even want that type of > load. > > On Sat, 10 May 2025 at 17:02, Stepan Neretin wrote: > > INSERT INTO test (col2) VALUES (random(0, 5)); > > > > can be rewritten as: > > > > \setrandom val 0 5 > > INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING; > > That won't test the same execution paths, so an option to explicitly > rollback or ignore failed transactions (rather than stopping the > benchmark) would be a nice feature. > With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if > there are many conflicting entries, as that triggers and catches > per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT > ...multiple rows could conflict on multiple rows, but will fail on the > first conflict. DO NOTHING would cause full execution of the SELECT > statement, which has an inherently different performance profile. > > > This avoids transaction aborts entirely in the presence of uniqueness > violations and ensures the client continues to issue load without > interruption. In many real-world benchmarking scenarios, this is the > preferred and simplest approach. > > > > So from that angle, could you elaborate on specific cases where this > SQL-level workaround wouldn't be sufficient? Are there error types you > intend to handle that cannot be gracefully avoided or recovered from using > SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO? > > The issue isn't necessarily whether you can construct SQL scripts that > don't raise such errors (indeed, it's possible to do so for nearly any > command; you can run pl/pgsql procedures or DO blocks which catch and > ignore errors), but rather whether we can make pgbench function in a > way that can keep load on the server even when it notices an error. > > Kind regards, > > Matthias van de Meent > Neon (https://neon.tech) > Hi Matthias, Thanks for your detailed explanation — it really helped clarify the usefulness of the patch. I agree that the feature is indeed valuable, and it's great to see it being pushed forward. Regarding the patch code, I noticed that there are duplicate case entries in the command-line option handling (specifically for case 18 or case ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option). These duplicated cases can be merged to simplify the logic and reduce redundancy. Best regards, Stepan Neretin
RE: Request for Implementation of Custom Error Messages for CHECK Constraints
Tom Lane writes: > Why don't you just choose better names for your constraints? A word versus a sentence. It's a big difference that greatly improves the user experience. > I'd argue that the proposed change might actually be a net loss for usability, if it entirely obscures the fact that what happened was a check-constraint violation. I understand, I'm looking at it from the point of view of the end user who is using an application. This application will not need to handle the message if the database generates a more 'user-friendly' message. > It's also not very clear why we'd stop with check constraints, if the desire is to get rid of database-produced error messages in favor of something that somebody likes better. The idea is just to be able to define a different message. if I have to use a trigger to set a different message, then I have to write the same rule twice, in CHECK and TRIGGER, which is redundant. Best regards, Miguel Ferreira
RE: Request for Implementation of Custom Error Messages for CHECK Constraints
De: David G. Johnston Enviada: 11 de maio de 2025 01:58 ALTER TABLE table_name ADD CONSTRAINT constraint_name CHECK (condition) MESSAGE 'Custom error message when the condition is not met.'; >> I’m seeing some value here but, odds are, there is not enough obvious >> benefit or design to convince someone else to try and envision specifically >> how the behavior should work and effort to push it through. if I have to use a trigger to set a different message, then I have to write the same rule twice, in CHECK and TRIGGER, which is redundant. -- Improved User Experience: Applications could capture and display more contextual and helpful error messages to end-users, improving usability and reducing confusion. >> Arguably a layering violation. To make this point more clearly, do you need >> to account for i18n? No -- Enhanced Debugging: Developers could immediately identify the specific business rule that has been violated, speeding up the debugging and resolution of data integrity issues. >> Is there really a meaningful gap here? I think there is a significant gap here, especially in complex constraints. The constraint name does not always reveal which specific part of the business rule was violated, which can delay debugging. Personalized messages could provide this information directly -- Implicit Documentation: The custom message would serve as a way to document the intent of the constraint directly within the database schema, facilitating understanding and maintenance of the data model. >> Constraints can be targeted by “comment on”. I agree, but the proposal aims at specific error messages when the constraint is violated, for better feedback in development and in applications. -- Consistency: It would allow for a more consistent approach to providing informative feedback on business rule violations, complementing the existing capability in triggers. Two different tools that can do the same job. One with structure and one customizable. Because triggers exist the proposed feature is less useful. Best regards, Miguel Ferreira
Re: Suggestion to add --continue-client-on-abort option to pgbench
> On Sat, May 10, 2025 at 8:45 PM ikedarintarof > wrote: >> >> Hi hackers, >> >> I would like to suggest adding a new option to pgbench, which enables >> the client to continue processing transactions even if some errors occur >> during a transaction. >> Currently, a client stops sending requests when its transaction is >> aborted due to reasons other than serialization failures or deadlocks. I >> think in some cases, especially when using custom scripts, the client >> should be able to rollback the failed transaction and start a new one. >> >> For example, my custom script (insert_to_unique_column.sql) follows: >> ``` >> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique); >> INSERT INTO test (col2) VALUES (random(0, 5)); >> ``` >> Assume we need to continuously apply load to the server using 5 clients >> for a certain period of time. However, a client sometimes stops when its >> transaction in my custom script is aborted due to a check constraint >> violation. As a result, the load on the server is lower than expected, >> which is the problem I want to address. >> >> The proposed new option solves this problem. When >> --continue-client-on-abort is set to true, the client rolls back the >> failed transaction and starts a new one. This allows all 5 clients to >> continuously apply load to the server, even if some transactions fail. +1. I've had similar cases before too, where I'd wanted pgbench to continue creating load on the server even if a transaction failed server-side for any reason. Sometimes, I'd even want that type of load. On Sat, 10 May 2025 at 17:02, Stepan Neretin wrote: > INSERT INTO test (col2) VALUES (random(0, 5)); > > can be rewritten as: > > \setrandom val 0 5 > INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING; That won't test the same execution paths, so an option to explicitly rollback or ignore failed transactions (rather than stopping the benchmark) would be a nice feature. With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if there are many conflicting entries, as that triggers and catches per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT ...multiple rows could conflict on multiple rows, but will fail on the first conflict. DO NOTHING would cause full execution of the SELECT statement, which has an inherently different performance profile. > This avoids transaction aborts entirely in the presence of uniqueness > violations and ensures the client continues to issue load without > interruption. In many real-world benchmarking scenarios, this is the > preferred and simplest approach. > > So from that angle, could you elaborate on specific cases where this > SQL-level workaround wouldn't be sufficient? Are there error types you intend > to handle that cannot be gracefully avoided or recovered from using SQL > constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO? The issue isn't necessarily whether you can construct SQL scripts that don't raise such errors (indeed, it's possible to do so for nearly any command; you can run pl/pgsql procedures or DO blocks which catch and ignore errors), but rather whether we can make pgbench function in a way that can keep load on the server even when it notices an error. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Valgrind - showing memory leaks
On 2025-May-08, Tom Lane wrote: > Uh ... yeah it is, down at the bottom of the function: > > /* Install array only after it's fully valid */ > relation->rd_att->constr->check = check; > relation->rd_att->constr->num_check = found; > > So it seems like valgrind is wrong here, or else we're leaking the > whole rd_att structure later on somehow. Well, the problem is that if num_check is zero, FreeTupleDesc() doesn't free ->check. > In any case, you're right that asking for a zero-size chunk is pretty > pointless. I'd support doing > > + if (ncheck > 0) > + check = (ConstrCheck *) > + MemoryContextAllocZero(CacheMemoryContext, > +ncheck * > sizeof(ConstrCheck)); > + else > + check = NULL; > > but I think we have to make sure it's null if we don't palloc it. Done that way, thanks. > > On the other hand, the bug I was thinking about, is that if the table > > has an invalid not-null constraint, we leak during detoasting in > > extractNotNullColumn(). [...] But it's no longer so obvious that > > extractNotNullColumn is okay to leak those few bytes. > > Given your description it still sounds fine to me. Cool, I left it alone. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ “Cuando no hay humildad las personas se degradan” (A. Christie)
Re: Request for Implementation of Custom Error Messages for CHECK Constraints
Em sáb., 10 de mai. de 2025 às 21:57, David G. Johnston < david.g.johns...@gmail.com> escreveu: > Constraints can be targeted by “comment on”. > So, why not using that COMMENT ON message to raise when that constraint gets violated ? A GUC like Constraint_Exception_Returns_MessageOn = {Never | Always | If_Exists} with its default value set to Never, so it runs like today, but if changed to If_Exists it will try to get that message or always, to show that COMMENT ON, even empty. alter table b add constraint fk_b_a foreign key(ID) references a(ID); comment on constraint fk_b_a on b is 'There is a problem on Foreign Key on Table B related to Table A'; regards Marcos
Re: Trivial comment fix for tsquerysend()
Hello, On 2025-Mar-06, Emre Hasegeli wrote: > Patch is attached. Looks correct to me. It's harmless obviously (and nobody appears to have been damanged by it) but I still backpatched to all live branches. Thanks, Emre. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)
Re: Why our Valgrind reports suck
I wrote: > And, since there's nothing new under the sun around here, > we already had a discussion about that back in 2021: > https://www.postgresql.org/message-id/flat/3471359.1615937770%40sss.pgh.pa.us > That thread seems to have led to fixing some specific bugs, > but we never committed any of the discussed valgrind infrastructure > improvements. I'll have a go at resurrecting that... Okay, here is a patch series that updates the 0001-Make-memory-contexts-themselves-more-visible-to-valg.patch patch you posted in that thread, and makes various follow-up fixes that either fix or paper over various leaks. Some of it is committable I think, but other parts are just WIP. Anyway, as of the 0010 patch we can run through the core regression tests and see no more than a couple of kilobytes total reported leakage in any process, except for two tests that expose leaks in TS dictionary building. (That's fixable but I ran out of time, and I wanted to get this posted before Montreal.) There is work left to do before we can remove the suppressions added in 0002, but this is already huge progress compared to where we were. A couple of these patches are bug fixes that need to be applied and even back-patched. In particular, I had not realized that autovacuum leaks a nontrivial amount of memory per relation processed (cf 0009), and apparently has done for a few releases now. This is horrid in databases with many tables, and I'm surprised that we've not gotten complaints about it. regards, tom lane From d4ca3e2e4824ff1f11ee5324b4ba46c4dd971ce9 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 11 May 2025 12:39:59 -0400 Subject: [PATCH v1 01/11] Improve our support for Valgrind's leak tracking. When determining whether an allocated chunk is still reachable, Valgrind will consider only pointers within what it believes to be allocated chunks. Normally, all of a block obtained from malloc() would be considered "allocated" --- but it turns out that if we use VALGRIND_MEMPOOL_ALLOC to designate sub-section(s) of a malloc'ed block as allocated, all the rest of that malloc'ed block is ignored. This leads to lots of false positives of course. In particular, in any multi-malloc-block context, all but the primary block were reported as leaked. We also had a problem with context "ident" strings, which were reported as leaked unless there was some other pointer to them besides the one in the context header. To fix, we need to use VALGRIND_MEMPOOL_ALLOC to designate a context's management structs (the context struct itself and any per-block headers) as allocated chunks. That forces moving the VALGRIND_CREATE_MEMPOOL/VALGRIND_DESTROY_MEMPOOL calls into the per-context-type code, so that the pool identifier can be made as soon as we've allocated the initial block, but otherwise it's fairly straightforward. Note that in Valgrind's eyes there is no distinction between these allocations and the allocations that the mmgr modules hand out to user code. That's fine for now, but perhaps someday we'll want to do better yet. Author: Andres Freund Co-authored-by: Tom Lane Discussion: https://postgr.es/m/285483.1746756...@sss.pgh.pa.us Discussion: https://postgr.es/m/20210317181531.7oggpqevzz6bk...@alap3.anarazel.de --- src/backend/utils/mmgr/aset.c | 69 - src/backend/utils/mmgr/bump.c | 31 - src/backend/utils/mmgr/generation.c | 29 src/backend/utils/mmgr/mcxt.c | 15 --- src/backend/utils/mmgr/slab.c | 32 + src/include/utils/memdebug.h| 1 + 6 files changed, 168 insertions(+), 9 deletions(-) diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 666ecd8f78d..21490213842 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -103,6 +103,8 @@ #define ALLOC_BLOCKHDRSZ MAXALIGN(sizeof(AllocBlockData)) #define ALLOC_CHUNKHDRSZ sizeof(MemoryChunk) +#define FIRST_BLOCKHDRSZ (MAXALIGN(sizeof(AllocSetContext)) + \ + ALLOC_BLOCKHDRSZ) typedef struct AllocBlockData *AllocBlock; /* forward reference */ @@ -458,6 +460,21 @@ AllocSetContextCreateInternal(MemoryContext parent, * we'd leak the header/initial block if we ereport in this stretch. */ + /* Create a Valgrind "pool" associated with the context */ + VALGRIND_CREATE_MEMPOOL(set, 0, false); + + /* + * Create a Valgrind "chunk" covering both the AllocSetContext struct and + * the keeper block's header. (It would be more sensible for these to be + * two separate chunks, but doing that seems to tickle bugs in some + * versions of Valgrind.) We must have these chunks, and also a chunk for + * each subsequently-added block header, so that Valgrind considers the + * pointers within them while checking for leaked memory. Note that + * Valgrind doesn't distinguish between these chunks and those created by + * mcxt.c for the user-accessible-data chunks we allocate. + */ + V
Re: add tab-complete for ALTER DOMAIN ADD...
So we should also have tab-completion for ALTER TABLE ADD NOT NULL, as in the attached. Any opinions? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Estoy de acuerdo contigo en que la verdad absoluta no existe... El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama) >From 526f816542b45773a6bb2214c8c23a15075e4b9c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=81lvaro=20Herrera?= Date: Sun, 11 May 2025 10:45:49 -0400 Subject: [PATCH] Add tab-completion for ALTER TABLE not-nulls Command is: ALTER TABLE x ADD [CONSTRAINT y] NOT NULL z --- src/bin/psql/tab-complete.in.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c index ec65ab79fec..a6135f4b6cb 100644 --- a/src/bin/psql/tab-complete.in.c +++ b/src/bin/psql/tab-complete.in.c @@ -2725,17 +2725,24 @@ match_previous_words(int pattern_id, /* ALTER TABLE xxx ADD */ else if (Matches("ALTER", "TABLE", MatchAny, "ADD")) { - /* make sure to keep this list and the !Matches() below in sync */ + /* + * make sure to keep this list and the MatchAnyExcept() below in sync + */ COMPLETE_WITH("COLUMN", "CONSTRAINT", "CHECK", "UNIQUE", "PRIMARY KEY", - "EXCLUDE", "FOREIGN KEY"); + "NOT NULL", "EXCLUDE", "FOREIGN KEY"); } /* ALTER TABLE xxx ADD [COLUMN] yyy */ else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "COLUMN", MatchAny) || - Matches("ALTER", "TABLE", MatchAny, "ADD", MatchAnyExcept("COLUMN|CONSTRAINT|CHECK|UNIQUE|PRIMARY|EXCLUDE|FOREIGN"))) + Matches("ALTER", "TABLE", MatchAny, "ADD", MatchAnyExcept("COLUMN|CONSTRAINT|CHECK|UNIQUE|PRIMARY|NOT|EXCLUDE|FOREIGN"))) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_datatypes); /* ALTER TABLE xxx ADD CONSTRAINT yyy */ else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny)) - COMPLETE_WITH("CHECK", "UNIQUE", "PRIMARY KEY", "EXCLUDE", "FOREIGN KEY"); + COMPLETE_WITH("CHECK", "UNIQUE", "PRIMARY KEY", "EXCLUDE", "FOREIGN KEY", "NOT NULL"); + /* ALTER TABLE xxx ADD NOT NULL */ + else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "NOT", "NULL")) + COMPLETE_WITH_ATTR(prev4_wd); + else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny, "NOT", "NULL")) + COMPLETE_WITH_ATTR(prev6_wd); /* ALTER TABLE xxx ADD [CONSTRAINT yyy] (PRIMARY KEY|UNIQUE) */ else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "PRIMARY", "KEY") || Matches("ALTER", "TABLE", MatchAny, "ADD", "UNIQUE") || -- 2.39.5
Re: add tab-complete for ALTER DOMAIN ADD...
On 2025-Apr-29, Dagfinn Ilmari Mannsåker wrote: > jian he writes: > > + /* ALTER DOMAIN ADD */ > > + else if (Matches("ALTER", "DOMAIN", MatchAny, "ADD")) > > + COMPLETE_WITH("CONSTRAINT", "NOT NULL", "CHECK"); > > I think the completion for CHECK should include the opening paren too, > since that's required for the expression. Yeah, we do that elsewhere. > We could also add completion after CONSTRAINT , like this: > > else if(Matches("ALTER", "DOMAIN", MatchAny, "ADD", "CONSTRAINT", > MatchAny)) > COMPLETE_WITH("NOT NULL", "CHECK ("); Done that, and pushed. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Re: FileFallocate misbehaving on XFS
Hello, I was running into the same thing, and the fix for me was mounting xfs with -o inodes64 Best, Pierre On Mon, Dec 9, 2024, at 08:34, Michael Harris wrote: > Hello PG Hackers > > Our application has recently migrated to PG16, and we have experienced > some failed upgrades. The upgrades are performed using pg_upgrade and > have failed during the phase where the schema is restored into the new > cluster, with the following error: > > pg_restore: error: could not execute query: ERROR: could not extend > file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with > FileFallocate(): No space left on device > HINT: Check free disk space. > > This has happened multiple times on different servers, and in each > case there was plenty of free space available. > > We found this thread describing similar issues: > > https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com > > As is the case in that thread, all of the affected databases are using XFS. > > One of my colleagues built postgres from source with > HAVE_POSIX_FALLOCATE not defined, and using that build he was able to > complete the pg_upgrade, and then switched to a stock postgres build > after the upgrade. However, as you might expect, after the upgrade we > have experienced similar errors during regular operation. We make > heavy use of COPY, which is mentioned in the above discussion as > pre-allocating files. > > We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky > Linux 9 (Kernel 5.14.0). > > I am wondering if this bug might be related: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323 > >> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports >> ENOSPC if the size of the file + the length to be allocated is greater than >> the available space. > > There is a reproduction procedure at the bottom of the above ubuntu > thread, and using that procedure I get the same results on both kernel > 4.18.0 and 5.14.0. > When calling fallocate with offset zero on an existing file, I get > enospc even if I am only requesting the same amount of space as the > file already has. > If I repeat the experiment with ext4 I don't get that behaviour. > > On a surface examination of the code paths leading to the > FileFallocate call, it does not look like it should be trying to > allocate already allocated space, but I might have missed something > there. > > Is this already being looked into? > > Thanks in advance, > > Cheers > Mike