Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-05-11 Thread Peter Geoghegan
On Sat, May 10, 2025 at 10:59 AM Tomas Vondra  wrote:
> But doesn't it also highlight how fragile this memory allocation is? The
> skip scan patch didn't do anything wrong - it just added a couple
> fields, using a little bit more memory. I think we understand allocating
> more memory may need more time, but we expect the effect to be somewhat
> proportional. Which doesn't seem to be the case here.
>
> Many other patches add fields somewhere, it seems like bad luck the skip
> scan happened to trigger this behavior. It's quite likely other patches
> ran into the same issue, except that no one noticed. Maybe the skip scan
> did that in much hotter code, not sure.

But what did the skip scan commit (specifically commit 92fe23d9,
without any of the follow-up commits) change about memory allocation,
that might be at issue with your test case? You said that that commit
"just added a couple fields". What specific fields are you talking
about, that were added by commit 92fe23d9?

I already speculated that the issue might be tied to the addition of a
new support routine (skip support), but the experiment we ran to try
to validate that theory disproved it. What else is there?

Again, commit 92fe23d9 didn't make BTScanOpaqueData any larger
(follow-up commit 8a510275 added a new BTScanOpaqueData.skipScan bool
field, but that didn't even affect sizeof BTScanOpaqueData, since the
field fit into what was previously just alignment padding space).
AFAICT nothing that seems like it might be relevant changed, apart
from the addition of the new support routine, which was ruled out
already.

-- 
Peter Geoghegan




Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

2025-05-11 Thread Manish Rai Jain
Hi hackers,

I’ve been exploring the idea of integrating an LSM tree–based storage engine
into PostgreSQL — similar in spirit to MyRocks for MySQL — by
replacing the underlying
storage while preserving PostgreSQL’s upper layers (planner, executor, MVCC,
etc.).

The motivation stems from the well‑known write‑amplification issues
with B‑trees
under high write throughput. An LSM‑based engine could offer:

   - Significant improvements in write performance and space efficiency,
   especially under heavy ingestion workloads.
   - Better scalability with larger datasets, particularly when compression
   is applied.
   - Comparable read performance (with trade‑offs depending on workload),
   and opportunities to optimize through Bloom filters, tiered compaction
   strategies, etc.
   - Reduced reliance on manual VACUUM: obsolete versions would be purged
   naturally during LSM compactions, potentially eliminating routine heap
   vacuuming (transaction‑ID wrap‑around handling and stats collection would
   still need careful design).

The hoped‑for outcome is a faster, more scalable PostgreSQL for >1 TB
workloads, while maintaining the rich feature set and ecosystem
compatibility users expect from Postgres.

Unlike Neon, this approach is not targeting cloud‑native object storage or
remote WAL streaming, but instead optimizing for maximum performance on
local disks or high‑performance block volumes, where write throughput and
compaction efficiency matter most.

This would likely involve implementing a new Table Access Method
(TAM), possibly
backed by a forked engine such as BadgerDB or RocksDB, adapted to
support PostgreSQL’s
MVCC and WAL semantics.

I’d love to hear your thoughts:


   1. Does this direction make sense for experimentation within the Postgres
   ecosystem?
   2. Are there known architectural blockers or prior discussions/attempts in
   this space worth revisiting?
   3. Would such a project be best developed entirely as a fork, or is there
   openness to evolving TAM to better support pluggable storage with LSM‑like
   semantics?

Looking forward to your feedback.
- Manish

https://github.com/manishrjain


Re: Improve hash join's handling of tuples with null join keys

2025-05-11 Thread Thomas Munro
On Tue, May 6, 2025 at 12:12 PM Tomas Vondra  wrote:
> On 5/6/25 01:11, Tom Lane wrote:
> > The attached patch is a response to the discussion at [1], where
> > it emerged that lots of rows with null join keys can send a hash
> > join into too-many-batches hell, if they are on the outer side
> > of the join so that they must be null-extended not just discarded.
> > This isn't really surprising given that such rows will certainly
> > end up in the same hash bucket, and no amount of splitting can
> > reduce the size of that bucket.  (I'm a bit surprised that the
> > growEnabled heuristic didn't kick in, but it seems it didn't,
> > at least not up to several million batches.)

Good idea.  I haven't reviewed it properly, but one observation is
that trapping the null-keys tuples in per-worker tuple stores creates
unfairness.  That could be fixed by using a SharedTuplestore instead,
but unfortunately SharedTuplestore always spills to disk at the
moment, so maybe I should think about how to give it some memory for
small sets like regular Tuplestore.  Will look more closely after
Montreal.

> I don't think that's too surprising - growEnabled depends on all tuples
> getting into the same batch during a split. But even if there are many
> duplicate values, real-world data sets often have a couple more tuples
> that just happen to fall into that bucket too. And then during the split
> some get into one batch and some get into another.

Yeah.

> My personal experience is that the growEnabled heuristics is overly
> sensitive, and probably does not trigger very often. It can also get set
> to "true" to early, but that's (much) harder to hit.
>
> I have suggested to make growEnabled less strict in [2], i.e. to
> calculate the threshold as percentage of the batch, and not disable
> growth permanently. But it was orthogonal to what that thread did.

+1, I also wrote a thread with a draft patch like that at some point,
which I'll also try to dig up after Montreal.

> But more importantly, wasn't the issue discussed in [1] about parallel
> hash joins? I got quite confused while reading the thread ... I'm asking
> because growEnabled is checked only in ExecHashIncreaseNumBatches, not
> in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins
> don't use growEnabled at all, no?

There is an equivalent mechanism, but it's slightly more complicated
as it requires consensus (see code near PHJ_GROW_BATCHES_DECIDE).
It's possible that it's not working quite as well as it should.  It's
definitely less deterministic in some edge cases since tuples are
packed into chunks differently so the memory used can vary slightly
run-to-run, but the tuple count should be stable.  I've made a note to
review that logic again too.

Note also that v16 is the first that could put NULLs in a shared
memory hash table (11c2d6fdf5 enabled Parallel Hash Right|Full Join),
while non-P HJ has had that for a long time, but it also couldn't be
used in a parallel query, so I guess it's possible that this stuff is
coming up now because it wasn't often picked for problems that would
generate interesting numbers of NULLs likely to exceed limits given
available plans in older releases.  See also related bug fix in
98c7c7152, spotted soon after this plan type escaped into the field.

While thinking about that, I wanted to note that we have more things
to improve in PHRJ: (1) Parallelism of unmatched scan: a short but not
entirely satisfying patch was already shared on the PHRJ thread but
not committed with the main feature.  I already had some inklings of
how to do much better which I recently described in a bit more detail
on the PBHS thread in vapourware form, where parallel fairness came up
again.  "La perfection est le mortel ennemi du bien" or whatever it is
they say in the language of Montreal, but really the easy patch for
unmatched scan parallelism wasn't actually bon enough, because it was
non-deterministic how many processes could participate due to deadlock
avoidance arcana, creating run-to-run variation that I'd expect Tomáš
to find empirically and reject in one of his benchmarking expeditions
:-).  (2) Bogus asymmetries in estimations/planning: I wrote some
analysis of why we don't use PHRJ as much as we could/should near
Richard Guo's work on anti/semi joins which went in around the same
time.  My idea there is to try to debogify the parallel degree logic
more generally, it's just that PHRJ brought key aspects of it into
relief for me, ie bogosity of the rule-based "driving table" concept.
I'll try to write these projects up on the wiki, instead of in random
threads :-)

In other words if you just use local Tuplestores as you showed it
would actually be an improvement in fairness over the status quo due
to (1) not being solved yet... but it will be solved, hence mentioning
it in this context.




Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-05-11 Thread Amit Kapila
On Sun, May 11, 2025 at 8:49 PM Xuneng Zhou  wrote:
>
> Hi, I was able to reproduce the failure by adding a 1-second sleep in the 
> LogicalRepApplyLoop function
> However, I noticed that the tests under src/test/subscription run 
> significantly slower— is this normal?
>

Yes, because you made apply slower by adding a sleep.

> Also, it looks like the patch mentioned in this thread addresses the issue:
> https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com
>>

So you are okay with a test-only fix?

With Regards,
Amit Kapila.




Re: Small fixes needed by high-availability tools

2025-05-11 Thread Amit Kapila
On Fri, May 2, 2025 at 6:30 PM Andrey Borodin  wrote:
>
> 3. Allow reading LSN written by walreciever, but not flushed yet
>
> Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might 
> be ahead of node1 by flush LSN, but before by written LSN. If we do a 
> failover we choose node2 instead of node1 and loose data recently committed 
> with synchronous_commit=remote_write.
>

In which case, can we rely on written WAL that is not yet flushed?
Because say you decide based on written WAL and choose node-1 in above
case for failover, what if it restarts without flushing the written
WAL?

> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact 
> returns flushed LSN, not written. I propose to add a new function which 
> returns LSN actually written. Internals of this function are already 
> implemented (GetWalRcvWriteRecPtr()), but unused.
>

It seems to me that this is less controversial than your other two
proposals. So, we can discuss this in a separate thread as well.

-- 
With Regards,
Amit Kapila.




Re: Valgrind - showing memory leaks

2025-05-11 Thread Tom Lane
=?utf-8?Q?=C3=81lvaro?= Herrera  writes:
> On 2025-May-08, Tom Lane wrote:
>> So it seems like valgrind is wrong here, or else we're leaking the
>> whole rd_att structure later on somehow.

> Well, the problem is that if num_check is zero, FreeTupleDesc() doesn't
> free ->check.

Ah-hah.

> Done that way, thanks.

Thanks! I've been putting together a list of fixes to make
Valgrind less noisy, and it just got one shorter.

regards, tom lane




Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-05-11 Thread Xuneng Zhou
Hi, I was able to reproduce the failure by adding a 1-second sleep in the
LogicalRepApplyLoop function
However, I noticed that the tests under src/test/subscription run
significantly slower— is this normal?

Also, it looks like the patch mentioned in this thread addresses the issue:
https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com

>
> A simpler way to consistently reproduce the issue is to add a 1-second
> sleep in the LogicalRepApplyLoop function, just before the call to
> WaitLatchOrSocket. This reproduces the test failure consistently for
> me. The failure reason is the same as in [1].
>
> [1] -
> https://www.postgresql.org/message-id/CALDaNm2Q_pfwiCkaV920iXEbh4D%3D5MmD_tNQm_GRGX6-MsLxoQ%40mail.gmail.com
>
> Regards,
> Vignesh
>


Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-05-11 Thread Tomas Vondra



On 5/11/25 18:07, Peter Geoghegan wrote:
> On Sat, May 10, 2025 at 10:59 AM Tomas Vondra  wrote:
>> But doesn't it also highlight how fragile this memory allocation is? The
>> skip scan patch didn't do anything wrong - it just added a couple
>> fields, using a little bit more memory. I think we understand allocating
>> more memory may need more time, but we expect the effect to be somewhat
>> proportional. Which doesn't seem to be the case here.
>>
>> Many other patches add fields somewhere, it seems like bad luck the skip
>> scan happened to trigger this behavior. It's quite likely other patches
>> ran into the same issue, except that no one noticed. Maybe the skip scan
>> did that in much hotter code, not sure.
> 
> But what did the skip scan commit (specifically commit 92fe23d9,
> without any of the follow-up commits) change about memory allocation,
> that might be at issue with your test case? You said that that commit
> "just added a couple fields". What specific fields are you talking
> about, that were added by commit 92fe23d9?
> 
> I already speculated that the issue might be tied to the addition of a
> new support routine (skip support), but the experiment we ran to try
> to validate that theory disproved it. What else is there?
> 

That's a good point. However, it seems I have done something wrong when
running the tests with the support routine removed :-( I just repeated
the tests, and I got this:

  mode   clients  3ba2cdaa454   master   revert  master   revert
  --
  prepared 110860 354811048 33% 102%
   4254921129925190 44%  99%
  32383991414238493 37% 100%
  --
 simple1 2595 1844 2604 71% 100%
   4 8266 6090 8126 74%  98%
  3211765 719811449 61%  97%

I where "revert" is master with the removal patch. Sorry about the
confusion, I guess I was distracted and did some mistake.

So, this seems to be in line with the hypothesis ...


regards

-- 
Tomas Vondra





Re: Why our Valgrind reports suck

2025-05-11 Thread Tom Lane
I wrote:
> Okay, here is a patch series that updates the
> 0001-Make-memory-contexts-themselves-more-visible-to-valg.patch
> patch you posted in that thread,

I forgot to mention that I did try to implement the "two-level
pool" scheme that the Valgrind documentation talks about, and
could not make it work.  There seem to be undocumented interactions
between the outer and inner chunks, and it's not real clear to me
that there's not outright bugs.  Anyway, AFAICS that scheme
would bring us no immediate advantages anyway, compared to the
flat approach of just adding mempool chunks for the allocators'
management areas.

regards, tom lane




Re: add tab-complete for ALTER DOMAIN ADD...

2025-05-11 Thread Dagfinn Ilmari Mannsåker
On Sun, 11 May 2025, at 15:48, Álvaro Herrera wrote:
> So we should also have tab-completion for ALTER TABLE ADD NOT NULL, as
> in the attached.  Any opinions?

LGTM

-- 
- ilmari




Re: Suggestion to add --continue-client-on-abort option to pgbench

2025-05-11 Thread Stepan Neretin
On Sun, May 11, 2025 at 7:07 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> > On Sat, May 10, 2025 at 8:45 PM ikedarintarof <
> ikedarinta...@oss.nttdata.com> wrote:
> >>
> >> Hi hackers,
> >>
> >> I would like to suggest adding a new option to pgbench, which enables
> >> the client to continue processing transactions even if some errors occur
> >> during a transaction.
> >> Currently, a client stops sending requests when its transaction is
> >> aborted due to reasons other than serialization failures or deadlocks. I
> >> think in some cases, especially when using custom scripts, the client
> >> should be able to rollback the failed transaction and start a new one.
> >>
> >> For example, my custom script (insert_to_unique_column.sql) follows:
> >> ```
> >> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
> >> INSERT INTO test (col2) VALUES (random(0, 5));
> >> ```
> >> Assume we need to continuously apply load to the server using 5 clients
> >> for a certain period of time. However, a client sometimes stops when its
> >> transaction in my custom script is aborted due to a check constraint
> >> violation. As a result, the load on the server is lower than expected,
> >> which is the problem I want to address.
> >>
> >> The proposed new option solves this problem. When
> >> --continue-client-on-abort is set to true, the client rolls back the
> >> failed transaction and starts a new one. This allows all 5 clients to
> >> continuously apply load to the server, even if some transactions fail.
>
> +1. I've had similar cases before too, where I'd wanted pgbench to
> continue creating load on the server even if a transaction failed
> server-side for any reason. Sometimes, I'd even want that type of
> load.
>
> On Sat, 10 May 2025 at 17:02, Stepan Neretin  wrote:
> > INSERT INTO test (col2) VALUES (random(0, 5));
> >
> > can be rewritten as:
> >
> > \setrandom val 0 5
> > INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
>
> That won't test the same execution paths, so an option to explicitly
> rollback or ignore failed transactions (rather than stopping the
> benchmark) would be a nice feature.
> With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
> there are many conflicting entries, as that triggers and catches
> per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
> ...multiple rows could conflict on multiple rows, but will fail on the
> first conflict. DO NOTHING would cause full execution of the SELECT
> statement, which has an inherently different performance profile.
>
> > This avoids transaction aborts entirely in the presence of uniqueness
> violations and ensures the client continues to issue load without
> interruption. In many real-world benchmarking scenarios, this is the
> preferred and simplest approach.
> >
> > So from that angle, could you elaborate on specific cases where this
> SQL-level workaround wouldn't be sufficient? Are there error types you
> intend to handle that cannot be gracefully avoided or recovered from using
> SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?
>
> The issue isn't necessarily whether you can construct SQL scripts that
> don't raise such errors (indeed, it's possible to do so for nearly any
> command; you can run pl/pgsql procedures or DO blocks which catch and
> ignore errors), but rather whether we can make pgbench function in a
> way that can keep load on the server even when it notices an error.
>
> Kind regards,
>
> Matthias van de Meent
> Neon (https://neon.tech)
>

Hi Matthias,

Thanks for your detailed explanation — it really helped clarify the
usefulness of the patch. I agree that the feature is indeed valuable, and
it's great to see it being pushed forward.

Regarding the patch code, I noticed that there are duplicate case entries
in the command-line option handling (specifically for case 18 or case
ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option). These
duplicated cases can be merged to simplify the logic and reduce redundancy.

Best regards,
Stepan Neretin


RE: Request for Implementation of Custom Error Messages for CHECK Constraints

2025-05-11 Thread Miguel Ferreira
Tom Lane  writes:
> Why don't you just choose better names for your constraints?

A word versus a sentence. It's a big difference that greatly improves the
user experience.

> I'd argue that the proposed change might actually be a net loss for
usability, if it entirely obscures the fact that what happened was a
check-constraint violation.

I understand, I'm looking at it from the point of view of the end user who
is using an application. This application will not need to handle the
message if the database generates a more 'user-friendly' message.

> It's also not very clear why we'd stop with check constraints, if the
desire is to get rid of database-produced error messages in favor of
something that somebody likes better.

The idea is just to be able to define a different message. if I have to use
a trigger to set a different message, then I have to write the same rule
twice, in CHECK and TRIGGER, which is redundant.

Best regards,
Miguel Ferreira





RE: Request for Implementation of Custom Error Messages for CHECK Constraints

2025-05-11 Thread Miguel Ferreira
De: David G. Johnston  
Enviada: 11 de maio de 2025 01:58



ALTER TABLE table_name

ADD CONSTRAINT constraint_name

CHECK (condition)

MESSAGE 'Custom error message when the condition is not met.';

 

>> I’m seeing some value here but, odds are, there is not enough obvious 
>> benefit or design to convince someone else to try and envision specifically 
>> how the behavior should work and effort to push it through.

 

if I have to use a trigger to set a different message, then I have to write the 
same rule twice, in CHECK and TRIGGER, which is redundant.

--

 

Improved User Experience: Applications could capture and display more 
contextual and helpful error messages to end-users, improving usability and 
reducing confusion.

 

>> Arguably a layering violation.  To make this point more clearly, do you need 
>> to account for i18n?

No

--

 

Enhanced Debugging: Developers could immediately identify the specific 
business rule that has been violated, speeding up the debugging and resolution 
of data integrity issues.

>> Is there really a meaningful gap here?

 

I think there is a significant gap here, especially in complex constraints. The 
constraint name does not always reveal which specific part of the business rule 
was violated, which can delay debugging. Personalized messages could provide 
this information directly

--

Implicit Documentation: The custom message would serve as a way to document 
the intent of the constraint directly within the database schema, facilitating 
understanding and maintenance of the data model.

>> Constraints can be targeted by “comment on”.

 

 I agree, but the proposal aims at specific error messages when the constraint 
is violated, for better feedback in development and in applications.

--

Consistency: It would allow for a more consistent approach to providing 
informative feedback on business rule violations, complementing the existing 
capability in triggers.

Two different tools that can do the same job.  One with structure and one 
customizable.  Because triggers exist the proposed feature is less useful.

 

 

Best regards,

Miguel Ferreira

 



Re: Suggestion to add --continue-client-on-abort option to pgbench

2025-05-11 Thread Matthias van de Meent
> On Sat, May 10, 2025 at 8:45 PM ikedarintarof  
> wrote:
>>
>> Hi hackers,
>>
>> I would like to suggest adding a new option to pgbench, which enables
>> the client to continue processing transactions even if some errors occur
>> during a transaction.
>> Currently, a client stops sending requests when its transaction is
>> aborted due to reasons other than serialization failures or deadlocks. I
>> think in some cases, especially when using custom scripts, the client
>> should be able to rollback the failed transaction and start a new one.
>>
>> For example, my custom script (insert_to_unique_column.sql) follows:
>> ```
>> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
>> INSERT INTO test (col2) VALUES (random(0, 5));
>> ```
>> Assume we need to continuously apply load to the server using 5 clients
>> for a certain period of time. However, a client sometimes stops when its
>> transaction in my custom script is aborted due to a check constraint
>> violation. As a result, the load on the server is lower than expected,
>> which is the problem I want to address.
>>
>> The proposed new option solves this problem. When
>> --continue-client-on-abort is set to true, the client rolls back the
>> failed transaction and starts a new one. This allows all 5 clients to
>> continuously apply load to the server, even if some transactions fail.

+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.

On Sat, 10 May 2025 at 17:02, Stepan Neretin  wrote:
> INSERT INTO test (col2) VALUES (random(0, 5));
>
> can be rewritten as:
>
> \setrandom val 0 5
> INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.

> This avoids transaction aborts entirely in the presence of uniqueness 
> violations and ensures the client continues to issue load without 
> interruption. In many real-world benchmarking scenarios, this is the 
> preferred and simplest approach.
>
> So from that angle, could you elaborate on specific cases where this 
> SQL-level workaround wouldn't be sufficient? Are there error types you intend 
> to handle that cannot be gracefully avoided or recovered from using SQL 
> constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?

The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Valgrind - showing memory leaks

2025-05-11 Thread Álvaro Herrera
On 2025-May-08, Tom Lane wrote:

> Uh ... yeah it is, down at the bottom of the function:
> 
>   /* Install array only after it's fully valid */
>   relation->rd_att->constr->check = check;
>   relation->rd_att->constr->num_check = found;
> 
> So it seems like valgrind is wrong here, or else we're leaking the
> whole rd_att structure later on somehow.

Well, the problem is that if num_check is zero, FreeTupleDesc() doesn't
free ->check.

> In any case, you're right that asking for a zero-size chunk is pretty
> pointless.  I'd support doing
> 
> + if (ncheck > 0)
> + check = (ConstrCheck *)
> + MemoryContextAllocZero(CacheMemoryContext,
> +ncheck * 
> sizeof(ConstrCheck));
> + else
> + check = NULL;
> 
> but I think we have to make sure it's null if we don't palloc it.

Done that way, thanks.

> > On the other hand, the bug I was thinking about, is that if the table
> > has an invalid not-null constraint, we leak during detoasting in
> > extractNotNullColumn().  [...] But it's no longer so obvious that
> > extractNotNullColumn is okay to leak those few bytes.
> 
> Given your description it still sounds fine to me.

Cool, I left it alone.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: Request for Implementation of Custom Error Messages for CHECK Constraints

2025-05-11 Thread Marcos Pegoraro
Em sáb., 10 de mai. de 2025 às 21:57, David G. Johnston <
david.g.johns...@gmail.com> escreveu:

> Constraints can be targeted by “comment on”.
>

So, why not using that COMMENT ON message to raise when that constraint
gets violated ?
A GUC like Constraint_Exception_Returns_MessageOn = {Never | Always |
If_Exists}
with its default value set to Never, so it runs like today, but if changed
to If_Exists it will try to
get that message or always, to show that COMMENT ON, even empty.

alter table b add constraint fk_b_a foreign key(ID) references a(ID);
comment on constraint fk_b_a on b is 'There is a problem on Foreign Key on
Table B related to Table A';

regards
Marcos


Re: Trivial comment fix for tsquerysend()

2025-05-11 Thread Álvaro Herrera
Hello,

On 2025-Mar-06, Emre Hasegeli wrote:

> Patch is attached.

Looks correct to me.  It's harmless obviously (and nobody appears to
have been damanged by it) but I still backpatched to all live branches.
Thanks, Emre.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)




Re: Why our Valgrind reports suck

2025-05-11 Thread Tom Lane
I wrote:
> And, since there's nothing new under the sun around here,
> we already had a discussion about that back in 2021:
> https://www.postgresql.org/message-id/flat/3471359.1615937770%40sss.pgh.pa.us
> That thread seems to have led to fixing some specific bugs,
> but we never committed any of the discussed valgrind infrastructure
> improvements.  I'll have a go at resurrecting that...

Okay, here is a patch series that updates the
0001-Make-memory-contexts-themselves-more-visible-to-valg.patch
patch you posted in that thread, and makes various follow-up
fixes that either fix or paper over various leaks.  Some of it
is committable I think, but other parts are just WIP.  Anyway,
as of the 0010 patch we can run through the core regression tests
and see no more than a couple of kilobytes total reported leakage
in any process, except for two tests that expose leaks in TS
dictionary building.  (That's fixable but I ran out of time,
and I wanted to get this posted before Montreal.)  There is
work left to do before we can remove the suppressions added in
0002, but this is already huge progress compared to where we were.

A couple of these patches are bug fixes that need to be applied and
even back-patched.  In particular, I had not realized that autovacuum
leaks a nontrivial amount of memory per relation processed (cf 0009),
and apparently has done for a few releases now.  This is horrid in
databases with many tables, and I'm surprised that we've not gotten
complaints about it.

regards, tom lane

From d4ca3e2e4824ff1f11ee5324b4ba46c4dd971ce9 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 11 May 2025 12:39:59 -0400
Subject: [PATCH v1 01/11] Improve our support for Valgrind's leak tracking.

When determining whether an allocated chunk is still reachable,
Valgrind will consider only pointers within what it believes to be
allocated chunks.  Normally, all of a block obtained from malloc()
would be considered "allocated" --- but it turns out that if we use
VALGRIND_MEMPOOL_ALLOC to designate sub-section(s) of a malloc'ed
block as allocated, all the rest of that malloc'ed block is ignored.
This leads to lots of false positives of course.  In particular,
in any multi-malloc-block context, all but the primary block were
reported as leaked.  We also had a problem with context "ident"
strings, which were reported as leaked unless there was some other
pointer to them besides the one in the context header.

To fix, we need to use VALGRIND_MEMPOOL_ALLOC to designate
a context's management structs (the context struct itself and
any per-block headers) as allocated chunks.  That forces moving
the VALGRIND_CREATE_MEMPOOL/VALGRIND_DESTROY_MEMPOOL calls into
the per-context-type code, so that the pool identifier can be
made as soon as we've allocated the initial block, but otherwise
it's fairly straightforward.  Note that in Valgrind's eyes there
is no distinction between these allocations and the allocations
that the mmgr modules hand out to user code.  That's fine for
now, but perhaps someday we'll want to do better yet.

Author: Andres Freund 
Co-authored-by: Tom Lane 
Discussion: https://postgr.es/m/285483.1746756...@sss.pgh.pa.us
Discussion: https://postgr.es/m/20210317181531.7oggpqevzz6bk...@alap3.anarazel.de
---
 src/backend/utils/mmgr/aset.c   | 69 -
 src/backend/utils/mmgr/bump.c   | 31 -
 src/backend/utils/mmgr/generation.c | 29 
 src/backend/utils/mmgr/mcxt.c   | 15 ---
 src/backend/utils/mmgr/slab.c   | 32 +
 src/include/utils/memdebug.h|  1 +
 6 files changed, 168 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 666ecd8f78d..21490213842 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -103,6 +103,8 @@
 
 #define ALLOC_BLOCKHDRSZ	MAXALIGN(sizeof(AllocBlockData))
 #define ALLOC_CHUNKHDRSZ	sizeof(MemoryChunk)
+#define FIRST_BLOCKHDRSZ	(MAXALIGN(sizeof(AllocSetContext)) + \
+			 ALLOC_BLOCKHDRSZ)
 
 typedef struct AllocBlockData *AllocBlock;	/* forward reference */
 
@@ -458,6 +460,21 @@ AllocSetContextCreateInternal(MemoryContext parent,
 	 * we'd leak the header/initial block if we ereport in this stretch.
 	 */
 
+	/* Create a Valgrind "pool" associated with the context */
+	VALGRIND_CREATE_MEMPOOL(set, 0, false);
+
+	/*
+	 * Create a Valgrind "chunk" covering both the AllocSetContext struct and
+	 * the keeper block's header.  (It would be more sensible for these to be
+	 * two separate chunks, but doing that seems to tickle bugs in some
+	 * versions of Valgrind.)  We must have these chunks, and also a chunk for
+	 * each subsequently-added block header, so that Valgrind considers the
+	 * pointers within them while checking for leaked memory.  Note that
+	 * Valgrind doesn't distinguish between these chunks and those created by
+	 * mcxt.c for the user-accessible-data chunks we allocate.
+	 */
+	V

Re: add tab-complete for ALTER DOMAIN ADD...

2025-05-11 Thread Álvaro Herrera
So we should also have tab-completion for ALTER TABLE ADD NOT NULL, as
in the attached.  Any opinions?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)
>From 526f816542b45773a6bb2214c8c23a15075e4b9c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Sun, 11 May 2025 10:45:49 -0400
Subject: [PATCH] Add tab-completion for ALTER TABLE not-nulls

Command is: ALTER TABLE x ADD [CONSTRAINT y] NOT NULL z
---
 src/bin/psql/tab-complete.in.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index ec65ab79fec..a6135f4b6cb 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -2725,17 +2725,24 @@ match_previous_words(int pattern_id,
 	/* ALTER TABLE xxx ADD */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ADD"))
 	{
-		/* make sure to keep this list and the !Matches() below in sync */
+		/*
+		 * make sure to keep this list and the MatchAnyExcept() below in sync
+		 */
 		COMPLETE_WITH("COLUMN", "CONSTRAINT", "CHECK", "UNIQUE", "PRIMARY KEY",
-	  "EXCLUDE", "FOREIGN KEY");
+	  "NOT NULL", "EXCLUDE", "FOREIGN KEY");
 	}
 	/* ALTER TABLE xxx ADD [COLUMN] yyy */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "COLUMN", MatchAny) ||
-			 Matches("ALTER", "TABLE", MatchAny, "ADD", MatchAnyExcept("COLUMN|CONSTRAINT|CHECK|UNIQUE|PRIMARY|EXCLUDE|FOREIGN")))
+			 Matches("ALTER", "TABLE", MatchAny, "ADD", MatchAnyExcept("COLUMN|CONSTRAINT|CHECK|UNIQUE|PRIMARY|NOT|EXCLUDE|FOREIGN")))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_datatypes);
 	/* ALTER TABLE xxx ADD CONSTRAINT yyy */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny))
-		COMPLETE_WITH("CHECK", "UNIQUE", "PRIMARY KEY", "EXCLUDE", "FOREIGN KEY");
+		COMPLETE_WITH("CHECK", "UNIQUE", "PRIMARY KEY", "EXCLUDE", "FOREIGN KEY", "NOT NULL");
+	/* ALTER TABLE xxx ADD NOT NULL */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "NOT", "NULL"))
+		COMPLETE_WITH_ATTR(prev4_wd);
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny, "NOT", "NULL"))
+		COMPLETE_WITH_ATTR(prev6_wd);
 	/* ALTER TABLE xxx ADD [CONSTRAINT yyy] (PRIMARY KEY|UNIQUE) */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "PRIMARY", "KEY") ||
 			 Matches("ALTER", "TABLE", MatchAny, "ADD", "UNIQUE") ||
-- 
2.39.5



Re: add tab-complete for ALTER DOMAIN ADD...

2025-05-11 Thread Álvaro Herrera
On 2025-Apr-29, Dagfinn Ilmari Mannsåker wrote:

> jian he  writes:

> > +   /* ALTER DOMAIN  ADD */
> > +   else if (Matches("ALTER", "DOMAIN", MatchAny, "ADD"))
> > +   COMPLETE_WITH("CONSTRAINT", "NOT NULL", "CHECK");
> 
> I think the completion for CHECK should include the opening paren too,
> since that's required for the expression.

Yeah, we do that elsewhere.

> We could also add completion after CONSTRAINT , like this:
> 
>   else if(Matches("ALTER", "DOMAIN", MatchAny, "ADD", "CONSTRAINT", 
> MatchAny))
>   COMPLETE_WITH("NOT NULL", "CHECK (");

Done that, and pushed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)




Re: FileFallocate misbehaving on XFS

2025-05-11 Thread Pierre Barre
Hello,

I was running into the same thing, and the fix for me was mounting xfs with -o 
inodes64

Best,
Pierre

On Mon, Dec 9, 2024, at 08:34, Michael Harris wrote:
> Hello PG Hackers
>
> Our application has recently migrated to PG16, and we have experienced
> some failed upgrades. The upgrades are performed using pg_upgrade and
> have failed during the phase where the schema is restored into the new
> cluster, with the following error:
>
> pg_restore: error: could not execute query: ERROR:  could not extend
> file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
> FileFallocate(): No space left on device
> HINT:  Check free disk space.
>
> This has happened multiple times on different servers, and in each
> case there was plenty of free space available.
>
> We found this thread describing similar issues:
>
> https://www.postgresql.org/message-id/flat/AS1PR05MB91059AC8B525910A5FCD6E699F9A2%40AS1PR05MB9105.eurprd05.prod.outlook.com
>
> As is the case in that thread, all of the affected databases are using XFS.
>
> One of my colleagues built postgres from source with
> HAVE_POSIX_FALLOCATE not defined, and using that build he was able to
> complete the pg_upgrade, and then switched to a stock postgres build
> after the upgrade. However, as you might expect, after the upgrade we
> have experienced similar errors during regular operation. We make
> heavy use of COPY, which is mentioned in the above discussion as
> pre-allocating files.
>
> We have seen this on both Rocky Linux 8 (kernel 4.18.0) and Rocky
> Linux 9 (Kernel 5.14.0).
>
> I am wondering if this bug might be related:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791323
>
>> When given an offset of 0 and a length, fallocate (man 2 fallocate) reports 
>> ENOSPC if the size of the file + the length to be allocated is greater than 
>> the available space.
>
> There is a reproduction procedure at the bottom of the above ubuntu
> thread, and using that procedure I get the same results on both kernel
> 4.18.0 and 5.14.0.
> When calling fallocate with offset zero on an existing file, I get
> enospc even if I am only requesting the same amount of space as the
> file already has.
> If I repeat the experiment with ext4 I don't get that behaviour.
>
> On a surface examination of the code paths leading to the
> FileFallocate call, it does not look like it should be trying to
> allocate already allocated space, but I might have missed something
> there.
>
> Is this already being looked into?
>
> Thanks in advance,
>
> Cheers
> Mike