Re: Add last_commit_lsn to pg_stat_database

2024-05-28 Thread James Coleman
On Mon, Feb 19, 2024 at 3:56 PM Michael Paquier  wrote:
>
> On Mon, Feb 19, 2024 at 10:26:43AM +0100, Tomas Vondra wrote:
> > On 2/19/24 07:57, Michael Paquier wrote:
> > > On Sun, Feb 18, 2024 at 02:28:06AM +0100, Tomas Vondra wrote:
> >>> 1) Do we really need to modify RecordTransactionCommitPrepared and
> >>> XactLogCommitRecord to return the LSN of the commit record? Can't we
> >>> just use XactLastRecEnd? It's good enough for advancing
> >>> replorigin_session_origin_lsn, why wouldn't it be good enough for this?
> >>
> >> Or XactLastCommitEnd?
> >
> > But that's not set in RecordTransactionCommitPrepared (or twophase.c in
> > general), and the patch seems to need that.
>
> Hmm.  Okay.
>
> > > The changes in AtEOXact_PgStat() are not really
> > > attractive for what's a static variable in all the backends.
> >
> > I'm sorry, I'm not sure what "changes not being attractive" means :-(
>
> It just means that I am not much a fan of the signature changes done
> for RecordTransactionCommit() and AtEOXact_PgStat_Database(), adding a
> dependency to a specify LSN.  Your suggestion to switching to
> XactLastRecEnd should avoid that.

Attached is an updated patch that uses XactLastCommitEnd and therefore
avoids all of those signature changes.

We can't use XactLastCommitEnd because it gets set to 0 immediately
after RecordTransactionCommit() sets XactLastCommitEnd to its value.

I added a test for two-phase commit, as well, and in so doing I
discovered something that I found curious: when I do "COMMIT PREPARED
't1'", not only does RecordTransactionCommitPrepared() get called, but
eventually RecordTransactionCommit() is called as well before the
command returns (despite the comments for those functions describing
them as two equivalents you need to change at the same time).

So it appears that we don't need to set XactLastCommitEnd in
RecordTransactionCommitPrepared() for this to work, and, indeed,
adding some logging to verify, the value of XactLastRecEnd we'd use to
update XactLastCommitEnd is the same at the end of both of those
functions during COMMIT PREPARED.

I'd like to have someone weigh in on whether relying on
RecordTransactionCommit() being called during COMMIT PREPARED is
correct or not (if perchance there are non-obvious reasons why we
shouldn't).

Regards,
James Coleman


v4-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch
Description: Binary data


Re: Add last_commit_lsn to pg_stat_database

2024-05-28 Thread James Coleman
On Thu, Mar 7, 2024 at 10:30 AM Heikki Linnakangas  wrote:
>
> > I've previously noted in "Add last commit LSN to
> > pg_last_committed_xact()" [1] that it's not possible to monitor how
> > many bytes of WAL behind a logical replication slot is (computing such
> > is obviously trivial for physical slots) because the slot doesn't need
> > to replicate beyond the last commit. In some cases it's possible for
> > the current WAL location to be far beyond the last commit. A few
> > examples:
> >
> > - An idle server with checkout_timeout at a lower value (so lots of
> > empty WAL advance).
> > - Very large transactions: particularly insidious because committing a
> > 1 GB transaction after a small transaction may show almost zero time
> > lag even though quite a bit of data needs to be processed and sent
> > over the wire (so time to replay is significantly different from
> > current lag).
> > - A cluster with multiple databases complicates matters further,
> > because while physical replication is cluster-wide, the LSNs that
> > matter for logical replication are database specific.
> >
> >
> > Since we don't expose the most recent commit's LSN there's no way to
> > say "the WAL is currently 1250, the last commit happened at 1000, the
> > slot has flushed up to 800, therefore there are at most 200 bytes
> > replication needs to read through to catch up.
>
> I'm not sure I fully understand the problem. What are you doing
> currently to measure the lag? If you look at pg_replication_slots today,
> confirmed_flush_lsn advances also when you do pg_switch_wal(), so
> looking at the diff between confirmed_flush_lsn and pg_current_wal_lsn()
> works, no?

No, it's not that simple because of the "large, uncommitted
transaction" case I noted in the OP. Suppose I have a batch system,
and a job in that system has written 1 GB of data but not yet
committed (or rolled back). In this case confirmed_flush_lsn cannot
advance, correct?

And so having a "last commit LSN" allows me to know how far back the
"last possibly replicatable write"

> And on the other hand, even if you expose the database's last commit
> LSN, you can have an publication that includes only a subset of tables.
> Or commits that don't write to any table at all. So I'm not sure why the
> database's last commit LSN is relevant. Getting the last LSN that did
> something that needs to be replicated through the publication might be
> useful, but that's not what what this patch does.

I think that's fine,  because as you noted earlier the
confirmed_flush_lsn could advance beyond that point anyway (if there's
nothing to replicate), but in the case where:

1. confirmed_flush_lsn is not advancing, and
2. last_commit_lsn is not advancing, and
3. pg_current_wal_lsn() has advanced a lot

then you can probably infer that there's a large amount of data that
simply cannot be completed by the subscriber, and so there's no "real"
delay. It also gives you an idea of how much data you will need to
churn through (even if not replicated) once the transaction commits.

Certainly understanding the data here will be simplest in the case
where 1.) there's a single database and 2.) all tables are in the
replication set, but I don't think the value is limited to that
situation either.

Regards,
James Coleman




Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-28 Thread James Coleman
On Fri, May 17, 2024 at 9:59 AM Robert Haas  wrote:
>
> On Fri, May 17, 2024 at 11:05 AM Tom Lane  wrote:
> > > We already have gone back to that model. We just haven't admitted it
> > > yet. And we're never going to get out of it until we find a way to get
> > > the contents of the CommitFest application down to a more reasonable
> > > size and level of complexity. There's just no way everyone's up for
> > > that level of pain. I'm not sure not up for that level of pain.
> >
> > Yeah, we clearly need to get the patch list to a point of
> > manageability, but I don't agree that abandoning time-boxed CFs
> > will improve anything.
>
> I'm not sure. Suppose we plotted commits generally, or commits of
> non-committer patches, or reviews on-list, vs. time. Would we see any
> uptick in activity during CommitFests? Would it vary by committer? I
> don't know. I bet the difference wouldn't be as much as Tom Lane would
> like to see. Realistically, we can't manage how contributors spend
> their time all that much, and trying to do so is largely tilting at
> windmills.
>
> To me, the value of time-based CommitFests is as a vehicle to ensure
> freshness, or cleanup, or whatever word you want to do. If you just
> make a list of things that need attention and keep incrementally
> updating it, eventually for various reasons that list no longer
> reflects your current list of priorities. At some point, you have to
> throw it out and make a new list, or at least that's what always
> happens to me. We've fallen into a system where the default treatment
> of a patch is to be carried over to the next CommitFest and CfMs are
> expected to exert effort to keep patches from getting that default
> treatment when they shouldn't. But that does not scale. We need a
> system where things drop off the list unless somebody makes an effort
> to keep them on the list, and the easiest way to do that is to
> periodically make a *fresh* list that *doesn't* just clone some
> previous list.
>
> I realize that many people here are (rightly!) concerned with
> burdening patch authors with more steps that they have to follow. But
> the current system is serving new patch authors very poorly. If they
> get attention, it's much more likely to be because somebody saw their
> email and wrote back than it is to be because somebody went through
> the CommitFest and found their entry and was like "oh, I should review
> this". Honestly, if we get to a situation where a patch author is sad
> because they have to click a link every 2 months to say "yeah, I'm
> still here, please review my patch," we've already lost the game. That
> person isn't sad because we asked them to click a link. They're sad
> it's already been N * 2 months and nothing has happened.

Yes, this is exactly right.

This is an off-the-wall idea, but what if the inverse is actually what
we need? Suppose there's been a decent amount of activity previously
on the thread, but no new patch version or CF app activity (e.g.,
status changes moving it forward) or maybe even just the emails died
off: perhaps that should trigger a question to the author to see what
they want the status to be -- i.e., "is this still 'needs review', or
is it really 'waiting on author' or 'not my priority right now'?"

It seems possible to me that that would actually remove a lot of the
patches from the current CF when a author simply hasn't had time to
respond yet (I know this is the case for me because the time I have to
work on patches fluctuates significantly), but it might also serve to
highlight patches that simply haven't had any review at all.

I'd like to add a feature to the CF app that shows me my current
patches by status, and I'd also like to have the option to have the CF
app notify me when someone changes the status (I've noticed before
that often a status gets changed without notification on list, and
then I get surprised months later when it's stuck in "waiting on
author"). Do either/both of those seem reasonable to add?

Regards,
James Coleman




Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-28 Thread James Coleman
On Thu, May 16, 2024 at 4:00 PM Joe Conway  wrote:
>
> On 5/16/24 17:36, Jacob Champion wrote:
> > On Thu, May 16, 2024 at 2:29 PM Joe Conway  wrote:
> >> If no one, including the author (new or otherwise) is interested in
> >> shepherding a particular patch, what chance does it have of ever getting
> >> committed?
> >
> > That's a very different thing from what I think will actually happen, which 
> > is
> >
> > - new author posts patch
> > - community member says "use commitfest!"
>
> Here is where we should point them at something that explains the care
> and feeding requirements to successfully grow a patch into a commit.
>
> > - new author registers patch
> > - no one reviews it
> > - patch gets automatically booted
>
> Part of the care and feeding instructions should be a warning regarding
> what happens if you are unsuccessful in the first CF and still want to
> see it through.
>
> > - community member says "register it again!"
> > - new author says ಠ_ಠ
>
> As long as this is not a surprise ending, I don't see the issue.

I've experienced this in another large open-source project that runs
on Github, not mailing lists, and here's how it goes:

1. I open a PR with a small bugfix (test case included).
2. PR is completely ignored by committers (presumably because they all
mostly work on their own projects they're getting paid to do).
3. <3 months goes by>
4. I may get a comment with "please rebase!", or, more frequently, a
bot closes the issue.

That cycle is _infuriating_ as a contributor. As much as I don't like
to hear "we're not doing this", I'd far prefer to have that outcome
then some automated process closing out my submission without my input
when, as far as I can tell, the real problem is not my lack of
activity by the required reviewers simply not looking at it.

So I'm genuinely confused by you say "As long as this is not a
surprise ending, I don't see the issue.". Perhaps we're imagining
something different here reading between the lines?

Regards,
James Coleman




Re: Teach predtest about IS [NOT] proofs

2024-04-05 Thread James Coleman
On Mon, Apr 1, 2024 at 8:06 AM James Coleman  wrote:
>
> On Mon, Mar 25, 2024 at 5:53 PM Tom Lane  wrote:
> >
> > James Coleman  writes:
> > > [ v6 patchset ]
> >
> > I went ahead and committed 0001 after one more round of review
> >
> > statements; my bad).  I also added the changes in test_predtest.c from
> > 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> > changed, mainly to keep the cfbot happy.
> >
> > I'm still not happy with what you did in predicate_refuted_by_recurse:
> > it feels wrong and rather expensively so.  There has to be a better
> > way.  Maybe strong vs. weak isn't quite the right formulation for
> > refutation tests?
>
> Possibly. Earlier I'd mused that:
>
> > Alternatively (to avoid unnecessary CPU burn) we could modify
> > predicate_implied_by_recurse (and functionals called by it) to have a
> > argument beyond "weak = true/false" Ie.g., an enum that allows for
> > something like "WEAK", "STRONG", and "EITHER". That's a bigger change,
> > so I didn't want to do that right away unless there was agreement on
> > that direction.
>
> I'm going to try implementing that and see how I feel about what it
> looks like in practice.

Attached is v8 which does this. Note that I kept the patch 0001 as
before and inserted a new 0002 to show exactly what's changed from the
previously version -- I wouldn't expect that to be committed
separately, of course. With this change we only need to recurse a
single time and can check for both strong and weak refutation when
either will do for proving refutation of the "NOT x" construct.

Regards,
James Coleman


v8-0001-Teach-predtest.c-about-BooleanTest.patch
Description: Binary data


v8-0003-Add-temporary-all-permutations-test.patch
Description: Binary data


v8-0002-Recurse-weak-and-strong-implication-at-the-same-t.patch
Description: Binary data


Re: Teach predtest about IS [NOT] proofs

2024-04-01 Thread James Coleman
On Mon, Mar 25, 2024 at 5:53 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > [ v6 patchset ]
>
> I went ahead and committed 0001 after one more round of review
>
> statements; my bad).  I also added the changes in test_predtest.c from
> 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> changed, mainly to keep the cfbot happy.
>
> I'm still not happy with what you did in predicate_refuted_by_recurse:
> it feels wrong and rather expensively so.  There has to be a better
> way.  Maybe strong vs. weak isn't quite the right formulation for
> refutation tests?

Possibly. Earlier I'd mused that:

> Alternatively (to avoid unnecessary CPU burn) we could modify
> predicate_implied_by_recurse (and functionals called by it) to have a
> argument beyond "weak = true/false" Ie.g., an enum that allows for
> something like "WEAK", "STRONG", and "EITHER". That's a bigger change,
> so I didn't want to do that right away unless there was agreement on
> that direction.

I'm going to try implementing that and see how I feel about what it
looks like in practice.

Regards,
James Coleman




Re: Teach predtest about IS [NOT] proofs

2024-04-01 Thread James Coleman
On Mon, Mar 25, 2024 at 11:45 PM Tom Lane  wrote:
>
> I wrote:
> > I went ahead and committed 0001 after one more round of review
> >
> > statements; my bad).  I also added the changes in test_predtest.c from
> > 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> > changed, mainly to keep the cfbot happy.
>
> [ squint.. ]  Apparently I managed to hit ^K right before sending this
> email.  The missing line was meant to be more or less
>
> > which found a couple of missing "break"
>
> Not too important, but perhaps future readers of the archives will
> be confused.

I was wondering myself :) so thanks for clarifying.

Regards,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-23 Thread James Coleman
On Thu, Mar 21, 2024 at 1:09 PM Robert Haas  wrote:
>
> On Thu, Mar 14, 2024 at 9:07 PM James Coleman  wrote:
> > If the goal here is the most minimal patch possible, then please
> > commit what you proposed. I am interested in improving the document
> > further, but I don't know how to do that easily if the requirement is
> > effectively "must only change one specific detail at a time".
>
> So, yesterday I wrote a long email on how I saw the goals here.
> Despite our disagreements, I believe we agree that the text I proposed
> is better than what's there, so I've committed that change now. I've
> also marked the CF entry as committed. Please propose the other
> changes you want separately.

Thanks for committing the fix.

Regards,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-23 Thread James Coleman
On Wed, Mar 20, 2024 at 2:15 PM Robert Haas  wrote:
>
> On Thu, Mar 14, 2024 at 9:07 PM James Coleman  wrote:
> > Obviously I have reasons for the other changes I made: for example,
> > "no longer visible" improves the correctness, since being an old
> > version isn't sufficient. I removed the "In summary" sentence because
> > it simply doesn't follow from the prose before it. That sentence
> > simply restates information already appearing earlier in almost as
> > simple a form, so it's redundant. But more importantly it's just not
> > actually a summary of the text before it, so removing it improves the
> > documentation.
> >
> > I can explain my reasoning further if desired, but I fear it would
> > simply frustrate you further, so I'll stop here.
> >
> > If the goal here is the most minimal patch possible, then please
> > commit what you proposed. I am interested in improving the document
> > further, but I don't know how to do that easily if the requirement is
> > effectively "must only change one specific detail at a time". So, that
> > leaves me feeling a bit frustrated also.
>
> I don't think the goal is to have the most minimal patch possible, but
> at the same time, I'm responsible for what I commit. If I commit a
> patch that changes something and somebody, especially another
> committer, writes an email and says "hey, why the heck did you change
> this, you dummy?" then I need to be able to answer that question, and
> saying "uh, well, James Coleman had it in his patch and he didn't
> explain why it was there and I didn't know either but I just kind of
> committed it anyway" is not where I want to be. Almost without
> exception, it's not the patch author who gets asked why they included
> a certain thing in the patch; it's the committer who gets asked why it
> got committed that way -- and at least as I see it, if there's not a
> good answer to that question, it's the committer who gets judged, not
> the patch author. In some cases, such questions and judgements arrive
> without warning YEARS after the commit.
>
> So, to protect myself against questions from other hackers that end,
> at least implicitly, in "you dummy," I try to make sure there's an
> adequate paper trail for everything I commit. Between the commit
> message, the comments, and the email discussion, it needs to be
> crystal clear why something was thought to be a good idea at the time.
> Hopefully, it will be clear enough that I never even get a question,
> because the reader will be able to understand ON THEIR OWN why
> something was done and either go "oh, that make sense" or "well, I get
> why they did that, but I disagree because $REASON" ... but if that
> doesn't quite work out, then I hope that at the very least the paper
> trail will be good enough that I can reconstruct the reasoning if
> needed. For a recent example of a case where I clearly failed do a
> good enough job to keep someone from asking a "you dummy" question,
> see the http://postgr.es/m/6030bdec-0de1-4da2-b0b3-335007eae...@iki.fi
> and in particular the paragraph that starts with "However".
>
> Therefore, if I realize when reading the patch that it contains
> odd-looking changes which I can't relate to the stated goal, it's
> getting bounced, pretty much 100% of the time. If it's a small patch
> and I really care about it and it's a new submitter who doesn't know
> any better, I might choose to remove those changes myself and commit
> the rest; and if I like those changes for other reasons, I might
> choose to commit them separately. In any other situation, I'm just
> going to tell the submitter that they need to take out the unrelated
> changes and move on to the next email thread.
>
> In this particular situation, I see that this whole discussion is
> slightly moot by virtue of 7a9328e8e40534fb4de41b4ac152e3c6989d84e7
> (Jeff Davis, 2024-03-05) which removed the "In summary, heap-only
> tuple updates can only be created if columns used by indexes are not
> updated" sentence that you also removed. But there is an important
> difference between that patch and yours, at least IMHO. You argue that
> the sentence in question didn't flow from what precedes it, but I
> don't agree with that as a rationale; I think that sentence flows
> perfectly well from what precedes it and is a generally adequate
> summary of the point of the section. That we disagree on the merits of
> that sentence is fine; there's no hope that we're all going to agree
> on everything. What makes me frustrated is that you didn't provide
> that rationale, which greatly complicates the whole discussion,
>

Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-14 Thread James Coleman
On Thu, Mar 14, 2024 at 10:28 AM Robert Haas  wrote:
>
> On Wed, Oct 4, 2023 at 9:12 PM James Coleman  wrote:
> > All right, attached is a v3 which attempts to fix the wrong
> > information with an economy of words. I may at some point submit a
> > separate patch that adds a broader pruning section, but this at least
> > brings the docs inline with reality insofar as they address it.
>
> I don't think this is as good as what I proposed back on October 2nd.
> IMHO, that version does a good job making the text accurate and clear,
> and is directly responsive to your original complaint, namely, that
> the root of the HOT chain can't be removed. But this version seems to
> contain a number of incidental changes that are unrelated to that
> point, e.g. "old versions" -> "old, no longer visible versions", "can
> be completely removed" -> "may be pruned", and the removal of the
> sentence "In summary, heap-only tuple updates can only be created - if
> columns used by indexes are not updated" which AFAICT is both
> completely correct as-is and unrelated to the original complaint.
>
> Maybe I shouldn't be, but I'm slightly frustrated here. I thought I
> had proposed an alternative which you found acceptable, but then you
> proposed several more versions that did other things instead, and I
> never really understood why we couldn't just adopt the version that
> you seemed to think was OK. If there's a problem with that, say what
> it is. If there's not, let's do that and move on.

I think there's simply a misunderstanding here. I read your proposal
as "here's an idea to consider as you work on the patch" (as happens
on many other threads), and so I attempted to incorporate your primary
points of feedback into my next version of the patch.

Obviously I have reasons for the other changes I made: for example,
"no longer visible" improves the correctness, since being an old
version isn't sufficient. I removed the "In summary" sentence because
it simply doesn't follow from the prose before it. That sentence
simply restates information already appearing earlier in almost as
simple a form, so it's redundant. But more importantly it's just not
actually a summary of the text before it, so removing it improves the
documentation.

I can explain my reasoning further if desired, but I fear it would
simply frustrate you further, so I'll stop here.

If the goal here is the most minimal patch possible, then please
commit what you proposed. I am interested in improving the document
further, but I don't know how to do that easily if the requirement is
effectively "must only change one specific detail at a time". So, that
leaves me feeling a bit frustrated also.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2024-03-02 Thread James Coleman
On Wed, Feb 28, 2024 at 1:18 AM Robert Haas  wrote:
>
> On Mon, Feb 26, 2024 at 5:31 PM torikoshia  wrote:
> > It would be nice if there was a place accessed once every few seconds or
> > so..
>
> I think this comment earlier from Andres deserves close attention:
>
> # If we went with something like tht approach, I think we'd have to do 
> something
> # like redirecting node->ExecProcNode to a wrapper, presumably from within a
> # CFI. That wrapper could then implement the explain support, without slowing
> # down the normal execution path.
>
> If this is correctly implemented, the overhead in the case where the
> feature isn't used should be essentially zero, I believe.

If I can rephrase this idea: it's basically "delay this interrupt
until inline to the next ExecProcNode execution".

That seems pretty promising to me as well.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2024-02-24 Thread James Coleman
On Fri, Feb 23, 2024 at 10:23 AM Robert Haas  wrote:
>
> On Fri, Feb 23, 2024 at 7:50 PM Julien Rouhaud  wrote:
> > On Fri, Feb 23, 2024 at 10:22:32AM +0530, Robert Haas wrote:
> > > On Thu, Feb 22, 2024 at 6:25 AM James Coleman  wrote:
> > > > This is potentially a bit of a wild idea, but I wonder if having some
> > > > kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
> > > > "normal" as opposed to "critical" (using that word differently than
> > > > the existing critical sections) would be worth it.
> > >
> > > It's worth considering, but the definition of "normal" vs. "critical"
> > > might be hard to pin down. Or, we might end up with a definition that
> > > is specific to this particular case and not generalizable to others.
> >
> > But it doesn't have to be all or nothing right?  I mean each call could say
> > what the situation is like in their context, like
> > CHECK_FOR_INTERRUPTS(GUARANTEE_NO_HEAVYWEIGHT_LOCK | GUARANTEE_WHATEVER), 
> > and
> > slowly tag calls as needed, similarly to how we add already CFI based on 
> > users
> > report.
>
> Absolutely. My gut feeling is that it's going to be simpler to pick a
> small number of places that are safe and sufficient for this
> particular feature and add an extra call there, similar to how we do
> vacuum_delay_point(). The reason I think that's likely to be better is
> that it will likely require changing only a relatively small number of
> places. If we instead start annotating CFIs, well, we've got hundreds
> of those. That's a lot more to change, and it also inconveniences
> third-party extension authors and people doing back-patching. I'm not
> here to say it can't work; I just think it's likely not the easiest
> path.

Yes, I suspect it's not the easiest path. I have a small hunch it
might end up paying more dividends in the future: there isn't just one
of these things that is regularly a thorny discussion for the same
reasons each time (basically "no way to trigger this safely from
another backend interrupting another one at an arbitrary point"), and
if we were able to generalize a solution we may have multiple wins (a
very obvious one in my mind is the inability of auto explain to run an
explain at the precise time it's most useful: when statement timeout
fires).

But it's also possible there are simply ways that get us more than
this scenario also, so I might be wrong; it's merely a hunch.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2024-02-21 Thread James Coleman
On Mon, Feb 19, 2024 at 11:53 PM Robert Haas  wrote:
>
> On Fri, Feb 16, 2024 at 12:29 AM Andres Freund  wrote:
> > If we went with something like tht approach, I think we'd have to do 
> > something
> > like redirecting node->ExecProcNode to a wrapper, presumably from within a
> > CFI. That wrapper could then implement the explain support, without slowing
> > down the normal execution path.
>
> That's an annoying complication; maybe there's some better way to
> handle this. But I think we need to do something different than what
> the patch does currently because...
>
> > > It's really hard for me to accept that the heavyweight lock problem
> > > for which the patch contains a workaround is the only one that exists.
> > > I can't see any reason why that should be true.
> >
> > I suspect you're right.
>
> ...I think the current approach is just plain dead, because of this
> issue. We can't take an approach that creates an unbounded number of
> unclear reentrancy issues and then insert hacks one by one to cure
> them (or hack around them, more to the point) as they're discovered.
>
> The premise has to be that we only allow logging the query plan at
> points where we know it's safe, rather than, as at present, allowing
> it in places that are unsafe and then trying to compensate with code
> elsewhere. That's not likely to ever be as stable as we want
> PostgreSQL to be.

This is potentially a bit of a wild idea, but I wonder if having some
kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
"normal" as opposed to "critical" (using that word differently than
the existing critical sections) would be worth it.

Regards,
James Coleman




Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-08 Thread James Coleman
On Thu, Feb 8, 2024 at 4:47 AM Ashutosh Bapat
 wrote:
>
> On Thu, Feb 8, 2024 at 9:57 AM Laurenz Albe  wrote:
> >
> > On Thu, 2024-02-08 at 13:40 +1100, Peter Smith wrote:
> > > -   how to set the replica identity.  If a table without a replica 
> > > identity is
> > > +   how to set the replica identity.  If a table without a replica 
> > > identity
> > > +   (or with replica identity behavior the same as 
> > > NOTHING) is
> > > added to a publication that replicates UPDATE
> > > or DELETE operations then
> > > subsequent UPDATE or DELETE
> >
> > I had the impression that the root of the confusion was the perceived 
> > difference
> > between "REPLICA IDENTITY NOTHING" and "no replica identity", and that 
> > change
> > doesn't improve that.
> >
> > How about:
> >
> >   If a table without a replica identity (explicitly set to 
> > NOTHING,
> >   or set to a primary key or index that doesn't exist) is added ...
>
> Another possibility is just to improve the documentation of various
> options as follows.
>
> DEFAULT
>
> If there is a primary key, record the old values of the columns of the
> primary key. Otherwise it acts as NOTHING. This is the default for
> non-system tables.
>
> USING INDEX index_name
>
> Records the old values of the columns covered by the named index, that
> must be unique, not partial, not deferrable, and include only columns
> marked NOT NULL. If this index is dropped, the behavior is the same as
> NOTHING.
>
> FULL
>
> Records the old values of all columns in the row.
>
> NOTHING
>
> Records no information about the old row. This is equivalent to having
> no replica identity. This is the default for system tables.

This is the simplest change, and it does solve the confusion, so I'd
be happy with it also. The other proposals have the benefit of having
all the information necessary on the publications page rather than
requiring the user to refer to the ALTER TABLE REPLICA IDENTITY page
to understand what's meant.

Regards,
James Coleman




Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-08 Thread James Coleman
On Wed, Feb 7, 2024 at 11:27 PM Laurenz Albe  wrote:
>
> On Thu, 2024-02-08 at 13:40 +1100, Peter Smith wrote:
> > -   how to set the replica identity.  If a table without a replica identity 
> > is
> > +   how to set the replica identity.  If a table without a replica identity
> > +   (or with replica identity behavior the same as 
> > NOTHING) is
> > added to a publication that replicates UPDATE
> > or DELETE operations then
> > subsequent UPDATE or DELETE
>
> I had the impression that the root of the confusion was the perceived 
> difference
> between "REPLICA IDENTITY NOTHING" and "no replica identity", and that change
> doesn't improve that.
>
> How about:
>
>   If a table without a replica identity (explicitly set to 
> NOTHING,
>   or set to a primary key or index that doesn't exist) is added ...

I think that would work also. I was reading the initial suggestion as
"(or with replica identity behavior the same as..." as defining what
"without a replica identity" meant, which would avoid the confusion.
But your proposal is more explicit and more succinct, so I think it's
the better option of the two.

Regards,
James Coleman




Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-07 Thread James Coleman
On Wed, Feb 7, 2024 at 6:04 PM Peter Smith  wrote:
>
> On Thu, Feb 8, 2024 at 9:04 AM James Coleman  wrote:
> >
> > On Wed, Feb 7, 2024 at 3:22 PM Laurenz Albe  
> > wrote:
> > >
> > > On Wed, 2024-02-07 at 15:12 -0500, James Coleman wrote:
> > > > We recently noticed some behavior that seems reasonable but also
> > > > surprised our engineers based on the docs.
> > > >
> > > > If we have this setup:
> > > > create table items(i int);
> > > > insert into items(i) values (1);
> > > > create publication test_pub for all tables;
> > > >
> > > > Then when we:
> > > > delete from items where i = 1;
> > > >
> > > > we get:
> > > > ERROR:  cannot delete from table "items" because it does not have a
> > > > replica identity and publishes deletes
> > > > HINT:  To enable deleting from the table, set REPLICA IDENTITY using
> > > > ALTER TABLE.
> > > >
> > > > Fair enough. But if we do this:
> > > > alter table items replica identity nothing;
> > > >
> > > > because the docs [1] say that NOTHING means "Records no information
> > > > about the old row." We still get the same error when we try the DELETE
> > > > again.
> > >
> > > Well, "REPLICA IDENTITY NOTHING" is the same as "has no replica identity".
> > > So is "REPLICA IDENTITY DEFAULT" if there is no primary key, or
> > > "REPLICA IDENTITY USING INDEX ..." if the index is dropped.
> > >
> > > See "pg_class": the column "relreplident" is not nullable.
> >
> > Right, I think the confusing point for us is that the docs for NOTHING
> > ("Records no information about the old row") imply you can decide you
> > don't have to record anything if you don't want to do so, but the
> > publication feature is effectively overriding that and asserting that
> > you can't make that choice.
> >
>
> Hi, I can see how the current docs could be interpreted in a way that
> was not intended.
>
> ~~~
>
> To emphasise the DEFAULT behaviour that Laurenze described, I felt
> there could be another sentence about DEFAULT, the same as there is
> already for the USING INDEX case.
>
> BEFORE [1]
> Records the old values of the columns of the primary key, if any. This
> is the default for non-system tables.
>
> SUGGESTION
> Records the old values of the columns of the primary key, if any. This
> is the default for non-system tables. If there is no primary key, the
> behavior is the same as NOTHING.
>
> ~~~
>
> If that is done, then would a publication docs tweak like the one
> below clarify things sufficiently?
>
> BEFORE [2]
> If a table without a replica identity is added to a publication that
> replicates UPDATE or DELETE operations then subsequent UPDATE or
> DELETE operations will cause an error on the publisher.
>
> SUGGESTION
> If a table without a replica identity (or with replica identity
> behavior equivalent to NOTHING) is added to a publication that
> replicates UPDATE or DELETE operations then subsequent UPDATE or
> DELETE operations will cause an error on the publisher.
>
> ==
> [1] 
> https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY
> [2] 
> https://www.postgresql.org/docs/current/logical-replication-publication.html
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia

Thanks for looking at this!

Yes, both of those changes together would make this unambiguous (and,
I think, easier to mentally parse).

Thanks,
James Coleman




Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-07 Thread James Coleman
On Wed, Feb 7, 2024 at 3:22 PM Laurenz Albe  wrote:
>
> On Wed, 2024-02-07 at 15:12 -0500, James Coleman wrote:
> > We recently noticed some behavior that seems reasonable but also
> > surprised our engineers based on the docs.
> >
> > If we have this setup:
> > create table items(i int);
> > insert into items(i) values (1);
> > create publication test_pub for all tables;
> >
> > Then when we:
> > delete from items where i = 1;
> >
> > we get:
> > ERROR:  cannot delete from table "items" because it does not have a
> > replica identity and publishes deletes
> > HINT:  To enable deleting from the table, set REPLICA IDENTITY using
> > ALTER TABLE.
> >
> > Fair enough. But if we do this:
> > alter table items replica identity nothing;
> >
> > because the docs [1] say that NOTHING means "Records no information
> > about the old row." We still get the same error when we try the DELETE
> > again.
>
> Well, "REPLICA IDENTITY NOTHING" is the same as "has no replica identity".
> So is "REPLICA IDENTITY DEFAULT" if there is no primary key, or
> "REPLICA IDENTITY USING INDEX ..." if the index is dropped.
>
> See "pg_class": the column "relreplident" is not nullable.

Right, I think the confusing point for us is that the docs for NOTHING
("Records no information about the old row") imply you can decide you
don't have to record anything if you don't want to do so, but the
publication feature is effectively overriding that and asserting that
you can't make that choice.

Regards,
James Coleman




Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-07 Thread James Coleman
Hello,

We recently noticed some behavior that seems reasonable but also
surprised our engineers based on the docs.

If we have this setup:
create table items(i int);
insert into items(i) values (1);
create publication test_pub for all tables;

Then when we:
delete from items where i = 1;

we get:
ERROR:  cannot delete from table "items" because it does not have a
replica identity and publishes deletes
HINT:  To enable deleting from the table, set REPLICA IDENTITY using
ALTER TABLE.

Fair enough. But if we do this:
alter table items replica identity nothing;

because the docs [1] say that NOTHING means "Records no information
about the old row." We still get the same error when we try the DELETE
again.

The publication docs [2] say "A published table must have a replica
identity configured in order to be able to replicate UPDATE and DELETE
operations, so that appropriate rows to update or delete can be
identified on the subscriber side."

We interpreted the intersection of these two docs to imply that if you
explicitly configured NOTHING that the publication would simply not
log anything about the original row. Part of the confusion I think was
fed by reading "must have a replica identity set" as "have selected
one of the options via ALTER TABLE REPLICA IDENTITY" -- i.e., as
meaning that a setting has been configured rather than being about a
subset of those possible configuration values/a specific key existing
on the table.

I'm wondering if this might be a surprise to anyone else, and if so,
is there a minor docs tweak that might avoid the confusion?

Thanks,
James Coleman

1: 
https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY
2: https://www.postgresql.org/docs/current/logical-replication-publication.html




Re: set_cheapest without checking pathlist

2024-02-01 Thread James Coleman
On Thu, Feb 1, 2024 at 1:36 AM Richard Guo  wrote:
>
>
> On Thu, Feb 1, 2024 at 11:37 AM David Rowley  wrote:
>>
>> On Thu, 1 Feb 2024 at 16:29, Richard Guo  wrote:
>> > On Thu, Feb 1, 2024 at 10:04 AM James Coleman  wrote:
>> >> I don't see any inherent reason why we must always assume that
>> >> gather_grouping_paths will always result in having at least one entry
>> >> in pathlist. If, for example, we've disabled incremental sort and the
>> >> cheapest partial path happens to already be sorted, then I don't
>> >> believe we'll add any paths. And if that happens then set_cheapest
>> >> will error with the message "could not devise a query plan for the
>> >> given query". So I propose we add a similar guard to this call point.
>> >
>> >
>> > I don't believe that would happen.  It seems to me that we should, at
>> > the very least, have a path which is Gather on top of the cheapest
>> > partial path (see generate_gather_paths), as long as the
>> > partially_grouped_rel has partial paths.
>>
>> It will have partial paths because it's nested inside "if
>> (partially_grouped_rel && partially_grouped_rel->partial_pathlist)"
>
>
> Right.  And that leads to the conclusion that gather_grouping_paths will
> always generate at least one path for partially_grouped_rel.  So it
> seems to me that the check added in the patch is not necessary.  But
> maybe we can add an Assert or a comment clarifying that the pathlist of
> partially_grouped_rel cannot be empty here.

Yes, that's true currently. I don't particularly love that we have the
knowledge of that baked in at this level, but it won't break anything
currently.

I'll put this as a patch in the parallelization patch series
referenced earlier since in that series my changes result in
generate_gather_paths() not necessarily always being able to build a
path.

Regards,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2024-01-31 Thread James Coleman
On Wed, Jan 31, 2024 at 3:18 PM Robert Haas  wrote:
>
> On Tue, Jan 30, 2024 at 9:56 PM James Coleman  wrote:
> > I don't follow the "Idle since July" since it just hasn't received
> > review since then, so there's been nothing to reply to.
>
> It wasn't clear to me if you thought that the patch was ready for
> review since July, or if it was waiting on you since July. Those are
> quite different, IMV.

Agreed they're very different. I'd thought it was actually in "Needs
review" and with no outstanding questions on the thread since July,
but maybe I'm missing something -- I've definitely misunderstood CF
app status before, but usually that's been in the other direction
(forgetting to mark it back to Needs Review after responding to a
Waiting on Author.

Regards,
James Coleman




set_cheapest without checking pathlist

2024-01-31 Thread James Coleman
Hello,

Robert: I've taken the liberty of cc'ing you since you worked on most
of this code. My apologies if that wasn't appropriate.

While working on "Parallelize correlated subqueries that execute
within each worker" [1] I noticed that while in the other call to
set_cheapest (for partially_grouped_rel) in the same function the call
after gather_grouping_paths(root, partially_grouped_rel) is not
similarly guarded with a check for a NIL pathlist on
partially_grouped_rel.

I don't see any inherent reason why we must always assume that
gather_grouping_paths will always result in having at least one entry
in pathlist. If, for example, we've disabled incremental sort and the
cheapest partial path happens to already be sorted, then I don't
believe we'll add any paths. And if that happens then set_cheapest
will error with the message "could not devise a query plan for the
given query". So I propose we add a similar guard to this call point.

I could be convinced that this should be simply part of the patch in
the other thread, but it seemed to me it'd be worth considering
independently because as noted above I don't see any reason why this
couldn't happen separately. That being said, on master I don't have a
case showing this is necessary.

Thanks,
James Coleman

1: 
https://www.postgresql.org/message-id/flat/CAAaqYe-Aq6oNf9NPZnpPE7SgRLomXXWJA1Gz9L9ndi_Nv%3D94Yw%40mail.gmail.com#e0b1a803d0fdb97189ce493f15f99c14


v1-0001-Guard-set_cheapest-with-pathlist-NIL-check.patch
Description: Binary data


Re: Parallelize correlated subqueries that execute within each worker

2024-01-31 Thread James Coleman
On Tue, Jan 30, 2024 at 10:34 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > I've finally had a chance to look at this, and I don't believe there's
> > any real failure here, merely drift of how the planner works on master
> > resulting in this query now being eligible for a different plan shape.
>
> > I was a bit wary at first because the changing test query is one I'd
> > previously referenced in [1] as likely exposing the bug I'd fixed
> > where params where being used across worker boundaries. However
> > looking at the diff in the patch at that point (v10) that particular
> > test query formed a different plan shape (there were two gather nodes
> > being created, and params crossing between them).
>
> > But in the current revision of master with the current patch applied
> > that's no longer true: we have a Gather node, and the Subplan using
> > the param is properly under that Gather node, and the param should be
> > being both generated and consumed within the same worker process.
>
> Hmm ... so the question this raises for me is: was that test intended
> to verify behavior of params being passed across workers?  If so,
> haven't you broken the point of the test?  This doesn't mean that
> your code change is wrong; but I think maybe you need to find a way
> to modify that test case so that it still tests what it's meant to.
> This is a common hazard when changing the planner's behavior.

I'd been thinking it was covered by another test I'd added in 0001,
but looking at it again that test doesn't exercise parallel append
(though it does exercise a different case of cross-worker param
usage), so I'll add another test for the parallel append behavior.

Regards,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2024-01-30 Thread James Coleman
On Tue, Jan 30, 2024 at 11:54 AM Robert Haas  wrote:
>
> On Tue, Jan 30, 2024 at 11:17 AM Akshat Jaimini  wrote:
> > I think we should move this patch to the next CF as I believe that work is 
> > still going on resolving the last reported bug.
>
> We shouldn't just keep pushing this forward to the next CF. It's been
> idle since July. If it needs more work, mark it RwF and it can be
> reopened when there's something for a reviewer to do.

I don't follow the "Idle since July" since it just hasn't received
review since then, so there's been nothing to reply to.

That being said, Vignesh's note in January about a now-failing test is
relevant activity, and I've just today responded to that, so I'm
changing the status back from Waiting on Author to Needs Review.

Regards,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2024-01-30 Thread James Coleman
On Tue, Jan 9, 2024 at 2:09 AM vignesh C  wrote:
> ...
> > Given all of that I settled on this approach:
> > 1. Modify is_parallel_safe() to by default ignore PARAM_EXEC params.
> > 2. Add is_parallel_safe_with_params() that checks for the existence of
> > such params.
> > 3. Store the required params in a bitmapset on each base rel.
> > 4. Union the bitmapset on join rels.
> > 5. Only insert a gather node if that bitmapset is empty.
> >
> > I have an intuition that there's some spot (e.g. joins) that we should
> > be removing params from this set (e.g., when we've satisfied them),
> > but I haven't been able to come up with such a scenario as yet.
> >
> > The attached v11 fixes the issue you reported.
>
> One of the tests has failed in CFBot at [1] with:
> +++ 
> /tmp/cirrus-ci-build/build/testrun/pg_upgrade/002_pg_upgrade/data/results/select_parallel.out
> 2023-12-20 20:08:42.480004000 +
> @@ -137,23 +137,24 @@
>  explain (costs off)
>   select (select max((select pa1.b from part_pa_test pa1 where pa1.a = 
> pa2.a)))
>   from part_pa_test pa2;
> -  QUERY PLAN
> ---
> - Aggregate
> + QUERY PLAN
> +
> + Finalize Aggregate
> ->  Gather
>   Workers Planned: 3
> - ->  Parallel Append
> -   ->  Parallel Seq Scan on part_pa_test_p1 pa2_1
> -   ->  Parallel Seq Scan on part_pa_test_p2 pa2_2
> + ->  Partial Aggregate
> +   ->  Parallel Append
> + ->  Parallel Seq Scan on part_pa_test_p1 pa2_1
> + ->  Parallel Seq Scan on part_pa_test_p2 pa2_2
> +   SubPlan 1
> + ->  Append
> +   ->  Seq Scan on part_pa_test_p1 pa1_1
> + Filter: (a = pa2.a)
> +   ->  Seq Scan on part_pa_test_p2 pa1_2
> + Filter: (a = pa2.a)
> SubPlan 2
>   ->  Result
> -   SubPlan 1
> - ->  Append
> -   ->  Seq Scan on part_pa_test_p1 pa1_1
> - Filter: (a = pa2.a)
> -   ->  Seq Scan on part_pa_test_p2 pa1_2
> - Filter: (a = pa2.a)
> -(14 rows)
> +(15 rows)
>
> More details of the failure is available at [2].
>
> [1] - https://cirrus-ci.com/task/5685696451575808
> [2] - 
> https://api.cirrus-ci.com/v1/artifact/task/5685696451575808/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

Thanks for noting this here.

I've finally had a chance to look at this, and I don't believe there's
any real failure here, merely drift of how the planner works on master
resulting in this query now being eligible for a different plan shape.

I was a bit wary at first because the changing test query is one I'd
previously referenced in [1] as likely exposing the bug I'd fixed
where params where being used across worker boundaries. However
looking at the diff in the patch at that point (v10) that particular
test query formed a different plan shape (there were two gather nodes
being created, and params crossing between them).

But in the current revision of master with the current patch applied
that's no longer true: we have a Gather node, and the Subplan using
the param is properly under that Gather node, and the param should be
being both generated and consumed within the same worker process.

So I've updated the patch to show that plan change as part of the diff.

See attached v12

Regards,
James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe-_TObm5KwmZLYXBJ3BJGh4cUZWM0v1mY1gWTMkRNQXDQ%40mail.gmail.com


v12-0002-Parallelize-correlated-subqueries.patch
Description: Binary data


v12-0001-Add-tests-before-change.patch
Description: Binary data


Re: Opportunistically pruning page before update

2024-01-29 Thread James Coleman
On Fri, Jan 26, 2024 at 8:33 PM James Coleman  wrote:
>
> On Tue, Jan 23, 2024 at 2:46 AM Dilip Kumar  wrote:
> >
> > On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
> > >
> > > On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> > > >
> > > > See rebased patch attached.
> > >
> > > I just realized I left a change in during the rebase that wasn't 
> > > necessary.
> > >
> > > v4 attached.
> >
> > I have noticed that you are performing the opportunistic pruning after
> > we decided that the updated tuple can not fit in the current page and
> > then we are performing the pruning on the new target page.  Why don't
> > we first perform the pruning on the existing page of the tuple itself?
> >  Or this is already being done before this patch? I could not find
> > such existing pruning so got this question because such pruning can
> > convert many non-hot updates to the HOT update right?
>
> First off I noticed that I accidentally sent a different version of
> the patch I'd originally worked on. Here's the one from the proper
> branch. It's still similar, but I want to make sure the right one is
> being reviewed.
>
> I'm working on a demo case for updates (to go along with the insert
> case I sent earlier) to test out your question, and I'll reply when I
> have that.

All right, getting all this loaded back into my head, as you noted
earlier the patch currently implements points 1 and 2 of my list of
possible improvements:

> 1. The most trivial case where this is useful is INSERT: we have a
> target page, and it may have dead tuples, so trying to prune may
> result in us being able to use the target page rather than getting a
> new page.
> 2. The next most trivial case is where UPDATE (potentially after
> failing to find space for a HOT tuple on the source tuple's page);
> much like the INSERT case our backend's target page may benefit from
> pruning.

What you're describing above would be implementing (at least part of) point 3:

> 3. A more complex UPDATE case occurs when we check the tuple's page
> for space in order to insert a HOT tuple and fail to find enough
> space. While we've already opportunistically pruned the page on
> initial read of the tuple, in complex queries this might be some time
> in the past, so it may be worth attempting again.
> ...

If we try to design a simple test case for updates (like my insert
test case above) we might end up with something like:

drop table if exists foo;
create table foo(pk serial primary key, t text);
insert into foo(t) select repeat('a', 250) from generate_series(1, 27);
select pg_relation_size('foo');
delete from foo where pk = 1;
update foo set t = repeat('b', 250) where pk = 2;
select pg_relation_size('foo');

But that actually works as expected on master, because we call
heap_page_prune_opt from heapam_index_fetch_tuple as part of the index
scan that drives the update query.

I was theorizing that if there are concurrent writes to the page we
might being able to trigger the need to re-prune a page in the for
loop in heap_update(), and I tried to both regular pgbench and a
custom pgbench script with inserts/deletes/updates (including some
artificial delays).

What I concluded what this isn't isn't likely to be fruitful: we need
the buffer to be local to our backend (no other pins) to be able to
clean it, but since we've already pruned it on read, we need to have
had another backend modify the page (and dropped its pin!) between our
read and our write.

If someone believes there's a scenario that would demonstrate
otherwise, I would of course be interested to hear any ideas, but at
this point I think it's probably worth focusing on the first two cases
this patch already addresses.

Regards,
James Coleman




Re: Opportunistically pruning page before update

2024-01-26 Thread James Coleman
On Tue, Jan 23, 2024 at 2:46 AM Dilip Kumar  wrote:
>
> On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
> >
> > On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> > >
> > > See rebased patch attached.
> >
> > I just realized I left a change in during the rebase that wasn't necessary.
> >
> > v4 attached.
>
> I have noticed that you are performing the opportunistic pruning after
> we decided that the updated tuple can not fit in the current page and
> then we are performing the pruning on the new target page.  Why don't
> we first perform the pruning on the existing page of the tuple itself?
>  Or this is already being done before this patch? I could not find
> such existing pruning so got this question because such pruning can
> convert many non-hot updates to the HOT update right?

First off I noticed that I accidentally sent a different version of
the patch I'd originally worked on. Here's the one from the proper
branch. It's still similar, but I want to make sure the right one is
being reviewed.

I'm working on a demo case for updates (to go along with the insert
case I sent earlier) to test out your question, and I'll reply when I
have that.

Regards,
James Coleman


v5-0004-log-when-pruning-2.patch
Description: Binary data


v5-0002-log-when-pruning.patch
Description: Binary data


v5-0003-Opportunistically-prune-to-avoid-building-a-new-p.patch
Description: Binary data


v5-0001-Allow-getting-lock-before-calling-heap_page_prune.patch
Description: Binary data


Re: Opportunistically pruning page before update

2024-01-22 Thread James Coleman
On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
>
> See rebased patch attached.

I just realized I left a change in during the rebase that wasn't necessary.

v4 attached.

Regards,
James Coleman


v4-0002-Opportunistically-prune-to-avoid-building-a-new-p.patch
Description: Binary data


v4-0001-Allow-getting-lock-before-calling-heap_page_prune.patch
Description: Binary data


Re: Opportunistically pruning page before update

2024-01-22 Thread James Coleman
On Sun, Jan 21, 2024 at 9:58 PM Peter Smith  wrote:
>
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> like there was some CFbot test failure last time it was run [2].
> Please have a look and post an updated version if necessary.
>
> ==
> [1] https://commitfest.postgresql.org/46/4384//
> [2] 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4384

See rebased patch attached.

Thanks,
James Coleman


v3-0001-Allow-getting-lock-before-calling-heap_page_prune.patch
Description: Binary data


v3-0002-Opportunistically-prune-to-avoid-building-a-new-p.patch
Description: Binary data


Re: Add last_commit_lsn to pg_stat_database

2024-01-22 Thread James Coleman
On Sun, Jan 21, 2024 at 10:26 PM Peter Smith  wrote:
>
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review", but it seems like
> there was some CFbot test failure last time it was run [1]. Please
> have a look and post an updated version if necessary.
>
> ==
> [1] 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4355
>
> Kind Regards,
> Peter Smith.

Updated patch attached,

Thanks,
James Coleman


v3-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch
Description: Binary data


Re: PG12 change to DO UPDATE SET column references

2024-01-20 Thread James Coleman
On Sat, Jan 20, 2024 at 5:57 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Sat, Jan 20, 2024 at 12:59 PM Tom Lane  wrote:
> >> A HINT if the bogus column name (1) matches the relation name and
> >> (2) is field-qualified seems plausible to me.  Then it's pretty
> >> likely to be a user misunderstanding about whether to write the
> >> relation name.
>
> > Attached is a patch to do just that. We could also add tests for
> > regular UPDATEs if you think that's useful.
>
> Pushed with minor alterations:
>
> 1. I think our usual style for conditional hints is to use a ternary
> expression within the ereport, rather than duplicating code.  In this
> case that way allows not touching any of the existing lines, making
> review easier.

Ah, I'd wondered if we had a pattern for that, but I didn't know what
I was looking for.

> 2. I thought we should test the UPDATE case as well as the ON CONFLICT
> case, but I didn't think we needed quite as many tests as you had
> here.  I split up the responsibility so that one test covers the
> alias case and the other the no-alias case.

That all makes sense. I figured it was better to have tests show all
the possible combinations for review and then you could whittle them
down as you saw fit.

Thanks for reviewing and committing!

Regards,
James Coleman




Re: PG12 change to DO UPDATE SET column references

2024-01-20 Thread James Coleman
On Sat, Jan 20, 2024 at 12:59 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > I do wonder if it's plausible (and sufficiently easy) to improve the
> > error message here. "column 'foo' of relation 'foo'" makes one thing
> > that you've written foo.foo, (in my real-world case the error message
> > also cut off the sql past "foo.", and so I couldn't even tell if the
> > sql was just malformed). At the very least it'd be nice to have a HINT
> > here (perhaps just when the relation and column name match).
>
> > Before I look at where it is, Is such an improvement something we'd be
> > interested in?
>
> A HINT if the bogus column name (1) matches the relation name and
> (2) is field-qualified seems plausible to me.  Then it's pretty
> likely to be a user misunderstanding about whether to write the
> relation name.

Attached is a patch to do just that. We could also add tests for
regular UPDATEs if you think that's useful.

Regards,
James Coleman


v1-0001-Helpful-hint-for-qualified-target-column-in-UPDAT.patch
Description: Binary data


Re: PG12 change to DO UPDATE SET column references

2024-01-20 Thread James Coleman
On Sat, Jan 20, 2024 at 11:12 AM Tom Lane  wrote:
>
> James Coleman  writes:
> > Suppose I have this table:
> > create table foo (id int primary key);
>
> > On PG11 this works:
> > postgres=# insert into foo (id) values (1) on conflict (id) do update
> > set foo.id = 1;
> > INSERT 0 1
>
> Hmm, are you sure about that?  I get
>
> ERROR:  column "foo" of relation "foo" does not exist
> LINE 2:   on conflict (id) do update set foo.id = 1;
>  ^
>
> in every branch back to 9.5 where ON CONFLICT was introduced.
>
> I'm checking branch tip in each case, so conceivably this is
> something that was changed post-11.0, but I kinda doubt we
> would have back-patched it.

Hmm, I just tested it on the official 11.15 docker image and couldn't
reproduce it. That leads me to believe that the difference isn't in
PG11 vs. 12, but rather in 2ndQuadrant Postgres (which we are running
for PG11, but are not using for > 11). Egg on my face twice in this
thread.

I do wonder if it's plausible (and sufficiently easy) to improve the
error message here. "column 'foo' of relation 'foo'" makes one thing
that you've written foo.foo, (in my real-world case the error message
also cut off the sql past "foo.", and so I couldn't even tell if the
sql was just malformed). At the very least it'd be nice to have a HINT
here (perhaps just when the relation and column name match).

Before I look at where it is, Is such an improvement something we'd be
interested in?

Regards,
James Coleman




Re: PG12 change to DO UPDATE SET column references

2024-01-20 Thread James Coleman
On Fri, Jan 19, 2024 at 1:53 PM David G. Johnston
 wrote:
>
> On Fri, Jan 19, 2024 at 10:01 AM James Coleman  wrote:
>>
>> Making this more confusing is the fact that if I want to do something
>> like "SET bar = foo.bar + 1" the table qualification cannot be present
>> on the setting column but is required on the reading column.
>>
>> There isn't anything in the docs that I see about this, and I don't
>> see anything scanning the release notes for PG12 either (though I
>> could have missed a keyword to search for).
>>
>
> https://www.postgresql.org/docs/12/sql-insert.html
>
> "When referencing a column with ON CONFLICT DO UPDATE, do not include the 
> table's name in the specification of a target column. For example, INSERT 
> INTO table_name ... ON CONFLICT DO UPDATE SET table_name.col = 1 is invalid 
> (this follows the general behavior for UPDATE)."
>
> The same text exists for v11.

Well, egg on my face for definitely missing that in the docs.

Unfortunately that doesn't explain why it works on PG11 and not on PG12.

Regards,
James Coleman




PG12 change to DO UPDATE SET column references

2024-01-19 Thread James Coleman
Hello,

I realize this is almost ancient history at this point, but I ran into
a surprising behavior change from PG11->12 with ON CONFLICT ... DO
UPDATE SET ...

Suppose I have this table:
create table foo (id int primary key);

On PG11 this works:
postgres=# insert into foo (id) values (1) on conflict (id) do update
set foo.id = 1;
INSERT 0 1

But on PG12+ this is the result:
postgres=# insert into foo (id) values (1) on conflict (id) do update
set foo.id = 1;
ERROR:  column "foo" of relation "foo" does not exist
LINE 1: ...oo (id) values (1) on conflict (id) do update set foo.id = 1...

Making this more confusing is the fact that if I want to do something
like "SET bar = foo.bar + 1" the table qualification cannot be present
on the setting column but is required on the reading column.

There isn't anything in the docs that I see about this, and I don't
see anything scanning the release notes for PG12 either (though I
could have missed a keyword to search for).

Was this intended? Or a side effect? And should we explicitly document
the expectations here

The error is also pretty confusing: when you miss the required
qualification on the read column the error is more understandable:
ERROR:  column reference "bar" is ambiguous

It seems to me that it'd be desirable to either allow the unnecessary
qualification or give an error that's more easily understood.

Regards,
James Coleman




Re: Add last_commit_lsn to pg_stat_database

2024-01-17 Thread James Coleman
On Sun, Jan 14, 2024 at 6:01 AM vignesh C  wrote:
>
> On Sat, 10 Jun 2023 at 07:57, James Coleman  wrote:
> >
> > I've previously noted in "Add last commit LSN to
> > pg_last_committed_xact()" [1] that it's not possible to monitor how
> > many bytes of WAL behind a logical replication slot is (computing such
> > is obviously trivial for physical slots) because the slot doesn't need
> > to replicate beyond the last commit. In some cases it's possible for
> > the current WAL location to be far beyond the last commit. A few
> > examples:
> >
> > - An idle server with checkout_timeout at a lower value (so lots of
> > empty WAL advance).
> > - Very large transactions: particularly insidious because committing a
> > 1 GB transaction after a small transaction may show almost zero time
> > lag even though quite a bit of data needs to be processed and sent
> > over the wire (so time to replay is significantly different from
> > current lag).
> > - A cluster with multiple databases complicates matters further,
> > because while physical replication is cluster-wide, the LSNs that
> > matter for logical replication are database specific.
> >
> > Since we don't expose the most recent commit's LSN there's no way to
> > say "the WAL is currently 1250, the last commit happened at 1000, the
> > slot has flushed up to 800, therefore there are at most 200 bytes
> > replication needs to read through to catch up.
> >
> > In the aforementioned thread [1] I'd proposed a patch that added a SQL
> > function pg_last_commit_lsn() to expose the most recent commit's LSN.
> > Robert Haas didn't think the initial version's modifications to
> > commit_ts made sense, and a subsequent approach adding the value to
> > PGPROC didn't have strong objections, from what I can see, but it also
> > didn't generate any enthusiasm.
> >
> > As I was thinking about how to improve things, I realized that this
> > information (since it's for monitoring anyway) fits more naturally
> > into the stats system. I'd originally thought of exposing it in
> > pg_stat_wal, but that's per-cluster rather than per-database (indeed,
> > this is a flaw I hadn't considered in the original patch), so I think
> > pg_stat_database is the correct location.
> >
> > I've attached a patch to track the latest commit's LSN in pg_stat_database.
>
> I have changed the status of commitfest entry to "Returned with
> Feedback" as Aleksander's comments have not yet been resolved. Please
> feel free to post an updated version of the patch and update the
> commitfest entry accordingly.

Thanks for reminding me; I'd lost track of this patch.

Regards,
James Coleman




Re: Add last_commit_lsn to pg_stat_database

2024-01-17 Thread James Coleman
Hello,

Thanks for reviewing!

On Tue, Sep 19, 2023 at 8:16 AM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > [...]
> > As I was thinking about how to improve things, I realized that this
> > information (since it's for monitoring anyway) fits more naturally
> > into the stats system. I'd originally thought of exposing it in
> > pg_stat_wal, but that's per-cluster rather than per-database (indeed,
> > this is a flaw I hadn't considered in the original patch), so I think
> > pg_stat_database is the correct location.
> >
> > I've attached a patch to track the latest commit's LSN in pg_stat_database.
>
> Thanks for the patch. It was marked as "Needs Review" so I decided to
> take a brief look.
>
> All in all the code seems to be fine but I have a couple of nitpicks:
>
> - If you are modifying pg_stat_database the corresponding changes to
> the documentation are needed.

Updated.

> - pg_stat_get_db_last_commit_lsn() has the same description as
> pg_stat_get_db_xact_commit() which is confusing.

I've fixed this.

> - pg_stat_get_db_last_commit_lsn() is marked as STABLE which is
> probably correct. But I would appreciate a second opinion on this.

Sounds good.

> - Wouldn't it be appropriate to add a test or two?

Added.

> - `if (!XLogRecPtrIsInvalid(commit_lsn))` - I suggest adding
> XLogRecPtrIsValid() macro for better readability

We have 36 usages of !XLogRecPtrIsInvalid(...) already, so I think we
should avoid making this change in this patch.

v2 is attached.

Regards,
James Coleman


v2-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch
Description: Binary data


Re: Teach predtest about IS [NOT] proofs

2024-01-17 Thread James Coleman
On Fri, Dec 22, 2023 at 2:48 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > I've not yet applied all of your feedback, but I wanted to get an
> > initial read on your thoughts on how using switch statements ends up
> > looking. Attached is a single (pure refactor) patch that converts the
> > various if/else levels that check things like node tag and
> > boolean/null test type into switch statements. I've retained 'default'
> > keyword usages for now for simplicity (my intuition is that we
> > generally prefer to list out all options for compiler safety benefits,
> > though I'm not 100% sure that's useful in the outer node tag check
> > since it's unlikely that someone adding a new node would modify
> > this...).
>
> > My big question is: are you comfortable with the indentation explosion
> > this creates? IMO it's a lot wordier, but it is also more obvious what
> > the structural goal is. I'm not sure how we want to make the right
> > trade-off though.
>
> Yeah, I see what you mean.  Also, I'd wanted to shove most of
> the text in the function header in-line and get rid of the short
> restatements of those paras.  I carried that through just for
> predicate_implied_by_simple_clause, as attached.  The structure is
> definitely clearer, but we end up with an awful lot of indentation,
> which makes the comments less readable than I'd like.  (I did some
> minor rewording to make them flow better.)
>
> On balance I think this is probably better than what we have, but
> maybe we'd be best off to avoid doubly nested switches?  I think
> there's a good argument for the outer switch on nodeTag, but
> maybe we're getting diminishing returns from an inner switch.
>
> regards, tom lane
>

Apologies for the long delay.

Attached is a new patch series.

0001 does the initial pure refactor. 0003 makes a lot of modifications
to what we can prove about implication and refutation. Finally, 0003
isn't intended to be committed, but attempts to validate more
holistically that none of the changes creates any invalid proofs
beyond the mostly happy-path tests added in 0004.

I ended up not tackling changing how test_predtest tests run for now.
That's plausibly still useful, and I'd be happy to add that if you
generally agree with the direction of the patch and with that
abstraction being useful.

I added some additional verifications to the test_predtest module to
prevent additional obvious flaws.

Regards,
James Coleman


v4-0003-Add-temporary-all-permutations-test.patch
Description: Binary data


v4-0001-Use-switch-statements-in-predicate_-implied-refut.patch
Description: Binary data


v4-0002-Teach-predtest.c-about-BooleanTest.patch
Description: Binary data


Re: Teach predtest about IS [NOT] proofs

2023-12-22 Thread James Coleman
On Thu, Dec 14, 2023 at 4:38 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Wed, Dec 13, 2023 at 1:36 PM Tom Lane  wrote:
> >> I don't have an objection in principle to adding more smarts to
> >> predtest.c.  However, we should be wary of slowing down cases where
> >> no BooleanTests are present to be optimized.  I wonder if it could
> >> help to use a switch on nodeTag rather than a series of if(IsA())
> >> tests.  (I'd be inclined to rewrite the inner if-then-else chains
> >> as switches too, really.  You get some benefit from the compiler
> >> noticing whether you've covered all the enum values.)
>
> > I think I could take this on; would you prefer it as a patch in this
> > series? Or as a new patch thread?
>
> No, keep it in the same thread (and make a CF entry, if you didn't
> already).  It might be best to make a series of 2 patches, first
> just refactoring what's there per this discussion, and then a
> second one to add BooleanTest logic.

CF entry is already created; I'll keep it here then.

> >> I note you've actively broken the function's ability to cope with
> >> NULL input pointers.  Maybe we don't need it to, but I'm not going
> >> to accept a patch that just side-swipes that case without any
> >> justification.
>
> > [ all callers have previously used predicate_classify ]
>
> OK, fair enough.  The checks for nulls are probably from ancient
> habit, but I agree we could remove 'em here.
>
> >> Perhaps, rather than hoping people will notice comments that are
> >> potentially offscreen from what they're modifying, we should relocate
> >> those comment paras to be adjacent to the relevant parts of the
> >> function?
>
> > Splitting up that block comment makes sense to me.
>
> Done, let's make it so.
>
> >> I've not gone through the patch in detail to see whether I believe
> >> the proposed proof rules.  It would help to have more comments
> >> justifying them.
>
> > Most of them are sufficiently simple -- e.g., X IS TRUE implies X --
> > that I don't think there's a lot to say in justification. In some
> > cases I've noted the cases that force only strong or weak implication.
>
> Yeah, it's the strong-vs-weak distinction that makes me cautious here.
> One's high-school-algebra instinct for what's obviously true tends to
> not think about NULL/UNKNOWN, and you do have to consider that.
>
> >>> As noted in a TODO in the patch itself, I think it may be worth 
> >>> refactoring
> >>> the test_predtest module to run the "x, y" case as well as the "y, x" case
>
> >> I think that's actively undesirable.  It is not typically the case that
> >> a proof rule for A => B also works in the other direction, so this would
> >> encourage wasting cycles in the tests.  I fear it might also cause
> >> confusion about which direction a proof rule is supposed to work in.
>
> > That makes sense in the general case.
>
> > Boolean expressions seem like a special case in that regard: (subject
> > to what it looks like) would you be OK with a wrapping function that
> > does both directions (with output that shows which direction is being
> > tested) used only for the cases where we do want to check both
> > directions?
>
> Using a wrapper where appropriate would remove the inefficiency
> concern, but I still worry whether it will promote confusion about
> which direction we're proving things in.  You'll need to be very clear
> about the labeling.

I've not yet applied all of your feedback, but I wanted to get an
initial read on your thoughts on how using switch statements ends up
looking. Attached is a single (pure refactor) patch that converts the
various if/else levels that check things like node tag and
boolean/null test type into switch statements. I've retained 'default'
keyword usages for now for simplicity (my intuition is that we
generally prefer to list out all options for compiler safety benefits,
though I'm not 100% sure that's useful in the outer node tag check
since it's unlikely that someone adding a new node would modify
this...).

My big question is: are you comfortable with the indentation explosion
this creates? IMO it's a lot wordier, but it is also more obvious what
the structural goal is. I'm not sure how we want to make the right
trade-off though.

Once there's agreement on this part, I'll add back a second patch
applying my changes on top of the refactor as well as apply other
feedback (e.g., splitting up the block comment).

Regards,
James Coleman


v2-0001-WIP-use-switch-statements.patch
Description: Binary data


Re: Teach predtest about IS [NOT] proofs

2023-12-13 Thread James Coleman
Thanks for taking a look!

On Wed, Dec 13, 2023 at 1:36 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > Attached is a patch that solves that issue. It also teaches predtest about
> > quite a few more cases involving BooleanTest expressions (e.g., how they
> > relate to NullTest expressions). One thing I could imagine being an
> > objection is that not all of these warrant cycles in planning. If that
> > turns out to be the case there's not a particularly clear line in my mind
> > about where to draw that line.
>
> I don't have an objection in principle to adding more smarts to
> predtest.c.  However, we should be wary of slowing down cases where
> no BooleanTests are present to be optimized.  I wonder if it could
> help to use a switch on nodeTag rather than a series of if(IsA())
> tests.  (I'd be inclined to rewrite the inner if-then-else chains
> as switches too, really.  You get some benefit from the compiler
> noticing whether you've covered all the enum values.)

I think I could take this on; would you prefer it as a patch in this
series? Or as a new patch thread?

> I note you've actively broken the function's ability to cope with
> NULL input pointers.  Maybe we don't need it to, but I'm not going
> to accept a patch that just side-swipes that case without any
> justification.

I should have explained that. I don't think I've broken it:

1. predicate_implied_by_simple_clause() is only ever called by
predicate_implied_by_recurse()
2. predicate_implied_by_recurse() starts with:
pclass = predicate_classify(predicate, _info);
3. predicate_classify(Node *clause, PredIterInfo info) starts off with:
Assert(clause != NULL);

I believe this means we are currently guaranteed by the caller to
receive a non-NULL pointer, but I could be missing something.

The same argument (just substituting the equivalent "refute" function
names) applies to predicate_refuted_by_simple_clause().

> Another way in which the patch needs more effort is that you've
> not bothered to update the large comment block atop the function.
> Perhaps, rather than hoping people will notice comments that are
> potentially offscreen from what they're modifying, we should relocate
> those comment paras to be adjacent to the relevant parts of the
> function?

Splitting up that block comment makes sense to me.

> I've not gone through the patch in detail to see whether I believe
> the proposed proof rules.  It would help to have more comments
> justifying them.

Most of them are sufficiently simple -- e.g., X IS TRUE implies X --
that I don't think there's a lot to say in justification. In some
cases I've noted the cases that force only strong or weak implication.

There are a few cases, though, (e.g., "X is unknown weakly implies X
is not true") that, reading over this again, don't immediately strike
me as obvious, so I'll expand on those.

> > As noted in a TODO in the patch itself, I think it may be worth refactoring
> > the test_predtest module to run the "x, y" case as well as the "y, x" case
> > with a single call so as to eliminate a lot of repetition in
> > clause/expression test cases. If reviewers agree that's desirable, then I
> > could do that as a precursor.
>
> I think that's actively undesirable.  It is not typically the case that
> a proof rule for A => B also works in the other direction, so this would
> encourage wasting cycles in the tests.  I fear it might also cause
> confusion about which direction a proof rule is supposed to work in.

That makes sense in the general case.

Boolean expressions seem like a special case in that regard: (subject
to what it looks like) would you be OK with a wrapping function that
does both directions (with output that shows which direction is being
tested) used only for the cases where we do want to check both
directions?

Thanks,
James Coleman




Teach predtest about IS [NOT] proofs

2023-12-11 Thread James Coleman
Hello,

I recently encountered a case where partial indexes were surprisingly not
being used. The issue is that predtest doesn't understand how boolean
values and IS  expressions relate.

For example if I have:

create table foo(i int, bar boolean);
create index on foo(i) where bar is true;

then this query:

select * from foo where i = 1 and bar;

doesn't use the partial index.

Attached is a patch that solves that issue. It also teaches predtest about
quite a few more cases involving BooleanTest expressions (e.g., how they
relate to NullTest expressions). One thing I could imagine being an
objection is that not all of these warrant cycles in planning. If that
turns out to be the case there's not a particularly clear line in my mind
about where to draw that line.

As noted in a TODO in the patch itself, I think it may be worth refactoring
the test_predtest module to run the "x, y" case as well as the "y, x" case
with a single call so as to eliminate a lot of repetition in
clause/expression test cases. If reviewers agree that's desirable, then I
could do that as a precursor.

Regards,
James Coleman


v1-0001-Teach-predtest-about-IS-NOT-bool-proofs.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-10-18 Thread James Coleman
On Wed, Oct 18, 2023 at 2:09 AM torikoshia  wrote:
>
> On 2023-10-16 18:46, Ashutosh Bapat wrote:
> > On Thu, Oct 12, 2023 at 6:51 PM torikoshia 
> > wrote:
> >>
> >> On 2023-10-11 16:22, Ashutosh Bapat wrote:
> >> >
> >> > Considering the similarity with auto_explain I wondered whether this
> >> > function should be part of auto_explain contrib module itself? If we
> >> > do that users will need to load auto_explain extension and thus
> >> > install executor hooks when this function doesn't need those. So may
> >> > not be such a good idea. I didn't see any discussion on this.
> >>
> >> I once thought about adding this to auto_explain, but I left it asis
> >> for
> >> below reasons:
> >>
> >> - One of the typical use case of pg_log_query_plan() would be
> >> analyzing
> >> slow query on customer environments. On such environments, We cannot
> >> always control what extensions to install.
> >
> > The same argument applies to auto_explain functionality as well. But
> > it's not part of the core.
>
> Yeah, and when we have a situation where we want to run
> pg_log_query_plan(), we can run it in any environment as long as it is
>   bundled with the core.
> On the other hand, if it is built into auto_explain, we need to start by
> installing auto_explain if we do not have auto_explain, which is often
> difficult to do in production environments.
>
> >>Of course auto_explain is a major extension and it is quite
> >> possible
> >> that they installed auto_explain, but but it is also possible they do
> >> not.
> >> - It seems a bit counter-intuitive that pg_log_query_plan() is in an
> >> extension called auto_explain, since it `manually`` logs plans
> >>
> >
> > pg_log_query_plan() may not fit auto_explain but
> > pg_explain_backend_query() does. What we are logging is more than just
> > plan of the query, it might expand to be closer to explain output.
> > While auto in auto_explain would refer to its automatically logging
> > explain outputs, it can provide an additional function which provides
> > similar functionality by manually triggering it.
> >
> > But we can defer this to a committer, if you want.
> >
> > I am more interested in avoiding the duplication of code, esp. the
> > first comment in my reply
>
> If there are no objections, I will try porting it to auto_explain and
> see its feasibility.
>
> >>> There is a lot of similarity between what this feature does and what
> >>> auto explain does. I see the code is also duplicated. There is some
> >>> merit in avoiding this duplication
> >>> 1. we will get all the features of auto_explain automatically like
> >>> choosing a format (this was expressed somebody earlier in this
> >>> thread), setings etc.
> >>> 2. avoid bugs. E.g your code switches context after ExplainState has
> >>> been allocated. These states may leak depending upon when this
> >>> function gets called.
> >>> 3. Building features on top as James envisions will be easier.

In my view the fact that auto_explain is itself not part of core is a
mistake, and I know there are more prominent hackers than myself who
hold that view.

While that decision as regards auto_explain has long since been made
(and there would be work to undo it), I don't believe that we should
repeat that choice here. I'm -10 on moving this into auto_explain.

However perhaps there is still an opportunity for moving some of the
auto_explain code into core so as to enable deduplicating the code.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2023-10-06 Thread James Coleman
On Fri, Oct 6, 2023 at 8:58 AM torikoshia  wrote:
>
> On 2023-10-04 03:00, James Coleman wrote:
> > On Thu, Sep 7, 2023 at 2:09 AM torikoshia 
> > wrote:
> >>
> >> On 2023-09-06 11:17, James Coleman wrote:
> >>
> >> >> > I've never been able to reproduce it (haven't tested the new version,
> >> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> >> >> > (first buster and now bullseye).
> >> >> >
> >> >> > I'm attaching several stacktraces in the hope that they provide some
> >> >> > clues. These all match the ps output I sent earlier, though note in
> >> >> > that output there is both the regress instance and my test instance
> >> >> > (pid 3213249) running (different ports, of course, and they are from
> >> >> > the exact same compilation run). I've attached ps output for the
> >> >> > postgres processes under the make check process to simplify cross
> >> >> > referencing.
> >> >> >
> >> >> > A few interesting things:
> >> >> > - There's definitely a lock on a relation that seems to be what's
> >> >> > blocking the processes.
> >> >> > - When I try to connect with psql the process forks but then hangs
> >> >> > (see the ps output with task names stuck in "authentication"). I've
> >> >> > also included a trace from one of these.
> >> >>
> >> >> Thanks for sharing them!
> >> >>
> >> >> Many processes are waiting to acquire the LW lock, including the
> >> >> process
> >> >> trying to output the plan(select1.trace).
> >> >>
> >> >> I suspect that this is due to a lock that was acquired prior to being
> >> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able
> >> >> to
> >> >> reproduce the same situation..
> >> >>
> >> >
> >> > I don't have time immediately to dig into this, but perhaps loading up
> >> > the core dumps would allow us to see what query is running in each
> >> > backend process (if it hasn't already been discarded by that point)
> >> > and thereby determine what point in each test process led to the error
> >> > condition?
> >>
> >> Thanks for the suggestion.
> >> I am concerned that core dumps may not be readable on different
> >> operating systems or other environments. (Unfortunately, I do not have
> >> Debian on hand)
> >>
> >> It seems that we can know what queries were running from the stack
> >> traces you shared.
> >> As described above, I suspect a lock which was acquired prior to
> >> ProcessLogQueryPlanInterrupt() caused the issue.
> >> I will try a little more to see if I can devise a way to create the
> >> same
> >> situation.
> >>
> >> > Alternatively we might be able to apply the same trick to the test
> >> > client instead...
> >> >
> >> > BTW, for my own easy reference in this thread: relid 1259 is pg_class
> >> > if I'm not mistaken.
> >>
> >> Yeah, and I think it's strange that the lock to 1259 appears twice and
> >> must be avoided.
> >>
> >>#10 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
> >> at
> >> lmgr.c:117
> >>..
> >>#49 0x559d61b4989d in ProcessLogQueryPlanInterrupt () at
> >> explain.c:5158
> >>..
> >>#53 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1)
> >> at
> >> lmgr.c:117
> >
> > I chatted with Andres and David about this at PGConf.NYC,
> Thanks again for the discussion with hackers!
>
> > and I think
> > what we need to do is explicitly disallow running this code any time
> > we are inside of lock acquisition code.
>
> Yeah, I think it's a straightforward workaround.
> And I'm also concerned that the condition of being in the process
> of acquiring some kind of lock is too strict and will make it almost
> impossible to output a running plan.

I was concerned about this too, but I was wondering if we might be
able to cheat a bit by making such a case not clear the flag but
instead just skip it for now.

Regards,
James




Re: Opportunistically pruning page before update

2023-10-06 Thread James Coleman
Hi,

Thanks for taking a look!

On Fri, Oct 6, 2023 at 1:18 AM Dilip Kumar  wrote:
>
> On Thu, Oct 5, 2023 at 2:35 AM James Coleman  wrote:
> >
> > I talked to Andres and Peter again today, and out of that conversation
> > I have some observations and ideas for future improvements.
> >
> > 1. The most trivial case where this is useful is INSERT: we have a
> > target page, and it may have dead tuples, so trying to prune may
> > result in us being able to use the target page rather than getting a
> > new page.
> > 2. The next most trivial case is where UPDATE (potentially after
> > failing to find space for a HOT tuple on the source tuple's page);
> > much like the INSERT case our backend's target page may benefit from
> > pruning.
>
> By looking at the patch I believe that v2-0003 is implementing these 2
> ideas.  So my question is are we planning to prune the backend's
> current target page only or if we can not find space in that then we
> are targetting to prune the other target pages as well which we are
> getting from FSM?  Because in the patch you have put a check in a loop
> it will try to prune every page it gets from the FSM not just the
> current target page of the backend.  Just wanted to understand if this
> is intentional.

Yes, just like with our opportunistically pruning on each read during
a select I think we should at least check when we have a new target
page. This seems particularly true since we're hoping to write to the
page anyway, and the cost of additionally making pruning changes to
that page is low. I looked at freespace.c, and it doesn't appear that
getting the block from the FSM does this already, so we're not
duplicating any existing work.

Regards,
James




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-10-04 Thread James Coleman
On Wed, Oct 4, 2023 at 9:42 AM Robert Haas  wrote:
>
> On Wed, Oct 4, 2023 at 9:36 AM James Coleman  wrote:
> > Are you thinking we should simply elide the fact that there is pruning
> > that happens outside of HOT? Or add that information onto the HOT
> > page, even though it doesn't directly fit?
>
> I think we should elide it. Maybe with a much larger rewrite there
> would be a good place to include that information, but with the
> current structure, the page is about why HOT is good, and talking
> about pruning that can happen apart from HOT doesn't advance that
> message.

All right, attached is a v3 which attempts to fix the wrong
information with an economy of words. I may at some point submit a
separate patch that adds a broader pruning section, but this at least
brings the docs inline with reality insofar as they address it.

James


v3-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch
Description: Binary data


Re: Opportunistically pruning page before update

2023-10-04 Thread James Coleman
On Tue, Sep 26, 2023 at 8:30 AM James Coleman  wrote:
>
> On Tue, Sep 5, 2023 at 1:40 PM Melanie Plageman
>  wrote:
> >
> > On Wed, Jun 21, 2023 at 8:51 AM James Coleman  wrote:
> > > While at PGCon I was chatting with Andres (and I think Peter G. and a
> > > few others who I can't remember at the moment, apologies) and Andres
> > > noted that while we opportunistically prune a page when inserting a
> > > tuple (before deciding we need a new page) we don't do the same for
> > > updates.
> > >
> > > Attached is a patch series to do the following:
> > >
> > > 0001: Make it possible to call heap_page_prune_opt already holding an
> > > exclusive lock on the buffer.
> > > 0002: Opportunistically prune pages on update when the current tuple's
> > > page has no free space. If this frees up enough space, then we
> > > continue to put the new tuple on that page; if not, then we take the
> > > existing code path and get a new page.
> >
> > I've reviewed these patches and have questions.
> >
> > Under what conditions would this be exercised for UPDATE? Could you
> > provide an example?
> >
> > With your patch applied, when I create a table, the first time I update
> > it heap_page_prune_opt() will return before actually doing any pruning
> > because the page prune_xid hadn't been set (it is set after pruning as
> > well as later in heap_update() after RelationGetBufferForTuple() is
> > called).
> >
> > I actually added an additional parameter to heap_page_prune() and
> > heap_page_prune_opt() to identify if heap_page_prune() was called from
> > RelationGetBufferForTuple() and logged a message when this was true.
> > Running the test suite, I didn't see any UPDATEs executing
> > heap_page_prune() from RelationGetBufferForTuple(). I did, however, see
> > other statement types doing so (see RelationGetBufferForTuple()'s other
> > callers). Was that intended?
> >
> > > I started to work on benchmarking this, but haven't had time to devote
> > > properly to that, so I'm wondering if there's anyone who might be
> > > interested in collaborating on that part.
> >
> > I'm interested in this feature and in helping with it/helping with
> > benchmarking it, but I don't yet understand the design in its current
> > form.
>
> Hi Melanie,
>
> Thanks for taking a look at this! Apologies for the long delay in
> replying: I started to take a look at your questions earlier, and it
> turned into more of a rabbit hole than I'd anticipated. I've since
> been distracted by other things. So -- I don't have any conclusions
> here yet, but I'm hoping at or after PGConf NYC that I'll be able to
> dedicate the time this deserves.

Hi,

I poked at this a decent amount last night and uncovered a couple of
things (whether or not Andres and I had discussed these details at
PGCon...I don't remember):

1. We don't ever opportunistically prune on INSERT, but we do
(somewhat, see below) on UPDATE, since we call it the first time we
read the page with the to-be-updated tuple on it.
2. The reason that original testing on v1 didn't see any real changes
is because PageClearHasFreeLinePointers() wasn't the right fastpath
gate on this; I should have been using !PageIsFull().

With the change to use !PageIsFull() I can trivially show that there
is improvement functionally. Consider the following commands:

drop table if exists foo;
create table foo(pk serial primary key, t text);
insert into foo(t) select repeat('a', 250) from generate_series(1, 27);
select pg_relation_size('foo');
delete from foo where pk <= 10;
insert into foo(t) select repeat('b', 250) from generate_series(1, 10);
select pg_relation_size('foo');

On master this will result in a final relation size of 16384 while
with the patch applied the final relation size is 8192.

I talked to Andres and Peter again today, and out of that conversation
I have some observations and ideas for future improvements.

1. The most trivial case where this is useful is INSERT: we have a
target page, and it may have dead tuples, so trying to prune may
result in us being able to use the target page rather than getting a
new page.
2. The next most trivial case is where UPDATE (potentially after
failing to find space for a HOT tuple on the source tuple's page);
much like the INSERT case our backend's target page may benefit from
pruning.
3. A more complex UPDATE case occurs when we check the tuple's page
for space in order to insert a HOT tuple and fail to find enough
space. While we've already opportunistically pruned the page on
initial read of the tuple, in complex queries this might be some time
in the past, so it may be worth attempting again. Beyond that context
is k

Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-10-04 Thread James Coleman
On Wed, Oct 4, 2023 at 9:18 AM Robert Haas  wrote:
>
> On Tue, Oct 3, 2023 at 3:35 PM James Coleman  wrote:
> > I like your changes. Reading through this several times, and noting
> > Peter's comments about pruning being more than just HOT, I'm thinking
> > that rather than a simple fixup for this one paragraph what we
> > actually want is to split out the concept of page pruning into its own
> > section of the storage docs. Attached is a patch that does that,
> > incorporating much of your language about LP_REDIRECT, along with
> > LP_DEAD so that readers know this affects more than just heap-only
> > tuple workloads.
>
> I considered this kind of approach, but I don't think it's better. I
> think the point of this documentation is to give people a general idea
> of how HOT works, not all the technical details. If we start getting
> into the nitty gritty, we're going to have to explain a lot more
> stuff, which is going to detract from the actual purpose of this
> documentation. For example, your patch talks about LP_REDIRECT and
> LP_DEAD, which are not referenced in any other part of the
> documentation. It also uses the term line pointer instead of page item
> identifier without explaining that those are basically the same thing.
> Obviously those kinds of things can be fixed, but in my opinion, your
> version doesn't really add much information that is likely to be
> useful to a reader of this section, while at the same time it does add
> some things that might be confusing. If we wanted to have a more
> technical discussion of all of this somewhere in the documentation, we
> could, but that would be quite a bit more work to write and review,
> and it would probably duplicate some of what we've already got in
> README.HOT.
>
> I suggest that we should be looking for a patch that tries to correct
> the wrong stuff in the present wording while adding the minimum
> possible amount of text.

There's one primary thing I think this approach adds (I don't mean the
patch I proposed is the only way to address this concern): the fact
that pages get cleaned up outside of the HOT optimization. We could
add a short paragraph about that on the HOT page, but that seems a bit
confusing.

Are you thinking we should simply elide the fact that there is pruning
that happens outside of HOT? Or add that information onto the HOT
page, even though it doesn't directly fit?

Regards,
James




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-10-03 Thread James Coleman
On Mon, Oct 2, 2023 at 2:55 PM Robert Haas  wrote:
>
> On Sat, Sep 30, 2023 at 1:05 AM Peter Geoghegan  wrote:
> > > This is why I discovered it: it says "indexes do not reference their
> > > page item identifiers", which is manifestly not true when talking
> > > about the root item, and in fact would defeat the whole purpose of HOT
> > > (at least in a old-to-new chain like Postgres uses).
> >
> > Yeah, but...that's not what was intended. Obviously, the index hasn't
> > changed, and we expect index scans to continue to give correct
> > answers. So it is pretty strongly implied that it continues to point
> > to something valid.
>
> I took a look at this. I agree with James that the current wording is
> just plain wrong.
>
>  periodic vacuum operations.  (This is possible because indexes
>  do not reference their page
>  item identifiers.)
>
> Here, the antecedent of "their" is "old versions of updated rows". It
> is slightly unclear whether we should interpret this to mean (a) the
> tuples together with the line pointers which point to them or (b) only
> the tuple data to which the line pointers point. If (a), then it's
> wrong because we can't actually get rid of the root line pointer;
> rather, we have to change it to a redirect. If (b), then it's wrong
> because heap_page_prune() removes dead tuples in this sense whether
> HOT is involved or not. I can see no interpretation under which this
> statement is true as written.
>
> I reviewed James's proposed alternative:
>
> + periodic vacuum operations. However because indexes reference the old
> + version's page item 
> identifiers
> + the line pointer must remain in place. Such a line pointer has its
> + LP_REDIRECT bit set and its offset updated to the
> + page item identifiers of
> + the updated row.
>
> I don't think that's really right either. That's true for the root
> line pointer, but the phrasing seems to be referring to old versions
> generally, which would seem to include not only the root, for which
> this is correct, and also all subsequent now-dead row versions, for
> which it is wrong.
>
> Here is my attempt:
>
> When a row is updated multiple times, row versions other than the
> oldest and the newest can be completely removed during normal
> operation, including SELECTs, instead of requiring
> periodic vacuum operations. (Indexes always refer to the  linkend="storage-page-layout">page item identifiers of the
> original row version. The tuple data associated with that row version
> is removed, and its item identifier is converted to a redirect that
> points to the oldest version that may still be visible to some
> concurrent transaction. Intermediate row versions that are no longer
> visible to anyone are completely removed, and the associated page item
> identifiers are made available for reuse.)

Hi Robert,

Thanks for reviewing!

I like your changes. Reading through this several times, and noting
Peter's comments about pruning being more than just HOT, I'm thinking
that rather than a simple fixup for this one paragraph what we
actually want is to split out the concept of page pruning into its own
section of the storage docs. Attached is a patch that does that,
incorporating much of your language about LP_REDIRECT, along with
LP_DEAD so that readers know this affects more than just heap-only
tuple workloads.

Regards,
James


v2-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-10-03 Thread James Coleman
On Thu, Sep 7, 2023 at 2:09 AM torikoshia  wrote:
>
> On 2023-09-06 11:17, James Coleman wrote:
>
> >> > I've never been able to reproduce it (haven't tested the new version,
> >> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> >> > (first buster and now bullseye).
> >> >
> >> > I'm attaching several stacktraces in the hope that they provide some
> >> > clues. These all match the ps output I sent earlier, though note in
> >> > that output there is both the regress instance and my test instance
> >> > (pid 3213249) running (different ports, of course, and they are from
> >> > the exact same compilation run). I've attached ps output for the
> >> > postgres processes under the make check process to simplify cross
> >> > referencing.
> >> >
> >> > A few interesting things:
> >> > - There's definitely a lock on a relation that seems to be what's
> >> > blocking the processes.
> >> > - When I try to connect with psql the process forks but then hangs
> >> > (see the ps output with task names stuck in "authentication"). I've
> >> > also included a trace from one of these.
> >>
> >> Thanks for sharing them!
> >>
> >> Many processes are waiting to acquire the LW lock, including the
> >> process
> >> trying to output the plan(select1.trace).
> >>
> >> I suspect that this is due to a lock that was acquired prior to being
> >> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able
> >> to
> >> reproduce the same situation..
> >>
> >
> > I don't have time immediately to dig into this, but perhaps loading up
> > the core dumps would allow us to see what query is running in each
> > backend process (if it hasn't already been discarded by that point)
> > and thereby determine what point in each test process led to the error
> > condition?
>
> Thanks for the suggestion.
> I am concerned that core dumps may not be readable on different
> operating systems or other environments. (Unfortunately, I do not have
> Debian on hand)
>
> It seems that we can know what queries were running from the stack
> traces you shared.
> As described above, I suspect a lock which was acquired prior to
> ProcessLogQueryPlanInterrupt() caused the issue.
> I will try a little more to see if I can devise a way to create the same
> situation.
>
> > Alternatively we might be able to apply the same trick to the test
> > client instead...
> >
> > BTW, for my own easy reference in this thread: relid 1259 is pg_class
> > if I'm not mistaken.
>
> Yeah, and I think it's strange that the lock to 1259 appears twice and
> must be avoided.
>
>    #10 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at
> lmgr.c:117
>..
>#49 0x559d61b4989d in ProcessLogQueryPlanInterrupt () at
> explain.c:5158
>..
>#53 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at
> lmgr.c:117

I chatted with Andres and David about this at PGConf.NYC, and I think
what we need to do is explicitly disallow running this code any time
we are inside of lock acquisition code.

Regards,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 4:06 PM Peter Geoghegan  wrote:
>
> On Fri, Sep 29, 2023 at 11:45 AM James Coleman 
> wrote:my reading the issue is that "old versions" doesn't say
> > anything about "old HOT versions; it seems to be describing what
> > happens generally when a heap-only tuple is written -- which would
> > include the first time a heap-only tuple is written.
>
> I think that it's talking about what happens during opportunistic
> pruning, in particular what happens to HOT chains. (Though pruning
> does almost the same amount of useful work with non-heap-only tuples,
> so it's a bit unfortunate that the name "HOT pruning" seems to have
> stuck.)

That's very likely what the intention was. I read it again, and the
same confusion still sticks out to me: it doesn't say anything
explicitly about opportunistic pruning (I'm not sure if that term is
"public docs" level, so that's probably fine), and it doesn't scope
the claim to intermediate tuples in a HOT chain -- indeed the context
is the HOT feature generally.

This is why I discovered it: it says "indexes do not reference their
page item identifiers", which is manifestly not true when talking
about the root item, and in fact would defeat the whole purpose of HOT
(at least in a old-to-new chain like Postgres uses).

Assuming people can be convinced this is confusing (I realize you may
not be yet), I see two basic options:

1. Update this to discuss both intermediate tuples and root items
separately. This could entail either one larger paragraph or splitting
such that instead of "two optimizations" we say "three" optimizations.

2. Change "old versions" to something like "intermediate versions in a
series of updates".

I prefer some form of (1) since it more fully describes the behavior,
but we could tweak further for concision.

> > And when it's the
> > first heap-only tuple the "old version" would be the original version,
> > which would not be a heap-only tuple.
>
> The docs say "Old versions of updated rows can be completely removed
> during normal operation". Opportunistic pruning removes dead heap-only
> tuples completely, and makes their line pointers LP_UNUSED right away.
> But it can also entail removing storage for the original root item
> heap tuple, and making its line pointer LP_REDIRECT right away (not
> LP_DEAD or LP_UNUSED) at most once in the life of each HOT chain. So
> yeah, we're not quite limited to removing storage for heap-only tuples
> when pruning a HOT chain. Does that distinction really matter, though?

Given pageinspect can show you the original tuple still exists and
that the index still references it...I think it does.

I suppose very few people go checking that out, of course, but I'd
like to be precise.

Regards,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 2:39 PM Peter Geoghegan  wrote:
>
> On Fri, Sep 29, 2023 at 11:04 AM Peter Geoghegan  wrote:
> > > But when a HOT update happens the entry in an (logically unchanged)
> > > index still points to the original heap tid, and that line item is
> > > updated with a pointer to the new line pointer in the same page.
> >
> > It's true that the original root heap tuple (which is never a
> > heap-only tuple) must have its line pointer changed from LP_NORMAL to
> > LP_REDIRECT the first time pruning takes place that affects its HOT
> > chain. But I don't think that referring to the root item as something
> > along the lines of "an obsolescent/old tuple's line pointer" is
> > particularly helpful.
>
> To be clear, the existing wording seems correct to me. Even heap-only
> tuples require line pointers. These line pointers are strictly
> guaranteed to never be *directly* referenced from indexes (if they are
> then we'll actually detect it and report data corruption on recent
> versions). The name "heap-only tuple" quite literally means that the
> tuple and its line pointer are only represented in the heap, and never
> in indexes.

Hmm, to my reading the issue is that "old versions" doesn't say
anything about "old HOT versions; it seems to be describing what
happens generally when a heap-only tuple is written -- which would
include the first time a heap-only tuple is written. And when it's the
first heap-only tuple the "old version" would be the original version,
which would not be a heap-only tuple.

I can work up a tweaked version of the patch that shows there are two
paths here (original tuple is being updated versus an intermediate
heap-only tuple is being updated); would you be willing to consider
that?

Thanks,
James Coleman




Re: Fix incorrect comment reference

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 2:26 PM Bruce Momjian  wrote:
>
> On Mon, Jan 23, 2023 at 06:42:45PM -0500, James Coleman wrote:
> > On Mon, Jan 23, 2023 at 4:07 PM James Coleman  wrote:
> > >
> > > On Mon, Jan 23, 2023 at 3:41 PM Robert Haas  wrote:
> > > >
> > > > On Mon, Jan 23, 2023 at 3:19 PM James Coleman  wrote:
> > > > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  
> > > > > wrote:
> > > > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman  
> > > > > > wrote:
> > > > > > > See the attached for a simple comment fix -- the referenced
> > > > > > > generate_useful_gather_paths call isn't in grouping_planner it's 
> > > > > > > in
> > > > > > > apply_scanjoin_target_to_paths.
> > > > > >
> > > > > > The intended reading of the comment is not clear. Is it telling you 
> > > > > > to
> > > > > > look at grouping_planner because that's where we
> > > > > > generate_useful_gather_paths, or is it telling you to look there to
> > > > > > see how we get the final target list together? If it's the former,
> > > > > > then your fix is correct. If the latter, it's fine as it is.
> > > > > >
> > > > > > The real answer is probably that some years ago both things happened
> > > > > > in that function. We've moved on from there, but I'm still not sure
> > > > > > what the most useful phrasing of the comment is.
> > > > >
> > > > > Yeah, almost certainly, and the comments just didn't keep up.
> > > > >
> > > > > Would you prefer something that notes both that the broader concern is
> > > > > happening via the grouping_planner() stage but still points to the
> > > > > proper callsite (so that people don't go looking for that confused)?
> > > >
> > > > I don't really have a strong view on what the best thing to do is. I
> > > > was just pointing out that the comment might not be quite so obviously
> > > > wrong as you were supposing.
> > >
> > > "Wrong" is certainly too strong; my apologies.
> > >
> > > I'm really just hoping to improve it for future readers to save them
> > > some confusion I had initially reading it.
> >
> > Updated patch attached.
>
> Patch applied.

Thanks!

Regards,
James Coleman




[DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
Hello,

While working on my talk for PGConf.NYC next week I came across this
bullet in the docs on heap only tuples:

> Old versions of updated rows can be completely removed during normal
> operation, including SELECTs, instead of requiring periodic vacuum
> operations. (This is possible because indexes do not reference their page
> item identifiers.)

But when a HOT update happens the entry in an (logically unchanged)
index still points to the original heap tid, and that line item is
updated with a pointer to the new line pointer in the same page.

Assuming I'm understanding this correctly, attached is a patch
correcting the description.

Thanks,
James Coleman


v1-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch
Description: Binary data


Re: Opportunistically pruning page before update

2023-09-26 Thread James Coleman
On Tue, Sep 5, 2023 at 1:40 PM Melanie Plageman
 wrote:
>
> On Wed, Jun 21, 2023 at 8:51 AM James Coleman  wrote:
> > While at PGCon I was chatting with Andres (and I think Peter G. and a
> > few others who I can't remember at the moment, apologies) and Andres
> > noted that while we opportunistically prune a page when inserting a
> > tuple (before deciding we need a new page) we don't do the same for
> > updates.
> >
> > Attached is a patch series to do the following:
> >
> > 0001: Make it possible to call heap_page_prune_opt already holding an
> > exclusive lock on the buffer.
> > 0002: Opportunistically prune pages on update when the current tuple's
> > page has no free space. If this frees up enough space, then we
> > continue to put the new tuple on that page; if not, then we take the
> > existing code path and get a new page.
>
> I've reviewed these patches and have questions.
>
> Under what conditions would this be exercised for UPDATE? Could you
> provide an example?
>
> With your patch applied, when I create a table, the first time I update
> it heap_page_prune_opt() will return before actually doing any pruning
> because the page prune_xid hadn't been set (it is set after pruning as
> well as later in heap_update() after RelationGetBufferForTuple() is
> called).
>
> I actually added an additional parameter to heap_page_prune() and
> heap_page_prune_opt() to identify if heap_page_prune() was called from
> RelationGetBufferForTuple() and logged a message when this was true.
> Running the test suite, I didn't see any UPDATEs executing
> heap_page_prune() from RelationGetBufferForTuple(). I did, however, see
> other statement types doing so (see RelationGetBufferForTuple()'s other
> callers). Was that intended?
>
> > I started to work on benchmarking this, but haven't had time to devote
> > properly to that, so I'm wondering if there's anyone who might be
> > interested in collaborating on that part.
>
> I'm interested in this feature and in helping with it/helping with
> benchmarking it, but I don't yet understand the design in its current
> form.

Hi Melanie,

Thanks for taking a look at this! Apologies for the long delay in
replying: I started to take a look at your questions earlier, and it
turned into more of a rabbit hole than I'd anticipated. I've since
been distracted by other things. So -- I don't have any conclusions
here yet, but I'm hoping at or after PGConf NYC that I'll be able to
dedicate the time this deserves.

Thanks,
James Coleman




Re: RFC: Logging plan of the running query

2023-09-05 Thread James Coleman
On Tue, Sep 5, 2023 at 9:59 AM torikoshia  wrote:
>
> On 2023-08-28 22:47, James Coleman wrote:
> > On Mon, Aug 28, 2023 at 3:01 AM torikoshia 
> > wrote:
> >>
> >> On 2023-08-26 21:03, James Coleman wrote:
> >> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman  wrote:
> >> >>
> >> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
> >> >>  wrote:
> >> >> >
> >> >> > On 2023-06-16 01:34, James Coleman wrote:
> >> >> > > Attached is v28
> >> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> >> >> > > when necessary. Once built with those two patches I'm simply running
> >> >> > > `make check`.
> >> >> >
> >> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes 
> >> >> > consume
> >> >> > huge
> >> >> > amount of memory and under some environments they were terminated by 
> >> >> > OOM
> >> >> > killer.
> >> >> >
> >> >> > This was because memory was allocated from existing memory contexts 
> >> >> > and
> >> >> > they
> >> >> > were not freed after ProcessLogQueryPlanInterrupt().
> >> >> > Updated the patch to use dedicated memory context for
> >> >> > ProcessLogQueryPlanInterrupt().
> >> >> >
> >> >> > Applying attached patch and v28-0002 patch, `make check` successfully
> >> >> > completed after 20min and 50GB of logs on my environment.
> >> >> >
> >> >> > >>> On 2023-06-15 01:48, James Coleman wrote:
> >> >> > >>> > The tests have been running since last night, but have been 
> >> >> > >>> > apparently
> >> >> > >>> > hung now for many hours.
> >> >> >
> >> >> > I don't know if this has anything to do with the hung you faced, but I
> >> >> > thought
> >> >> > it might be possible that the large amount of memory usage resulted in
> >> >> > swapping, which caused a significant delay in processing.
> >> >>
> >> >> Ah, yes, I think that could be a possible explanation. I was delaying
> >> >> on this thread because I wasn't comfortable with having caused an
> >> >> issue once (even if I couldn't easily reproduce) without at least some
> >> >> theory as to the cause (and a fix).
> >> >>
> >> >> > If possible, I would be very grateful if you could try to reproduce 
> >> >> > this
> >> >> > with
> >> >> > the v29 patch.
> >> >>
> >> >> I'll kick off some testing.
> >> >>
> >> >
> >> > I don't have time to investigate what's happening here, but 24 hours
> >> > later the first "make check" is still running, and at first glance it
> >> > seems to have the same behavior I'd seen that first time. The test
> >> > output is to this point:
> >> >
> >> > # parallel group (5 tests):  index_including create_view
> >> > index_including_gist create_index create_index_spgist
> >> > ok 66+ create_index26365 ms
> >> > ok 67+ create_index_spgist 27675 ms
> >> > ok 68+ create_view  1235 ms
> >> > ok 69+ index_including  1102 ms
> >> > ok 70+ index_including_gist 1633 ms
> >> > # parallel group (16 tests):  create_aggregate create_cast errors
> >> > roleattributes drop_if_exists hash_func typed_table create_am
> >> > infinite_recurse
> >> >
> >> > and it hasn't progressed past that point since at least ~16 hours ago
> >> > (the first several hours of the run I wasn't monitoring it).
> >> >
> >> > I haven't connected up gdb yet, and won't be able to until maybe
> >> > tomorrow, but here's the ps output for postgres processes that are
> >> > running:
> >> >
> >> > admin3213249  0.0  0.0 196824 20552 ?Ss   Aug25   0:00
> >> > /home/admin/postgresql-test/bin/postgres -D
> >> > /home/admin/postgresql-test-data
> >> > admin3213250  0.0  0.0 196964  72

Re: RFC: Logging plan of the running query

2023-08-28 Thread James Coleman
On Mon, Aug 28, 2023 at 3:01 AM torikoshia  wrote:
>
> On 2023-08-26 21:03, James Coleman wrote:
> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman  wrote:
> >>
> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
> >>  wrote:
> >> >
> >> > On 2023-06-16 01:34, James Coleman wrote:
> >> > > Attached is v28
> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> >> > > when necessary. Once built with those two patches I'm simply running
> >> > > `make check`.
> >> >
> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> >> > huge
> >> > amount of memory and under some environments they were terminated by OOM
> >> > killer.
> >> >
> >> > This was because memory was allocated from existing memory contexts and
> >> > they
> >> > were not freed after ProcessLogQueryPlanInterrupt().
> >> > Updated the patch to use dedicated memory context for
> >> > ProcessLogQueryPlanInterrupt().
> >> >
> >> > Applying attached patch and v28-0002 patch, `make check` successfully
> >> > completed after 20min and 50GB of logs on my environment.
> >> >
> >> > >>> On 2023-06-15 01:48, James Coleman wrote:
> >> > >>> > The tests have been running since last night, but have been 
> >> > >>> > apparently
> >> > >>> > hung now for many hours.
> >> >
> >> > I don't know if this has anything to do with the hung you faced, but I
> >> > thought
> >> > it might be possible that the large amount of memory usage resulted in
> >> > swapping, which caused a significant delay in processing.
> >>
> >> Ah, yes, I think that could be a possible explanation. I was delaying
> >> on this thread because I wasn't comfortable with having caused an
> >> issue once (even if I couldn't easily reproduce) without at least some
> >> theory as to the cause (and a fix).
> >>
> >> > If possible, I would be very grateful if you could try to reproduce this
> >> > with
> >> > the v29 patch.
> >>
> >> I'll kick off some testing.
> >>
> >
> > I don't have time to investigate what's happening here, but 24 hours
> > later the first "make check" is still running, and at first glance it
> > seems to have the same behavior I'd seen that first time. The test
> > output is to this point:
> >
> > # parallel group (5 tests):  index_including create_view
> > index_including_gist create_index create_index_spgist
> > ok 66+ create_index26365 ms
> > ok 67+ create_index_spgist 27675 ms
> > ok 68+ create_view  1235 ms
> > ok 69+ index_including  1102 ms
> > ok 70+ index_including_gist 1633 ms
> > # parallel group (16 tests):  create_aggregate create_cast errors
> > roleattributes drop_if_exists hash_func typed_table create_am
> > infinite_recurse
> >
> > and it hasn't progressed past that point since at least ~16 hours ago
> > (the first several hours of the run I wasn't monitoring it).
> >
> > I haven't connected up gdb yet, and won't be able to until maybe
> > tomorrow, but here's the ps output for postgres processes that are
> > running:
> >
> > admin3213249  0.0  0.0 196824 20552 ?Ss   Aug25   0:00
> > /home/admin/postgresql-test/bin/postgres -D
> > /home/admin/postgresql-test-data
> > admin3213250  0.0  0.0 196964  7284 ?Ss   Aug25   0:00
> > postgres: checkpointer
> > admin3213251  0.0  0.0 196956  4276 ?Ss   Aug25   0:00
> > postgres: background writer
> > admin3213253  0.0  0.0 196956  8600 ?Ss   Aug25   0:00
> > postgres: walwriter
> > admin3213254  0.0  0.0 198424  7316 ?Ss   Aug25   0:00
> > postgres: autovacuum launcher
> > admin3213255  0.0  0.0 198412  5488 ?Ss   Aug25   0:00
> > postgres: logical replication launcher
> > admin3237967  0.0  0.0   2484   516 pts/4S+   Aug25   0:00
> > /bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
> > PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
> > LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgr

Re: RFC: Logging plan of the running query

2023-08-26 Thread James Coleman
On Fri, Aug 25, 2023 at 7:43 AM James Coleman  wrote:
>
> On Thu, Aug 17, 2023 at 10:02 AM torikoshia  
> wrote:
> >
> > On 2023-06-16 01:34, James Coleman wrote:
> > > Attached is v28
> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> > > when necessary. Once built with those two patches I'm simply running
> > > `make check`.
> >
> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> > huge
> > amount of memory and under some environments they were terminated by OOM
> > killer.
> >
> > This was because memory was allocated from existing memory contexts and
> > they
> > were not freed after ProcessLogQueryPlanInterrupt().
> > Updated the patch to use dedicated memory context for
> > ProcessLogQueryPlanInterrupt().
> >
> > Applying attached patch and v28-0002 patch, `make check` successfully
> > completed after 20min and 50GB of logs on my environment.
> >
> > >>> On 2023-06-15 01:48, James Coleman wrote:
> > >>> > The tests have been running since last night, but have been apparently
> > >>> > hung now for many hours.
> >
> > I don't know if this has anything to do with the hung you faced, but I
> > thought
> > it might be possible that the large amount of memory usage resulted in
> > swapping, which caused a significant delay in processing.
>
> Ah, yes, I think that could be a possible explanation. I was delaying
> on this thread because I wasn't comfortable with having caused an
> issue once (even if I couldn't easily reproduce) without at least some
> theory as to the cause (and a fix).
>
> > If possible, I would be very grateful if you could try to reproduce this
> > with
> > the v29 patch.
>
> I'll kick off some testing.
>

I don't have time to investigate what's happening here, but 24 hours
later the first "make check" is still running, and at first glance it
seems to have the same behavior I'd seen that first time. The test
output is to this point:

# parallel group (5 tests):  index_including create_view
index_including_gist create_index create_index_spgist
ok 66+ create_index26365 ms
ok 67+ create_index_spgist 27675 ms
ok 68+ create_view  1235 ms
ok 69+ index_including  1102 ms
ok 70+ index_including_gist 1633 ms
# parallel group (16 tests):  create_aggregate create_cast errors
roleattributes drop_if_exists hash_func typed_table create_am
infinite_recurse

and it hasn't progressed past that point since at least ~16 hours ago
(the first several hours of the run I wasn't monitoring it).

I haven't connected up gdb yet, and won't be able to until maybe
tomorrow, but here's the ps output for postgres processes that are
running:

admin3213249  0.0  0.0 196824 20552 ?Ss   Aug25   0:00
/home/admin/postgresql-test/bin/postgres -D
/home/admin/postgresql-test-data
admin3213250  0.0  0.0 196964  7284 ?Ss   Aug25   0:00
postgres: checkpointer
admin3213251  0.0  0.0 196956  4276 ?Ss   Aug25   0:00
postgres: background writer
admin3213253  0.0  0.0 196956  8600 ?Ss   Aug25   0:00
postgres: walwriter
admin3213254  0.0  0.0 198424  7316 ?Ss   Aug25   0:00
postgres: autovacuum launcher
admin3213255  0.0  0.0 198412  5488 ?Ss   Aug25   0:00
postgres: logical replication launcher
admin3237967  0.0  0.0   2484   516 pts/4S+   Aug25   0:00
/bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
../../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20
--schedule=./parallel_schedule
admin3237973  0.0  0.0 197880 20688 pts/4S+   Aug25   0:00
postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
listen_addresses= -k /tmp/pg_regress-7mmGUa
admin3237976  0.0  0.1 198332 44608 ?Ss   Aug25   0:00
postgres: checkpointer
admin3237977  0.0  0.0 198032  4640 ?Ss   Aug25   0:00
postgres: background writer
admin3237979  0.0  0.0 197880  8580 ?Ss   Aug25   0:00
postgres: walwriter
admin3237980  0.0  0.0 199484  7608 ?Ss   Aug25   0:00
postgres: autovacuum launcher
admin3237981  0.0  0.0 199460  5488 ?Ss   Aug25   0:00
postgres: logical replication launcher
admin3243644  0.0  0.2 252400 74656 ?Ss   Au

Re: RFC: Logging plan of the running query

2023-08-25 Thread James Coleman
On Thu, Aug 17, 2023 at 10:02 AM torikoshia  wrote:
>
> On 2023-06-16 01:34, James Coleman wrote:
> > Attached is v28
> > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> > when necessary. Once built with those two patches I'm simply running
> > `make check`.
>
> With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> huge
> amount of memory and under some environments they were terminated by OOM
> killer.
>
> This was because memory was allocated from existing memory contexts and
> they
> were not freed after ProcessLogQueryPlanInterrupt().
> Updated the patch to use dedicated memory context for
> ProcessLogQueryPlanInterrupt().
>
> Applying attached patch and v28-0002 patch, `make check` successfully
> completed after 20min and 50GB of logs on my environment.
>
> >>> On 2023-06-15 01:48, James Coleman wrote:
> >>> > The tests have been running since last night, but have been apparently
> >>> > hung now for many hours.
>
> I don't know if this has anything to do with the hung you faced, but I
> thought
> it might be possible that the large amount of memory usage resulted in
> swapping, which caused a significant delay in processing.

Ah, yes, I think that could be a possible explanation. I was delaying
on this thread because I wasn't comfortable with having caused an
issue once (even if I couldn't easily reproduce) without at least some
theory as to the cause (and a fix).

> If possible, I would be very grateful if you could try to reproduce this
> with
> the v29 patch.

I'll kick off some testing.

Thanks,
James Coleman




Re: pgindent (probably my missing something obvious)

2023-07-03 Thread James Coleman
On Mon, Jul 3, 2023 at 9:20 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > This is the first time I've run pgindent on my current machine, and it
> > doesn't seem to be making any modifications to source files. For
> > example this command:
>
> > ./src/tools/pgindent/pgindent src/backend/optimizer/path/allpaths.c
>
> > leaves the allpaths.c file unchanged despite my having some very long
> > function calls.
>
> "Long function calls" aren't necessarily something pgindent would
> change.  Have you tried intentionally misindenting some lines?
>
> regards, tom lane

Hmm, yeah, that works.

My heuristic for what pgindent changes must be wrong. The long
function calls (and 'if' conditions) seem obviously out of place to my
eyes with the surrounding code. Does that mean the surrounding code
was just hand-prettified?

Thanks,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2023-07-03 Thread James Coleman
On Sun, Jun 11, 2023 at 10:23 PM James Coleman  wrote:
>
> ...
> > And while trying the v9 patch I came across a crash with the query
> > below.
> >
> > set min_parallel_table_scan_size to 0;
> > set parallel_setup_cost to 0;
> > set parallel_tuple_cost to 0;
> >
> > explain (costs off)
> > select * from pg_description t1 where objoid in
> > (select objoid from pg_description t2 where t2.description = 
> > t1.description);
> >QUERY PLAN
> > 
> >  Seq Scan on pg_description t1
> >Filter: (SubPlan 1)
> >SubPlan 1
> >  ->  Gather
> >Workers Planned: 2
> >->  Parallel Seq Scan on pg_description t2
> >  Filter: (description = t1.description)
> > (7 rows)
> >
> > select * from pg_description t1 where objoid in
> > (select objoid from pg_description t2 where t2.description = 
> > t1.description);
> > WARNING:  terminating connection because of crash of another server process
> >
> > Seems something is wrong when extracting the argument from the Param in
> > parallel worker.
>
> With what I'm trying to change I don't think this plan should ever be
> generated since it means we'd have to pass a param from the outer seq
> scan into the parallel subplan, which we can't do (currently).
>
> I've attached the full backtrace to the email, but as you hinted at
> the parallel worker is trying to get the param (in this case
> detoasting it), but the param doesn't exist on the worker, so it seg
> faults. Looking at this further I think there's an existing test case
> that exposes the misplanning here (the one right under the comment
> "Parallel Append is not to be used when the subpath depends on the
> outer param" in select_parallel.sql), but it doesn't seg fault because
> the param is an integer, doesn't need to be detoasted, and therefore
> (I think) we skate by (but probably with wrong results in depending on
> the dataset).
>
> Interestingly this is one of the existing test queries my original
> patch approach didn't change, so this gives me something specific to
> work with improving the path. Thanks for testing this and bringing
> this to my attention!

Here's what I've found debugging this:

There's only a single gather path ever created when planning this
query, making it easy to know which one is the problem. That gather
path is created with this stacktrace:

frame #0: 0x000105291590
postgres`create_gather_path(root=0x00013081ae78,
rel=0x00013080c8e8, subpath=0x00013081c080,
target=0x00013081c8c0, required_outer=0x,
rows=0x) at pathnode.c:1971:2
frame #1: 0x000105208e54
postgres`generate_gather_paths(root=0x00013081ae78,
rel=0x00013080c8e8, override_rows=false) at allpaths.c:3097:4
frame #2: 0x0001052090ec
postgres`generate_useful_gather_paths(root=0x00013081ae78,
rel=0x00013080c8e8, override_rows=false) at allpaths.c:3241:2
frame #3: 0x000105258754
postgres`apply_scanjoin_target_to_paths(root=0x00013081ae78,
rel=0x00013080c8e8, scanjoin_targets=0x00013081c978,
scanjoin_targets_contain_srfs=0x,
scanjoin_target_parallel_safe=true, tlist_same_exprs=true) at
planner.c:7696:3
frame #4: 0x0001052533cc
postgres`grouping_planner(root=0x00013081ae78, tuple_fraction=0.5)
at planner.c:1611:3
frame #5: 0x000105251e9c
postgres`subquery_planner(glob=0x0001308188d8,
parse=0x00013080caf8, parent_root=0x00013080cc38,
hasRecursion=false, tuple_fraction=0.5) at planner.c:1062:2
frame #6: 0x00010526b134
postgres`make_subplan(root=0x00013080cc38,
orig_subquery=0x00013080ff58, subLinkType=ANY_SUBLINK,
subLinkId=0, testexpr=0x00013080d848, isTopQual=true) at
subselect.c:221:12
frame #7: 0x000105268b8c
postgres`process_sublinks_mutator(node=0x00013080d6d8,
context=0x00016b0998f8) at subselect.c:1950:10
frame #8: 0x000105268ad8
postgres`SS_process_sublinks(root=0x00013080cc38,
expr=0x00013080d6d8, isQual=true) at subselect.c:1923:9
frame #9: 0x0001052527b8
postgres`preprocess_expression(root=0x00013080cc38,
expr=0x00013080d6d8, kind=0) at planner.c:1169:10
frame #10: 0x000105252954
postgres`preprocess_qual_conditions(root=0x00013080cc38,
jtnode=0x00013080d108) at planner.c:1214:14
frame #11: 0x000105251580
postgres`subquery_planner(glob=0x0001308188d8,
parse=0x000137010d68, parent_root=0x,
hasRecursion=false, tuple_fraction=0) at planner.c:832:2
frame #12: 0x00010525042c
postgres`standard_planner(parse=0x000137010d68,
query_string="explain (costs off)\nsel

pgindent (probably my missing something obvious)

2023-07-03 Thread James Coleman
Hello,

This is the first time I've run pgindent on my current machine, and it
doesn't seem to be making any modifications to source files. For
example this command:

./src/tools/pgindent/pgindent src/backend/optimizer/path/allpaths.c

leaves the allpaths.c file unchanged despite my having some very long
function calls. I've downloaded the latest typedefs list, but I
haven't added any types anyway.

What obvious thing am I missing?

Thanks,
James Coleman




Re: Analyze on table creation?

2023-06-27 Thread James Coleman
On Mon, Jun 26, 2023 at 4:16 PM James Coleman  wrote:
>
> On Mon, Jun 26, 2023 at 4:00 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> > > Have we ever discussed running an analyze immediately after creating a 
> > > table?
> >
> > That doesn't make a whole lot of sense to me - we could just insert the
> > constants stats we wanted in that case.
> >
>
> I thought that was implicit in that, but fair enough :)
>
> > > Consider the following:
> > >
> > > create table stats(i int, t text not null);
> > > explain select * from stats;
> > >Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> > > analyze stats;
> > > explain select * from stats;
> > >Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> > >
> > > Combined with rapidly increasing error margin on row estimates when
> > > adding joins means that a query joining to a bunch of empty tables
> > > when a database first starts up can result in some pretty wild plan
> > > costs.
> >
> > The issue is that the table stats are likely going to quickly out of date in
> > that case, even a hand full of inserts (which wouldn't trigger
> > autovacuum analyzing) would lead to the "0 rows" stats causing very bad 
> > plans.
> >
>
> It's not obvious to me (as noted elsewhere in the thread) which is
> worse: a bunch of JOINs on empty tables can result in (specific
> example) plans with cost=15353020, and then trigger JIT, and...here we
> collide with my other thread about JIT [1].
>
> Regards,
> James Coleman
>
> 1: 
> https://www.postgresql.org/message-id/CAAaqYe-g-Q0Mm5H9QLcu8cHeMwok%2BHaxS4-UC9Oj3bK3a5jPvg%40mail.gmail.com

Thinking about this a bit more: it seems like what we're missing is either:

1. A heuristic for "this table will probably remain empty", or
2. A way to invalidate "0 rows" stats more quickly on even a handful of inserts.

I think one of those (ignoring questions about "how" for now) would
solve both cases?

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 4:00 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> > Have we ever discussed running an analyze immediately after creating a 
> > table?
>
> That doesn't make a whole lot of sense to me - we could just insert the
> constants stats we wanted in that case.
>

I thought that was implicit in that, but fair enough :)

> > Consider the following:
> >
> > create table stats(i int, t text not null);
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> > analyze stats;
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> >
> > Combined with rapidly increasing error margin on row estimates when
> > adding joins means that a query joining to a bunch of empty tables
> > when a database first starts up can result in some pretty wild plan
> > costs.
>
> The issue is that the table stats are likely going to quickly out of date in
> that case, even a hand full of inserts (which wouldn't trigger
> autovacuum analyzing) would lead to the "0 rows" stats causing very bad plans.
>

It's not obvious to me (as noted elsewhere in the thread) which is
worse: a bunch of JOINs on empty tables can result in (specific
example) plans with cost=15353020, and then trigger JIT, and...here we
collide with my other thread about JIT [1].

Regards,
James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe-g-Q0Mm5H9QLcu8cHeMwok%2BHaxS4-UC9Oj3bK3a5jPvg%40mail.gmail.com




Re: Analyze on table creation?

2023-06-26 Thread James Coleman
cc'ing Tom because I'm curious if he's willing to provide some greater
context on the commit in question.

On Mon, Jun 26, 2023 at 2:16 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:48 odesílatel James Coleman  napsal:
>>
>> On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  
>> wrote:
>> >
>> >
>> >
>> > po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
>> > napsal:
>> >>
>> >> Hi
>> >>
>> >> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Have we ever discussed running an analyze immediately after creating a 
>> >>> table?
>> >>>
>> >>> Consider the following:
>> >>>
>> >>> create table stats(i int, t text not null);
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>> >>> analyze stats;
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>> >>>
>> >>> Combined with rapidly increasing error margin on row estimates when
>> >>> adding joins means that a query joining to a bunch of empty tables
>> >>> when a database first starts up can result in some pretty wild plan
>> >>> costs.
>> >>>
>> >>> This feels like a simple idea to me, and so I assume people have
>> >>> considered it before. If so, I'd like to understand why the conclusion
>> >>> was not to do it, or, alternatively if it's a lack of tuits.
>> >>
>> >>
>> >> I like this. On the second hand, described behaviour is designed for 
>> >> ensuring of back compatibility.
>> >
>> >
>> > if you break this back compatibility, then the immediate ANALYZE is not 
>> > necessary
>>
>> I don't follow what backwards compatibility you're referencing. Could
>> you expand on that?
>
>
> Originally, until the table had minimally one row, the PostgreSQL calculated 
> with 10 pages. It was fixed (changed) in PostgreSQL 14.
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4
>

>From that commit message:
> Historically, we've considered the state with relpages and reltuples
> both zero as indicating that we do not know the table's tuple density.
> This is problematic because it's impossible to distinguish "never yet
> vacuumed" from "vacuumed and seen to be empty".  In particular, a user
> cannot use VACUUM or ANALYZE to override the planner's normal heuristic
> that an empty table should not be believed to be empty because it is
> probably about to get populated.  That heuristic is a good safety
> measure, so I don't care to abandon it, but there should be a way to
> override it if the table is indeed intended to stay empty.

So that implicitly provides our reasoning for not analyzing up-front
on table creation.

I haven't thought about this too deeply yet, but it seems plausible to
me that the dangers of overestimating row count here (at minimum in
queries like I described with lots of joins) are higher than the
dangers of underestimating, which we would do if we believed the table
was empty. One critical question would be how fast we can assume the
table will be auto-analyzed (i.e., how fast would the underestimate be
corrected.

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
> napsal:
>>
>> Hi
>>
>> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>>>
>>> Hello,
>>>
>>> Have we ever discussed running an analyze immediately after creating a 
>>> table?
>>>
>>> Consider the following:
>>>
>>> create table stats(i int, t text not null);
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>>> analyze stats;
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>>>
>>> Combined with rapidly increasing error margin on row estimates when
>>> adding joins means that a query joining to a bunch of empty tables
>>> when a database first starts up can result in some pretty wild plan
>>> costs.
>>>
>>> This feels like a simple idea to me, and so I assume people have
>>> considered it before. If so, I'd like to understand why the conclusion
>>> was not to do it, or, alternatively if it's a lack of tuits.
>>
>>
>> I like this. On the second hand, described behaviour is designed for 
>> ensuring of back compatibility.
>
>
> if you break this back compatibility, then the immediate ANALYZE is not 
> necessary

I don't follow what backwards compatibility you're referencing. Could
you expand on that?

Regards,
James Coleman




Analyze on table creation?

2023-06-26 Thread James Coleman
Hello,

Have we ever discussed running an analyze immediately after creating a table?

Consider the following:

create table stats(i int, t text not null);
explain select * from stats;
   Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
analyze stats;
explain select * from stats;
   Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)

Combined with rapidly increasing error margin on row estimates when
adding joins means that a query joining to a bunch of empty tables
when a database first starts up can result in some pretty wild plan
costs.

This feels like a simple idea to me, and so I assume people have
considered it before. If so, I'd like to understand why the conclusion
was not to do it, or, alternatively if it's a lack of tuits.

Regards,
James Coleman




Re: Stampede of the JIT compilers

2023-06-25 Thread James Coleman
On Sun, Jun 25, 2023 at 5:10 AM Michael Banck  wrote:
>
> Hi,
>
> On Sat, Jun 24, 2023 at 01:54:53PM -0400, Tom Lane wrote:
> > I don't know whether raising the default would be enough to fix that
> > in a nice way, and I certainly don't pretend to have a specific value
> > to offer.  But it's undeniable that we have a serious problem here,
> > to the point where JIT is a net negative for quite a few people.
>
> Some further data: to my knowledge, most major managed postgres
> providers disable jit for their users. Azure certainly does, but I don't
> have a Google Cloud SQL or RDS instance running right to verify their
> settings. I do seem to remember that they did as well though, at least a
> while back.
>
>
> Michael

I believe it's off by default in Aurora Postgres also.

Regards,
James Coleman




Re: Stampede of the JIT compilers

2023-06-24 Thread James Coleman
On Sat, Jun 24, 2023 at 8:14 PM David Rowley  wrote:
>
> On Sun, 25 Jun 2023 at 05:54, Tom Lane  wrote:
> >
> > James Coleman  writes:
> > > On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
> > >  wrote:
> > >> On 6/24/23 02:33, David Rowley wrote:
> > >>> On Sat, 24 Jun 2023 at 02:28, James Coleman  wrote:
> > >>>> There are a couple of issues here. I'm sure it's been discussed
> > >>>> before, and it's not the point of my thread, but I can't help but note
> > >>>> that the default value of jit_above_cost of 10 seems absurdly low.
> > >>>> On good hardware like we have even well-planned queries with costs
> > >>>> well above that won't be taking as long as JIT compilation does.
> >
> > >>> It would be good to know your evidence for thinking it's too low.
> >
> > > It's definitely possible that I stated this much more emphatically
> > > than I should have -- it was coming out of my frustration with this
> > > situation after all.
> >
> > I think there is *plenty* of evidence that it is too low, or at least
> > that for some reason we are too willing to invoke JIT when the result
> > is to make the overall cost of a query far higher than it is without.
>
> I've seen plenty of other reports and I do agree there is a problem,
> but I think you're jumping to conclusions in this particular case.
> I've seen nothing here that couldn't equally indicate the planner
> didn't overestimate the costs or some row estimate for the given
> query.  The solution to those problems shouldn't be bumping up the
> default JIT thresholds it could be to fix the costs or tune/add
> statistics to get better row estimates.
>
> I don't think it's too big an ask to see a few more details so that we
> can confirm what the actual problem is.

I did say in the original email "encountered a situation where a
misplanned query (analyzing helped with this, but I think the issue is
still relevant)".

I'll look at specifics again on Monday, but what I do remember is that
there were a lot of joins, and we already know we have cases where
those are planned poorly too (even absent bad stats).

What I wanted to get at more broadly here was thinking along the lines
of how to prevent the misplanning from causing such a disaster.

Regards,
James Coleman




Re: Stampede of the JIT compilers

2023-06-24 Thread James Coleman
On Sat, Jun 24, 2023 at 1:54 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > In that context capping the number of backends compiling, particularly
> > where plans (and JIT?) might be cached, could well save us (depending
> > on workload).
>
> TBH I do not find this proposal attractive in the least.  We have
> a problem here even when you consider a single backend.  If we fixed
> that, so that we don't invoke JIT unless it really helps, then it's
> not going to help less just because you have a lot of backends.
> Plus, the overhead of managing a system-wide limit is daunting.
>
> regards, tom lane

I'm happy to withdraw that particular idea. My mental model was along
the lines "this is a startup cost, and then we'll have it cached, so
the higher than expected cost won't matter as much when the system
settles down", and in that scenario limiting the size of the herd can
make sense.

But that's not the broader problem, so...

Regards,
James Coleman




Re: Stampede of the JIT compilers

2023-06-24 Thread James Coleman
On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
 wrote:
>
>
>
> On 6/24/23 02:33, David Rowley wrote:
> > On Sat, 24 Jun 2023 at 02:28, James Coleman  wrote:
> >> There are a couple of issues here. I'm sure it's been discussed
> >> before, and it's not the point of my thread, but I can't help but note
> >> that the default value of jit_above_cost of 10 seems absurdly low.
> >> On good hardware like we have even well-planned queries with costs
> >> well above that won't be taking as long as JIT compilation does.
> >
> > It would be good to know your evidence for thinking it's too low.

It's definitely possible that I stated this much more emphatically
than I should have -- it was coming out of my frustration with this
situation after all.

I think, though, that my later comments here will provide some
philosophical justification for it.

> > The main problem I see with it is that the costing does not account
> > for how many expressions will be compiled. It's quite different to
> > compile JIT expressions for a query to a single table with a simple
> > WHERE clause vs some query with many joins which scans a partitioned
> > table with 1000 partitions, for example.
> >
>
> I think it's both - as explained by James, there are queries with much
> higher cost, but the JIT compilation takes much more than just running
> the query without JIT. So the idea that 100k difference is clearly not
> sufficient to make up for the extra JIT compilation cost.
>
> But it's true that's because the JIT costing is very crude, and there's
> little effort to account for how expensive the compilation will be (say,
> how many expressions, ...).
>
> IMHO there's no "good" default that wouldn't hurt an awful lot of cases.
>
> There's also a lot of bias - people are unlikely to notice/report cases
> when the JIT (including costing) works fine. But they sure are annoyed
> when it makes the wrong choice.
>
> >> But on the topic of the thread: I'd like to know if anyone has ever
> >> considered implemented a GUC/feature like
> >> "max_concurrent_jit_compilations" to cap the number of backends that
> >> may be compiling a query at any given point so that we avoid an
> >> optimization from running amok and consuming all of a servers
> >> resources?
> >
> > Why do the number of backends matter?  JIT compilation consumes the
> > same CPU resources that the JIT compilation is meant to save.  If the
> > JIT compilation in your query happened to be a net win rather than a
> > net loss in terms of CPU usage, then why would
> > max_concurrent_jit_compilations be useful? It would just restrict us
> > on what we could save. This idea just covers up the fact that the JIT
> > costing is disconnected from reality.  It's a bit like trying to tune
> > your radio with the volume control.
> >
>
> Yeah, I don't quite get this point either. If JIT for a given query
> helps (i.e. makes execution shorter), it'd be harmful to restrict the
> maximum number of concurrent compilations. It we just disable JIT after
> some threshold is reached, that'd make queries longer and just made the
> pileup worse.

My thought process here is that given the poor modeling of JIT costing
you've both described that we're likely to estimate the cost of "easy"
JIT compilation acceptably well but also likely to estimate "complex"
JIT compilation far lower than actual cost.

Another way of saying this is that our range of JIT compilation costs
may well be fine on the bottom end but clamped on the high end, and
that means that our failure modes will tend towards the worst
mis-costings being the most painful (e.g., 2s compilation time for a
100ms query). This is even more the case in an OLTP system where the
majority of queries are already known to be quite fast.

In that context capping the number of backends compiling, particularly
where plans (and JIT?) might be cached, could well save us (depending
on workload).

That being said, I could imagine an alternative approach solving a
similar problem -- a way of exiting early from compilation if it takes
longer than we expect.

> If it doesn't help for a given query, we shouldn't be doing it at all.
> But that should be based on better costing, not some threshold.
>
> In practice there'll be a mix of queries where JIT does/doesn't help,
> and this threshold would just arbitrarily (and quite unpredictably)
> enable/disable costing, making it yet harder to investigate slow queries
> (as if we didn't have enough trouble with that already).
>
> > I think the JIT costs would be better taking into account how useful
> > each expression will be to JIT compile.  There were some ideas thrown
> > around in [1].
> >
>
> +1 to that

That does sound like an improvement.

One thing about our JIT that is different from e.g. browser JS engine
JITing is that we don't substitute in the JIT code "on the fly" while
execution is already underway. That'd be another, albeit quite
difficult, way to solve these issues.

Regards,
James Coleman




Stampede of the JIT compilers

2023-06-23 Thread James Coleman
Hello,

We recently brought online a new database cluster, and in the course
of ramping up traffic to it encountered a situation where a misplanned
query (analyzing helped with this, but I think the issue is still
relevant) resulted in that query being compiled with JIT, and soon a
large number of backends were running that same shape of query, all of
them JIT compiling it. Since each JIT compilation took ~2s, this
starved the server of resources.

There are a couple of issues here. I'm sure it's been discussed
before, and it's not the point of my thread, but I can't help but note
that the default value of jit_above_cost of 10 seems absurdly low.
On good hardware like we have even well-planned queries with costs
well above that won't be taking as long as JIT compilation does.

But on the topic of the thread: I'd like to know if anyone has ever
considered implemented a GUC/feature like
"max_concurrent_jit_compilations" to cap the number of backends that
may be compiling a query at any given point so that we avoid an
optimization from running amok and consuming all of a servers
resources?

Regards,
James Coleman




Re: Memory leak in incremental sort re-scan

2023-06-21 Thread James Coleman
On Thu, Jun 15, 2023 at 6:35 PM Tomas Vondra
 wrote:
>
>
>
> On 6/15/23 22:36, Tom Lane wrote:
> > Tomas Vondra  writes:
> >> On 6/15/23 22:11, Tom Lane wrote:
> >>> I see zero leakage in that example after applying the attached quick
> >>> hack.  (It might be better to make the check in the caller, or to just
> >>> move the call to ExecInitIncrementalSort.)
> >
> >> Thanks for looking. Are you planning to work on this and push the fix,
> >> or do you want me to finish this up?
> >
> > I'm happy to let you take it -- got lots of other stuff on my plate.
> >
>
> OK, will do.

I think the attached is enough to fix it -- rather than nulling out
the sort states in rescan, we can reset them (as the comment says),
but not set them to null (we also have the same mistake with
presorted_keys). That avoids unnecessary recreation of the sort
states, but it also fixes the problem Tom noted as well: the call to
preparePresortedCols() is already guarded by a test on fullsort_state
being NULL, so with this change we also won't unnecessarily redo that
work.

Regards,
James Coleman


v2-0001-Fix-memory-leak-in-incremental-sort-rescan.patch
Description: Binary data


Re: Use of additional index columns in rows filtering

2023-06-21 Thread James Coleman
On Wed, Jun 21, 2023 at 11:28 AM Tomas Vondra
 wrote:
>
>
>
> On 6/21/23 14:45, James Coleman wrote:
> > Hello,
> >
> > I've cc'd Jeff Davis on this due to a conversation we had at PGCon
> > about applying filters on index tuples during index scans.
> >
> > I've also cc'd Andres Freund because I think this relates to his
> > musing in [1] that:
> >> One thing I have been wondering around this is whether we should not have
> >> split the code for IOS and plain indexscans...
> >
> > I think I remember Peter Geoghegan also wondering (I can't remember if
> > this was in conversation at PGCon about index skip scans or in a
> > hackers thread) about how we compose these various index scan
> > optimizations.
> >
> > To be certain this is probably a thing to tackle as a follow-on to
> > this patch, but it does seem to me that what we are implicitly
> > realizing is that (unlike with bitmap scans, I think) it doesn't
> > really make a lot of conceptual sense to have index only scans be a
> > separate node from index scans. Instead it's likely better to consider
> > it an optimization to index scans that can dynamically kick in when
> > it's able to be of use. That would allow it to compose with e.g.
> > prefetching in the aforelinked thread. At the very least we would need
> > pragmatic (e.g., cost of dynamically applying optimizations) rather
> > than conceptual reasons to argue they should continue to be separate.
> >
>
> I agree it seems a bit weird to have IOS as a separate node. In a way, I
> think there are two dimensions for "index-only" scans - which pages can
> be scanned like that, and which clauses can be evaluated with only the
> index tuple. The current approach focuses on page visibility, but
> ignores the other aspect entirely. Or more precisely, it disables IOS
> entirely as soon as there's a single condition requiring heap tuple.
>
> I agree it's probably better to see this as a single node with various
> optimizations that can be applied when possible / efficient (based on
> planner and/or dynamically).
>
> I'm not sure I see a direct link to the prefetching patch, but it's true
> that needs to deal with tids (instead of slots), just like IOS. So if
> the node worked with tids, maybe the prefetching could be done at that
> level (which I now realize may be what Andres meant by doing prefetching
> in the executor).

The link to prefetching is that IOS (as a separate node) won't benefit
from prefetching (I think) with your current prefetching patch (in the
case where the VM doesn't allow us to just use the index tuple),
whereas if the nodes were combined that would more naturally be
composable.

> > Apologies for that lengthy preamble; on to the patch under discussion:
> >
> > On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
> >  wrote:
> >>
> >> Hi,
> >>
> >> I took a stab at this and implemented the trick with the VM - during
> >> index scan, we also extract the filters that only need the indexed
> >> attributes (just like in IOS). And then, during the execution we:
> >>
> >>   1) scan the index using the scan keys (as before)
> >>
> >>   2) if the heap page is all-visible, we check the new filters that can
> >>  be evaluated on the index tuple
> >>
> >>   3) fetch the heap tuple and evaluate the filters
> >
> > Thanks for working on this; I'm excited about this class of work
> > (along with index prefetching and other ideas I think there's a lot of
> > potential for improving index scans).
> >
> >> This is pretty much exactly the same thing we do for IOS, so I don't see
> >> why this would be incorrect while IOS is correct.
> >>
> >> This also adds "Index Filter" to explain output, to show which filters
> >> are executed on the index tuple (at the moment the filters are a subset
> >> of "Filter"), so if the index tuple matches we'll execute them again on
> >> the heap tuple. I guess that could be fixed by having two "filter"
> >> lists, depending on whether we were able to evaluate the index filters.
> >
> > Given that we show index filters and heap filters separately it seems
> > like we might want to maintain separate instrumentation counts of how
> > many tuple were filtered by each set of filters.
> >
>
> Yeah, separate instrumentation counters would be useful. What I was
> talking about was more about the conditions itself, because right now we
> re-evaluate the index-only clauses on the heap tuple.
>
> Imagine an index on t(a) and a query th

Opportunistically pruning page before update

2023-06-21 Thread James Coleman
Hello,

While at PGCon I was chatting with Andres (and I think Peter G. and a
few others who I can't remember at the moment, apologies) and Andres
noted that while we opportunistically prune a page when inserting a
tuple (before deciding we need a new page) we don't do the same for
updates.

Attached is a patch series to do the following:

0001: Make it possible to call heap_page_prune_opt already holding an
exclusive lock on the buffer.
0002: Opportunistically prune pages on update when the current tuple's
page has no free space. If this frees up enough space, then we
continue to put the new tuple on that page; if not, then we take the
existing code path and get a new page.

One would plausibly expect the following improvements:
- Reduced table bloat
- Increased HOT update rate
- Improved performance on updates

I started to work on benchmarking this, but haven't had time to devote
properly to that, so I'm wondering if there's anyone who might be
interested in collaborating on that part.

Other TODOs:
- Audit other callers of RelationSetTargetBlock() to ensure they don't
hold pointers into the page.

Regards,
James Coleman


v1-0001-Allow-getting-lock-before-calling-heap_page_prune.patch
Description: Binary data


v1-0002-Opportunistically-prune-to-avoid-building-a-new-p.patch
Description: Binary data


Re: Use of additional index columns in rows filtering

2023-06-21 Thread James Coleman
Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I took a stab at this and implemented the trick with the VM - during
> index scan, we also extract the filters that only need the indexed
> attributes (just like in IOS). And then, during the execution we:
>
>   1) scan the index using the scan keys (as before)
>
>   2) if the heap page is all-visible, we check the new filters that can
>  be evaluated on the index tuple
>
>   3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

> This is pretty much exactly the same thing we do for IOS, so I don't see
> why this would be incorrect while IOS is correct.
>
> This also adds "Index Filter" to explain output, to show which filters
> are executed on the index tuple (at the moment the filters are a subset
> of "Filter"), so if the index tuple matches we'll execute them again on
> the heap tuple. I guess that could be fixed by having two "filter"
> lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

> Most of the patch is pretty mechanical - particularly the planning part
> is about identifying filters that can be evaluated on the index tuple,
> and that code was mostly shamelessly copied from index-only scan.
>
> The matching of filters to index is done in check_index_filter(), and
> it's simpler than match_clause_to_indexcol() as it does not need to
> consider operators etc. (I think). But maybe it should be careful about
> other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

> The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
> earlier, the idea is to check VM and evaluate the filters on the index
> tuple if possible, similar to index-only scans. Except that we then have
> to fetch the heap tuple. Unfortunately, this means the code can't use
> index_getnext_slot() anymore. Perhaps we should invent a new variant
> that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

> With the patch applied, the query plan changes from:
>
> ...
>
> to
>
>  QUERY PLAN
> ---
>  Limit  (cost=0.42..3662.15 rows=1 width=12)
> (actual time=13.663..13.667 rows=0 loops=1)
>Buffers: shared hit=544
>->  Index Scan using t_a_include_b on t
>(cost=0.42..3662.15 rows=1 width=12)
>(actual time=13.659..13.660 rows=0 loops=1)
>  Index Cond: (a > 100)
>  Index Filter: (b = 4)
>  Rows Removed by Index Recheck: 197780
>  Filter: (b = 4)
>  Buffers: shared hit=544
>  Planning Time: 0.105 ms
>  Execution Time: 13.690 ms
> (10 rows)
>
> ...

I did also confirm that this properly identifies

Re: path->param_info only set for lateral?

2023-06-20 Thread James Coleman
On Sun, Jun 18, 2023 at 10:57 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > Over in "Parallelize correlated subqueries that execute within each
> > worker" [1} Richard Guo found a bug in the current version of my patch
> > in that thread. While debugging that issue I've been wondering why
> > Path's param_info field seems to be NULL unless there is a LATERAL
> > reference even though there may be non-lateral outer params
> > referenced.
>
> Per pathnodes.h:
>
>  * "param_info", if not NULL, links to a ParamPathInfo that identifies outer
>  * relation(s) that provide parameter values to each scan of this path.
>  * That means this path can only be joined to those rels by means of nestloop
>  * joins with this path on the inside.  ...
>
> We're only interested in this for params that are coming from other
> relations of the same query level, so that they affect join order and
> join algorithm choices.  Params coming down from outer query levels
> are much like EXTERN params to the planner: they are pseudoconstants
> for any one execution of the current query level.
>
> This isn't just LATERAL stuff; it's also intentionally-generated
> nestloop-with-inner-indexscan-cases.  But it's not outer-level Params.
> Even though those are also PARAM_EXEC Params, they are fundamentally
> different animals for the planner's purposes.

Thanks for the explanation.

I wonder if it'd be worth clarifying the comment slightly to hint in
that direction (like the attached)?

Thanks,
James Coleman


v1-0001-Clarify-param_info-query-level.patch
Description: Binary data


path->param_info only set for lateral?

2023-06-18 Thread James Coleman
Hello,

Over in "Parallelize correlated subqueries that execute within each
worker" [1} Richard Guo found a bug in the current version of my patch
in that thread. While debugging that issue I've been wondering why
Path's param_info field seems to be NULL unless there is a LATERAL
reference even though there may be non-lateral outer params
referenced.

Consider the query:
select * from pg_description t1 where objoid in
(select objoid from pg_description t2 where t2.description =
t1.description);

The subquery's rel has a baserestrictinfo containing an OpExpr
comparing a Var (t2.description) to a Param of type PARAM_EXEC
(t1.description). But the generated SeqScan path doesn't have its
param_info field set, which means PATH_REQ_OUTER returns NULL also
despite there being an obvious param referencing a required outer
relid. Looking at create_seqscan_path we see that param_info is
initialized with:

get_baserel_parampathinfo(root, rel, required_outer)

where required_outer is passed in from set_plain_rel_pathlist as
rel->lateral_relids. And get_baserel_parampathinfo always returns NULL
if required_outer is empty, so obviously with this query (no lateral
reference) we're not going to get any ParamPathInfo added to the path
or the rel.

Is there a reason why we don't track the required relids providing the
PARAM_EXEC params in this case?

Thanks,
James Coleman

1: 
https://www.postgresql.org/message-id/CAMbWs4_evjcMzN8Gw78bHfhfo2FKJThqhEjRJRmoMZx%3DNXcJ7w%40mail.gmail.com




Re: RFC: Logging plan of the running query

2023-06-15 Thread James Coleman
On Thu, Jun 15, 2023 at 9:00 AM torikoshia  wrote:
>
> On 2023-06-15 01:48, James Coleman wrote:
> > On Tue, Jun 13, 2023 at 11:53 AM James Coleman 
> > wrote:
> >>
> >> ...
> >> I'm going to re-run tests with my patch version + resetting the flag
> >> on SIGINT (and any other error condition) to be certain that the issue
> >> you uncovered (where backends get stuck after a SIGINT not responding
> >> to the requested plan logging) wasn't masking any other issues.
> >>
> >> As long as that run is clean also then I believe the patch is safe
> >> as-is even without the re-entrancy guard.
> >>
> >> I'll report back with the results of that testing.
> >
> > The tests have been running since last night, but have been apparently
> > hung now for many hours. I haven't been able to fully look into it,
> > but so far I know the hung (100% CPU) backend last logged this:
> >
> > 2023-06-14 02:00:30.045 UTC client backend[84461]
> > pg_regress/updatable_views LOG:  query plan running on backend with
> > PID 84461 is:
> > Query Text: SELECT table_name, column_name, is_updatable
> >   FROM information_schema.columns
> >  WHERE table_name LIKE E'r_\\_view%'
> >  ORDER BY table_name, ordinal_position;
> >
> > The last output from the regression test harness was:
> >
> > # parallel group (5 tests):  index_including create_view
> > index_including_gist create_index create_index_spgist
> > ok 66+ create_index36508 ms
> > ok 67+ create_index_spgist 38588 ms
> > ok 68+ create_view  1394 ms
> > ok 69+ index_including   654 ms
> > ok 70+ index_including_gist 1701 ms
> > # parallel group (16 tests):  errors create_cast drop_if_exists
> > create_aggregate roleattributes constraints hash_func typed_table
> > infinite_recurse
> >
> > Attaching gdb to the hung backend shows this:
> >
> > #0  0x5601ab1f9529 in ProcLockWakeup
> > (lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
> > , lock=lock@entry=0x7f5325c913f0) at proc.c:1655
> > #1  0x5601ab1e99dc in CleanUpLock (lock=lock@entry=0x7f5325c913f0,
> > proclock=proclock@entry=0x7f5325d40d60,
> > lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
> > ,
> > hashcode=hashcode@entry=573498161, wakeupNeeded=)
> > at lock.c:1673
> > #2  0x5601ab1e9e21 in LockRefindAndRelease
> > (lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
> > , proc=,
> > locktag=locktag@entry=0x5601ac3d7998, lockmode=lockmode@entry=1,
> >
> > decrement_strong_lock_count=decrement_strong_lock_count@entry=false)
> > at lock.c:3150
> > #3  0x5601ab1edb27 in LockReleaseAll
> > (lockmethodid=lockmethodid@entry=1, allLocks=false) at lock.c:2295
> > #4  0x5601ab1f8599 in ProcReleaseLocks
> > (isCommit=isCommit@entry=true) at proc.c:781
> > #5  0x5601ab37f1f4 in ResourceOwnerReleaseInternal
> > (owner=, phase=phase@entry=RESOURCE_RELEASE_LOCKS,
> > isCommit=isCommit@entry=true, isTopLevel=isTopLevel@entry=true) at
> > resowner.c:618
> > #6  0x5601ab37f7b7 in ResourceOwnerRelease (owner=,
> > phase=phase@entry=RESOURCE_RELEASE_LOCKS,
> > isCommit=isCommit@entry=true, isTopLevel=isTopLevel@entry=true) at
> > resowner.c:494
> > #7  0x5601aaec1d84 in CommitTransaction () at xact.c:2334
> > #8  0x5601aaec2b22 in CommitTransactionCommand () at xact.c:3067
> > #9  0x5601ab200a66 in finish_xact_command () at postgres.c:2783
> > #10 0x5601ab20338f in exec_simple_query (
> > query_string=query_string@entry=0x5601ac3b0858 "SELECT table_name,
> > column_name, is_updatable\n  FROM information_schema.columns\n WHERE
> > table_name LIKE E'r__view%'\n ORDER BY table_name,
> > ordinal_position;") at postgres.c:1300
> >
> > I am unable to connect to the regression test Postgres instance --
> > psql just hangs, so the lock seems to have affected the postmaster
> > also.
> >
> > I'm wondering if this might represent a bug in the current patch.
>
> Thanks for running and analyzing the test!

Sure thing!

> Could you share me how you are running the test?
>
> I imagined something like below, but currently couldn't reproduce it.
> - apply both v26-0001 and v27-0002 and build
> - run PostgreSQL with default GUCssaaa
> - make installcheck-world
> - run 'SELECT pg_log_query_plan(pid) FROM pg_stat_activity \watch 0.1'
> during make installcheck-world

Apologies, I should have attached my updated patch (with the fix for
the bug you'd reporting with the re-entrancy guard). Attached is v28
which sets ProcessLogQueryPlanInterruptActive to false in errfinish
when necessary. Once built with those two patches I'm simply running
`make check`.

Regards,
James Coleman


v28-0001-Add-function-to-log-the-plan-of-the-query.patch
Description: Binary data


v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-06-14 Thread James Coleman
On Tue, Jun 13, 2023 at 11:53 AM James Coleman  wrote:
>
> ...
> I'm going to re-run tests with my patch version + resetting the flag
> on SIGINT (and any other error condition) to be certain that the issue
> you uncovered (where backends get stuck after a SIGINT not responding
> to the requested plan logging) wasn't masking any other issues.
>
> As long as that run is clean also then I believe the patch is safe
> as-is even without the re-entrancy guard.
>
> I'll report back with the results of that testing.

The tests have been running since last night, but have been apparently
hung now for many hours. I haven't been able to fully look into it,
but so far I know the hung (100% CPU) backend last logged this:

2023-06-14 02:00:30.045 UTC client backend[84461]
pg_regress/updatable_views LOG:  query plan running on backend with
PID 84461 is:
Query Text: SELECT table_name, column_name, is_updatable
  FROM information_schema.columns
 WHERE table_name LIKE E'r_\\_view%'
 ORDER BY table_name, ordinal_position;

The last output from the regression test harness was:

# parallel group (5 tests):  index_including create_view
index_including_gist create_index create_index_spgist
ok 66+ create_index36508 ms
ok 67+ create_index_spgist 38588 ms
ok 68+ create_view  1394 ms
ok 69+ index_including   654 ms
ok 70+ index_including_gist 1701 ms
# parallel group (16 tests):  errors create_cast drop_if_exists
create_aggregate roleattributes constraints hash_func typed_table
infinite_recurse

Attaching gdb to the hung backend shows this:

#0  0x5601ab1f9529 in ProcLockWakeup
(lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
, lock=lock@entry=0x7f5325c913f0) at proc.c:1655
#1  0x5601ab1e99dc in CleanUpLock (lock=lock@entry=0x7f5325c913f0,
proclock=proclock@entry=0x7f5325d40d60,
lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
,
hashcode=hashcode@entry=573498161, wakeupNeeded=)
at lock.c:1673
#2  0x5601ab1e9e21 in LockRefindAndRelease
(lockMethodTable=lockMethodTable@entry=0x5601ab6484e0
, proc=,
locktag=locktag@entry=0x5601ac3d7998, lockmode=lockmode@entry=1,
decrement_strong_lock_count=decrement_strong_lock_count@entry=false)
at lock.c:3150
#3  0x5601ab1edb27 in LockReleaseAll
(lockmethodid=lockmethodid@entry=1, allLocks=false) at lock.c:2295
#4  0x5601ab1f8599 in ProcReleaseLocks
(isCommit=isCommit@entry=true) at proc.c:781
#5  0x5601ab37f1f4 in ResourceOwnerReleaseInternal
(owner=, phase=phase@entry=RESOURCE_RELEASE_LOCKS,
isCommit=isCommit@entry=true, isTopLevel=isTopLevel@entry=true) at
resowner.c:618
#6  0x5601ab37f7b7 in ResourceOwnerRelease (owner=,
phase=phase@entry=RESOURCE_RELEASE_LOCKS,
isCommit=isCommit@entry=true, isTopLevel=isTopLevel@entry=true) at
resowner.c:494
#7  0x5601aaec1d84 in CommitTransaction () at xact.c:2334
#8  0x5601aaec2b22 in CommitTransactionCommand () at xact.c:3067
#9  0x5601ab200a66 in finish_xact_command () at postgres.c:2783
#10 0x5601ab20338f in exec_simple_query (
query_string=query_string@entry=0x5601ac3b0858 "SELECT table_name,
column_name, is_updatable\n  FROM information_schema.columns\n WHERE
table_name LIKE E'r__view%'\n ORDER BY table_name,
ordinal_position;") at postgres.c:1300

I am unable to connect to the regression test Postgres instance --
psql just hangs, so the lock seems to have affected the postmaster
also.

I'm wondering if this might represent a bug in the current patch.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2023-06-13 Thread James Coleman
On Tue, Jun 13, 2023 at 11:22 AM torikoshia  wrote:
>
> On 2023-06-13 00:52, James Coleman wrote:
> >>
> >> > I've attached v27. The important change here in 0001 is that it
> >> > guarantees the interrupt handler is re-entrant, since that was a bug
> >> > exposed by my testing. I've also included 0002 which is only meant for
> >> > testing (it attempts to log in the plan in every
> >> > CHECK_FOR_INTERRUPTS() call).
> >>
> >> When SIGINT is sent during ProcessLogQueryPlanInterrupt(),
> >> ProcessLogQueryPlanInterruptActive can remain true.
> >> After that, pg_log_query_plan() does nothing but just returns.
> >>
> >> AFAIU, v26 logs plan for each pg_log_query_plan() even when another
> >> pg_log_query_plan() is running, but it doesn't cause errors or
> >> critical
> >> problem.
> >>
> >> Considering the problem solved and introduced by v27, I think v26
> >> might
> >> be fine.
> >> How do you think?
> >
> > The testing I did with calling this during every CFI is what uncovered
> > the re-entrancy problem. IIRC (without running that test again) the
> > problem was a stack overflow. Now, to be sure this is a particularly
> > degenerate case because in real-world usage it'd be impossible in
> > practice, I think, to trigger that many calls to this function (and by
> > extension the interrupt handler).
>
> Yeah.In addition, currently only superusers are allowed to execute
> pg_log_query_plan(), I think we don't need to think about cases
> where users are malicious.
>
> > If SIGINT is the only concern we could reset
> > ProcessLogQueryPlanInterruptActive in error handling code. I admit
> > that part of my thought process here is thinking ahead to an
> > additional patch I'd like to see on top of this, which is logging a
> > query plan before cleaning up when statement timeout occurs.
>
> I remember this is what you wanted do.[1]
>
> > The
> > re-entrancy issue becomes more interesting then, I think, since we
> > would then have automated calling of the logging code. BTW: I'd
> > thought that would make a nice follow-up patch for this, but if you'd
> > prefer I could add it as another patch in the series here.
> >
> > What do you think about resetting the flag versus just not having it?
>
> If I understand you correctly, adding the flag is not necessary for this
> proposal.
> To keep the patch simple, I prefer not having it.
>

I'm going to re-run tests with my patch version + resetting the flag
on SIGINT (and any other error condition) to be certain that the issue
you uncovered (where backends get stuck after a SIGINT not responding
to the requested plan logging) wasn't masking any other issues.

As long as that run is clean also then I believe the patch is safe
as-is even without the re-entrancy guard.

I'll report back with the results of that testing.

Regards,
James Coleman




Re: RFC: Logging plan of the running query

2023-06-12 Thread James Coleman
On Sun, Jun 11, 2023 at 11:07 PM torikoshia  wrote:
>
> On 2023-06-06 03:26, James Coleman wrote:
> > On Mon, Jun 5, 2023 at 4:30 AM torikoshia 
> > wrote:
> >>
> >> On 2023-06-03 02:51, James Coleman wrote:
> >> > Hello,
> >> >
> >> > Thanks for working on this patch!
> >
> > Sure thing! I'm *very interested* in seeing this available, and I
> > think it paves the way for some additional features later on...
> >
> >> > On Thu, Dec 8, 2022 at 12:10 AM torikoshia 
> >> ...
> >> > To put it positively: we believe that, for example, catalog accesses
> >> > inside CHECK_FOR_INTERRUPTS() -- assuming that the CFI call is inside
> >> > an existing valid transaction/query state, as it would be for this
> >> > patch -- are safe. If there were problems, then those problems are
> >> > likely bugs we already have in other CFI cases.
> >>
> >> Thanks a lot for the discussion and sharing it!
> >> I really appreciate it.
> >>
> >> BTW I'm not sure whether all the CFI are called in valid transaction,
> >> do you think we should check each of them?
> >
> > I kicked off the regressions tests with a call to
> > ProcessLogQueryPlanInterrupt() in every single CHECK_FOR_INTERRUPTS()
> > call. Several hours and 52 GB of logs later I have confirmed that
> > (with the attached revision) at the very least the regression test
> > suite can't trigger any kind of failures regardless of when we trigger
> > this. The existing code in the patch for only running the explain when
> > there's an active query handling that.
>
> Thanks for the testing!
>
> > I've attached v27. The important change here in 0001 is that it
> > guarantees the interrupt handler is re-entrant, since that was a bug
> > exposed by my testing. I've also included 0002 which is only meant for
> > testing (it attempts to log in the plan in every
> > CHECK_FOR_INTERRUPTS() call).
>
> When SIGINT is sent during ProcessLogQueryPlanInterrupt(),
> ProcessLogQueryPlanInterruptActive can remain true.
> After that, pg_log_query_plan() does nothing but just returns.
>
> AFAIU, v26 logs plan for each pg_log_query_plan() even when another
> pg_log_query_plan() is running, but it doesn't cause errors or critical
> problem.
>
> Considering the problem solved and introduced by v27, I think v26 might
> be fine.
> How do you think?

The testing I did with calling this during every CFI is what uncovered
the re-entrancy problem. IIRC (without running that test again) the
problem was a stack overflow. Now, to be sure this is a particularly
degenerate case because in real-world usage it'd be impossible in
practice, I think, to trigger that many calls to this function (and by
extension the interrupt handler).

If SIGINT is the only concern we could reset
ProcessLogQueryPlanInterruptActive in error handling code. I admit
that part of my thought process here is thinking ahead to an
additional patch I'd like to see on top of this, which is logging a
query plan before cleaning up when statement timeout occurs. The
re-entrancy issue becomes more interesting then, I think, since we
would then have automated calling of the logging code. BTW: I'd
thought that would make a nice follow-up patch for this, but if you'd
prefer I could add it as another patch in the series here.

What do you think about resetting the flag versus just not having it?

Regards,
James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2023-06-11 Thread James Coleman
On Tue, Jun 6, 2023 at 4:36 AM Richard Guo  wrote:
>
>
> On Mon, Jan 23, 2023 at 10:00 PM James Coleman  wrote:
>>
>> Which this patch we do in fact now see (as expected) rels with
>> non-empty lateral_relids showing up in generate_[useful_]gather_paths.
>> And the partial paths can now have non-empty required outer rels. I'm
>> not able to come up with a plan that would actually be caught by those
>> checks; I theorize that because of the few places we actually call
>> generate_[useful_]gather_paths we are in practice already excluding
>> those, but for now I've left these as a conditional rather than an
>> assertion because it seems like the kind of guard we'd want to ensure
>> those methods are safe.
>
>
> I'm trying to understand this part.  AFAICS we will not create partial
> paths for a rel, base or join, if it has lateral references.  So it
> seems to me that in generate_[useful_]gather_paths after we've checked
> that there are partial paths, the checks for lateral_relids are not
> necessary because lateral_relids should always be empty in this case.
> Maybe I'm missing something.

At first I was thinking "isn't the point of the patch to generate
partial paths for rels with lateral references" given what I'd written
back in January, but I added "Assert(bms_is_empty(required_outer));"
to both of those functions and the assertion never fails running the
tests (including my newly parallelizable queries). I'm almost positive
I'd checked this back in January (not only had I'd explicitly written
that I'd confirmed we had non-empty lateral_relids there, but also it
was the entire based of the alternate approach to the patch), but...I
can't go back to 5 months ago and remember what I'd done.

Ah! Your comment about "after we've checked that there are partial
paths" triggered a thought. I think originally I'd had the
"bms_is_subset(required_outer, rel->relids)" check first in these
functions. And indeed if I run the tests with that the assertion moved
to above the partial paths check, I get failures in
generate_useful_gather_paths specifically. Mystery solved!

> And while trying the v9 patch I came across a crash with the query
> below.
>
> set min_parallel_table_scan_size to 0;
> set parallel_setup_cost to 0;
> set parallel_tuple_cost to 0;
>
> explain (costs off)
> select * from pg_description t1 where objoid in
> (select objoid from pg_description t2 where t2.description = 
> t1.description);
>QUERY PLAN
> 
>  Seq Scan on pg_description t1
>Filter: (SubPlan 1)
>SubPlan 1
>  ->  Gather
>Workers Planned: 2
>->  Parallel Seq Scan on pg_description t2
>  Filter: (description = t1.description)
> (7 rows)
>
> select * from pg_description t1 where objoid in
> (select objoid from pg_description t2 where t2.description = 
> t1.description);
> WARNING:  terminating connection because of crash of another server process
>
> Seems something is wrong when extracting the argument from the Param in
> parallel worker.

With what I'm trying to change I don't think this plan should ever be
generated since it means we'd have to pass a param from the outer seq
scan into the parallel subplan, which we can't do (currently).

I've attached the full backtrace to the email, but as you hinted at
the parallel worker is trying to get the param (in this case
detoasting it), but the param doesn't exist on the worker, so it seg
faults. Looking at this further I think there's an existing test case
that exposes the misplanning here (the one right under the comment
"Parallel Append is not to be used when the subpath depends on the
outer param" in select_parallel.sql), but it doesn't seg fault because
the param is an integer, doesn't need to be detoasted, and therefore
(I think) we skate by (but probably with wrong results in depending on
the dataset).

Interestingly this is one of the existing test queries my original
patch approach didn't change, so this gives me something specific to
work with improving the path. Thanks for testing this and bringing
this to my attention!

BTW are you by any chance testing on ARM macOS? I reproduced the issue
there, but for some reason I did not reproduce the error (and the plan
wasn't parallelized) when I tested this on linux. Perhaps I missed
setting something up; it seems odd.

> BTW another rebase is needed as it no longer applies to HEAD.

Apologies; I'd rebased, but hadn't updated the thread. See attached
for an updated series (albeit still broken on your test query).

Thanks,
James
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x0)
  * frame #0: 0x00010449cab0 postgre

Add last_commit_lsn to pg_stat_database

2023-06-09 Thread James Coleman
I've previously noted in "Add last commit LSN to
pg_last_committed_xact()" [1] that it's not possible to monitor how
many bytes of WAL behind a logical replication slot is (computing such
is obviously trivial for physical slots) because the slot doesn't need
to replicate beyond the last commit. In some cases it's possible for
the current WAL location to be far beyond the last commit. A few
examples:

- An idle server with checkout_timeout at a lower value (so lots of
empty WAL advance).
- Very large transactions: particularly insidious because committing a
1 GB transaction after a small transaction may show almost zero time
lag even though quite a bit of data needs to be processed and sent
over the wire (so time to replay is significantly different from
current lag).
- A cluster with multiple databases complicates matters further,
because while physical replication is cluster-wide, the LSNs that
matter for logical replication are database specific.

Since we don't expose the most recent commit's LSN there's no way to
say "the WAL is currently 1250, the last commit happened at 1000, the
slot has flushed up to 800, therefore there are at most 200 bytes
replication needs to read through to catch up.

In the aforementioned thread [1] I'd proposed a patch that added a SQL
function pg_last_commit_lsn() to expose the most recent commit's LSN.
Robert Haas didn't think the initial version's modifications to
commit_ts made sense, and a subsequent approach adding the value to
PGPROC didn't have strong objections, from what I can see, but it also
didn't generate any enthusiasm.

As I was thinking about how to improve things, I realized that this
information (since it's for monitoring anyway) fits more naturally
into the stats system. I'd originally thought of exposing it in
pg_stat_wal, but that's per-cluster rather than per-database (indeed,
this is a flaw I hadn't considered in the original patch), so I think
pg_stat_database is the correct location.

I've attached a patch to track the latest commit's LSN in pg_stat_database.

Regards,
James Coleman

1: 
https://www.postgresql.org/message-id/flat/caaaqye9qbiau+j8rbun_jkbre-3hekluhfvvsyfsxqg0vql...@mail.gmail.com


v1-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-06-05 Thread James Coleman
On Mon, Jun 5, 2023 at 4:30 AM torikoshia  wrote:
>
> On 2023-06-03 02:51, James Coleman wrote:
> > Hello,
> >
> > Thanks for working on this patch!

Sure thing! I'm *very interested* in seeing this available, and I
think it paves the way for some additional features later on...

> > On Thu, Dec 8, 2022 at 12:10 AM torikoshia 
> ...
> > To put it positively: we believe that, for example, catalog accesses
> > inside CHECK_FOR_INTERRUPTS() -- assuming that the CFI call is inside
> > an existing valid transaction/query state, as it would be for this
> > patch -- are safe. If there were problems, then those problems are
> > likely bugs we already have in other CFI cases.
>
> Thanks a lot for the discussion and sharing it!
> I really appreciate it.
>
> BTW I'm not sure whether all the CFI are called in valid transaction,
> do you think we should check each of them?

I kicked off the regressions tests with a call to
ProcessLogQueryPlanInterrupt() in every single CHECK_FOR_INTERRUPTS()
call. Several hours and 52 GB of logs later I have confirmed that
(with the attached revision) at the very least the regression test
suite can't trigger any kind of failures regardless of when we trigger
this. The existing code in the patch for only running the explain when
there's an active query handling that.

> > Another concern Robert raised may apply here: what if a person tries
> > to cancel a query when we're explaining? I believe that's a reasonable
> > question to ask, but I believe it's a trade-off that's worth making
> > for the (significant) introspection benefits this patch would provide.
>
> Is the concern here limited to the case where explain code goes crazy
> as Robert pointed out?
> If so, this may be a trade-off worth doing.
> I am a little concerned about whether there will be cases where the
> explain code is not problematic but just takes long time.

The explain code could take a long time in some rare cases (e.g., we
discovered a bug a few years back with the planning code that actually
descends an index to find the max value), but I think the trade-off is
worth it.

> > On to the patch itself: overall I think it looks like it's in pretty
> > good shape. I also noticed we don't have any tests (I assume it'd have
> > to be TAP tests) of the actual output happening, and I think it would
> > be worth adding that.
>
> Checking the log output may be better, but I didn't add it since there
> is only a little content that can be checked, and similar function
> pg_log_backend_memory_contexts() does not do such type of tests.

Fair enough. I still think that would be an improvement here, but
others could also weigh in.

> > Are you interested in re-opening this patch? I'd be happy to provide
> > further review and help to try to push this along.
> Sure, I'm going to re-open this.

I've attached v27. The important change here in 0001 is that it
guarantees the interrupt handler is re-entrant, since that was a bug
exposed by my testing. I've also included 0002 which is only meant for
testing (it attempts to log in the plan in every
CHECK_FOR_INTERRUPTS() call).

Regards,
James


v27-0001-Add-function-to-log-the-plan-of-the-query.patch
Description: Binary data


v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-06-02 Thread James Coleman
Hello,

Thanks for working on this patch!

On Thu, Dec 8, 2022 at 12:10 AM torikoshia  wrote:
>
> BTW, since this patch depends on ProcessInterrupts() and EXPLAIN codes
> which is used in auto_explain, I'm feeling that the following discussion
> also applies to this patch.
>
> > --
> > https://www.postgresql.org/message-id/CA%2BTgmoYW_rSOW4JMQ9_0Df9PKQ%3DsQDOKUGA4Gc9D8w4wui8fSA%40mail.gmail.com
> >
> > explaining a query is a pretty
> > complicated operation that involves catalog lookups and lots of
> > complicated stuff, and I don't think that it would be safe to do all
> > of that at any arbitrary point in the code where ProcessInterrupts()
> > happened to fire.
>
> If I can't come up with some workaround during the next Commitfest, I'm
> going to withdraw this proposal.

While at PGCon this week I'd brought up this idea with a few people
without realizing you'd already worked on it previously, and then
after I discovered this thread several of us (Greg, Ronan, David,
Heikki, and myself -- all cc'd) discussed the safety concerns over
dinner last night.

Our conclusion was that all of the concerns we could come up with (for
example, walking though the code for ExplainTargetRel() and discussing
the relevant catalog and syscache accesses) all applied specifically
to Robert's concerns about running explain inside an aborted
transaction. After all, I'd originally started that thread to ask
about running auto-explain after a query timeout.

To put it positively: we believe that, for example, catalog accesses
inside CHECK_FOR_INTERRUPTS() -- assuming that the CFI call is inside
an existing valid transaction/query state, as it would be for this
patch -- are safe. If there were problems, then those problems are
likely bugs we already have in other CFI cases.

Another concern Robert raised may apply here: what if a person tries
to cancel a query when we're explaining? I believe that's a reasonable
question to ask, but I believe it's a trade-off that's worth making
for the (significant) introspection benefits this patch would provide.

On to the patch itself: overall I think it looks like it's in pretty
good shape. I also noticed we don't have any tests (I assume it'd have
to be TAP tests) of the actual output happening, and I think it would
be worth adding that.

Are you interested in re-opening this patch? I'd be happy to provide
further review and help to try to push this along.

I've rebased the patch and attached as v26.

Thanks,
James Coleman


v26-0001-Add-function-to-log-the-plan-of-the-query.patch
Description: Binary data


Re: An inefficient query caused by unnecessary PlaceHolderVar

2023-06-01 Thread James Coleman
On Wed, May 31, 2023 at 10:30 PM Richard Guo  wrote:
>
>
> On Wed, May 31, 2023 at 1:27 AM James Coleman  wrote:
>>
>> This looks good to me.
>
>
> Thanks for the review!

Sure thing!

>>
>> A few small tweaks suggested to comment wording:
>>
>> +-- lateral reference for simple Var can escape PlaceHolderVar if the
>> +-- referenced rel is under the same lowest nulling outer join
>> +explain (verbose, costs off)
>>
>> I think this is clearer: "lateral references to simple Vars do not
>> need a PlaceHolderVar when the referenced rel is part of the same
>> lowest nulling outer join"?
>
>
> Thanks for the suggestion!  How about we go with "lateral references to
> simple Vars do not need a PlaceHolderVar when the referenced rel is
> under the same lowest nulling outer join"?  This seems a little more
> consistent with the comment in prepjointree.c.

That sounds good to me.

>>
>>   * lateral references to something outside the subquery 
>> being
>> - * pulled up.  (Even then, we could omit the PlaceHolderVar 
>> if
>> - * the referenced rel is under the same lowest outer join, 
>> but
>> - * it doesn't seem worth the trouble to check that.)
>> + * pulled up.  Even then, we could omit the PlaceHolderVar 
>> if
>> + * the referenced rel is under the same lowest nulling outer
>> + * join.
>>
>> I think this is clearer: "references something outside the subquery
>> being pulled up and is not under the same lowest outer join."
>
>
> Agreed.  Will use this one.
>
>>
>> One other thing: it would be helpful to have the test query output be
>> stable between HEAD and this patch; perhaps add:
>>
>> order by 1, 2, 3, 4, 5, 6, 7
>>
>> to ensure stability?
>
>
> Thanks for the suggestion!  I wondered about that too but I'm a bit
> confused about whether we should add ORDER BY in test case.  I checked
> 'sql/join.sql' and found that some queries are using ORDER BY but some
> are not.  Not sure what the criteria are.

I think it's just "is this helpful in this test". Obviously we don't
need it for correctness of this particular check, but as long as the
plan change still occurs as desired (i.e., the ORDER BY doesn't change
the plan from what you're testing) I think it's fine to consider it
author's choice.

James




Re: An inefficient query caused by unnecessary PlaceHolderVar

2023-05-30 Thread James Coleman
On Fri, May 12, 2023 at 2:35 AM Richard Guo  wrote:
>
> I happened to notice that the query below can be inefficient.
>
> # explain (costs off)
> select * from
>   int8_tbl a left join
>   (int8_tbl b inner join
>lateral (select *, b.q2 as x from int8_tbl c) ss on b.q2 = ss.q1)
>   on a.q1 = b.q1;
>  QUERY PLAN
> 
>  Hash Right Join
>Hash Cond: (b.q1 = a.q1)
>->  Nested Loop
>  ->  Seq Scan on int8_tbl b
>  ->  Seq Scan on int8_tbl c
>Filter: (b.q2 = q1)
>->  Hash
>  ->  Seq Scan on int8_tbl a
> (8 rows)
>
> For B/C join, currently we only have one option, i.e., nestloop with
> parameterized inner path.  This could be extremely inefficient in some
> cases, such as when C does not have any indexes, or when B is very
> large.  I believe the B/C join can actually be performed with hashjoin
> or mergejoin here, as it is an inner join.
>
> This happens because when we pull up the lateral subquery, we notice
> that Var 'x' is a lateral reference to 'b.q2' which is outside the
> subquery.  So we wrap it in a PlaceHolderVar.  This is necessary for
> correctness if it is a lateral reference from nullable side to
> non-nullable item.  But here in this case, the referenced item is also
> nullable, so actually we can omit the PlaceHolderVar with no harm.  The
> comment in pullup_replace_vars_callback() also explains this point.
>
>* (Even then, we could omit the PlaceHolderVar if
>* the referenced rel is under the same lowest outer join, but
>* it doesn't seem worth the trouble to check that.)

It's nice that someone already thought about this and left us this comment :)

> All such PHVs would imply lateral dependencies which would make us have
> no choice but nestloop.  I think we should avoid such PHVs as much as
> possible.  So IMO it may 'worth the trouble to check that'.
>
> Attached is a patch to check that for simple Vars.  Maybe we can extend
> it to avoid PHVs for more complex expressions, but that requires some
> codes because for now we always wrap non-var expressions to PHVs in
> order to have a place to insert nulling bitmap.  As a first step, let's
> do it for simple Vars only.
>
> Any thoughts?

This looks good to me.

A few small tweaks suggested to comment wording:

+-- lateral reference for simple Var can escape PlaceHolderVar if the
+-- referenced rel is under the same lowest nulling outer join
+explain (verbose, costs off)

I think this is clearer: "lateral references to simple Vars do not
need a PlaceHolderVar when the referenced rel is part of the same
lowest nulling outer join"?

  * lateral references to something outside the subquery being
- * pulled up.  (Even then, we could omit the PlaceHolderVar if
- * the referenced rel is under the same lowest outer join, but
- * it doesn't seem worth the trouble to check that.)
+ * pulled up.  Even then, we could omit the PlaceHolderVar if
+ * the referenced rel is under the same lowest nulling outer
+ * join.

I think this is clearer: "references something outside the subquery
being pulled up and is not under the same lowest outer join."

One other thing: it would be helpful to have the test query output be
stable between HEAD and this patch; perhaps add:

order by 1, 2, 3, 4, 5, 6, 7

to ensure stability?

Thanks,
James Coleman




Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2023-02-28 Thread James Coleman
On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas  wrote:
>
> On 16/11/2022 07:17, kuroda.keis...@nttcom.co.jp wrote:
> >> The issue here is pg_rewind looks into control file to determine the
> >> soruce timeline, because the control file is not updated until the
> >> first checkpoint ends after promotion finishes, even though file
> >> blocks are already diverged.
> >>
> >> Even in that case history file for the new timeline is already
> >> created, so searching for the latest history file works.
> >
> > I think this change is a good one because if I want
> > pg_rewind to run automatically after a promotion,
> > I don't have to wait for the checkpoint to complete.
> >
> > The attached patch is Horiguchi-san's patch with
> > additional tests. The tests are based on James's tests,
> > "010_no_checkpoint_after_promotion.pl" tests that
> > pg_rewind is successfully executed without running
> > checkpoint after promote.
>
> I fixed this last week in commit 009746, see thread [1]. I'm sorry I
> didn't notice this thread earlier.
>
> I didn't realize that we had a notice about this in the docs. I'll go
> and remove that. Thanks!
>
> - Heikki
>

Thanks; I think the missing [1] (for reference) is:
https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi

James




Re: Parallelize correlated subqueries that execute within each worker

2023-02-08 Thread James Coleman
On Mon, Feb 6, 2023 at 11:39 AM Antonin Houska  wrote:
>
> James Coleman  wrote:
> > Which this patch we do in fact now see (as expected) rels with
> > non-empty lateral_relids showing up in generate_[useful_]gather_paths.
> > And the partial paths can now have non-empty required outer rels. I'm
> > not able to come up with a plan that would actually be caught by those
> > checks; I theorize that because of the few places we actually call
> > generate_[useful_]gather_paths we are in practice already excluding
> > those, but for now I've left these as a conditional rather than an
> > assertion because it seems like the kind of guard we'd want to ensure
> > those methods are safe.
>
> Maybe we can later (in separate patches) relax the restrictions imposed on
> partial path creation a little bit, so that more parameterized partial paths
> are created.
>
> One particular case that should be rejected by your checks is a partial index
> path, which can be parameterized, but I couldn't construct a query that makes
> your checks fire. Maybe the reason is that a parameterized index path is
> mostly used on the inner side of a NL join, however no partial path can be
> used there. (The problem is that each worker evaluating the NL join would only
> see a subset of the inner relation, which whould lead to incorrect results.)
>
> So I'd also choose conditions rather than assert statements.

Thanks for confirming.

>
> Following are my (minor) findings:
>
> In generate_gather_paths() you added this test
>
> /*
>  * Delay gather path creation until the level in the join tree where 
> all
>  * params used in a worker are generated within that worker.
>  */
> if (!bms_is_subset(required_outer, rel->relids))
> return;
>
> but I'm not sure if required_outer can contain anything of rel->relids. How
> about using bms_is_empty(required) outer, or even this?
>
> if (required_outer)
> return;
>
> Similarly,
>
> /* We can't pass params to workers. */
> if (!bms_is_subset(PATH_REQ_OUTER(cheapest_partial_path), 
> rel->relids))
>
> might look like
>
> if (!bms_is_empty(PATH_REQ_OUTER(cheapest_partial_path)))
>
> or
>
> if (PATH_REQ_OUTER(cheapest_partial_path))

I'm not sure about this change. Deciding is difficult given the fact
that we don't seem to currently generate these paths, but I don't see
a reason why lateral_relids can't be present on the rel, and if so,
then we need to check to see if they're a subset of relids we can
satisfy rather than checking that they don't exist.

> In particular, build_index_paths() does the following when setting
> outer_relids (which eventually becomes (path->param_info->ppi_req_outer):
>
> /* Enforce convention that outer_relids is exactly NULL if empty */
> if (bms_is_empty(outer_relids))
> outer_relids = NULL;
>
>
> Another question is whether in this call
>
> simple_gather_path = (Path *)
> create_gather_path(root, rel, cheapest_partial_path, 
> rel->reltarget,
>required_outer, rowsp);
>
> required_outer should be passed to create_gather_path(). Shouldn't it rather
> be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again,
> build_index_paths() initializes outer_relids this way
>
> outer_relids = bms_copy(rel->lateral_relids);
>
> but then it may add some more relations:
>
> /* OK to include this clause */
> index_clauses = lappend(index_clauses, iclause);
> outer_relids = bms_add_members(outer_relids,
>
> rinfo->clause_relids);
>
> So I think that PATH_REQ_OUTER(cheapest_partial_path) in
> generate_gather_paths() can eventually contain more relations than
> required_outer, and therefore it's safer to check the first.

Yes, this is a good catch. Originally I didn't know about
PATH_REQ_OUTER, and I'd missed using it in these places.

>
> Similar comments might apply to generate_useful_gather_paths(). Here I also
> suggest to move this test
>
> /* We can't pass params to workers. */
> if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids))
> continue;
>
> to the top of the loop because it's relatively cheap.

Moved.

Attached is v9.

James Coleman


v9-0001-Add-tests-before-change.patch
Description: Binary data


v9-0002-Parallelize-correlated-subqueries.patch
Description: Binary data


Re: Fix incorrect comment reference

2023-01-23 Thread James Coleman
On Mon, Jan 23, 2023 at 4:07 PM James Coleman  wrote:
>
> On Mon, Jan 23, 2023 at 3:41 PM Robert Haas  wrote:
> >
> > On Mon, Jan 23, 2023 at 3:19 PM James Coleman  wrote:
> > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  wrote:
> > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman  wrote:
> > > > > See the attached for a simple comment fix -- the referenced
> > > > > generate_useful_gather_paths call isn't in grouping_planner it's in
> > > > > apply_scanjoin_target_to_paths.
> > > >
> > > > The intended reading of the comment is not clear. Is it telling you to
> > > > look at grouping_planner because that's where we
> > > > generate_useful_gather_paths, or is it telling you to look there to
> > > > see how we get the final target list together? If it's the former,
> > > > then your fix is correct. If the latter, it's fine as it is.
> > > >
> > > > The real answer is probably that some years ago both things happened
> > > > in that function. We've moved on from there, but I'm still not sure
> > > > what the most useful phrasing of the comment is.
> > >
> > > Yeah, almost certainly, and the comments just didn't keep up.
> > >
> > > Would you prefer something that notes both that the broader concern is
> > > happening via the grouping_planner() stage but still points to the
> > > proper callsite (so that people don't go looking for that confused)?
> >
> > I don't really have a strong view on what the best thing to do is. I
> > was just pointing out that the comment might not be quite so obviously
> > wrong as you were supposing.
>
> "Wrong" is certainly too strong; my apologies.
>
> I'm really just hoping to improve it for future readers to save them
> some confusion I had initially reading it.

Updated patch attached.

Thanks,
James Coleman


v2-0001-Fixup-incorrect-comment.patch
Description: Binary data


Re: Fix incorrect comment reference

2023-01-23 Thread James Coleman
On Mon, Jan 23, 2023 at 3:41 PM Robert Haas  wrote:
>
> On Mon, Jan 23, 2023 at 3:19 PM James Coleman  wrote:
> > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  wrote:
> > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman  wrote:
> > > > See the attached for a simple comment fix -- the referenced
> > > > generate_useful_gather_paths call isn't in grouping_planner it's in
> > > > apply_scanjoin_target_to_paths.
> > >
> > > The intended reading of the comment is not clear. Is it telling you to
> > > look at grouping_planner because that's where we
> > > generate_useful_gather_paths, or is it telling you to look there to
> > > see how we get the final target list together? If it's the former,
> > > then your fix is correct. If the latter, it's fine as it is.
> > >
> > > The real answer is probably that some years ago both things happened
> > > in that function. We've moved on from there, but I'm still not sure
> > > what the most useful phrasing of the comment is.
> >
> > Yeah, almost certainly, and the comments just didn't keep up.
> >
> > Would you prefer something that notes both that the broader concern is
> > happening via the grouping_planner() stage but still points to the
> > proper callsite (so that people don't go looking for that confused)?
>
> I don't really have a strong view on what the best thing to do is. I
> was just pointing out that the comment might not be quite so obviously
> wrong as you were supposing.

"Wrong" is certainly too strong; my apologies.

I'm really just hoping to improve it for future readers to save them
some confusion I had initially reading it.

James Coleman




Re: Fix incorrect comment reference

2023-01-23 Thread James Coleman
On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  wrote:
>
> On Mon, Jan 23, 2023 at 8:31 AM James Coleman  wrote:
> > See the attached for a simple comment fix -- the referenced
> > generate_useful_gather_paths call isn't in grouping_planner it's in
> > apply_scanjoin_target_to_paths.
>
> The intended reading of the comment is not clear. Is it telling you to
> look at grouping_planner because that's where we
> generate_useful_gather_paths, or is it telling you to look there to
> see how we get the final target list together? If it's the former,
> then your fix is correct. If the latter, it's fine as it is.
>
> The real answer is probably that some years ago both things happened
> in that function. We've moved on from there, but I'm still not sure
> what the most useful phrasing of the comment is.

Yeah, almost certainly, and the comments just didn't keep up.

Would you prefer something that notes both that the broader concern is
happening via the grouping_planner() stage but still points to the
proper callsite (so that people don't go looking for that confused)?

James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2023-01-23 Thread James Coleman
On Sat, Jan 21, 2023 at 10:07 PM James Coleman  wrote:
> ...
> While working through Tomas's comment about a conditional in the
> max_parallel_hazard_waker being guaranteed true I realized that in the
> current version of the patch the safe_param_ids tracking in
> is_parallel_safe isn't actually needed any longer. That seemed
> suspicious, and so I started digging, and I found out that in the
> current approach all of the tests pass with only the changes in
> clauses.c. I don't believe that the other changes aren't needed;
> rather I believe there isn't yet a test case exercising them, but I
> realize that means I can't prove they're needed. I spent some time
> poking at this, but at least with my current level of imagination I
> haven't stumbled across a query that would exercise these checks.

I played with this a good bit more yesterday, I'm now a good bit more
confident this is correct. I've cleaned up the patch; see attached for
v7.

Here's some of my thought process:
The comments in src/include/nodes/pathnodes.h:2953 tell us that
PARAM_EXEC params are used to pass values around from one plan node to
another in the following ways:
1. Values down into subqueries (for outer references in subqueries)
2. Up out of subqueries (for the results of a subplan)
3. From a NestLoop plan node into its inner relation (when the inner
scan is parameterized with values from the outer relation)

Case (2) is already known to be safe (we currently add these params to
safe_param_ids in max_parallel_hazard_walker when we encounter a
SubPlan node).

I also believe case (3) is already handled. We don't build partial
paths for joins when joinrel->lateral_relids is non-empty, and join
order calculations already require that parameterization here go the
correct way (i.e., inner depends on outer rather than the other way
around).

That leaves us with only case (1) to consider in this patch. Another
way of saying this is that this is really the only thing the
safe_param_ids tracking is guarding against. For params passed down
into subqueries we can further distinguish between init plans and
"regular" subplans. We already know that params from init plans are
safe (at the right level). So we're concerned here with a way to know
if the params passed to subplans are safe. We already track required
rels in ParamPathInfo, so it's fairly simple to do this test.

Which this patch we do in fact now see (as expected) rels with
non-empty lateral_relids showing up in generate_[useful_]gather_paths.
And the partial paths can now have non-empty required outer rels. I'm
not able to come up with a plan that would actually be caught by those
checks; I theorize that because of the few places we actually call
generate_[useful_]gather_paths we are in practice already excluding
those, but for now I've left these as a conditional rather than an
assertion because it seems like the kind of guard we'd want to ensure
those methods are safe.

The other other place that we actually create gather[_merge] paths is
gather_grouping_paths(), and there I've chosen to use assertions,
because the point at which grouping happens in planning suggests to me
that we shouldn't have lateral dependencies at that point. If someone
is concerned about that, I'd be happy to change those to conditionals
also.

James Coleman


v8-0001-Add-tests-before-change.patch
Description: Binary data


v8-0002-Parallelize-correlated-subqueries.patch
Description: Binary data


Fix incorrect comment reference

2023-01-23 Thread James Coleman
Hello,

See the attached for a simple comment fix -- the referenced
generate_useful_gather_paths call isn't in grouping_planner it's in
apply_scanjoin_target_to_paths.

Thanks,
James Coleman


v1-0001-Fixup-incorrect-comment.patch
Description: Binary data


Re: Parallelize correlated subqueries that execute within each worker

2023-01-21 Thread James Coleman
On Wed, Jan 18, 2023 at 9:34 PM James Coleman  wrote:
>
> On Wed, Jan 18, 2023 at 2:09 PM Tomas Vondra
>  wrote:
> >
> > Hi,
> >
> > This patch hasn't been updated since September, and it got broken by
> > 4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort
> > stuff a little bit. But the breakage was rather limited, so I took a
> > stab at fixing it - attached is the result, hopefully correct.
>
> Thanks for fixing this up; the changes look correct to me.
>
> > I also added a couple minor comments about stuff I noticed while
> > rebasing and skimming the patch, I kept those in separate commits.
> > There's also a couple pre-existing TODOs.
>
> I started work on some of these, but wasn't able to finish this
> evening, so I don't have an updated series yet.
>
> > James, what's your plan with this patch. Do you intend to work on it for
> > PG16, or are there some issues I missed in the thread?
>
> I'd love to see it get into PG16. I don't have any known issues, but
> reviewing activity has been light. Originally Robert had had some
> concerns about my original approach; I think my updated approach
> resolves those issues, but it'd be good to have that sign-off.
>
> Beyond that I'm mostly looking for review and evaluation of the
> approach I've taken; of note is my description of that in [1].

Here's an updated patch version incorporating feedback from Tomas as
well as some additional comments and tweaks.

While working through Tomas's comment about a conditional in the
max_parallel_hazard_waker being guaranteed true I realized that in the
current version of the patch the safe_param_ids tracking in
is_parallel_safe isn't actually needed any longer. That seemed
suspicious, and so I started digging, and I found out that in the
current approach all of the tests pass with only the changes in
clauses.c. I don't believe that the other changes aren't needed;
rather I believe there isn't yet a test case exercising them, but I
realize that means I can't prove they're needed. I spent some time
poking at this, but at least with my current level of imagination I
haven't stumbled across a query that would exercise these checks. One
of the reasons I'm fairly confident that this is true is that the
original approach (which was significantly more invasive) definitely
required rechecking parallel safety at each level until we reached the
point where the subquery was known to be generated within the current
worker through the safe_param_ids tracking mechanism. Of course it is
possible that that complexity is actually required and this simplified
approach isn't feasible (but I don't have a good reason to suspect
that currently). It's also possible that the restrictions on
subqueries just aren't necessary...but that isn't compelling because
it would require proving that you can never have a query level with
as-yet unsatisfied lateral rels.

Note: All of the existing tests for "you can't parallelize a
correlated subquery" are all simple versions which are not actually
parallel unsafe in theory. I assume they were added to show that the
code excluded that broad case, and there wasn't any finer grain of
detail required since the code simply didn't support making the
decision with that granularity anyway. But that means there weren't
any existing test cases to exercise the granularity I'm now trying to
achieve.

If someone is willing to help out what I'd like help with currently is
finding such a test case (where a gather or gather merge path would
otherwise be created but at the current plan level not all of the
required lateral rels are yet available -- meaning that we can't
perform all of the subqueries within the current worker). In support
of that patch 0004 converts several of the new parallel safety checks
into WARNING messages instead to make it easy to see if a query
happens to encounter any of those checks.

James Coleman


v7-0004-Warning-messages-for-finding-test-cases.patch
Description: Binary data


v7-0001-Add-tests-before-change.patch
Description: Binary data


v7-0003-Possible-additional-checks.patch
Description: Binary data


v7-0002-Parallelize-correlated-subqueries.patch
Description: Binary data


Re: Parallelize correlated subqueries that execute within each worker

2023-01-18 Thread James Coleman
On Wed, Jan 18, 2023 at 2:09 PM Tomas Vondra
 wrote:
>
> Hi,
>
> This patch hasn't been updated since September, and it got broken by
> 4a29eabd1d91c5484426bc5836e0a7143b064f5a which the incremental sort
> stuff a little bit. But the breakage was rather limited, so I took a
> stab at fixing it - attached is the result, hopefully correct.

Thanks for fixing this up; the changes look correct to me.

> I also added a couple minor comments about stuff I noticed while
> rebasing and skimming the patch, I kept those in separate commits.
> There's also a couple pre-existing TODOs.

I started work on some of these, but wasn't able to finish this
evening, so I don't have an updated series yet.

> James, what's your plan with this patch. Do you intend to work on it for
> PG16, or are there some issues I missed in the thread?

I'd love to see it get into PG16. I don't have any known issues, but
reviewing activity has been light. Originally Robert had had some
concerns about my original approach; I think my updated approach
resolves those issues, but it'd be good to have that sign-off.

Beyond that I'm mostly looking for review and evaluation of the
approach I've taken; of note is my description of that in [1].

> One of the queries in in incremental_sort changed plans a little bit:
>
> explain (costs off) select distinct
>   unique1,
>   (select t.unique1 from tenk1 where tenk1.unique1 = t.unique1)
> from tenk1 t, generate_series(1, 1000);
>
> switched from
>
>  Unique  (cost=18582710.41..18747375.21 rows=1 width=8)
>->  Gather Merge  (cost=18582710.41..18697375.21 rows=1000 ...)
>  Workers Planned: 2
>  ->  Sort  (cost=18582710.39..18593127.06 rows=417 ...)
>Sort Key: t.unique1, ((SubPlan 1))
>  ...
>
> to
>
>  Unique  (cost=18582710.41..18614268.91 rows=1 ...)
>->  Gather Merge  (cost=18582710.41..18614168.91 rows=2 ...)
>  Workers Planned: 2
>  ->  Unique  (cost=18582710.39..18613960.39 rows=1 ...)
>->  Sort  (cost=18582710.39..18593127.06 ...)
>  Sort Key: t.unique1, ((SubPlan 1))
>...
>
> which probably makes sense, as the cost estimate decreases a bit.

Off the cuff that seems fine. I'll read it over again when I send the
updated series.

James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe8m0DHUWk7gLKb_C4abTD4nMkU26ErE%3Dahow4zNMZbzPQ%40mail.gmail.com




Re: Commit fest 2022-11

2022-11-14 Thread James Coleman
On Mon, Nov 14, 2022 at 7:08 AM Ian Lawrence Barwick  wrote:
>
> 2022年11月9日(水) 8:12 Justin Pryzby :

> > If my script is not wrong, these patches add TAP tests, but don't update
> > the requisite ./meson.build file.  It seems like it'd be reasonable to
> > set them all as WOA until that's done.
> >
> > $ for a in `git branch -a |sort |grep commitfest/40`; do : echo "$a..."; 
> > x=`git log -1 --compact-summary "$a"`; echo "$x" |grep '/t/.*pl.*new' 
> > >/dev/null || continue; echo "$x" |grep -Fw meson >/dev/null && continue; 
> > git log -1 --oneline "$a"; done
> > ... [CF 40/3558] Allow file inclusion in pg_hba and pg_ident files
> > ... [CF 40/3628] Teach pg_waldump to extract FPIs from the WAL stream
> > ... [CF 40/3646] Skip replicating the tables specified in except table 
> > option
> > ... [CF 40/3663] Switching XLog source from archive to streaming when 
> > primary available
> > ... [CF 40/3670] pg_rewind: warn when checkpoint hasn't happened after 
> > promotion
> > ... [CF 40/3729] Testing autovacuum wraparound
> > ... [CF 40/3877] vacuumlo: add test to vacuumlo for test coverage
> > ... [CF 40/3985] TDE key management patches
>
> Looks like your script is correct, will update accordingly.
>
> Do we have a FAQ/checklist of meson things to consider for patches anywhere?

It's possible this has been discussed before, but it seems less than
ideal to have notifications about moving to WOA be sent only in a bulk
email hanging off the "current CF" email chain as opposed to being
sent to the individual threads. Perhaps that's something the app
should do for us in this situation. Without that though the patch
authors are left to wade through unrelated discussion, and, probably
more importantly, the patch discussion thread doesn't show the current
state (I think bumping there is more likely to prompt activity as
well).

James Coleman




Re: cirrus-ci cross-build interactions?

2022-09-26 Thread James Coleman
On Mon, Sep 26, 2022 at 10:48 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-26 22:36:24 -0400, James Coleman wrote:
> > I had a build on Cirrus CI fail tonight in what I have to assume was
> > either a problem with caching across builds or some such similar
> > flakiness. In the Debian task [1] I received this error:
> >
> > su postgres -c "make -s -j${BUILD_JOBS} world-bin"
> > In file included from parser.c:25:
> > ./gramparse.h:29:10: fatal error: 'gram.h' file not found
> > #include "gram.h"
> > ^~~~
> > 1 error generated.
> > make[3]: *** [../../../src/Makefile.global:1078: parser.bc] Error 1
> > make[3]: *** Waiting for unfinished jobs
> > make[2]: *** [common.mk:36: parser-recursive] Error 2
> > make[1]: *** [Makefile:42: all-backend-recurse] Error 2
> > make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2
> >
> > There were no changes in the commits I'd made to either parser.c or
> > gramparse.h or gram.h. After running "git commit --amend --no-edit"
> > (with zero changes) to rewrite the commit and forcing pushing the
> > build [2] seems to be fine. I've double-checked there are no
> > differences between the commits on the two builds (git diff shows no
> > output).
> >
> > Is it possible we're missing some kind of necessary build isolation in
> > the Cirrus CI scripting?
>
> Very unlikely - most of the tasks, including debian, use VMs that are thrown
> away after a single use.
>
> The explanation is likely that you're missing
>
> commit 16492df70bb25bc99ca3c340a75ba84ca64171b8
> Author: John Naylor 
> Date:   2022-09-15 10:24:55 +0700
>
> Blind attempt to fix LLVM dependency in the backend
>
> and that the reason you noticed this in one build but not another is purely
> due to scheduling variances.

Yes, as noted in my child reply to yours the egg is on my face -- I
hadn't rebased on the latest commits for a little too long.

Thanks for the troubleshooting and relevant fix.

James Coleman




Re: Parallelize correlated subqueries that execute within each worker

2022-09-26 Thread James Coleman
On Mon, Mar 21, 2022 at 8:48 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-01-22 20:25:19 -0500, James Coleman wrote:
> > On the other hand this is a dramatically simpler patch series.
> > Assuming the approach is sound, it should much easier to maintain than
> > the previous version.
> >
> > The final patch in the series is a set of additional checks I could
> > imagine to try to be more explicit, but at least in the current test
> > suite there isn't anything at all they affect.
> >
> > Does this look at least somewhat more like what you'd envisionsed
> > (granting the need to squint hard given the relids checks instead of
> > directly checking params)?
>
> This fails on freebsd (so likely a timing issue): 
> https://cirrus-ci.com/task/4758411492458496?logs=test_world#L2225
>
> Marked as waiting on author.

I've finally gotten around to checking this out, and the issue was an
"explain analyze" test that had actual loops different on FreeBSD.
There doesn't seem to be a way to disable loop output, but instead of
processing the explain output with e.g. a function (as we do some
other places) to remove the offending and unnecessary output I've just
removed the "analyze" (as I don't believe it was actually necessary).

Attached is an updated patch series. In this version I've removed the
"parallelize some subqueries with limit" patch since discussion is
proceeding in the spun off thread. The first patch adds additional
tests so that you can see how those new tests change with the code
changes in the 2nd patch in the series. As before the final patch in
the series includes changes where we may also want to verify
correctness but don't have a test demonstrating the need.

Thanks,
James Coleman


v5-0002-Parallelize-correlated-subqueries.patch
Description: Binary data


v5-0003-Possible-additional-checks.patch
Description: Binary data


v5-0001-Add-tests-before-change.patch
Description: Binary data


Re: cirrus-ci cross-build interactions?

2022-09-26 Thread James Coleman
On Mon, Sep 26, 2022 at 10:36 PM James Coleman  wrote:
>
> I had a build on Cirrus CI fail tonight in what I have to assume was
> either a problem with caching across builds or some such similar
> flakiness. In the Debian task [1] I received this error:
>
> su postgres -c "make -s -j${BUILD_JOBS} world-bin"
> In file included from parser.c:25:
> ./gramparse.h:29:10: fatal error: 'gram.h' file not found
> #include "gram.h"
> ^~~~
> 1 error generated.
> make[3]: *** [../../../src/Makefile.global:1078: parser.bc] Error 1
> make[3]: *** Waiting for unfinished jobs
> make[2]: *** [common.mk:36: parser-recursive] Error 2
> make[1]: *** [Makefile:42: all-backend-recurse] Error 2
> make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2
>
> There were no changes in the commits I'd made to either parser.c or
> gramparse.h or gram.h. After running "git commit --amend --no-edit"
> (with zero changes) to rewrite the commit and forcing pushing the
> build [2] seems to be fine. I've double-checked there are no
> differences between the commits on the two builds (git diff shows no
> output).
>
> Is it possible we're missing some kind of necessary build isolation in
> the Cirrus CI scripting?
>
> Thanks,
> James Coleman
>
> 1: https://cirrus-ci.com/task/6141559258218496
> 2: https://cirrus-ci.com/build/6309235720978432

Hmm, it looks like I don't have the commit that came out of this
thread [1] about gram.h issues; perhaps that's the issue.

I'm not sure why it fails sometimes and not others, however, though I
noticed that on the second build from my original email the Debian
step passed while the compiler warnings step failed with the same
error.

James Coleman

1: 
https://www.postgresql.org/message-id/20220914210427.y26tkagmxo5wwbvp%40awork3.anarazel.de




cirrus-ci cross-build interactions?

2022-09-26 Thread James Coleman
I had a build on Cirrus CI fail tonight in what I have to assume was
either a problem with caching across builds or some such similar
flakiness. In the Debian task [1] I received this error:

su postgres -c "make -s -j${BUILD_JOBS} world-bin"
In file included from parser.c:25:
./gramparse.h:29:10: fatal error: 'gram.h' file not found
#include "gram.h"
^~~~
1 error generated.
make[3]: *** [../../../src/Makefile.global:1078: parser.bc] Error 1
make[3]: *** Waiting for unfinished jobs
make[2]: *** [common.mk:36: parser-recursive] Error 2
make[1]: *** [Makefile:42: all-backend-recurse] Error 2
make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2

There were no changes in the commits I'd made to either parser.c or
gramparse.h or gram.h. After running "git commit --amend --no-edit"
(with zero changes) to rewrite the commit and forcing pushing the
build [2] seems to be fine. I've double-checked there are no
differences between the commits on the two builds (git diff shows no
output).

Is it possible we're missing some kind of necessary build isolation in
the Cirrus CI scripting?

Thanks,
James Coleman

1: https://cirrus-ci.com/task/6141559258218496
2: https://cirrus-ci.com/build/6309235720978432




  1   2   3   4   5   6   >