its" issue seems like an issue with the
instrumentation itself. Possibly one that is totally unrelated to
everything else we're discussing.
--
Peter Geoghegan
#x27;s split IO handling") fixed the issue, without anyone
>> realizing that the bug in question could manifest like this.
>
> I can't explain that. If you can consistently reproduce the change at
> the two base commits, maybe bisect?
Commit b4212231 was a wild guess on my part. Probably should have refrained
from that.
--
Peter Geoghegan
don't need to depend on heuristic-driven OS
readahead. Maybe that was wrong.
--
Peter Geoghegan
Thomas ("Fix rare bug in
read_stream.c's split IO handling") fixed the issue, without anyone
realizing that the bug in question could manifest like this.
--
Peter Geoghegan
guess we could re-read the internal page
only when prefetching later leaf pages starts to look like a good
idea, but that's another complicated code path to maintain.
--
Peter Geoghegan
y similarly, but somehow
have very different performance). I can't really justify that, but my
gut feeling is that that's the best place to focus our efforts for the
time being.
--
Peter Geoghegan
t might still be interesting to
> think about opportunities to do fuzzier speculative lookahead. I'll
> start a new thread.
That sounds interesting. I worry that we won't ever be able to get
away without some fallback that behaves roughly like OS readahead.
--
Peter Geoghegan
pin-like mechanism to avoid unsafe
concurrent TID recycling hazards).
If, in the end, the only solution that really works for GiST is a more
aggressive/invasive one than we'd prefer, then making those changes
must have been inevitable all along -- even with the old amgettuple
interface. That's why I'm not too worried about GiST ordered scans;
we're not making that problem any harder to solve. It's even possible
that it'll be a bit *easier* to fix the problem with the new batch
interface, since it somewhat normalizes the idea of hanging on to
buffer pins for longer.
--
Peter Geoghegan
ranch is subtly broken, but we can't in good conscience ignore those
problems while making these kinds of changes.
> It doesn't need to be committable, just good enough to be reasonably
> certain it's possible.
That's what I have in mind, too. If we have support for a second index
AM, then we're much less likely to over-optimize for nbtree in a way
that doesn't really make sense.
> Understood, and I agree in principle. It's just that given the fuzziness
> I find it hard how it should look like.
I suspect that index AMs are much more similar for the purposes of
prefetching than they are in other ways.
--
Peter Geoghegan
that's what's needed to make optimal choices.
> 4) More testing to minimize the risk of regressions.
>
> 5) Figuring out how to make this work for IOS (the simple patch has some
> special logic in the callback, which may not be great, not sure what's
> the right solution in the complex patch).
I agree that all these items are probably the biggest risks to the
project. I'm not sure that I can attribute this to the use of the
"complex" approach over the "simple" approach.
> 6)
I guess that this means "unknown unknowns", which are another significant risk.
--
Peter Geoghegan
bute_always_inline" can be used to force the compiler to
inline a function. I'd be very surprised if GCC 10 failed to honor the
underlying "__attribute__((always_inline))" function attribute.
--
Peter Geoghegan
ded batches. In
> principle you get that from v3 by filtering, but it might be slow on
> large indexes. I'll try doing that in v3.
Cool.
--
Peter Geoghegan
ve to the right without doing anything. Note that the
rightmost page cannot be P_IGNORE().
This scheme will always succeed, no matter the nblocks argument,
provided the initial leaf page is a valid leaf page (and provided the
nblocks arg is >= 1).
I get that this is just a prototype that might not go anywhere, but
the scheme I've described requires few changes.
--
Peter Geoghegan
On Wed, Jul 23, 2025 at 9:59 PM Peter Geoghegan wrote:
> Tomas' index-prefetch-simple-master branch:
> │ I/O Timings: shared read=1490.918
> │ Execution Time: 2015.731 ms
> Complex patch (same prewarming/eviction are omitted this time):
> │ I/O Timings:
On Wed, Jul 23, 2025 at 12:36 PM Peter Geoghegan wrote:
> * The TPC-C order line table primary key.
I tested this for myself.
Tomas' index-prefetch-simple-master branch:
set max_parallel_workers_per_gather =0;
SELECT pg_buffercache_evict_relation('order_line');
y memory bloat problems in the common case where the scan
isn't running in an EXPLAIN ANALYZE.
--
Peter Geoghegan
On Tue, Jul 22, 2025 at 9:31 PM Peter Geoghegan wrote:
> I'll get back to you on this soon. There are plenty of indexes that
> are not perfectly correlated (like pgbench_accounts_pkey is) that'll
> nevertheless benefit significantly from the approach taken by the
> complex p
of data sets from "perfectly clean" to "random" and see how the
> patch(es) behave on all of them.
Obviously none of your test cases are invalid -- they're all basically
reasonable, when considered in isolation. But the "linear_1" test is
*far* closer to the &
7;t know
about Tomas, but I've given almost no thought to
INDEX_SCAN_MAX_BATCHES specifically just yet.
--
Peter Geoghegan
On Tue, Jul 22, 2025 at 5:11 PM Andres Freund wrote:
> On 2025-07-18 23:25:38 -0400, Peter Geoghegan wrote:
> > > To some degree the table AM will need to care about the index level
> > > batching -
> > > we have to be careful about how many pages we keep p
which will store
about 370 when the index is in a pristine state.
It does matter, but in the grand scheme of things it's unlikely to be decisive.
--
Peter Geoghegan
302 │ (356,16) │
│ 303 │ (235,10) │
│304 │ (812,18) │
│305 │ (675,1) │
│306 │ (258,13) │
│307 │ (1187,9) │
│308 │ (185,2) │
│309 │ (179,2) │
│310 │ (951,2) │
└┴───┘
(310 rows)
There's actually 55,556 heap blocks in total in the underlying table.
So clearly there is some correlation here. Just not enough to ever
matter very much to prefetching. Again, the sole test case that has
that quality to it is the "linear" test case.
--
Peter Geoghegan
On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan wrote:
> What is the difference between cases like "linear / eic=16 / sync" and
> "linear_1 / eic=16 / sync"?
I figured this out for myself.
> One would imagine that these tests are very similar, based on the fact
On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan wrote:
> I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
> (among other things) ouputs "nhblks" for each leaf page from a given
> index (while showing the details of each leaf page in index key space
> or
t
that they have very similar names. But we see very different results
for each: with the former ("linear") test results, the "complex" patch
is 2x-4x faster than the "simple" patch. But, with the latter test
results ("linear_1", and other similar pairs of &qu
into a new layer that will decide when to release each
index page buffer pin (and call _hash_kill_items-like routines) based
on its own criteria.
--
Peter Geoghegan
hat the table AM has to call some indexam function to release
> index-batches, whenever it doesn't need the reference anymore? And the
> index-batch release can then unpin?
It does. But that can be fairly generic -- btfreebatch will probably
end up looking very similar to (say) hashfreebatch and gistfreebatch.
Again, the indexam.c layer actually gets to decide when it happens --
that's what I meant about it being under its control (I didn't mean
that it literally did everything without involving the index AM).
--
Peter Geoghegan
AM the freedom to do its own kind of batch access at the level
of heap pages. We don't necessarily have to figure all that out in the
first committed version, though.
--
Peter Geoghegan
"complex" approach, but as I
> said, I'm sure I can't pull that off on my own. With your help, I think
> the chance of success would be considerably higher.
I can commit to making this project my #1 focus for Postgres 19 (#1
focus by far), provided the "complex" approach is used - just say the
word.
I cannot promise that we will be successful. But I can say for sure
that I'll have skin in the game. If the project fails, then I'll have
failed too.
> Does this clarify how I think about the complex patch?
Yes, it does.
BTW, I don't think that there's all that much left to be said about
nbtree in particular here. I don't think that there's very much work
left there.
--
Peter Geoghegan
On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan wrote:
> The loop has an early check for this (for non-itemDead entries) here:
>
> /* Quickly skip over items never ItemDead-set by btgettuple */
> if (!kitem->itemDead)
> continue;
>
> I really do
ch this patch removes) can be in
ascending leaf-page-wise order, descending leaf-page-wise order, or
(with a scrollable cursor) some random mix of the two -- even when
there's no posting list tuples involved.
--
Peter Geoghegan
x" approach. We chatted
briefly on IM, and he seems more optimistic about it than I thought
(in my on-list remarks from earlier). It is definitely his patch, and I don't
want to speak for him.
--
Peter Geoghegan
hat's what heapam expects, and
what amgettuple (which I'd like to replace with amgetbatch) does.
--
Peter Geoghegan
that won't
break either patch/approach.
The index AM is obligated to pass back heap TIDs, without any external
code needing to understand these sorts of implementation details. The
on-disk representation of TIDs remains an implementation detail known
only to index AMs.
--
Peter Geoghegan
My guess is that it's due to the less efficient
memory allocation with batching. Obviously this isn't acceptable, but
I'm not particularly concerned about it right now. I was actually
pleased to see that there wasn't a much larger regression there.
--
Peter Geoghegan
keeps btgettuple in largely its
current form.
I agree that having such a GUC is important during development, and
will try to add it back soon. It'll have to work in some completely
different way, but that still shouldn't be difficult.
--
Peter Geoghegan
that (say) Thomas will be able to add pause-and-resume
to the read stream interface some time soon, at which point the
regressions we see with the "simple" patch (but not the "complex"
patch) go away?
--
Peter Geoghegan
petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11
I think that the version that Tomas must have used is a few days old,
and might be a tiny bit different. But I don't think that that's
likely to matter, especially not if you just want to get the general
idea.
--
Peter Geoghegan
On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra wrote:
> On 7/16/25 16:45, Peter Geoghegan wrote:
> > I get that index characteristics could be the limiting factor,
> > especially in a world where we're not yet eagerly reading leaf pages.
> > But that in no way justi
magine how it could possibly be relevant.
Thanks for the review
--
Peter Geoghegan
On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan wrote:
> For example, with "linear_10 / eic=16 / sync", it looks like "complex"
> has about half the latency of "simple" in tests where selectivity is
> 10. The advantage for "complex" is even gre
t;, it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.
Have I missed something?
--
Peter Geoghegan
> I think it'll need to do something like that in some cases, when we need
> to limit the number of leaf pages kept in memory to something sane.
That's the only reason? The memory usage for batches?
That doesn't seem like a big deal. It's something to keep an eye on,
but I see no reason why it'd be particularly difficult.
Doesn't this argue for the "complex" patch's approach?
--
Peter Geoghegan
uot; the stream,
> without resetting the distance etc. But we don't have that, and the
> reset thing was suggested to me as a workaround.
Does the "complex" patch require a similar workaround? Why or why not?
--
Peter Geoghegan
t BM_VALID set,
> of course). Thus we won't start another IO for that block.
Even if it was worth optimizing for, it'd probably still be too far
down the list of problems to be worth discussing right now.
--
Peter Geoghegan
sue with no starting the I/O earlier. The fadvise is just
> easier to trace/inspect.
It's not at all surprising that you're seeing duplicate prefetch
requests. I have no reason to believe that it's important to suppress
those ourselves, rather than leaving it up to the OS (though I also
have no reason to believe that the opposite is true).
--
Peter Geoghegan
On Wed, Jul 16, 2025 at 9:36 AM Peter Geoghegan wrote:
> Another issue with the "simple" patch: it adds 2 bool fields to
> "BTScanPosItem". That increases its size considerably. We're very
> sensitive to the size of this struct (I think that you know about this
about this
already). Bloating it like this will blow up our memory usage, since
right now we allocate MaxTIDsPerBTreePage/1358 such structs for
so->currPos (and so->markPos). Wasting all that memory on alignment
padding is probably going to have consequences beyond memory bloat.
--
Peter Geoghegan
seems worth considering.
--
Peter Geoghegan
mpiler.
> I'd be a bit worried about
> creating a back-patching mine-field. But maybe these are all
> in spots we're unlikely to touch?
That seems like much less of a problem for a purely subtractive change
such as this.
--
Peter Geoghegan
compiler quiet"
variable initializations that follow an elog/ereport with elevel >=
ERROR. My guess is that we have several hundred.
--
Peter Geoghegan
elog(ERROR) related warnings/to
suppose that we no longer have to add "keep compiler quiet" variable
initializations after an elog(ERROR). If it works for MSVC +
pg_assume, then it ought to also work for MSVC + pg_unreachable.
Right?
--
Peter Geoghegan
essage. v2 also slightly simplifies the logic that the notnullkey
code block uses to select the key's strategy (which is also the
strategy that'll be used as _bt_first's strat_total later on), since
that seems like it'll make my explanation slightly clearer to anybody
that reads the code
On Tue, Jul 15, 2025 at 2:19 PM Peter Geoghegan wrote:
> * gistkillitems() correctly checks if the page's LSN has changed in
> the period between when we initially read the leaf page and the period
> when/after we accessed the heap. But (unlike nbtree), it fails to
> acc
On Sat, Jul 12, 2025 at 7:50 PM Peter Geoghegan wrote:
> Why wouldn't we want to relieve all AMs of that responsibility?
> Leaving it up to index AMs has resulted in subtle bugs [2][3], and
> AFAICT has no redeeming quality. If affected index AMs were *forced*
> to do *exactly*
al index AMs, and into indexam.c
[1]. My hope is that centralizing this kind of logic in indexam.c will
make these kinds of problems impossible.
[1] https://commitfest.postgresql.org/patch/5721/
[2] https://commitfest.postgresql.org/patch/5542/
[3]
https://postgr.es/m/cah2-wzmvvu2kmkyq8sueyxrc_roa5lfogde1-vdtorpk_m3...@mail.gmail.com
--
Peter Geoghegan
x27;t
find that inappropriate.
> Do we want that as the name of this new extension?
Personally, I think it's fine. We ought to discourage the idea that
this is just "explain, but better".
--
Peter Geoghegan
ng on _bt_advance_array_keys to advance the array keys properly
later on. If you're interested in why we need the remaining hard reset
of the arrays within amrestrpos/btrestrpos, let me know and I'll
explain.)
--
Peter Geoghegan
were *forced*
to do *exactly* the same thing as each other (not just *oblidged* to
do *almost* the same thing), it would make life easier for everybody.
[1] https://www.postgresql.org/docs/current/index-locking.html
[2] https://commitfest.postgresql.org/patch/5721/
[3] https://commitfest.postgresql.org/patch/5542/
--
Peter Geoghegan
on this thread. I also plan on using
injection points to write a simple/serial regression test exercising
the nbtree code that completes an incomplete split (following a hard
crash/error).
Thanks again
--
Peter Geoghegan
_lock_and_validate_left as expected. I don't see any second
WARNING indicating that we've taken _bt_lock_and_validate_left's
unhappy path, though (and the test still fails). This doesn't look
like an issue with the planner.
I attach the relevant regression test output, that sho
n "meson test --suite setup --suite nbtree -q --print-errorlogs"
in a loop 500 times on my Debian workstation without seeing any
failures. Seems stable there. Whereas the FreeBSD target hasn't even
passed once out of more than a dozen attempts. Seems to be reliably
broken on FreeBSD.
; explain what the detach and wait functions actually do, and how and
> > why one might want to call them together.
>
> If you see ways to improve the existing template, please feel free to
> propose something, sure.
I'll need to figure this out for myself first.
--
Peter Geoghegan
7;t seem to
explain what the detach and wait functions actually do, and how and
why one might want to call them together.
--
Peter Geoghegan
all with similar code. That seems borderline unacceptable.
It would be far preferable if I could just use some built-in way of
waiting exactly once, that can be used directly from SQL, through the
injection_points extension. That would allow me to write the isolation
test without having to add an
hich seems worthwhile given that the patch is so
simple.
I'll submit this patch to the next open commitfest for Postgres 19.
--
Peter Geoghegan
v1-0001-nbtree-Use-only-one-notnullkey-ScanKeyData.patch
Description: Binary data
wer-order row compare member keys, at least.
I'll submit this to the next open commitfest for Postgres 19.
--
Peter Geoghegan
v1-0001-Teach-nbtree-to-set-pstate.ikey-beyond-a-row-comp.patch
Description: Binary data
> PS. Not a new thing, but the "Add coverage..." comments in the tests
> seem redundant. All tests add coverage for something.
I'll adjust them along those lines.
Unless there are any objections, and assuming you're done giving
feedback, I'll commit both patches tomorrow.
Thanks!
--
Peter Geoghegan
On Fri, Jun 27, 2025 at 5:35 PM Peter Geoghegan wrote:
> Attached is v4, which is largely just a polished version of v3. It has
> improved comments and more worked out commit messages. Plus the second
> patch (the row compare patch) now teaches _bt_first to fully rely on
> scan key
On Wed, Jun 25, 2025 at 5:44 PM Peter Geoghegan wrote:
> Correction: there *is* a live bug like this on Postgres 18/HEAD, which
> involves simple scalar inequalities with an incomplete opfamily.
> Attached v3 fixes the bug in its new 0001-* patch.
Attached is v4, which is largely just a
exactly one best way to order and
required-mark the keys output by preprocessing -- even with an
incomplete opfamily. (And even with row compares that happen to be
redundant or contradictory, which are similar.)
--
Peter Geoghegan
On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan wrote:
> In general, when we end a primitive index scan, the code that sets
> continuescan=false (any such code, not just _bt_check_rowcompare NULL
> row member code) has to make sure that starting the next primitive
> index scan will ac
On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan wrote:
> Attached patch shows how this could work. It refines the rules around
> NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
> The fundamental idea here is to make _bt_check_rowcompare *consistent*
> with _bt
heck_rowcompare
behaviors from a great distance, which is pretty grotty. Seems worth
fixing sooner rather than later.
--
Peter Geoghegan
v1-0001-Improve-_bt_first-row-compare-NULL-member-handlin.patch
Description: Binary data
On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan wrote:
> Making this change in _bt_readpage creates a problem in _bt_killitems:
> it currently expects posting list TIDs in ascending order (the loop
> invariant relies on that), which is why the deduplication patch didn't
> ju
te that the qsort is completely useless during standard forwards
scans. It's hard to tell those apart from scrollable cursor scans that
briefly changed directions "within a page", though, so I'm inclined to
just always do the qsort, rather than trying to cleverly avoid it.
That'
On Fri, May 2, 2025 at 3:04 PM Peter Geoghegan wrote:
> The second patch is more complicated, and seems like something that
> I'll need to spend more time thinking about before proceeding with
> commit. It has subtle behavioral implications, in that it causes the
> pstate
ight now is the complex rules around buffer
pins, particularly with mark/restore, which is generally a very hard
code path to test.
[1]
https://postgr.es/m/CAH2-Wzm3a8i31aBXi6oQt9uG7m1-xpX-MXjMMYoDJH=sbj1...@mail.gmail.com
--
Peter Geoghegan
v1-0001-Stop-conditioning-nbtree-actions-on-pin-being-hel.patch
Description: Binary data
On Tue, Jun 10, 2025 at 3:00 PM Peter Geoghegan wrote:
> I have confirmed that this flavor of the problem has existed for a
> long time. We'll need to backpatch the fix to all supported branches.
> Current plan: Commit 0001 on all supported branches in the next couple
> of da
On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan wrote:
> As I said, I think that this is actually an old bug. After all,
> _bt_killitems is required to test so->currPos.lsn against the same
> page's current LSN if the page pin was dropped since _bt_readpage was
> first called
On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan wrote:
> I've always thought that the whole idiom of testing
> BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
> it makes sense to totally replace those calls/tests with similar tests
> of the new so->
This is also why we do things like allocate so->currTuples within
btrescan. We don't yet know if the scan will be an index-only scan
when btbeginscan is called.
--
Peter Geoghegan
've always thought that the whole idiom of testing
BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
it makes sense to totally replace those calls/tests with similar tests
of the new so->dropPin field.
--
Peter Geoghegan
v1-0001-Fix-issue-with-mark-restore-nbtree-pins.patch
Description: Binary data
On Fri, Jun 6, 2025 at 6:58 PM Masahiko Sawada wrote:
> Agreed. Given the above test results, it's unlikely always sorting the
> array helps speedups.
Did you try specializing the sort? In my experience, it makes a big difference.
--
Peter Geoghegan
On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan wrote:
> My current plan is to commit this in the next couple of days.
Pushed.
It would be great if we could also teach BufferGetLSNAtomic() to just
call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
platforms), but my sense is t
On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan wrote:
> On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra wrote:
> > So better to get this in now, otherwise we may have to wait until PG19,
> > because of ABI (the patch adds a field into BTScanPosData, but maybe
> > it'd
. There's a kind of symmetry to how
things work with the patch in place, which IMV makes things simpler.
--
Peter Geoghegan
On Tue, Jun 3, 2025 at 1:59 PM Peter Geoghegan wrote:
> We do currently end up using OldestXmin-1 as our
> snapshotConflictHorizon when this happens, but as I said in the other
> email, I don't think that that's really required.
Actually, that's not really true, either.
our
snapshotConflictHorizon when this happens, but as I said in the other
email, I don't think that that's really required.
--
Peter Geoghegan
lly planning (originally) to write a patch to try
> and change the horizon to make it older in more cases when it's
> correct. I'm trying to figure out the most straightforward code to
> calculate the combined snapshot conflict horizon for a prune/freeze/vm
> update record.
I figured that that was what this was really about.
--
Peter Geoghegan
the oldest safe value in all cases. How much that actually buys you
seems much less clear.
--
Peter Geoghegan
he same place, it might make
sense to do it just to make things more consistent.
--
Peter Geoghegan
in cases where we still give up and just use
OldestXmin (cases where we just use "OldestXmin - 1", I should say).
I'm not sure how much this matters in practice; I imagine that using
prunestate->visibility_cutoff_xid is very much the common case with
page-level freezing.
--
Peter Geoghegan
On Mon, Jun 2, 2025 at 10:27 AM Melanie Plageman
wrote:
> Attached what I plan to push shortly.
Looks good to me.
--
Peter Geoghegan
o that
it took place next to the initialization of other, similar BlockNumber
fields from LVRelState. IIRC I wrote a comment about this issue at
least in part because I understood the temptation to do that.
> I'll push the fix tomorrow.
Cool.
--
Peter Geoghegan
ning/unfrozen XID value from one of these unscanned heap
pages.
--
Peter Geoghegan
;break" here, it doesn't matter what visibility_cutoff_xid has
been set to. It cannot be used for any purpose.
--
Peter Geoghegan
On Fri, May 30, 2025 at 5:16 PM Melanie Plageman
wrote:
> On Thu, May 29, 2025 at 11:37 AM Peter Geoghegan wrote:
> > How can taking the "Avoids false conflicts when hot_standby_feedback
> > in use" path more often result in fewer unnecessary conflicts on
> > s
idTransactionId;
}
else
{
/* Avoids false conflicts when hot_standby_feedback in use */
snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
TransactionIdRetreat(snapshotConflictHorizon);
}
How can taking the "Avoids false conflicts when hot_standby_feedback
in use" path m
On Fri, May 23, 2025 at 11:42 AM Andres Freund wrote:
> > I'm envisioning this patch series as v19 work, were you
> > thinking we should be more aggressive?
>
> Mostly agreed - but I am wondering if the AV fix should be backpatched?
I think that it probably should be.
--
Peter Geoghegan
1 - 100 of 1855 matches
Mail list logo