from:"Matthias van de Meent"

Re: cleanup patches for incremental backup

2024-01-15 Thread Matthias van de Meent

On Mon, 15 Jan 2024 at 17:58, Robert Haas  wrote:
>
> On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> > I've found one more typo in the sgml:
> > summarized_pid
> > And one in a comment:
> > sumamry
> >
> > A trivial fix is attached.
>
> Thanks, committed.

Off-list I was notified that the new WAL summarizer process was not
yet added to the glossary, so PFA a patch that does that.
In passing, it also adds "incremental backup" to the glossary, and
updates the documented types of backends in monitoring.sgml with the
new backend type, too.

Kind regards,

Matthias van de Meent.

v1-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Matthias van de Meent

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

> +/*
> + * _bt_merge_arrays() -- merge together duplicate array keys
> + *
> + * Both scan keys have array elements that have already been sorted and
> + * deduplicated.
> + */

As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

> +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> + IndexTuple tuple, int sktrig, bool validtrig)

I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

> _bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators? E.g. a
type 'truncatedint4' that compares only the highest 16 bits of an
integer would be strictly sorted, and could compare 0::truncatedint4 =
0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4
= 2::int4 is false.
Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I notice an assertion error here:
> +Assert(cur->sk_strategy != BTEqualStrategyNumber);
> +Assert(all_required_sk_satisfied);
> +Assert(!foundRequiredOppositeDirOnly);
> +
> +foundRequiredOppositeDirOnly = true;

This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

> +_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

> +++ b/src/test/regress/sql/btree_index.sql
> +-- Add tests to give coverage of various subtle issues.
> +--
> +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> +--
> +-- Here we don't remember the scan's array keys before processing a page, 
> only
> +-- after processing a page (which is implicit, it's just the scan's current
> +-- keys).  So when we move the scan backwards we think that the top-level 
> scan
> +-- should terminate, when in reality it should jump backwards to the leaf 
> page
> +-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Note: I did not yet do any in-depth review of the planner changes in
indxpath.c/selfuncs.c.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Matthias van de Meent

On Wed, 10 Jan 2024 at 02:44, Andy Fan  wrote:
> Hi,
>
> I want to know if Andres or you have plan
> to do some code review. I don't expect this would happen very soon, just
> want to make sure this will not happen that both of you think the other
> one will do, but actually none of them does it in fact. a commit fest
> [1] has been added for this.

> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -5419,6 +5419,7 @@ LockBufHdr(BufferDesc *desc)
> perform_spin_delay();
> }
> finish_spin_delay();
> +START_SPIN_LOCK();
> return old_buf_state | BM_LOCKED;
> }

I think that we need to 'arm' the checks just before we lock the spin
lock, and 'disarm' the checks just after we unlock the spin lock,
rather than after and before, respectively. That way, we won't have a
chance of false negatives: with your current patch it is possible that
an interrupt fires between the acquisition of the lock and the code in
START_SPIN_LOCK() marking the thread as holding a spin lock, which
would cause any check in that signal handler to incorrectly read that
we don't hold any spin locks.

> +++ b/src/backend/storage/lmgr/lock.c
> @@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
> boolfound_conflict;
> boollog_lock = false;
>
> +Assert(SpinLockCount == 0);
> +

I'm not 100% sure on the policy of this, but theoretically you could
use LockAquireExtended(dontWait=true) while holding a spin lock, as
that would not have an unknown duration. Then again, this function
also does elog/ereport, which would cause issues, still, so this code
may be the better option.

> +elog(PANIC, "stuck spinlock detected at %s, %s:%d after waiting for %u 
> ms",
> + func, file, line, delay_ms);

pg_usleep doesn't actually guarantee that we'll wait for exactly that
duration; depending on signals received while spinning and/or OS
scheduling decisions it may be off by orders of magnitude.

> +++ b/src/common/scram-common.c

This is unrelated to the main patchset.

> +++ b/src/include/storage/spin.h

Minor: I think these changes could better be included in miscadmin, or
at least the definition for SpinLockCount should be moved there: The
spin lock system itself shouldn't be needed in places where we need to
make sure that we don't hold any spinlocks, and miscadmin.h already
holds things related to "System interrupt and critical section
handling", which seems quite related.

Kind regards,

Matthias van de Meent

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-04 Thread Matthias van de Meent

On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
 wrote:
>
> Hello!
>
> It seems like the idea of "old" snapshot is still a valid one.
>
> > Should this deal with any potential XID wraparound, too?
>
> As far as I understand in our case, we are not affected by this in any way.
> Vacuum in our table is not possible because of locking, so, nothing
> may be frozen (see below).
> In the case of super long index building, transactional limits will
> stop new connections using current
> regular infrastructure because it is based on relation data (but not
> actual xmin of backends).
>
> > How does this behave when the newly inserted tuple's xmin gets frozen?
> > This would be allowed to happen during heap page pruning, afaik - no
> > rules that I know of which are against that - but it would create
> > issues where normal snapshot visibility rules would indicate it
> > visible to both snapshots regardless of whether it actually was
> > visible to the older snapshot when that snapshot was created...
>
> As I can see, heap_page_prune never freezes any tuples.
> In the case of regular vacuum, it used this way: call heap_page_prune
> and then call heap_prepare_freeze_tuple and then
> heap_freeze_execute_prepared.

Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.

As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/caakru_a+g2oe6ahjcbibftnfiy2aib4e31x9qyj_qkjxzmz...@mail.gmail.com

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Matthias van de Meent

On Thu, 4 Jan 2024 at 08:09, Andy Fan  wrote:
>
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
> [...]
> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.
>
> the attached code show the prototype in my mind. Any feedback is welcome.

While I understand your point and could maybe agree with the change
itself (a crash isn't great), I don't think it's an appropriate fix
for the problem of holding a spinlock while waiting for a LwLock, as
spin.h specifically mentions the following (and you quoted the same):

"""
Keep in mind the coding rule that spinlocks must not be held for more
than a few instructions.
"""

I suspect that we'd be better off with stronger protections against
waiting for LwLocks while we hold any spin lock. More specifically,
I'm thinking about something like tracking how many spin locks we
hold, and Assert()-ing that we don't hold any such locks when we start
to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with
potential manual contextual overrides in specific areas of code where
specific care has been taken to make it safe to hold spin locks while
doing those operations - although I consider their existence unlikely
I can't rule them out as I've yet to go through all lock-touching
code). This would probably work in a similar manner as
START_CRIT_SECTION/END_CRIT_SECTION.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent

On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut  wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the  consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me.  By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
 , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
 , sum(pg_column_size(ev_action)) as "compressed"
 , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | pg_rewrite | compressed |   raw
-|---+++-
 master  |   7545359 | 761856 | 573307 | 2998712
 0001|   7365135 | 622592 | 438224 | 1943772
 0002|   7258639 | 573440 | 401660 | 1835803
 0003|   7258639 | 565248 | 386211 | 1672539
 0004|   7176719 | 483328 | 317099 | 1316552
 0005|   7176719 | 483328 | 315556 | 1300420
 0006|   7160335 | 466944 | 302806 | 1208621
 0007|   7143951 | 450560 | 287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v1-0001-pg_node_tree-Don-t-serialize-fields-with-type-def.patch
Description: Binary data

v1-0002-pg_node_tree-reset-node-location-before-catalog-s.patch
Description: Binary data

v1-0005-NodeSupport-Don-t-emit-trailing-0s-in-outDatum.patch
Description: Binary data

v1-0004-NodeSupport-add-some-more-default-markers-for-var.patch
Description: Binary data

v1-0003-Nodesupport-add-support-for-custom-default-values.patch
Description: Binary data

v1-0007-NodeSupport-Apply-RLE-and-differential-encoding-o.patch
Description: Binary data

v1-0006-NodeSupport-Apply-some-more-defaults-serializatio.patch
Description: Binary data

Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent

On Wed, 3 Jan 2024 at 03:02, David Rowley  wrote:
>
> On Thu, 14 Dec 2023 at 19:21, Matthias van de Meent
>  wrote:
> >
> > On Thu, 7 Dec 2023 at 13:09, David Rowley  wrote:
> > > We could also easily serialize plans to binary format for copying to
> > > parallel workers rather than converting them to a text-based
> > > serialized format. It would also allow us to do things like serialize
> > > PREPAREd plans into a nicely compact single allocation that we could
> > > just pfree in a single pfree call on DEALLOCATE.
> >
> > I'm not sure what benefit you're refering to. If you mean "it's more
> > compact than the current format" then sure; but the other points can
> > already be covered by either the current nodeToString format, or by
> > nodeCopy-ing the prepared plan into its own MemoryContext, which would
> > allow us to do essentially the same thing.
>
> There's significantly less memory involved in just having a plan
> serialised into a single chunk of memory vs a plan stored in its own
> MemoryContext.  With the serialised plan, you don't have any power of
> 2 rounding up wastage that aset.c does and don't need extra space for
> all the MemoryChunks that would exist for every single palloc'd chunk
> in the MemoryContext version.

I was envisioning this to use the Bump memory context you proposed
over in [0], as to the best of my knowledge prepared plans are not
modified, so nodeCopy-ing a prepared plan into bump context could be a
good use case for those contexts. This should remove the issue of
rounding and memorychunk wastage in aset.

> I think it would be nice if one day in the future if a PREPAREd plan
> could have multiple different plans cached. We could then select which
> one to use by looking at statistics for the given parameters and
> choose the plan that's most suitable for the given parameters.   Of
> course, this is a whole entirely different project. I mention it just
> because being able to serialise a plan would make the memory
> management and overhead for such a feature much more manageable.
> There'd likely need to be some eviction logic in such a feature as the
> number of possible plans for some complex query is quite likely to be
> much more than we'd care to cache.

Yeah, that'd be nice, but is also definitely future work.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: 
https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM%3DBp%3D6xjL_nLys4zdZOpfNyWBk97Xrdj2w%40mail.gmail.com

Re: Next step towards 64bit XIDs: Switch to FullTransactionId for PGPROC->xid and XLogRecord->xl_xid

2023-12-29 Thread Matthias van de Meent

On Fri, 29 Dec 2023, 13:49 Maxim Orlov,  wrote:

> Hi!
>
> As were discussed in [0] our overall goal is to make Postgres 64 bit
> XIDs.  It's obvious, that such a big patch set
> couldn't possible to commit "at once".  SLUR patch set [1] was committed a
> short while ago as a first significant
> step in this direction.
>
> This thread is a next step in this enterprise.  My objective here is to
> commit some changes, which were mandatory,
> as far as I understand, for any type of 64 XIDs implementation. And I'm
> sure there will be points for discussion here.
>
> My original intention was to make PGPROC->xmin, PGPROC->xid and
> PROC_HDR->xids 64bit.  But in reality,
> it turned out to be much more difficult than I expected.  On the one hand,
> the patch became too big and on the other
> hand, it's heavily relayed on epoch and XID "adjustment" to FXID.  Therefore,
> for now, I decided to limit myself to
> more atomic and independent changes. However, as I said above, these
> changes are required for any implementation
> of 64bit XIDs.
>
> So, PFA patches to make switch PGPROC->xid
>

I think this could be fine, but ...

and XLogRecord->xl_xid to FullTransactionId.
>

I don't think this is an actionable change, as this wastes 4 more bytes (or
8 with alignment) in nearly all WAL records that don't use the
HEAP/HEAP2/XLOG rmgrs, which would then be up to 10 (if not 14, when
64but-aligned) bytes per record. Unless something like [0] gets committed
this will add a significant write overhead to all operations, even if they
are not doing anything that needs an XID.

Also, I don't think we need to support transactions that stay alive and
change things for longer than 2^31 concurrently created transactions, so we
could well add a fxid to each WAL segment header (and checkpoint record?)
and calculate the fxid of each record as a relative fxid off of that.

Kind regards

Matthias van de Meent

[0] https://commitfest.postgresql.org/46/4386/

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-20 Thread Matthias van de Meent

On Wed, 20 Dec 2023 at 10:56, Michail Nikolaev
 wrote:
> > Note that the use of such expressions would be a violation of the
> > function's definition; it would depend on data from other tables which
> > makes the function behave like a STABLE function, as opposed to the
> > IMMUTABLE that is required for index expressions. So, I don't think we
> > should specially care about being correct for incorrectly marked
> > function definitions.
>
> Yes, but such cases could probably cause crashes also...
> So, I think it is better to check them for custom functions. But I
> still not sure -
> if such limitations still required for proposed optimization or not.

I think contents could be inconsistent, but not more inconsistent than
if the index was filled across multiple transactions using inserts.
Either way I don't see it breaking more things that are not already
broken in that way in other places - at most it will introduce another
path that exposes the broken state caused by mislabeled functions.

> > I just realised there is one issue with this design: We can't cheaply
> > reset the snapshot during the second table scan:
> > It is critically important that the second scan of R/CIC uses an index
> > contents summary (made with index_bulk_delete) that was created while
> > the current snapshot was already registered.
>
> > So, the "reset the snapshot every so often" trick cannot be applied in
> > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > every time we reset the snapshot. Rescanning might be worth the cost
> > (e.g. when using BRIN), but that is very unlikely.
>
> Hm, I think it is still possible. We could just manually recheck the
> tuples we see
> to the snapshot currently used for the scan. If an "old" snapshot can see
> the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> index summary.
That's an interesting method.

How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?

> HeapTupleSatisfiesHistoricMVCC

That function has this comment marker:
   "Only usable on tuples from catalog tables!"
Is that correct even for this?

Should this deal with any potential XID wraparound, too?
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...

Either way, "Historic snapshot" isn't something I've worked with
before, so that goes onto my "figure out how it works" pile.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: pg_waldump

2023-12-19 Thread Matthias van de Meent

On Tue, 19 Dec 2023, 12:27 Fabrice Chapuis,  wrote:
>
> Hi,
> Is it possible to visualize the DDL with the pg_waldump tool. I created a 
> postgres user but I cannot find the creation command in the wals

Not really, no. PostgreSQL does not log DDL or DML as such in WAL.
Essentially all catalog updates are logged only as changes on a
certain page in some file: a new user getting inserted would be
approximately "Insert tuple [user's pg_role row data] on page X in
file [the file corresponding to the pg_role table]".

You could likely derive most DDL commands from Heap/Insert,
Heap/Delete, and Heap/Update records (after cross-referencing the
database's relfilemap), as most DDL is "just" a lot of in-memory
operations plus some record insertions/updates/deletes in catalog
tables. You'd also need to keep track of any relfilemap changes while
processing the WAL, as VACUUM FULL on the catalog tables would change
the file numbering of catalog tables...

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Add --check option to pgindent

2023-12-18 Thread Matthias van de Meent

On Mon, 18 Dec 2023 at 16:45, Tristan Partin  wrote:
>
> On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote:
> > > On 15 Dec 2023, at 16:43, Tristan Partin  wrote:
> >
> > > Here is a v3.
> >
> > I think this is pretty much ready to go, the attached v4 squashes the 
> > changes
> > and fixes the man-page which also needed an update.  The referenced Wiki 
> > page
> > will need an edit or two after this goes in, but that's easy enough.
>
> I have never edited the Wiki before. How can I do that? More than happy
> to do it.

As mentioned at the bottom of the main page of the wiki:

  NOTE: due to recent spam activity "editor" privileges are granted
manually for the time being.
  If you just created a new community account or if your current
account used to have "editor" privileges ask on either the PostgreSQL
-www Mailinglist or the PostgreSQL IRC Channel (direct your request to
'pginfra', multiple individuals in the channel highlight on that
string) for help. Please include your community account name in those
requests.

After that, it's just a case of loggin in on the wiki (link top right
corner, which uses the community account) and then just go on to your
prefered page, click edit, and do your modifications. Don't forget to
save the modifications - I'm not sure whether the wiki editor's state
is locally persisted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-17 Thread Matthias van de Meent

On Sun, 17 Dec 2023, 21:14 Michail Nikolaev,  wrote:
>
> Hello!
>
> > I've thought about alternative solutions, too: how about getting a new 
> > snapshot every so often?
> > We don't really care about the liveness of the already-scanned data; the 
> > snapshots used for RIC
> > are used only during the scan. C/RIC's relation's lock level means vacuum 
> > can't run to clean up
> > dead line items, so as long as we only swap the backend's reported snapshot 
> > (thus xmin) while
> > the scan is between pages we should be able to reduce the time C/RIC is the 
> > one backend
> > holding back cleanup of old tuples.
>
> Hm, it looks like an interesting idea! It may be more dangerous, but
> at least it feels much more elegant than an LP_DEAD-related way.
> Also, feels like we may apply this to both phases (first and the second 
> scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).
>
> But for the first scan we allowed to do so only for non-unique indexes
> because of:
>
> > * The reason for doing that is to avoid
> > * bogus unique-index failures due to concurrent UPDATEs (we might see
> > * different versions of the same row as being valid when we pass over them,
> > * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> > * does not contain any tuples added to the table while we built the index.

Yes, for that we'd need an extra scan of the index that validates
uniqueness. I think there was a proposal (though it may only have been
an idea) some time ago, about turning existing non-unique indexes into
unique ones by validating the data. Such a system would likely be very
useful to enable this optimization.

> Also, (1) was limited to indexes without expressions and predicates
> (2) because such may execute queries to other tables (sic!).

Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.

> One possible solution is to add some checks to make sure no
> user-defined functions are used.
> But as far as I understand, it affects only CIC for now and does not
> affect the ability to use the proposed technique (updating snapshot
> time to time).
>
> However, I think we need some more-less formal proof it is safe - it
> is really challenging to keep all the possible cases in the head. I’ll
> try to do something here.

I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered. If we didn't do that, the
following would occur:

1. The index is marked ready for inserts from concurrent backends, but
not yet ready for queries.
2. We get the bulkdelete data
3. A concurrent backend inserts a new tuple T on heap page P, inserts
it into the index, and commits. This tuple is not in the summary, but
has been inserted into the index.
4. R/CIC resets the snapshot, making T visible.
5. R/CIC scans page P, finds that tuple T has to be indexed but is not
present in the summary, thus inserts that tuple into the index (which
already had it inserted at 3)

This thus would be a logic bug, as indexes assume at-most-once
semantics for index tuple insertion; duplicate insertion are an error.

So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.

Alternatively, we'd need to find another way to prevent us from
inserting these duplicate entries - maybe by storing the scan's data
in a buffer to later load into the index after another
index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely
require a buffer much larger than the result index would be.

Either way, for the first scan (i.e. phase 2 "build new indexes") this
is not an issue: we don't care about what transaction adds/deletes
tuples at that point.
For all we know, all tuples of the table may be deleted concurrently
before we even allow concurrent backends to start inserting tuples,
and the algorithm would still work as it does right now.

> Another possible issue may be caused by the new locking pattern - we
> will be required to wait for all transaction started before the ending
> of the phase to exit.

What do you mean by "new locking pat

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-15 Thread Matthias van de Meent

On Fri, 15 Dec 2023, 20:07 Michail Nikolaev, 
wrote:

> Hello, hackers!
>
> I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> improvements) in some lighter way.
>
> Yes, a serious bug was (2) caused by this optimization and now it reverted.
>
> But what about a more safe idea in that direction:
> 1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys
> queries
> 2) use this horizon for settings LP_DEAD bit in indexes (excluding
> indexes being built of course)
>
> Index LP_DEAD hints are not used by standby in any way (they are just
> ignored), also heap scan done by index building does not use them as
> well.
>
> But, at the same time:
> 1) index scans will be much faster during index creation or standby
> reporting queries
> 2) indexes can keep them fit using different optimizations
> 3) less WAL due to a huge amount of full pages writes (which caused by
> tons of LP_DEAD in indexes)
>
> The patch seems more-less easy to implement.
> Does it worth being implemented? Or to scary?
>

I hihgly doubt this is worth the additional cognitive overhead of another
liveness state, and I think there might be other issues with marking index
tuples dead in indexes before the table tuple is dead that I can't think of
right now.

I've thought about alternative solutions, too: how about getting a new
snapshot every so often?
We don't really care about the liveness of the already-scanned data; the
snapshots used for RIC are used only during the scan. C/RIC's relation's
lock level means vacuum can't run to clean up dead line items, so as long
as we only swap the backend's reported snapshot (thus xmin) while the scan
is between pages we should be able to reduce the time C/RIC is the one
backend holding back cleanup of old tuples.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

> [1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
> [2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org
>
>
>

Re: Reducing output size of nodeToString

2023-12-13 Thread Matthias van de Meent

On Thu, 7 Dec 2023 at 13:09, David Rowley  wrote:
>
> On Thu, 7 Dec 2023 at 10:09, Matthias van de Meent
>  wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size significantly.
>
> It would be very cool to have the technology proposed by Andres back
> in 2019 [1]. With that, we could easily write various output
> functions.  One could be compact and easily machine-readable and
> another designed to be better for humans for debugging purposes.
>
> We could also easily serialize plans to binary format for copying to
> parallel workers rather than converting them to a text-based
> serialized format. It would also allow us to do things like serialize
> PREPAREd plans into a nicely compact single allocation that we could
> just pfree in a single pfree call on DEALLOCATE.

I'm not sure what benefit you're refering to. If you mean "it's more
compact than the current format" then sure; but the other points can
already be covered by either the current nodeToString format, or by
nodeCopy-ing the prepared plan into its own MemoryContext, which would
allow us to do essentially the same thing.

> Likely we could just use the existing Perl scripts to form the
> metadata arrays rather than the clang parsing stuff Andres used in his
> patch.
>
> Anyway, just wanted to ensure you knew about this idea.

I knew about that thread thread, but didn't notice the metadata arrays
part of it, which indeed looks interesting for this patch. Thanks for
pointing it out. I'll see if I can incorporate parts of that into this
patchset.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: initdb caching during tests

2023-12-07 Thread Matthias van de Meent

On Thu, 7 Dec 2023 at 15:06, Daniel Gustafsson  wrote:
>
> > On 7 Dec 2023, at 14:50, Matthias van de Meent 
> >  wrote:
>
> > Attached a patch that fixes this for both make and meson, by adding
> > --no-clean to the initdb template.
>
> Makes sense.  While in there I think we should rename -N to the long optoin
> --no-sync to make it easier to grep for and make the buildfiles more
> self-documenting.

Then that'd be the attached patch, which also includes --auth instead
of -A, for the same reason as -N vs --no-sync

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v2-0001-Don-t-remove-initdb-template-when-initdb-fails.patch
Description: Binary data

Re: initdb caching during tests

2023-12-07 Thread Matthias van de Meent

On Fri, 25 Aug 2023 at 00:16, Andres Freund  wrote:
>
> Hi,
>
> On 2023-08-23 10:10:31 +0200, Daniel Gustafsson wrote:
> > > On 23 Aug 2023, at 03:17, Andres Freund  wrote:
> > > On 2023-08-22 23:47:24 +0200, Daniel Gustafsson wrote:
> >
> > >> My only small gripe is that I keep thinking about template databases for 
> > >> CREATE
> > >> DATABASE when reading the error messages in this patch, which is clearly 
> > >> not
> > >> related to what this does.
> > >>
> > >> +   note("initializing database system by copying initdb template");
> > >>
> > >> I personally would've used cache instead of template in the user facing 
> > >> parts
> > >> to keep concepts separated, but thats personal taste.
> > >
> > > I am going back and forth on that one (as one can notice with $subject). 
> > > It
> > > doesn't quite seem like a cache, as it's not "created" on demand and only
> > > usable when the exactly same parameters are used repeatedly. But template 
> > > is
> > > overloaded as you say...
> >
> > That's a fair point, cache is not a good word to describe a stored copy of
> > something prefabricated.  Let's go with template, we can always refine 
> > in-tree
> > if a better wording comes along.
>
> Cool. Pushed that way. Only change I made is to redirect the output of cp
> (and/or robocopy) in pg_regress, similar to how that was done for initdb
> proper.

While working on some things that are prone to breaking initdb, I
noticed that this template isn't generated with --no-clean, while
pg_regress does do that. This meant `make check` didn't have any
meaningful debuggable output when I broke the processes in initdb,
which is undesirable.

Attached a patch that fixes this for both make and meson, by adding
--no-clean to the initdb template.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v1-0001-Don-t-remove-initdb-template-when-initdb-fails.patch
Description: Binary data

Re: Reducing output size of nodeToString

2023-12-07 Thread Matthias van de Meent

On Thu, 7 Dec 2023 at 11:26, Peter Eisentraut  wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
> >
> > Future work will probably have to be on a significantly different
> > storage format, as the textual format is about to hit its entropy
> > limits.
>
> One thing that was mentioned repeatedly is that we might want different
> formats for human consumption and for machine storage.
> For human consumption, I would like some format like what you propose,
> because it generally omits the "unset" or "uninteresting" fields.
>
> But since you also talk about the size of pg_rewrite, I wonder whether
> it would be smaller if we just didn't write the field names at all but
> instead all the field values.  (This should be pretty easy to test,
> since the read functions currently ignore the field names anyway; you
> could just write out all field names as "x" and see what happens.)

I've been thinking about using a more binary storage format similar to
protobuf (but with system knowledge baked in, instead of PB's
defaults), but that would be dependent on functions that change the
output functions of pg_node_tree too, which Michel mentioned he would
work on a year ago (iiuc).

I think it would be a logical next step after this, but this patch is
just on building infrastructure that reduces the stored size without
getting in the way of Michel's work, if there was any result.

> I don't much like the way your patch uses the term "default".  Most of
> these default values are not defaults at all, but perhaps "most common
> values".

Yes, some 'defaults' are curated, but they have sound logic behind
them: *typmod is essentially always copied from an attypmod, which
defaults to -1. *isnull for any constant is generally unset. Many of
those other fields (once initialized by the relevant code) default to
those values I used.

> In theory, I would expect a default value to be initialized by
> makeNode().  (That could be an interesting feature, but let's stay
> focused here.)  But even then most of these "defaults" wouldn't be
> appropriate for a real default value.  This part seems quite
> controversial to me, and I would like to see some more details about how
> much this specifically really saves.

The tuning of these "defaults" got the savings from 20-30% to this
50%+ reduction in raw size.

> I don't quite understand why in your patch you have some fields as
> optional and some not.  Or is that what WRITE_NODE_FIELD() vs.
> WRITE_NODE_FIELD_OPT() means?  How is it decided which one to use?

I use _OPT when I know the value is likely to be its defualt value,
and don't change over to _OPT when I know with great certainty the
value is going to be dynamic, such as relation ID in RTEs, but this is
only relevant for manual code as generated code essentially always
uses the _OPT paths.

> The part that clears out the location fields in pg_rewrite entries might
> be worth considering as a separate patch.  Could you explain it more?
> Does it affect location pointers when using views at all?

Views don't store the original query string, so the location pointers
in views point to locations in a now non-existent query string.
Additionally, unless WRITE_READ_PARSE_PLAN_TREES is defined,
READ_LOCATION_FIELD does not actually read the stored value but
instead stores -1 in the indicated field, so in most cases there won't
be any difference between the deserialized data before and after this
part of the patch; the only difference is the amount of debugable
information stored in the view's internal data.
Note that resetting them to 'invalid' value thus makes sense, and
improves compressibility and allows removal from the serialized format
when serialization omits fields with default values.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: automating RangeTblEntry node support

2023-12-06 Thread Matthias van de Meent

On Wed, 6 Dec 2023 at 21:02, Peter Eisentraut  wrote:
>
> I have been looking into what it would take to get rid of the
> custom_read_write and custom_query_jumble for the RangeTblEntry node
> type.  This is one of the larger and more complex exceptions left.
> [...]
> Now one could probably rightfully complain that having all these unused
> fields dumped would make the RangeTblEntry serialization bigger.  I'm
> not sure who big of a problem that actually is, considering how many
> often-unset fields other node types have.  But it deserves some
> consideration.  I think the best way to work around that would be to
> have a mode that omits fields that have their default value (zero).
> This would be more generally useful; for example Query also has a bunch
> of fields that are not often set.  I think this would be pretty easy to
> implement, for example like

Actually, I've worked on this last weekend, and got some good results.
It did need some fine-tuning and field annotations, but got raw
nodeToString sizes down 50%+ for the pg_rewrite table's ev_action
column, and compressed-with-pglz size of pg_rewrite total down 30%+.

> #define WRITE_INT_FIELD(fldname) \
>  if (full_mode || node->fldname) \
>  appendStringInfo(str, " :" CppAsString(fldname) " %d",
> node->fldname)
>
> There is also the discussion over at [0] about larger redesigns of the
> node serialization format.  I'm also interested in that, but here I'm
> mainly trying to remove more special cases to make that kind of work
> easier in the future.
>
> Any thoughts about the direction?

I've created a new thread [0] with my patch. It actually didn't need
_that_ many manual changes - most of it was just updating the
gen_node_support.pl code generation, and making the macros do a good
job.

In general I'm all for reducing special cases, so +1 on the idea. I'll
have to check the specifics of the patches at a later point in time.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Reducing output size of nodeToString

2023-12-06 Thread Matthias van de Meent

Hi,

PFA a patch that reduces the output size of nodeToString by 50%+ in
most cases (measured on pg_rewrite), which on my system reduces the
total size of pg_rewrite by 33% to 472KiB. This does keep the textual
pg_node_tree format alive, but reduces its size signficantly.

The basic techniques used are
- Don't emit scalar fields when they contain a default value, and
make the reading code aware of this.
- Reasonable defaults are set for most datatypes, and overrides can
be added with new pg_node_attr() attributes. No introspection into
non-null Node/Array/etc. is being done though.
- Reset more fields to their default values before storing the values.
- Don't write trailing 0s in outDatum calls for by-ref types. This
saves many bytes for Name fields, but also some other pre-existing
entry points.

Future work will probably have to be on a significantly different
storage format, as the textual format is about to hit its entropy
limits.

See also [0], [1] and [2], where complaints about the verbosity of
nodeToString were vocalized.

Kind regards,

Matthias van de Meent

[0]
https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com
[1]
https://www.postgresql.org/message-id/flat/CACxu%3DvL_SD%3DWJiFSJyyBuZAp_2v_XBqb1x9JBiqz52a_g9z3jA%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/4b27fc50-8cd6-46f5-ab20-88dbaadca645%40eisentraut.org

v0-0001-Reduce-the-size-of-serialized-nodes-in-nodeToStri.patch
Description: Binary data

Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-06 Thread Matthias van de Meent

On Wed, 6 Dec 2023 at 19:55, Peter Geoghegan  wrote:
>
> On Wed, Dec 6, 2023 at 5:27 AM Matthias van de Meent
>  wrote:
> > 1. When scanning an index in ascending order using scankey a > 1 (so,
> > one that defines a start point of the scan), we don't need to check
> > items for consistency with that scankey once we've found the first
> > value that is consistent with the scankey, as all future values will
> > also be consistent with the scankey (if we assume no concurrent page
> > deletions).
>
> BTW, I don't think that page deletion is a concern for these
> optimizations in the way that it is for the similar idea of "dynamic
> prefix compression", which works against insertion-type scan keys
> (used to descend the tree and to do an initial binary search of a leaf
> page).
>
> We already rely on the first call to _bt_readpage (the one that takes
> place from within _bt_first rather than from _bt_next) passing a page
> offset number that's exactly at the start of where matches begin --
> this is crucial in the case of scans with required equality strategy
> scan keys (most scans). If we just skipped the _bt_binsrch and passed
> P_FIRSTDATAKEY(opaque) to _bt_readpage within _bt_first instead, that
> would break lots of queries. So the _bt_binsrch step within _bt_first
> isn't just an optimization -- it's crucial. This is nothing new.

I was thinking more along the lines of page splits+deletions while
we're doing _bt_stepright(), but forgot to consider that we first lock
the right sibling, and only then release the left sibling for splits,
so we should be fine here.

Kind regards,

Matthias van de Meent

Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-06 Thread Matthias van de Meent

On Wed, 6 Dec 2023 at 14:11, Robert Haas  wrote:
>
> On Tue, Dec 5, 2023 at 8:15 PM Peter Geoghegan  wrote:
> > Just to be clear, you're raising a concern that seems to me to apply
> > to "the other optimization" from the same commit, specifically -- the
> > precheck optimization. Not the one I found a problem in. (They're
> > closely related but distinct optimizations.)
>
> It isn't very clear from the commit message that this commit is doing
> two different things, and in fact I'm still unclear on what exactly
> the other optimization is.

I feel that Peter refered to these two distinct optimizations:

1. When scanning an index in ascending order using scankey a > 1 (so,
one that defines a start point of the scan), we don't need to check
items for consistency with that scankey once we've found the first
value that is consistent with the scankey, as all future values will
also be consistent with the scankey (if we assume no concurrent page
deletions).

2. When scanning an index in ascending order using scankey a < 10 (one
that defines an endpoint of the scan), we can look ahead and check if
the last item on the page is consistent. If so, then all other items
on the page will also be consistent with that scankey.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: backtrace_on_internal_error

2023-12-05 Thread Matthias van de Meent

On Tue, 5 Dec 2023 at 19:30, Robert Haas  wrote:
>
> On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart  
> wrote:
> > On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:
> > > I think we should consider unconditionally emitting a backtrace when
> > > an elog() is hit, instead of requiring a GUC. Or at least any elog()
> > > that's not at a DEBUGn level. If the extra output annoys anybody, that
> > > means they're regularly hitting an elog(), and it ought to be promoted
> > > to ereport().
> >
> > Perhaps this should be a GUC that defaults to LOG or ERROR.
>
> Why?

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Similarly, there are probably extensions that do not use ereport()
directly, but instead use elog(), because of reasons like 'not
planning on doing translations' and 'elog() is the easier API'.
Forcing a change over to ereport because of stack trace spam in logs
caused by elog would be quite annoying.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Extensible storage manager API - SMGR hook Redux

2023-12-04 Thread Matthias van de Meent

On Mon, 4 Dec 2023 at 22:03, Kirill Reshke  wrote:
>
> On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent 
>  wrote:
>>
>> On Mon, 4 Dec 2023 at 17:51, Kirill Reshke  wrote:
>> >
>> > So, 0002 patch uses the `get_tablespace` function, which searches Catalog 
>> > to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
>>
>> That's a very good point I hadn't considered in detail yet. Quite
>> clearly, the current code is wrong in assuming that the catalog is
>> accessible, and it should probably be stored in a way similar to
>> pg_filenode.map in a file managed outside the buffer pool.
>>
> Hmm, pg_filenode.map  is a nice idea. So, simply maintain TableSpaceOId -> 
> smgr id mapping in a separate file and update the whole file on any changes, 
> right?
> Looks reasonable to me, but it is clear that this solution can be really slow 
> in some patterns, like if we create many-many tablespaces(the way you 
> suggested it in the per-relation SMGR feature). Maybe we can store data in 
> files somehow separately, and only update one chunk per operation.

Yes, but that's a later issue... I'm not sure many-many tablespaces is
actually a good thing. There are already very few reasons to store
tables in more than just the default tablespace. For temporary
relations, there is indeed a guc to automatically put them into one
tablespace; and I can see a similar thing being useful for temporary
relations, too. Then there I can see high-performant local disks vs
lower-performant (but cheaper) local disks also as something
reasonable. But that only gets us to ~6 tablespaces, assuming separate
tablespaces for each combination of (normal, temp, unlogged) * (fast,
cheap). I'm not sure there are many other reasons to add tablespaces,
let alone making one for each table.

Note that you can select which tablespace a table is stored in, so I
see very little reason to actually do something about large numbers of
tablespaces being prohibitively expensive performance-wise.

Why do you want to have a whole new storage configuration for each of
your relations?

> Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its 
> code infrasture, right? For example, it seems that code that calculates 
> checksums can be reused.
> So, we need to refactor code here, define something like FileMap API maybe. 
> Or is it not really worth it? We can just write similar code twice.

I'm not sure about that. I really doubt we'll need things that are
that similar: right now, the tablespace->smgr mapping could be
considered to be implied by the symlinks in /pg_tblspc/. Non-MD
tablespaces could add a file .tblspc that detail their
configuration, which would also fix the issue of spcoid->smgr mapping.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Extensible storage manager API - SMGR hook Redux

2023-12-04 Thread Matthias van de Meent

On Mon, 4 Dec 2023 at 17:51, Kirill Reshke  wrote:
>
> So, 0002 patch uses the `get_tablespace` function, which searches Catalog to 
> tablespace SMGR id. I wonder how `smgr_redo` would work with it?

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

> Is it possible to query the system catalog during crash recovery? As far as i 
> understand the answer is "no", correct me if I'm wrong.

Yes, you're correct, we can't access buffers like this during
recovery. That's going to need some more effort.

> Furthermore, why do we only allow tablespace to have its own SMGR 
> implementation, can we have per-relation SMGR? Maybe we can do it in a way 
> similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash 
> recovery etc.)?

AMs (and by extension, their RMGRs) that use Postgres' buffer pool
have control over how they want to layout their blocks and files, but
generally don't care about where those blocks and files are located,
as long as they _can_ be retrieved.

Tablespaces, however, describe 'drives' or 'storage pools' in which
the tables/relations are stored, which to me seems to be the more
logical place to configure the SMGR abstraction of how and where to
store the actual data, as SMGRs manage the low-level relation block IO
(= file accesses), and tablespaces manage where files are stored.

Note that nothing prevents you from using one tablespace (thus
different SMGR) per relation, apart from bloated catalogs and the
superuser permissions required for creating those tablespaces. It'd be
difficult to manage, but not impossible.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Proposal to use JSON for Postgres Parser format

2023-12-04 Thread Matthias van de Meent

On Mon, 31 Oct 2022 at 15:56, Michel Pelletier
 wrote:
> On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent 
>  wrote:
>> On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov  
>> wrote:
>>> On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan  wrote:
>>>> On 2022-10-27 Th 19:38, Andres Freund wrote:
>>>>> Hi,
>>>>>
>>>>> On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
>>>>>> Maybe a compromise could be found whereby we provide a conversion 
>>>>>> function
>>>>>> that converts whatever the catalog storage format is to some JSON
>>>>>> equivalent.  That would address the needs of external code that doesn't 
>>>>>> want
>>>>>> to write a custom parser, while not tying us directly to JSON.
>>>>> +1
>>>>
>>>> Agreed.
>>>
>>> +1
>>>
>>> Michel, it seems that you now have a green light to implement node to
>>> json function.
>>
>> I think that Tom's proposal that we +1 is on a pg_node_tree to json
>> SQL function / cast; which is tangentially related to the "nodeToJson
>> / changing the storage format of pg_node_tree to json" proposal, but
>> not the same.
>
>
> I agree.
>
>>
>> I will add my +1 to Tom's proposal for that function/cast, but I'm not
>> sure on changing the storage format of pg_node_tree to json.
>
>
> I'm going to spike on this function and will get back to the thread with any 
> updates.

Michel, did you get a result from this spike?

I'm asking, because as I spiked most of my ideas on updating the node
text format, and am working on wrapping it up into a patch (or
patchset) later this week. The ideas for this are:

1. Don't write fields with default values for their types, such as
NULL for Node* fields;
2. Reset location fields before transforming the node tree to text
when we don't have a copy of the original query, which removes
location fields from serialization with step 1;
3. Add default() node labels to struct fields that do not share the
field type's default, allowing more fields to be omitted with step 1;
4. Add special default_ref() pg_node_attr for node fields that default
to other node field's values, used in Var's varnosyn/varattnosyn as
refering to varno/varattno; and
5. Truncate trailing 0s in Const' outDatum notation of by-ref types,
so that e.g. Consts with `name` data don't waste so much space with 0s

Currently, it reduces the pg_total_relation_size metric of pg_rewrite
after TOAST compression by 35% vs pg16, down to 483328 bytes / 59
pages, from 753664 bytes / 92 pages. The raw size of the ev_action
column's data (that is, before compression) is reduced by 55% to
1.18MB (from 2.80MB), and the largest default shipped row (the
information_schema.columns view) in that table is reduced to 'only'
78kB raw, from 193kB.

RW performance hasn't been tested yet, so that is still to be determined...

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-12-04 Thread Matthias van de Meent

On Sun, 3 Dec 2023 at 17:46, Tomas Vondra  wrote:
> On 11/30/23 18:47, Matthias van de Meent wrote:
> > ...
> >
> > I just ran some more tests in less favorable environments, and it
> > looks like I hit a bug:
> >
> > % SET max_parallel_workers = 0;
> > % CREATE INDEX ... USING brin (...);
> > ERROR:  cannot update tuples during a parallel operation
> >
> > Fix attached in 0002.
>
> Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
> label seems a bit cleaner (if that can be said about using goto), so I
> tweaked the patch to do that instead.

Good point, I agree that's cleaner.

> > In 0003 I add the mentioned backfilling of empty ranges at the end of
> > the table. I added it for both normal and parallel index builds, as
> > normal builds apparently also didn't yet have this yet.
> >
>
> Right. I was thinking about doing that to, but you beat me to it. I
> don't want to bury this in the main patch adding parallel builds, it's
> not really related to parallel CREATE INDEX. And it'd be weird to have
> this for parallel builds first, so I rebased it as 0001.

OK.

> As for the backfilling, I think we need to simplify the code a bit.
>
> So 0004 simplifies this - the backfilling is done by a function called
> from all the places. The main complexity is in ensuring all three places
> have the same concept of how to specify the range (of ranges) to fill.

Good points, +1. However, the simplification in 0005 breaks that with
an underflow:

> @@ -1669,6 +1672,19 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap 
> *revmap,
> state->bs_worker_id = 0;
> state->bs_spool = NULL;
>
> +/*
> + * Calculate the start of the last page range. Page numbers are 0-based,
> + * so to get the index of the last page we need to subtract one. Then the
> + * integer division gives us the proper 0-based range index.
> + */
> +state->bs_maxRangeStart = ((tablePages - 1) / pagesPerRange) * 
> pagesPerRange;

When the table is empty, this will try to fill all potential ranges up
to InvalidBlockNo's range, which is obviously invalid. It also breaks
the regression tests, as showin in CFBot.

> skipping the last page range?
> -
>
> I noticed you explicitly skipped backfilling empty tuple for the last
> page range. Can you explain? I suspect the idea was that the user
> activity would trigger building the tuple once that page range is
> filled, but we don't really know if the table receives any changes. It
> might easily be just a static table, in which case the last range would
> remain unsummarized. If this is the right thing to do, the serial build
> should do that too probably ...
>
> But I don't think that's the correct thing to do - I think CREATE INDEX
> is expected to always build a complete index, so my version always
> builds an index for all table pages.

Hmm. My idea here is to create an index that is closer to what you get
when you hit the insertion path with aminsert. This isn't 1:1 how the
index builds ranges during (re)index when there is data for that
range, but I thought it to be a close enough analog. Either way, I
don't mind it adding an empty range for the last range if that's
considered useful.

> BlockNumber overflows
> -
>
> The one thing that I'm not quite sure is correct is whether this handles
> overflows/underflows correctly. I mean, imagine you have a huge table
> that's almost 0x blocks, pages_per_range is prime, and the last
> range ends less than pages_per_range from 0x. Then this
>
> blkno += pages_per_range;
>
> can overflow, and might start inserting index tuples again (so we'd end
> up with a duplicate).
>
> I do think the current patch does this correctly, but AFAICS this is a
> pre-existing issue ...

Yes, I know I've flagged this at least once before. IIRC, the response
back then was that it's a very unlikely issue, as you'd have to extend
the relation to at least the first block of the last range, which
would currently be InvalidBlockNo - 131072 + 1, or just shy of 32TB of
data at 8kB BLCKSZ. That's not exactly a common use case, and BRIN
range ID wraparound is likely the least of your worries at that point.

> Anyway, while working on this / stress-testing it, I realized there's a
> bug in how we allocate the emptyTuple. It's allocated lazily, but if can
> easily happen in the per-range context we introduced last week. It needs
> to be allocated in the context covering the whole index build.

Yeah, I hadn't tested with (very) sparse datasets yet.

> I think the best way to do that is per 0006, i.e. allocate it in the
> BrinBuildState, along with the appropriate memory context.

That fix looks fine to me.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Avoid detoast overhead when possible

2023-12-04 Thread Matthias van de Meent

On Mon, 4 Dec 2023 at 14:23,  wrote:
>
>
> Hi,
>
> Matthias van de Meent  writes:
>
> > On Mon, 4 Dec 2023 at 07:56,  wrote:
>
> > ..It would also add overhead when
> > we write results to disk, such as spilling merge sorts, hash join
> > spills, or CTE materializations.
> >
> > Could you find a way to reduce this memory and IO usage when the value
> > is not going to be used immediately? Using the toast pointer at such
> > points surely will be cheaper than storing the full value again and
> > again.
>
> I'm not sure I understand you correctly, I think the issue you raised
> here is covered by the below design (not implemented in the patch).
>
> "
> However this patch just throws away almost all the benefits of toast, so
> how can we draw a line between should vs should not do this code path?
> IMO, we should only run the 'eagerly detoast' when we know that we will
> have a FuncCall against the toast_col on **the current plan node**. I
> think this information can be get from Qual and TargetList. If so, we
> can set the slot->detoast_attrs accordingly.
> "
>
> Let's see an example of this:
>
> SELECT f(t1.toastable_col) FROM t1 join t2 using(c);
>
> Suppose it is using hash join and t1 should be hashed.  With the above
> design, we will NOT detoast toastable_col at the scan of t1 or hash t1
> since there is no one "funcall" access it in either SeqScan of t1 or
> hash (t1). But when we do the projection on the joinrel, the detoast
> would happen.

I assume that you detoast the column only once, and not in a separate
per-node context? This would indicate to me that a query like the
following would detoast toastable_col and never "retoast" it.

SELECT toastable_col FROM t1
WHERE f(t1.toastable_col)
ORDER BY nonindexed;

or the equivalent in current PG catalogs:

SELECT ev_class
FROM pg_rewrite
WHERE octet_length(ev_action) > 1
ORDER BY ev_class;

whose plan is

 Sort
   Sort Key: ev_class
   ->  Seq Scan on pg_rewrite
 Filter: (octet_length((ev_action)::text) > 1)

This would first apply the condition (because sort-then-filter is
generally more expensive than filter-then-sort), and  thus permanently
detoast the column, which is thus detoasted when it is fed into the
sort, which made the sort much more expensive than without the
aggressive detoasting.

Or do I still misunderstand something here?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Avoid detoast overhead when possible

2023-12-04 Thread Matthias van de Meent

On Mon, 4 Dec 2023 at 07:56,  wrote:
> 'SELECT f1(toast_col) FROM t;' will apply this code path, but nothing
> gain and nothing lost.  Applying this code path only when the toast
> datum is accessed 1+ times needs some extra run-time effort. I don't
> implement this so far, I'd like to see if I miss some obvious points.
> Any feedback is welcome.

This does add some measurable memory overhead to query execution where
the produced derivative of the large toasted field is small (e.g. 1MB
toast value -> 2x BIGINT), and when the toasted value is deep in the
query tree (e.g. 3 nested loops deep). It would also add overhead when
we write results to disk, such as spilling merge sorts, hash join
spills, or CTE materializations.

Could you find a way to reduce this memory and IO usage when the value
is not going to be used immediately? Using the toast pointer at such
points surely will be cheaper than storing the full value again and
again.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-30 Thread Matthias van de Meent

On Thu, 30 Nov 2023 at 01:10, Tomas Vondra
 wrote:
>
> On 11/29/23 23:59, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
>>  wrote:
>>>
>>> On 11/29/23 21:30, Matthias van de Meent wrote:
>>>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>>>  wrote:
>>>>> I did try to measure how much it actually saves, but none of the tests I
>>>>> did actually found measurable improvement. So I'm tempted to just not
>>>>> include this part, and accept that we may deserialize some of the tuples
>>>>> unnecessarily.
>>>>>
>>>>> Did you actually observe measurable improvements in some cases?
>>>>
>>>> The improvements would mostly stem from brin indexes with multiple
>>>> (potentially compressed) by-ref types, as they go through more complex
>>>> and expensive code to deserialize, requiring separate palloc() and
>>>> memcpy() calls each.
>>>> For single-column and by-value types the improvements are expected to
>>>> be negligible, because there is no meaningful difference between
>>>> copying a single by-ref value and copying its container; the
>>>> additional work done for each tuple is marginal for those.
>>>>
>>>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>>>> (sha256((id+1)::text::bytea)::text),
>>>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>>>> measured a difference of 10x less time spent in the main loop of
>>>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>>>> ranges. It's not a lot, but worth at least something, I guess?
>>>>
>>>
>>> It is something, but I can't really convince myself it's worth the extra
>>> code complexity. It's a somewhat extreme example, and the parallelism
>>> certainly saves much more than this.
>>
>> True. For this, I usually keep in mind that the docs on multi-column
>> indexes still indicate to use 1 N-column brin index over N 1-column
>> brin indexes (assuming the same storage parameters), so multi-column
>> BRIN indexes should not be considered to be uncommon:
>>
>> "The only reason to have multiple BRIN indexes instead of one
>> multicolumn BRIN index on a single table is to have a different
>> pages_per_range storage parameter."
>>
>> Note that most of the time in my example index is spent in creating
>> the actual tuples due to the use of hashing for data generation; for
>> index or plain to-text formatting the improvement is much more
>> pronounced: If I use an 8-column index (id::text, id, ...), index
>> creation takes ~500ms with 4+ workers. Of this, deforming takes some
>> 20ms, though when skipping the deforming step (i.e.,with my patch) it
>> takes ~3.5ms. That's a 3% shaved off the build time when the index
>> shape is beneficial.
>>
>
> That's all true, and while 3.5% is not something to ignore, my POV is
> that the parallelism speeds this up from ~2000ms to ~500ms. Yes, it
> would be great to shave off the extra 1% (relative to the original
> duration). But I don't have a great idea how to do code that in a way
> that is readable, and I don't want to stall the patch indefinitely
> because of a comparatively small improvement.
>
> Therefore I propose we get the simpler code committed and leave this as
> a future improvement.

That's fine with me, it is one reason why I kept it as a separate patch file.

>>>> The attached patch fixes the issue that you called out .
>>>> It also further updates _brin_end_parallel: the final 'write empty
>>>> tuples' loop is never hit and is thus removed, because if there were
>>>> any tuples in the spool we'd have filled the empty ranges at the end
>>>> of the main loop, and if there were no tuples in the spool then the
>>>> memtuple would still be at its original initialized value of 0 thus
>>>> resulting in a constant false condition. I also updated some comments.
>>>>
>>>
>>> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
>>> insert the empty ranges in the main loop, because we're already looking
>>> at the *next* summary.
>>
>> Yes, merging adds some significant complexity here. I don't think we
>> can easily get around that though...
>>
>>> But I think the idea was to insert empty ranges if there's a chunk of
>>> empty ranges at the end of the table, after the last tuple the index
>>> build reads. But I'm n

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Wed, 29 Nov 2023 at 21:56, Tomas Vondra
 wrote:
>
> On 11/29/23 21:30, Matthias van de Meent wrote:
>> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
>>  wrote:
>>> I did try to measure how much it actually saves, but none of the tests I
>>> did actually found measurable improvement. So I'm tempted to just not
>>> include this part, and accept that we may deserialize some of the tuples
>>> unnecessarily.
>>>
>>> Did you actually observe measurable improvements in some cases?
>>
>> The improvements would mostly stem from brin indexes with multiple
>> (potentially compressed) by-ref types, as they go through more complex
>> and expensive code to deserialize, requiring separate palloc() and
>> memcpy() calls each.
>> For single-column and by-value types the improvements are expected to
>> be negligible, because there is no meaningful difference between
>> copying a single by-ref value and copying its container; the
>> additional work done for each tuple is marginal for those.
>>
>> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
>> (sha256((id+1)::text::bytea)::text),
>> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
>> measured a difference of 10x less time spent in the main loop of
>> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
>> ranges. It's not a lot, but worth at least something, I guess?
>>
>
> It is something, but I can't really convince myself it's worth the extra
> code complexity. It's a somewhat extreme example, and the parallelism
> certainly saves much more than this.

True. For this, I usually keep in mind that the docs on multi-column
indexes still indicate to use 1 N-column brin index over N 1-column
brin indexes (assuming the same storage parameters), so multi-column
BRIN indexes should not be considered to be uncommon:

"The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter."

Note that most of the time in my example index is spent in creating
the actual tuples due to the use of hashing for data generation; for
index or plain to-text formatting the improvement is much more
pronounced: If I use an 8-column index (id::text, id, ...), index
creation takes ~500ms with 4+ workers. Of this, deforming takes some
20ms, though when skipping the deforming step (i.e.,with my patch) it
takes ~3.5ms. That's a 3% shaved off the build time when the index
shape is beneficial.

> > The attached patch fixes the issue that you called out .
> > It also further updates _brin_end_parallel: the final 'write empty
> > tuples' loop is never hit and is thus removed, because if there were
> > any tuples in the spool we'd have filled the empty ranges at the end
> > of the main loop, and if there were no tuples in the spool then the
> > memtuple would still be at its original initialized value of 0 thus
> > resulting in a constant false condition. I also updated some comments.
> >
>
> Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
> insert the empty ranges in the main loop, because we're already looking
> at the *next* summary.

Yes, merging adds some significant complexity here. I don't think we
can easily get around that though...

> But I think the idea was to insert empty ranges if there's a chunk of
> empty ranges at the end of the table, after the last tuple the index
> build reads. But I'm not sure that can actually happen ...

This would be trivial to construct with partial indexes; e.g. WHERE
(my_pk IS NULL) would consist of exclusively empty ranges.
I don't see a lot of value in partial BRIN indexes, but I may be
overlooking something.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
 wrote:
>
> On 11/29/23 15:52, Tomas Vondra wrote:
> >> ...
> >>
> >> This also made me think a bit more about how we're working with the
> >> tuples. With your latest patch, we always deserialize and re-serialize
> >> the sorted brin tuples, just in case the next tuple will also be a
> >> BRIN tuple of the same page range. Could we save some of that
> >> deserialization time by optimistically expecting that we're not going
> >> to need to merge the tuple and only store a local copy of it locally?
> >> See attached 0002; this saves some cycles in common cases.
> >>
> >
> > Good idea!
> >
>
> FWIW there's a bug, in this part of the optimization:
>
> --
> +if (memtuple == NULL)
> +memtuple = brin_deform_tuple(state->bs_bdesc, btup,
> + memtup_holder);
> +
>  union_tuples(state->bs_bdesc, memtuple, btup);
>  continue;
> --
>
> The deforming should use prevbtup, otherwise union_tuples() jut combines
> two copies of the same tuple.

Good point. There were some more issues as well, fixes are attached.

> Which however brings me to the bigger issue with this - my stress test
> found this issue pretty quickly, but then I spent quite a bit of time
> trying to find what went wrong. I find this reworked code pretty hard to
> understand, and not necessarily because of how it's written. The problem
> is it the same loop tries to juggle multiple pieces of information with
> different lifespans, and so on. I find it really hard to reason about
> how it behaves ...

Yeah, it'd be nice if we had a peek option for sortsupport, that'd
improve context handling.

> I did try to measure how much it actually saves, but none of the tests I
> did actually found measurable improvement. So I'm tempted to just not
> include this part, and accept that we may deserialize some of the tuples
> unnecessarily.
>
> Did you actually observe measurable improvements in some cases?

The improvements would mostly stem from brin indexes with multiple
(potentially compressed) by-ref types, as they go through more complex
and expensive code to deserialize, requiring separate palloc() and
memcpy() calls each.
For single-column and by-value types the improvements are expected to
be negligible, because there is no meaningful difference between
copying a single by-ref value and copying its container; the
additional work done for each tuple is marginal for those.

For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
(sha256((id+1)::text::bytea)::text),
(sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
measured a difference of 10x less time spent in the main loop of
_brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
ranges. It's not a lot, but worth at least something, I guess?

The attached patch fixes the issue that you called out .
It also further updates _brin_end_parallel: the final 'write empty
tuples' loop is never hit and is thus removed, because if there were
any tuples in the spool we'd have filled the empty ranges at the end
of the main loop, and if there were no tuples in the spool then the
memtuple would still be at its original initialized value of 0 thus
resulting in a constant false condition. I also updated some comments.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v6-0003-NOCOMMIT-Instrumentation-for-time-spent-in-_brin_.patch
Description: Binary data

v6-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patch
Description: Binary data

v6-0001-Allow-BRIN-to-build-its-index-in-parallel.patch
Description: Binary data

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-29 Thread Matthias van de Meent

On Tue, 28 Nov 2023 at 18:59, Tomas Vondra
 wrote:
>
> On 11/28/23 16:39, Matthias van de Meent wrote:
> > On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
> >  wrote:
> >> On 11/23/23 13:33, Matthias van de Meent wrote:
> >>> The union operator may leak (lots of) memory, so I think it makes
> >>> sense to keep a context around that can be reset after we've extracted
> >>> the merge result.
> >>>
> >>
> >> But does the current code actually achieve that? It does create a "brin
> >> union" context, but then it only does this:
> >>
> >> /* Use our own memory context to avoid retail pfree */
> >> cxt = AllocSetContextCreate(CurrentMemoryContext,
> >> "brin union",
> >> ALLOCSET_DEFAULT_SIZES);
> >> oldcxt = MemoryContextSwitchTo(cxt);
> >> db = brin_deform_tuple(bdesc, b, NULL);
> >> MemoryContextSwitchTo(oldcxt);
> >>
> >> Surely that does not limit the amount of memory used by the actual union
> >> functions in any way?
> >
> > Oh, yes, of course. For some reason I thought that covered the calls
> > to the union operator function too, but it indeed only covers
> > deserialization. I do think it is still worthwhile to not do the
> > create/delete cycle, but won't hold the patch back for that.
> >
>
> I think the union_tuples() changes are better left for a separate patch.
>
> >>>> However, I don't think the number of union_tuples calls is likely to be
> >>>> very high, especially for large tables. Because we split the table into
> >>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
> >>>> (where this matters) we're likely close to 8192.
> >>>
> >>> I agree that the merging part of the index creation is the last part,
> >>> and usually has no high impact on the total performance of the reindex
> >>> operation, but in memory-constrained environments releasing and then
> >>> requesting the same chunk of memory over and over again just isn't
> >>> great.
> >>
> >> OK, I'll take a look at the scratch context you suggested.
> >>
> >> My point however was we won't actually do that very often, because on
> >> large tables the BRIN ranges are likely smaller than the parallel scan
> >> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> >> ranges are large, there'll be few of them.
> >
> > That's true, so maybe I'm concerned about something that amounts to
> > only marginal gains.
> >
>
> However, after thinking about this a bit more, I think we actually do
> need to do something about the memory management when merging tuples.
> AFAIK the general assumption was that union_tuple() only runs for a
> single range, and then the whole context gets freed.

Correct, but it is also is (or should be) assumed that union_tuple
will be called several times in the same context to fix repeat
concurrent updates. Presumably, that only happens rarely, but it's
something that should be kept in mind regardless.

> But the way the
> merging was implemented, it all runs in a single context. And while a
> single union_tuple() may not need a lot memory, in total it may be
> annoying. I just added a palloc(1MB) into union_tuples and ended up with
> ~160MB allocated in the PortalContext on just 2GB table. In practice the
> memory will grow more slowly, but not great :-/
>
> The attached 0003 patch adds a memory context that's reset after
> producing a merged BRIN tuple for each page range.

Looks good.

This also made me think a bit more about how we're working with the
tuples. With your latest patch, we always deserialize and re-serialize
the sorted brin tuples, just in case the next tuple will also be a
BRIN tuple of the same page range. Could we save some of that
deserialization time by optimistically expecting that we're not going
to need to merge the tuple and only store a local copy of it locally?
See attached 0002; this saves some cycles in common cases.

The v20231128 version of the patchset (as squashed, attached v5-0001)
looks good to me.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)


v5-0002-Reduce-de-forming-of-BRIN-tuples-in-parallel-BRIN.patch
Description: Binary data


v5-0001-Allow-BRIN-to-build-its-index-in-parallel.patch
Description: Binary data

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-28 Thread Matthias van de Meent

On Thu, 23 Nov 2023 at 14:35, Tomas Vondra
 wrote:
> On 11/23/23 13:33, Matthias van de Meent wrote:
>> The union operator may leak (lots of) memory, so I think it makes
>> sense to keep a context around that can be reset after we've extracted
>> the merge result.
>>
>
> But does the current code actually achieve that? It does create a "brin
> union" context, but then it only does this:
>
> /* Use our own memory context to avoid retail pfree */
> cxt = AllocSetContextCreate(CurrentMemoryContext,
> "brin union",
> ALLOCSET_DEFAULT_SIZES);
> oldcxt = MemoryContextSwitchTo(cxt);
> db = brin_deform_tuple(bdesc, b, NULL);
> MemoryContextSwitchTo(oldcxt);
>
> Surely that does not limit the amount of memory used by the actual union
> functions in any way?

Oh, yes, of course. For some reason I thought that covered the calls
to the union operator function too, but it indeed only covers
deserialization. I do think it is still worthwhile to not do the
create/delete cycle, but won't hold the patch back for that.

>>> However, I don't think the number of union_tuples calls is likely to be
>>> very high, especially for large tables. Because we split the table into
>>> 2048 chunks, and then cap the chunk size by 8192. For large tables
>>> (where this matters) we're likely close to 8192.
>>
>> I agree that the merging part of the index creation is the last part,
>> and usually has no high impact on the total performance of the reindex
>> operation, but in memory-constrained environments releasing and then
>> requesting the same chunk of memory over and over again just isn't
>> great.
>
> OK, I'll take a look at the scratch context you suggested.
>
> My point however was we won't actually do that very often, because on
> large tables the BRIN ranges are likely smaller than the parallel scan
> chunk size, so few overlaps. OTOH if the table is small, or if the BRIN
> ranges are large, there'll be few of them.

That's true, so maybe I'm concerned about something that amounts to
only marginal gains.

I noticed that the v4 patch doesn't yet update the documentation in
indexam.sgml with am->amcanbuildparallel.
Once that is included and reviewed I think this will be ready, unless
you want to address any of my comments upthread (that I marked with
'not in this patch') in this patch.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: POC, WIP: OR-clause support for indexes

2023-11-27 Thread Matthias van de Meent

On Mon, 27 Nov 2023, 23:16 Peter Geoghegan,  wrote:

> On Mon, Nov 27, 2023 at 1:04 PM Robert Haas  wrote:
> > The use of op_mergejoinable() seems pretty random to me. Why should we
> > care about that? If somebody writes a<1 or a<2 or a<3 or a<4, you can
> > transform that to a > good idea, but I think it's a legal transformation.
>
> That kind of transformation is likely to be a very good idea, because
> nbtree's _bt_preprocess_array_keys() function knows how to perform
> preprocessing that makes the final index qual "a < 1". Obviously that
> could be far more efficient.
>

a < 4, you mean? The example mentioned ANY, not ALL

Further suppose you have a machine generated query  "a<1 or a<2 or a<3
> or a<4 AND a = 2" -- same as before, except that I added "AND a = 2"
> to the end. Now _bt_preprocess_array_keys() will be able to do the
> aforementioned inequality preprocessing, just as before. But this time
> _bt_preprocess_keys() (a different function with a similar name) can
> see that the quals are contradictory. That makes the entire index scan
> end, before it ever really began.
>

With the given WHERE-clause I would hope it did *not* return before
scanning the index, given that any row with a < 3 is valid for that
constraint with current rules of operator precedence.

- Matthias

Re: Questions regarding Index AMs and natural ordering

2023-11-27 Thread Matthias van de Meent

On Fri, 24 Nov 2023, 19:58 Tom Lane,  wrote:
>
> Peter Geoghegan  writes:
> > On Fri, Nov 24, 2023 at 8:44 AM Matthias van de Meent
> >  wrote:
> >> Yes, the part where btree opclasses determine a type's ordering is
> >> clear. But what I'm looking for is "how do I, as an index AM
> >> implementation, get the signal that I need to return column-ordered
> >> data?" If that signal is "index AM marked amcanorder == index must
> >> return ordered data", then that's suboptimal for the index AM writer,
> >> but understandable. If amcanorder does not imply always ordered
> >> retrieval, then I'd like to know how it is signaled to the AM. But as
> >> of yet I've not found a clear and conclusive answer either way.
>
> > I suppose that amcanorder=true cannot mean that, since we have the
> > SAOP path key thing (at least for now).
>
> As things stand, amcanorder definitely means that the index always
> returns ordered data, since the planner will unconditionally apply
> pathkeys to the indexscan Paths generated for it (see plancat.c's
> get_relation_info which sets up info->sortopfamily, and
> build_index_pathkeys which uses that).  We could reconsider that
> definition if there were a reason to, but so far it hasn't seemed
> interesting.

For GIST there is now a case for improving the support for optionally
ordered retrieval, as there is a patch that tries to hack ORDER BY
support into GIST. Right now that patch applies (what I consider) a
hack by implicitly adding an operator ordering clause for ORDER BY
column with the column type's btree ordering operator, but with
improved APIs that shouldn't need such a hacky approach.

> The hack in 807a40c5 is a hack, without a doubt, but
> that doesn't necessarily mean we should spend time on generalizing it,
> and even less that we should have done so in 2012.  Maybe by now there
> are non-core index AMs that have cases where it's worth being pickier.
> We'd have to invent some API that allows the index AM to have a say in
> what pathkeys get applied.

I think that would be quite useful, as it would allow indexes to
return ordered results in other orders than the defined key order, and
it would allow e.g. BRIN to run its sort for ordered retrieval inside
the index scan node (rather than requiring its own sort node type).

CC: Tomas, maybe you have some ideas about this as well? What was the
reason for moving BRIN-assisted sort into its own node? Was there more
to it than "BRIN currently doesn't have amgettuple, and amgettuple
can't always be used"?

Kind regards,

Matthias van de Meent

Re: Questions regarding Index AMs and natural ordering

2023-11-24 Thread Matthias van de Meent

On Thu, 23 Nov 2023 at 19:52, Peter Geoghegan  wrote:
>
> On Thu, Nov 23, 2023 at 9:16 AM Matthias van de Meent
>  wrote:
> > For example, btree ignores any ordering scan keys that it is given in
> > btrescan, which seems fine for btree because the ordering of a btree
> > is static (and no other order than that is expected apart from it's
> > reverse order), but this becomes problematic for other indexes that
> > could return ordered data but would prefer not to have to go through
> > the motions of making sure the return order is 100% correct, rather
> > than a k-sorted sequence, or just the matches to the quals (like
> > GIST). Is returning index scan results in (reverse) natural order not
> > optional but required with amcanorder? If it is required, why is the
> > am indicator called 'canorder' instead of 'willorder', 'doesorder' or
> > 'isordered'?
>
> I don't know. I have a hard time imagining an index AM that is
> amcanorder=true that isn't either nbtree, or something very similar
> (so similar that it seems unlikely that anybody would actually go to
> the trouble of implementing it from scratch).

Well, BRIN (with minmax opclasses) could return ordered results if it
needs to (see [0]; though that implements it as a distinct plan node).
Ordering the tuples correctly takes quite some effort, but is quite
likely to use less effort and/or scratch space than a table/bitmap
scan + sort, because we won't have to manage all tuples of the table
at the same time. However, it woould be extremely expensive if the
planner expects this to always return the data in btree order.

For GIST with the btree_gist opclasses it is even easier to return
ordered results (patch over at [1]), but then still it prefers not to
have to make a strict ordering as it adds overhead vs 'normal' index
scans.

Also, was that a confirmation that amcanorder is a requirement for the
AM to return data in index order (unless amrescan's orderbys is not
null), or just a comment on the reason for the name of 'amcanorder'
being unclear?

> You didn't mention support for merge joins. That's one of the defining
> characteristics of an amcanorder=true index AM, since an
> "ammarkpos/amrestrpos function need only be provided if the access
> method supports ordered scans". It's hard to imagine how that could
> work with a loosely ordered index. It seems to imply that the scan
> must always work with a simple linear order.

I probably didn't think of merge join support because 'merge join' is
not mentioned as such in the index AM api - I knew of
ammarkpos/amrestrpos, but hadn't yet gone into detail what they're
used for.

> Cases where the planner uses a merge join often involve an index path
> with an "interesting sort order" that "enables" the merge join.
> Perhaps most of the alternative plans (that were almost as cheap as
> the merge join plan) would have had to scan the same index in the same
> way anyway, so it ends up making sense to use a merge join. The merge
> join can get ordered results from the index "at no extra cost". (All
> of this is implicit, of course -- the actual reason why the planner
> chose the merge join plan is because it worked out to be the cheapest
> plan.)

Couldn't the merge join (or scan node) use a tuple store to return to
some earlier point in the index scan when a native version of markpos
is not easily supported? It would add (potentially very significant)
IO overhead, but it would also allow merge joins on ordered paths that
currently don't have a natural way of marking their position.

> > Alternatively, if an am should be using the order scan keys from
> > .amrescan and natural order scans also get scan keys, is there some
> > place I find the selection process for ordered index AMs, and how this
> > ordering should be interepreted? There are no good examples available
> > in core code because btree is special-cased, and there are no other
> > in-tree AMs that have facilities where both `amcanorderbyop` and
> > `amcanorder` are set.
>
> The general notion of a data type's sort order comes from its default
> btree operator class, so the whole idea of a generic sort order is
> deeply tied to the nbtree AM. That's why we sometimes have btree
> operator classes for types that you'd never actually want to index (at
> least not using a btree index).

Yes, the part where btree opclasses determine a type's ordering is
clear. But what I'm looking for is "how do I, as an index AM
implementation, get the signal that I need to return column-ordered
data?" If that signal is "index AM marked amcanorder == index must
return ordered data", then that's suboptimal for the index AM writer,
but understandable. If amcanorder does not imply always ordered
retrieval, then I'd like t

Re: Table AM Interface Enhancements

2023-11-23 Thread Matthias van de Meent

these enhancements will significantly
> improve the flexibility and capabilities of the PostgreSQL Table AM
> interface.

I've noticed there is not a lot of rationale for several of the
changes as to why PostgreSQL needs these changes implemented like
this, amongst which the index-related tableAM changes.

I understand that index-organized tables can be quite useful, but I
don't see design solutions to the more complex questions that would
still be required before we could host such table AMs like OreoleDB's
as a first-party citizen: For index-organized tables, you also need
index AM APIs that support TIDS with more than 48 bits of data
(assuming we actually want primary keys with >48 bits of usable
space), and for undo-based logging you would probably need index APIs
for retail index tuple deletion. Neither is supplied here, nor is
described why these APIs were omitted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Questions regarding Index AMs and natural ordering

2023-11-23 Thread Matthias van de Meent

Hi,

Over in [0] and [1] there are patches that touch on the topic of
'natual ordering' index retrieval, and [2] also touches on the topic.
For those patches, I've been looking at how the planner and executor
indicate to index AMs that they expects the output to be ordered, and
how this ordering should work.
I've mostly found how it works for index_key opr constant, but I've
yet to find a good mental model for how the planner handles indexes
that can expose the 'intrinsic order' of data, i.e. indexes with
`amcanorder=true`, because there is very little (if any) real
documentation on what is expected from indexes when it advertises
certain features, and how the executor signals to the AM that it wants
to make use of those features.

For example, btree ignores any ordering scan keys that it is given in
btrescan, which seems fine for btree because the ordering of a btree
is static (and no other order than that is expected apart from it's
reverse order), but this becomes problematic for other indexes that
could return ordered data but would prefer not to have to go through
the motions of making sure the return order is 100% correct, rather
than a k-sorted sequence, or just the matches to the quals (like
GIST). Is returning index scan results in (reverse) natural order not
optional but required with amcanorder? If it is required, why is the
am indicator called 'canorder' instead of 'willorder', 'doesorder' or
'isordered'?

Alternatively, if an am should be using the order scan keys from
.amrescan and natural order scans also get scan keys, is there some
place I find the selection process for ordered index AMs, and how this
ordering should be interepreted? There are no good examples available
in core code because btree is special-cased, and there are no other
in-tree AMs that have facilities where both `amcanorderbyop` and
`amcanorder` are set.

I did read through indexam.sgml, but that does not give a clear answer
on this distinction of 'amcanorder' having required ordered results or
not, nor on how to interpret amrescan's orderbys argument. I also
looked at planner code where it interacts with amcanorder /
amcanorderbyop, but I couldn't wrap my head around its interactions
with indexes, either (more specifically, the ordering part of those
indexes) due to the complexity of the planner and the many layers that
the various concepts are passed through. The README in
backend/optimizer didn't answer this question for me, either.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/flat/EB2AF704-70FC-4D73-A97A-A7884A0381B5%40kleczek.org
[1] 
https://www.postgresql.org/message-id/flat/CAH2-Wz%3DksvN_sjcnD1%2BBt-WtifRA5ok48aDYnq3pkKhxgMQpcw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/e70fa091-e338-1598-9de4-6d0ef6b693e2%40enterprisedb.com

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-23 Thread Matthias van de Meent

Hi,

On Wed, 22 Nov 2023 at 20:16, Tomas Vondra
 wrote:
>
> On 11/20/23 20:48, Matthias van de Meent wrote:
>> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra  
>> wrote:
>>>
>>> Hi,
>>>
>>> here's an updated patch, addressing the review comments, and reworking
>>> how the work is divided between the workers & leader etc.
>>>
>>
>> After code-only review, here are some comments:
>>
>>> +++ b/src/backend/access/brin/brin.c
>>> [...]
>>> +/* Magic numbers for parallel state sharing */
>>> +#define PARALLEL_KEY_BRIN_SHAREDUINT64CONST(0xA001)
>>> +#define PARALLEL_KEY_TUPLESORTUINT64CONST(0xA002)
>>
>> These shm keys use the same constants also in use in
>> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
>> operations, I'd prefer if we didn't actively introduce conflicting
>> identifiers when we still have significant amounts of unused values
>> remaining.
>>
>
> Hmmm. Is there some rule of thumb how to pick these key values?

None that I know of.
There is a warning in various places that define these constants that
they take care to not conflict with plan node's node_id: parallel plan
execution uses plain plan node IDs as keys, and as node_id is
int-sized, any other key value that's created manually of value < 2^32
should be sure that it can't be executed in a parallel backend.
But apart from that one case, I can't find a convention, no.

>>> +#define PARALLEL_KEY_QUERY_TEXTUINT64CONST(0xA003)
>>
>> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
>> others being in access/nbtree/nbtsort.c (value 0xA004, one
>> more than brin's), backend/executor/execParallel.c
>> (0xE008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
>> I've not checked that their uses are exactly the same, I'd expect at
>> least btree to match mostly, if not fully, 1:1).
>> I think we could probably benefit from a less ad-hoc sharing of query
>> texts. I don't think that needs to happen specifically in this patch,
>> but I think it's something to keep in mind in future efforts.
>>
>
> I'm afraid I don't quite get what you mean by "ad hoc sharing of query
> texts". Are you saying we shouldn't propagate the query text to the
> parallel workers? Why? Or what's the proper solution?

What I mean is that we have several different keys that all look like
they contain the debug query string, and always for the same debugging
purposes. For debugging, I think it'd be useful to use one well-known
key, rather than N well-known keys in each of the N parallel
subsystems.

But as mentioned, it doesn't need to happen in this patch, as that'd
increase scope beyond brin/index ams.

>>> +state->bs_numtuples = brinshared->indtuples;
>>
>> My IDE complains about bs_numtuples being an integer. This is a
>> pre-existing issue, but still valid: we can hit an overflow on tables
>> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
>> but problematic nonetheless.
>>
>
> True. I think I've been hesitant to make this a double because it seems
> a bit weird to do +1 with a double, and at some point (d == d+1). But
> this seems safe, we're guaranteed to be far away from that threshold.

Yes, ignoring practical constraints like page space, we "only" have
bitspace for 2^48 tuples in each (non-partitioned) relation, so
double's 56 significant bits should be more than enough to count
tuples.

>> I also noticed that this is likely to execute `union_tuples` many
>> times when pages_per_range is coprime with the parallel table scan's
>> block stride (or when we for other reasons have many tuples returned
>> for each range); and this `union_tuples` internally allocates and
>> deletes its own memory context for its deserialization of the 'b'
>> tuple. I think we should just pass a scratch context instead, so that
>> we don't have the overhead of continously creating then deleting the
>> same memory context
>
> Perhaps. Looking at the code, isn't it a bit strange how union_tuples
> uses the context? It creates the context, calls brin_deform_tuple in
> that context, but then the rest of the function (including datumCopy and
> similar stuff) happens in the caller's context ...

The union operator may leak (lots of) memory, so I think it makes
sense to keep a context around that can be reset after we've extracted
the merge result.

> However, I don't think the number of union_tuples calls is likely to be
> very high, especially for large tables. Because we split the table into
> 2048 chunks, and then cap th

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-20 Thread Matthias van de Meent

heap does, and should thus be
compatible with BRIN. Thus, "heap" is not a useful name here.

There are 2 new mentions of "tuplestore" in the patch, while the
structure used is tuplesort: one on form_and_spill_tuple, and one on
brinbuildCallbackParallel. Please update those comments.

That's it for code review. I'll do some performance comparisons and
testing soon, too.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Inquiry on Generating Bitmaps from Filter Conditions in Index Scans

2023-11-20 Thread Matthias van de Meent

On Mon, 20 Nov 2023 at 09:30, Jinjing Zhou  wrote:
>
> Hi hackers,
>
> I hope this message finds you well. I am reaching out to seek guidance on a 
> specific aspect of PostgreSQL's index scanning functionality.
>
> I am currently working on a vector search extension for postgres, where I 
> need to generate bitmaps based on filter conditions during an index scan. The 
> goal is to optimize the query performance by efficiently identifying the rows 
> that meet the given criteria.
>
> The query plan looks like this
>
> Index Scan using products_feature_idx on products  (cost=0.00..27.24 rows=495 
> width=12)
>  Order By: (feature <-> '[0.5, 0.5, 0.5]'::vector)
>  Filter: ((price > '0.2'::double precision) AND (price <= 
> '0.7'::double precision))
>
>
> We have a custom index for the order by clause on the feature column. Now we 
> want to utilize the index on other columns like price column. We want to 
> access the bitmap of price column's filter condition in the feature column 
> index. Is there any way I can achieve this goal?

If you mean "I'd like to use bitmaps generated by combining filter
results from index A, B, and C for (pre-)filtering the ordered index
lookups in index D",
then there is no current infrastructure to do this. Bitmap scans
currently generate a data structure that is not indexable, and can
thus not be used efficiently to push an index's generated bitmap into
another bitmap's scans.

There are efforts to improve the data structures we use for storing
TIDs during vacuum [0] which could extend to the TID bitmap structure,
but even then we'd need some significant effort to rewire Postgres'
internals to push down the bitmap filters; and that is even under the
assumption that pushing the bitmap down into the index AM is more
efficient than doing the merges above the index AM and then re-sorting
the data.

So, in short, it's not currently available in community PostgreSQL.
You could probably create a planner hook + custom executor node that
does this, but it won't be able to use much of the features available
inside PostgreSQL.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/CANWCAZbrZ58-w1W_3pg-0tOfbx8K41_n_03_0ndGV78hJWswBA%2540mail.gmail.com

Re: RFC: Pluggable TOAST

2023-11-15 Thread Matthias van de Meent

On Tue, 14 Nov 2023, 14:12 Nikita Malakhov,  wrote:
>
> Hi!
>
> Matthias, regarding your message above, I have a question to ask.
> On typed TOAST implementations - we thought that TOAST method used
> for storing data could depend not only on data type, but on the flow or 
> workload,
> like out bytea appendable toaster which is much (hundreds of times) faster on
> update compared to regular procedure. That was one of ideas behind the
> Pluggable TOAST - we can choose the most suitable TOAST implementation
> available.
>
> If we have a single TOAST entry point for data type - then we should have
> some means to control it or choose a TOAST method suitable to our needs.
> Or should not?

I'm not sure my interpretation of the question is correct, but I'll
assume it's "would you want something like STORAGE
[plain/external/...] for controlling type-specific toast operations?".

I don't see many reasons why we'd need a system to disable (some of)
those features, with the only one being "the workload is mostly
read-only of the full attributes, so any performance overhead of
type-aware detoasting is not worth the temporary space savings during
updates". So, while I do think there would be good reasons for typed
toasting to be disabled, I don't see a good reason for only specific
parts of type-specific toasting to be disabled (no reason for 'disable
the append optimization for bytea, but not the splice optimization').

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Why do indexes and sorts use the database collation?

2023-11-14 Thread Matthias van de Meent

On Wed, 15 Nov 2023 at 00:28, Jeff Davis  wrote:
>
> On Tue, 2023-11-14 at 14:47 -0500, Tom Lane wrote:
> > Why should that ever be different from the column's own declared
> > collation?
>
> Because an index with the "C" collation is more efficient in terms of
> building/maintaining/searching the index, and it also doesn't carry
> risks of corrupting your PK index when you upgrade libc or other
> dependency headaches.

That doesn't really answer the question for me. Why would you have a
primary key that has different collation rules (which include equality
rules) than the columns that this primary key contains? It is not
unlikely that users are misinformed about the behaviour of the
collation they're creating, thus breaking any primary key or equality
lookup that uses indexes auto-converted from that collation to the "C"
collation.

If the collation on my primary key's columns changes from one that is
deterministic to one that isn't, then my primary key surely has to be
reindexed. If the collation of the underlying index was overwritten to
'C' for performance, then that's a problem, right, as we wouldn't have
the expectation that the index is based on the columns' actual
collation's properties?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Parallel CREATE INDEX for BRIN indexes

2023-11-12 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 12:03, Tomas Vondra  wrote:
>
> Hi,
>
> here's an updated patch, addressing the review comments, and reworking
> how the work is divided between the workers & leader etc.

Thanks!

> In general I'm quite happy with the current state, and I believe it's
> fairly close to be committable.

Are you planning on committing the patches separately, or squashed? I
won't have much time this week for reviewing the patch, and it seems
like these patches are incremental, so some guidance on what you want
to be reviewed would be appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-11 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 02:53, Peter Geoghegan  wrote:
>
> On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
>  wrote:
> > On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan  wrote:
> > > I should be able to post v6 later this week. My current plan is to
> > > commit the other nbtree patch first (the backwards scan "boundary
> > > cases" one from the ongoing CF) -- since I saw your review earlier
> > > today. I think that you should probably wait for this v6 before
> > > starting your review.
> >
> > Okay, thanks for the update, then I'll wait for v6 to be posted.
>
> On second thought, I'll just post v6 now (there won't be conflicts
> against the master branch once the other patch is committed anyway).

Thanks. Here's my review of the btree-related code:

> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
> OffsetNumber offnum)
>  * set flag to true if all required keys are satisfied and false
>  * otherwise.
>  */
> -(void) _bt_checkkeys(scan, itup, indnatts, dir,
> - , false);
> +_bt_checkkeys(scan, , itup, false, false);
> +requiredMatchedByPrecheck = pstate.continuescan;
> +pstate.continuescan = true; /* reset */

The comment above the updated section needs to be updated.

> @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
> OffsetNumber offnum)
>  * set flag to true if all required keys are satisfied and false
>  * otherwise.
>  */
> -(void) _bt_checkkeys(scan, itup, indnatts, dir,
> - , false);
> +_bt_checkkeys(scan, , itup, false, false);

This 'false' finaltup argument is surely wrong for the rightmost
page's rightmost tuple, no?

> +++ b/src/backend/access/nbtree/nbtutils.c
> @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> +/* We could pfree(elem_values) after, but not worth the cycles */
> +num_elems = _bt_merge_arrays(scan, cur,
> + (indoption[cur->sk_attno - 1] & 
> INDOPTION_DESC) != 0,
> + prev->elem_values, prev->num_elems,
> + elem_values, num_elems);

This code can get hit several times when there are multiple = ANY
clauses, which may result in repeated leakage of these arrays during
this scan. I think cleaning up may well be worth the cycles when the
total size of the arrays is large enough.

> @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
>_bt_compare_array_elements, );
> +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> + Datum *elems_orig, int nelems_orig,
> + Datum *elems_next, int nelems_next)
> [...]
> +/*
> + * Incrementally copy the original array into a temp buffer, skipping 
> over
> + * any items that are missing from the "next" array
> + */

Given that we only keep the members that both arrays have in common,
the result array will be a strict subset of the original array. So, I
don't quite see why we need the temporary buffer here - we can reuse
the entries of the elems_orig array that we've already compared
against the elems_next array.

We may want to optimize this further by iterating over only the
smallest array: With the current code, [1, 2] + [11000] is faster
to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
smaller than 1000*log(2). In practice this may matter very little,
though.
An even better optimized version would do a merge join on the two
arrays, rather than loop + binary search.


> @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void 
> *b, void *arg)
> [...]
> +_bt_binsrch_array_skey(FmgrInfo *orderproc,

Is there a reason for this complex initialization of high/low_elem,
rather than the this easier to understand and more compact
initialization?:

+ low_elem = 0;
+ high_elem = array->num_elems - 1;
+ if (cur_elem_start)
+ {
+ if (ScanDirectionIsForward(dir))
+ low_elem = array->cur_elem;
+ else
+ high_elem = array->cur_elem;
+ }


> @@ -661,20 +1008,691 @@ _bt_restore_array_keys(IndexScanDesc scan)
> [...]
> + _bt_array_keys_remain(IndexScanDesc scan, ScanDirection dir)
> [...]
> +if (scan->parallel_scan != NULL)
> +_bt_parallel_done(scan);
> +
> +/*
> + * No more primitive index scans.  Terminate the top-level scan.
> + */
> +return false;

I think the conditional _bt_parallel_done(scan) feels misplaced here,
as the comment immediately below indicates the sc

Re: Parallel aggregates in PG 16.1

2023-11-10 Thread Matthias van de Meent

On Fri, 10 Nov 2023 at 11:47, ZIMANYI Esteban  wrote:
>
> In MobilityDB
> https://github.com/MobilityDB/MobilityDB
> we have defined a tstzspan type which is a fixed-size equivalent of the 
> tstzrange type in PostgreSQL.
>
> We have a span_union aggregate function which is the equivalent of the 
> range_agg function in PostgreSQL defined as follows
>
> CREATE FUNCTION tstzspan_union_finalfn(internal)
>   RETURNS tstzspanset
>   AS 'MODULE_PATHNAME', 'Span_union_finalfn'
>   LANGUAGE C IMMUTABLE PARALLEL SAFE;
>
> CREATE AGGREGATE span_union(tstzspan) (
>   SFUNC = array_agg_transfn,
>   STYPE = internal,
>   COMBINEFUNC = array_agg_combine,
>   SERIALFUNC = array_agg_serialize,
>   DESERIALFUNC = array_agg_deserialize,
>   FINALFUNC = tstzspan_union_finalfn
> );
>
> As can be seen, we reuse the array_agg function to accumulate the values in 
> an array and the final function just does similar work as the 
> range_agg_finalfn to merge the overlapping spans.

Did you note the following section in the CREATE AGGREGATE documentation [0]?

"""
An aggregate can optionally support partial aggregation, as described
in Section 38.12.4.
This requires specifying the COMBINEFUNC parameter. If the
state_data_type is internal, it's usually also appropriate to provide
the SERIALFUNC and DESERIALFUNC parameters so that parallel
aggregation is possible.
Note that the aggregate must also be marked PARALLEL SAFE to enable
parallel aggregation.
"""

>From this, it seems like the PARALLEL = SAFE argument is missing from
your aggregate definition as provided above.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/docs/16/sql-createaggregate.html

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-09 Thread Matthias van de Meent

 ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I expected to hit 0 buffers, until I
realized that we don't push ALL into index filters). I shall assume
ALL isn't used all that often (heh), but it sure feels like we're
missing out on performance here.

2. We also don't seem to support array keys for row compares, which
probably is an even more niche use case:

SELECT count(*)
FROM tenk1
WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]);

This is no different from master, too, but it'd be nice if there was
support for arrays of row operations, too, just so that composite
primary keys can also be looked up with SAOPs.


Kind regards,

Matthias van de Meent

Re: pg_walfile_name_offset can return inconsistent values

2023-11-09 Thread Matthias van de Meent

On Thu, 9 Nov 2023 at 20:22, Bruce Momjian  wrote:
> I know this bug report is four years old, but it is still a
> pg_walfile_name_offset() bug.  Here is the bug:
>
> SELECT *
> FROM (VALUES ('0/16ff'), ('0/1700'), ('0/1701')) AS 
> t(lsn),
>  LATERAL pg_walfile_name_offset(lsn::pg_lsn);
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777215
> -->  0/1700 | 00010016 |   0
>  0/1701 | 00010017 |   1
>
> The bug is in the indicated line --- it shows the filename as 00016 but
> offset as zero, when clearly the LSN is pointing to 17/0.  The bug is
> essentially that the code for pg_walfile_name_offset() uses the exact
> offset from the LSN, but uses the file name from the previous byte of
> the LSN.

Yes, that's definitely not a correct result.

> The fix involves deciding what the description or purpose of
> pg_walfile_name_offset() means, and adjusting it to be clearer.  The
> current documentation says:
>
> Converts a write-ahead log location to a WAL file name and byte
> offset within that file.
>
> Fix #1:  If we assume write-ahead log location means LSN, it is saying
> show the file/offset of the LSN, and that is most clearly:
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777215
>  0/1700 | 00010017 |   0
>  0/1701 | 00010017 |   1
>
> Fix #2:  Now, there are some who have said they want the output to be
> the last written WAL byte (the byte before the LSN), not the current
> LSN, for archiving purposes.  However, if we do that, we have to update
> the docs to clarify it.  Its output would be:
>
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00010016 |16777214
>  0/1700 | 00010016 |16777215
>  0/1701 | 00010017 |   0
>
> I have attached fix #1 as offset1.diff and fix #2 as offset2.diff.

I believe you got the references wrong; fix #1 looks like the output
of offset2's changes, and fix #2 looks like the result of offset1's
changes.

Either way, I think fix #1 is most correct (as was attached in
offset2.diff, and quoted verbatim here), because that has no chance of
having surprising underflowing behaviour when you use '0/0'::lsn as
input.

> diff --git a/src/backend/access/transam/xlogfuncs.c 
> b/src/backend/access/transam/xlogfuncs.c
> index 45a70668b1..e65502d51e 100644
> --- a/src/backend/access/transam/xlogfuncs.c
> +++ b/src/backend/access/transam/xlogfuncs.c
> @@ -414,7 +414,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
> /*
>  * xlogfilename
>  */
> -XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
> +XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
> XLogFileName(xlogfilename, GetWALInsertionTimeLine(), xlogsegno,
>  wal_segment_size);

Kind regards,

Matthias van de Meent

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-08 Thread Matthias van de Meent

On Wed, 8 Nov 2023 at 05:13, Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-06 14:00:13 +0530, Bharath Rupireddy wrote:
> > Well, that's a total of ~17 LWLocks this new function takes to make
> > the stats reset atomic. I'm not sure if this atomicity is worth the
> > effort which can easily be misused - what if someone runs something
> > like SELECT pg_stat_reset_shared() FROM generate_series(1,
> > 10n); to cause heavy lock acquisition and release cycles?
>
> Yea, this seems like an *extremely* bad idea to me. Without careful analysis
> it could very well cause deadlocks.

I didn't realize that it'd take 17 LwLocks to reset those stats; I
thought it was one shared system using the same lock, or a very
limited set of locks. Aquiring 17 locks is quite likely not worth the
chance of having to wait for some stats lock or another and thus
generating 'bubbles' in other stats gathering pipelines.

> > IMV, atomicity is not something that applies for the stats reset
> > operation because stats are approximate numbers by nature after all.
> > If the pg_stat_reset_shared() resets stats for only a bunch of stats
> > types and fails, it's the basic application programming style that
> > when a query fails it's the application that needs to have a retry
> > mechanism. FWIW, the atomicity doesn't apply today if someone wants to
> > reset stats in a loop for all stats types.
>
> Yea. Additionally it's not really atomic regardless of the lwlocks, due to
> various processes all accumulating in local counters first, and only
> occasionally updating the shared data. So even after holding all the locks at
> the same time, the shared stats would still not actually represent a truly
> atomic state.

Good points that I hadn't thought much about yet. I agree that atomic
reset isn't worth implementing in this stats system - it's too costly
and complex to do it correctly.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: ALTER TABLE uses a bistate but not for toast tables

2023-11-07 Thread Matthias van de Meent

Hi Justin,

This patch has gone stale quite some time ago; CFbot does not seem to
have any history of a successful apply attemps, nor do we have any
succesful build history (which was introduced some time ago already).

Are you planning on rebasing this patch?

Kind regards,

Matthias van de Meent

Re: Buffer Cache Problem

2023-11-07 Thread Matthias van de Meent

On Tue, 7 Nov 2023 at 14:28, jacktby jacktby  wrote:
>
> Hi, postgres hackers, I’m studying postgres buffer cache part. So I open this 
> thread to communicate some buffer cache codes design and try to improve some 
> tricky codes.
>
> For Buffer Cache, we know it’s a buffer array, every bucket of this array is 
> consist of a data page and its header which is used to describe the state of 
> the buffer.
>
> For field wait_backend_pgprocno, the comment is "backend of pin-count 
> waiter”, I have problems below:

Did you read the README at src/backend/storage/buffer/README, as well
as the comments and documentation in and around the buffer-locking
functions?

> 1. it means which processId is waiting this buffer, right?
> 2. and if wait_backend_pgprocno is valid, so it says this buffer is in use by 
> one process, right?
> 3. if one buffer is wait by another process, it means all buffers are out of 
> use, right? So let’s try this: we have 5 buffers with ids (1,2,3,4,5), and 
> they  are all in use, now another process  with processId 8017 is coming, and 
> it choose buffer id 1, so  buffer1’s wait_backend_pgprocno is 8017, but later
> buffer4 is released, can process 8017 change to get buffer4? how?

I believe these questions are generally answered by the README and the
comments in bufmgr.c/buf_internal.h for the functions that try to lock
buffers.

> 4. wait_backend_pgprocno is a “integer” type, not an array, why can one 
> buffer be wait by only one process?

Yes, that is correct. It seems like PostgreSQL has yet to find a
workload requires more than one backend to wait for super exclusive
access to a buffer at the same time.
VACUUM seems to be the only workload that currently can wait and sleep
for this exclusive buffer access, and that is already limited to one
process per relation, so there are no explicit concurrent
super-exclusive waits in the system right now.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: 2023-11-09 release announcement draft

2023-11-07 Thread Matthias van de Meent

On Mon, 6 Nov 2023 at 23:04, Jonathan S. Katz  wrote:
>
> Hi,
>
> Attached is the release announcement draft for the 2023-11-09 release
> (16.1 et al.).
>
> Please review for accuracy and notable omissions. Please have all
> feedback in by 2023-11-09 08:00 UTC at the latest (albeit the sooner the
> better).

> 20231109updaterelease.md
> [...]
> * Provide more efficient indexing of `date`, `timestamptz`, and `timestamp`
> values in BRIN indexes. While not required, we recommend
> [reindexing](https://www.postgresql.org/docs/current/sql-reindex.html) BRIN
> indexes that include these data types after installing this update.

As the type's minmax_multi opclasses are marked as default, I believe
it makes sense to explicitly mention that only indexes that use the
type's minmax_multi opclasses would need to be reindexed for them to
see improved performance. The types' *_bloom and *_minmax opclasses
were not affected and therefore do not need to be reindexed.

Kind regards,

Matthias van de meent.

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-07 Thread Matthias van de Meent

On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan  wrote:
>
> On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
>  wrote:
> > I'm planning on reviewing this patch tomorrow, but in an initial scan
> > through the patch I noticed there's little information about how the
> > array keys state machine works in this new design. Do you have a more
> > toplevel description of the full state machine used in the new design?
>
> This is an excellent question. You're entirely right: there isn't
> enough information about the design of the state machine.
>
> I should be able to post v6 later this week. My current plan is to
> commit the other nbtree patch first (the backwards scan "boundary
> cases" one from the ongoing CF) -- since I saw your review earlier
> today. I think that you should probably wait for this v6 before
> starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: RFC: Pluggable TOAST

2023-11-07 Thread Matthias van de Meent

of type-aware toasting
optimization. The toast storage relation growing too large is not
unique to jsonb- or bytea-typed columns, so I believe this is better
solved in a different thread. Ideas like 'toast relation per column'
also doesn't really solve the issue when the main table only has one
bigint and one jsonb column, so I think this needs a different
approach, too. I think solutions could probably best be discussed in a
separate thread.

Kind regards,

Matthias van de Meent.

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-06 Thread Matthias van de Meent

On Sat, 21 Oct 2023 at 00:40, Peter Geoghegan  wrote:
>
> On Sun, Oct 15, 2023 at 1:50 PM Peter Geoghegan  wrote:
> > Attached is v4, which applies cleanly on top of HEAD. This was needed
> > due to Alexandar Korotkov's commit e0b1ee17, "Skip checking of scan
> > keys required for directional scan in B-tree".
> >
> > Unfortunately I have more or less dealt with the conflicts on HEAD by
> > disabling the optimization from that commit, for the time being.
>
> Attached is v5, which deals with the conflict with the optimization
> added by Alexandar Korotkov's commit e0b1ee17 sensibly: the
> optimization is now only disabled in cases without array scan keys.
> (It'd be very hard to make it work with array scan keys, since an
> important principle for my patch is that we can change search-type
> scan keys right in the middle of any _bt_readpage() call).

I'm planning on reviewing this patch tomorrow, but in an initial scan
through the patch I noticed there's little information about how the
array keys state machine works in this new design. Do you have a more
toplevel description of the full state machine used in the new design?
If not, I'll probably be able to discover my own understanding of the
mechanism used in the patch, but if there is a framework to build that
understanding on (rather than having to build it from scratch) that'd
be greatly appreciated.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Add bump memory context type and use it for tuplesorts

2023-11-06 Thread Matthias van de Meent

On Tue, 11 Jul 2023 at 01:51, David Rowley  wrote:
>
> On Tue, 27 Jun 2023 at 21:19, David Rowley  wrote:
> > I've attached the bump allocator patch and also the script I used to
> > gather the performance results in the first 2 tabs in the attached
> > spreadsheet.
>
> I've attached a v2 patch which changes the BumpContext a little to
> remove some of the fields that are not really required.  There was no
> need for the "keeper" field as the keeper block always comes at the
> end of the BumpContext as these are allocated in a single malloc().
> The pointer to the "block" also isn't really needed. This is always
> the same as the head element in the blocks dlist.

Neat idea, +1.

I think it would make sense to split the "add a bump allocator"
changes from the "use the bump allocator in tuplesort" patches.

Tangent: Do we have specific notes on worst-case memory usage of
memory contexts with various allocation patterns? This new bump
allocator seems to be quite efficient, but in a worst-case allocation
pattern it can still waste about 1/3 of its allocated memory due to
never using free space on previous blocks after an allocation didn't
fit on that block.
It probably isn't going to be a huge problem in general, but this
seems like something that could be documented as a potential problem
when you're looking for which allocator to use and compare it with
other allocators that handle different allocation sizes more
gracefully.

> +++ b/src/backend/utils/mmgr/bump.c
> +BumpBlockIsEmpty(BumpBlock *block)
> +{
> +/* it's empty if the freeptr has not moved */
> +return (block->freeptr == (char *) block + Bump_BLOCKHDRSZ);
> [...]
> +static inline void
> +BumpBlockMarkEmpty(BumpBlock *block)
> +{
> +#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY)
> +char   *datastart = ((char *) block) + Bump_BLOCKHDRSZ;

These two use different definitions of the start pointer. Is that deliberate?

> +++ b/src/include/utils/tuplesort.h
> @@ -109,7 +109,8 @@ typedef struct TuplesortInstrumentation
>  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
>  * which is a separate palloc chunk --- we assume it is just one chunk and
>  * can be freed by a simple pfree() (except during merge, when we use a
> - * simple slab allocator).  SortTuples also contain the tuple's first key
> + * simple slab allocator and when performing a non-bounded sort where we
> + * use a bump allocator).  SortTuples also contain the tuple's first key

I'd go with something like the following:

+ * ...(except during merge *where* we use a
+ * simple slab allocator, and during a non-bounded sort where we
+ * use a bump allocator).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

2023-11-06 Thread Matthias van de Meent

On Sun, 15 Oct 2023 at 22:56, Peter Geoghegan  wrote:
>
> On Mon, Sep 18, 2023 at 4:58 PM Peter Geoghegan  wrote:
> > Attached is v3, which is a straightforward rebase of v2. v3 is needed
> > to get the patch to apply cleanly against HEAD - so no real changes
> > here.
>
> Attached is v4. Just to keep CFTester happy.

> @@ -402,10 +405,27 @@ _bt_binsrch(Relation rel,
> +if (unlikely(key->backward))
> +return OffsetNumberPrev(low);
> +
> return low;

I wonder if this is (or can be) optimized to the mostly equivalent
"return low - (OffsetNumber) key->backward", as that would remove an
"unlikely" branch that isn't very unlikely during page deletion, even
if page deletion by itself is quite rare.
I'm not sure it's worth the additional cognitive overhead, or if there
are any significant performance implications for the hot path.

> @@ -318,9 +318,12 @@ _bt_moveright(Relation rel,
> [...]
>  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
> [...]
> + * key >= given scankey, or > scankey if nextkey is true for forward scans.
> + * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
> + * case of backward scans.  (NOTE: this means it is possible to return a 
> value
> + * that's 1 greater than the number of keys on the leaf page.  It also means
> + * that we can return an item 1 less than the first non-pivot tuple on any
> + * leaf page.)

I think this can use a bit more wordsmithing: the use of "also" with
"steps back" implies we also step back in other cases, which aren't
mentioned. Could you update the wording to be more clear about this?

> @@ -767,7 +787,7 @@ _bt_compare(Relation rel,
> [...]
> - * Most searches have a scankey that is considered greater than a
> + * Forward scans have a scankey that is considered greater than a

Although it's not strictly an issue for this patch, the comment here
doesn't describe backward scans in as much detail as forward scans
here. The concepts are mostly "do the same but in reverse", but the
difference is noticable.

Apart from these comments, no further noteworthy comments. Looks good.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread Matthias van de Meent

On Sat, 4 Nov 2023 at 03:38, Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > I'm quite surprised at the significant number of changes being made
> > outside the core storage manager files. I thought that changing out
> > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > would be the most obvious change to implement cluster-wide encryption
> > with the least code touched, as relations don't need to know whether
> > the files they're writing are encrypted, right? Is there a reason to
> > not implement this at the smgr level that I overlooked in the
> > documentation of these patches?
>
> You can't really implement encryption transparently inside an smgr without
> significant downsides. You need a way to store an initialization vector
> associated with the page (or you can store that elsewhere, but then you've
> doubled the worst cse amount of random reads/writes). The patch uses the LSN
> as the IV (which I doubt is a good idea). For authenticated encryption further
> additional storage space is required.

I am unaware of any user of the smgr API that doesn't also use the
buffer cache, and thus implicitly the Page layout with PageHeader
[^1]. The API of smgr is also tailored to page-sized quanta of data
with mostly relation-level information. I don't see why there would be
a veil covering the layout of Page for smgr when all other information
already points to the use of PageHeader and Page layouts. In my view,
it would even make sense to allow the smgr to get exclusive access to
some part of the page in the current Page layout.

Yes, I agree that there will be an impact on usable page size if you
want authenticated encryption, and that AMs will indeed need to
account for storage space now being used by the smgr - inconvenient,
but it serves a purpose. That would happen regardless of whether smgr
or some higher system decides where to store the data for encryption -
as long as it is on the page, the AM effectively can't use those
bytes.
But I'd say that's best solved by making the Page documentation and
PageInit API explicit about the potential use of that space by the
chosen storage method (encrypted, plain, ...) instead of requiring the
various AMs to manually consider encryption when using Postgres' APIs
for writing data to disk without hitting shared buffers; page space
management is already a task of AMs, but handling the actual
encryption is not.

Should the AM really care whether the data on disk is encrypted or
not? I don't think so. When the disk contains encrypted bytes, but
smgrread() and smgrwrite() both produce and accept plaintext data,
who's going to complain? Requiring AMs to be mindful about encryption
on all common paths only adds pitfalls where encryption would be
forgotten by the developer of AMs in one path or another.

> To be able to to use the LSN as the IV, the patch needs to ensure that the LSN
> increases in additional situations (a more aggressive version of
> wal_log_hint_bits) - which can't be done below smgr, where we don't know about
> what WAL logging was done. Nor can you easily just add space on the page below
> md.c, for the purpose of storing an LSN independent IV and the authentication
> data.

I think that getting PageInit to allocate the smgr-specific area would
take some effort, too (which would potentially require adding some
relational context to PageInit, so that it knows which page of which
relation it is going to initialize), but IMHO that would be more
natural than requiring all index and table AMs to be aware the actual
encryption of its pages and require manual handling of that encryption
when the page needs to be written to disk, when it otherwise already
conforms to the various buffer management and file extension APIs
currently in use in PostgreSQL. I would expect "transparent" data
encryption to be handled at the file write layer (i.e. smgr), not
inside the AMs.

Kind regards,

Matthias van de Meent

[^1] ReadBuffer_common uses PageIsVerifiedExtended which verifies that
a page conforms with Postgres' Page layout if checksums are enabled.
Furthermore, all builtin index AMs utilize pd_special, further
implying the use of a PageInit/PageHeader-based page layout.
Additionally, the heap tableAM also complies, and both FSM and VM also
use postgres' Page layout.
As for other AMs that I could check: bloom, rum, and pgvector's
ivfflat and hnsw all use page layouts.

Re: brininsert optimization opportunity

2023-11-03 Thread Matthias van de Meent

On Fri, 3 Nov 2023 at 19:37, Tomas Vondra  wrote:
>
> Hi,
>
> I took a look at this patch today. I had to rebase the patch, due to
> some minor bitrot related to 9f0602539d (but nothing major). I also did
> a couple tiny cosmetic tweaks, but other than that the patch seems OK.
> See the attached v6.
> [...]
> Barring objections, I'll try to push this early next week, after another
> round of cleanup.

No hard objections: The principle looks fine.

I do think we should choose a better namespace than bs_* for the
fields of BrinInsertState, as BrinBuildState already uses the bs_*
namespace for its fields in the same file, but that's only cosmetic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: DRAFT GIST support for ORDER BY

2023-11-03 Thread Matthias van de Meent

On Mon, 30 Oct 2023 at 14:39, Michał Kłeczek  wrote:
>> On 30 Oct 2023, at 13:31, Matthias van de Meent 
>>  wrote:
>>
>>> The problem is though that right now handling of ORDER BY column clauses is 
>>> tightly coupled to BTree.
>>> It would be good to refactor the code so that semantics of ORDER BY column 
>>> could be more flexible.
>>
>> The existence of a BTREE operator class for the type is the indicator
>> that (and how) the type can be ordered - that is where PostgreSQL gets
>> its methods for ordering most types. Although I agree that it's a
>> quirk, I don't mind it that much as an indicator of how a type is
>> ordered.
>> I do agree, though, that operator classes by themselves should be able
>> to say "hey, we support full ordered retrieval as well". Right now,
>> that seems to be limited to btrees, but indeed a GiST index with
>> btree_gist columns should be able to support the same.
>
> Right now opfamily and strategy are set in PathKey before creating index scan 
> paths.
>
> The patch actually copies existing code from create_indexscan_plan
> that finds an operator OID for (pk_opfamily, pk_strategy).
> The operator is supposed to be binary with specific operand types.
>
> To create a path:
> 1) do the operator OID lookup as above
> 2) look for sortfamily of pg_amop entry for (operator did, index opfamily)
> If the sort family is the same as pk_opfamily we can create a path.
>
> The side effect is that it is possible to “ORDER BY column < ‘constant’” as 
> we have more ordering operators in pg_amop.
>
> Ideally we could look up _unary_ operator in pg_amop instead - that would 
> make sense we are actually measuring some “absolute distance”.
> But this would require more changes - createplan.c would need to decide when 
> to lookup unary and when - binary operator.

After researching this a bit more, I'm confused: If I register an opclass

CREATE OPERATOR CLASS gist_mytype_btree
DEFUALT FOR mytype USING gist
AS
OPERATOR 1 < (mytype, mytype) FOR ORDER BY mytype_ops, -- operator
<(mytype, mytype) returns bool
...
OPERATOR 15 <-> (mytype, mytype) FOR ORDER BY mytype_ops. --
operator <->(mytype, mytype) returns mytype
...

Then which order of values does the system expect the index to return
tuples in when either of these operators is applied?
Is that
  ORDER BY (index_column opr constant); but bool isn't the type
supported by the FOR ORDER BY opclass, or
  ORDER BY (index_column); but this makes no sense for distance operators.

After looking at get_relation_info() in optimizer/util/plancat.c, I
guess the difference is the difference between amhandler->amcanorder
vs amhandler->amcanorderbyop? But still it's not quite clear what the
implication for this is. Does it mean an index AM can either provide
natural ordering, or operator ordering, but not both?

>>> ORDER BY a == ORDER BY a <-> MIN_VALUE
>>> and
>>> ORDER BY a DESC == ORDER BY a <-> MAX_VALUE
>>>
>>> This allows implementing GIST ordered scans for btree_gist datatypes.
>>>
>>> This in turn makes using GIST with partitioning feasible (I have described 
>>> issues with such usage in my previous e-mails - see below).
>>
>> Did you take into account that GiST's internal distance function uses
>> floating point, and is thus only an approximation for values that
>> require more than 2^54 significant bits in their distance function?
>> For example, GiST wouldn't be guaranteed to yield correct ordering of
>> int8/bigint when you use `my_column <-> UINT64_MAX` because as far as
>> the floating point math is concerned, 0 is about as far away from
>> INT64_MAX as (say) 20 and -21.
>
> Hmm… Good point but it means ORDER BY <-> is broken for these types then?
> The patch assumes it works correctly and just uses it for ordered scans.

Huh, I didn't know this before, but apparently values are pushed onto
a reorderqueue/pairingheap if the index scan is marked
xs_recheckorderby (i.e. when the tuple order is not exact), which
would be used in this case.

So it seems like this wouldn't be much of an issue for the patch,
apart from the potential issue where this could use the pairingheap
much more than the usual ordered scan operations, which could result
in larger-than-normal memory usage. E.g. float btree ops wouldn't work
effectively at all because every reasonable value is extremely distant
from its max value.

Kind regards,

Matthias van de Meent

Re: Popcount optimization using AVX512

2023-11-03 Thread Matthias van de Meent

On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D  wrote:
>
> This proposal showcases the speed-up provided to popcount feature when using 
> AVX512 registers. The intent is to share the preliminary results with the 
> community and get feedback for adding avx512 support for popcount.
>
> Revisiting the previous discussion/improvements around this feature, I have 
> created a micro-benchmark based on the pg_popcount() in PostgreSQL's current 
> implementations for x86_64 using the newer AVX512 intrinsics. Playing with 
> this implementation has improved performance up to 46% on Intel's Sapphire 
> Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.

How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

> My setup:
>
> Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
> OS : Ubuntu 22.04
> GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl 
> -march=native -O2".
>
> 1. I copied the pg_popcount() implementation into a new C/C++ project using 
> cmake/make.
> a. Software only and
> b. SSE 64 bit version
> 2. I created an implementation using the following AVX512 intrinsics:
> a. _mm512_popcnt_epi64()
> b. _mm512_reduce_add_epi64()
> 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; 
> repeatable with RNG seed [std::mt19937_64])

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Detoasting optionally to make Explain-Analyze less misleading

2023-11-02 Thread Matthias van de Meent

On Thu, 2 Nov 2023 at 22:25, Tomas Vondra  wrote:
>
>
>
> On 11/2/23 21:02, Matthias van de Meent wrote:
> > On Thu, 2 Nov 2023 at 20:32, Tomas Vondra  
> > wrote:
> >> On 11/2/23 20:09, stepan rutz wrote:
> >>> db1=# explain (analyze, serialize) select * from test;
> >>> QUERY PLAN
> >>> ---
> >>>  Seq Scan on test  (cost=0.00..22.00 rows=1200 width=40) (actual
> >>> time=0.023..0.027 rows=1 loops=1)
> >>>  Planning Time: 0.077 ms
> >>>  Execution Time: 303.281 ms
> >>>  Serialized Bytes: 7953 Bytes. Mode Text. Bandwidth 248.068 MB/sec
> >> [...]
> >> BTW if you really want to print amount of memory, maybe print it in
> >> kilobytes, like every other place in explain.c?
> >
> > Isn't node width in bytes, or is it an opaque value not to be
> > interpreted by users? I've never really investigated that part of
> > Postgres' explain output...
> >
>
> Right, "width=" is always in bytes. But fields like amount of sorted
> data is in kB, and this seems closer to that.
>
> >> Also, explain generally
> >> prints stuff in "key: value" style (in text format).
> >
> > That'd be key: metrickey=metricvalue for expanded values like those in
> > plan nodes and the buffer usage, no?
> >
>
> Possibly. But the proposed output does neither. Also, it starts with
> "Serialized Bytes" but then prints info about bandwidth.
>
>
> >>>  Serialized Bytes: 7953 Bytes. Mode Text. Bandwidth 248.068 MB/sec
> >
> > I was thinking more along the lines of something like this:
> >
> > [...]
> > Execution Time: xxx ms
> > Serialization: time=yyy.yyy (in ms) size=yyy (in KiB, or B) mode=text
> > (or binary)
> > > This is significantly different from your output, as it doesn't hide
> > the measured time behind a lossy calculation of bandwidth, but gives
> > the measured data to the user; allowing them to derive their own
> > precise bandwidth if they're so inclined.
> >
>
> Might work. I'm still not convinced we need to include the mode, or that
> the size is that interesting/useful, though.

I'd say size is interesting for systems where network bandwidth is
constrained, but CPU isn't. We currently only show estimated widths &
accurate number of tuples returned, but that's not an accurate
explanation of why your 30-row 3GB resultset took 1h to transmit on a
10mbit line - that is only explained by the bandwidth of your
connection and the size of the dataset. As we can measure the size of
the returned serialized dataset here, I think it's in the interest of
any debugability to also present it to the user. Sadly, we don't have
good measures of bandwidth without sending that data across, so that's
the only metric that we can't show here, but total query data size is
definitely something that I'd be interested in here.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Moving forward with TDE [PATCH v3]

2023-11-02 Thread Matthias van de Meent

On Tue, 31 Oct 2023 at 22:23, David Christensen
 wrote:
>
> Greetings,
>
> I am including an updated version of this patch series; it has been rebased 
> onto 6ec62b7799 and reworked somewhat.
>
> The patches are as follows:
>
> 0001 - doc updates
> 0002 - Basic key management and cipher support
> 0003 - Backend-related changes to support heap encryption

I'm quite surprised at the significant number of changes being made
outside the core storage manager files. I thought that changing out
mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
would be the most obvious change to implement cluster-wide encryption
with the least code touched, as relations don't need to know whether
the files they're writing are encrypted, right? Is there a reason to
not implement this at the smgr level that I overlooked in the
documentation of these patches?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-02 Thread Matthias van de Meent

On Thu, 2 Nov 2023 at 20:26, Bharath Rupireddy
 wrote:
>
> On Wed, Nov 1, 2023 at 4:24 AM Michael Paquier  wrote:
> >
> > On Tue, Oct 31, 2023 at 04:26:18PM +0900, torikoshia wrote:
> > > Yes, calling pg_stat_reset_shared() for all stats types can do what I 
> > > wanted
> > > to do.
> > > But calling it with 6 different parameters seems tiresome and I thought it
> > > would be convenient to have a parameter to delete all cluster-wide
> > > statistics at once.
> > >
> > > I may be wrong, but I imagine that it's more common to want to delete all 
> > > of
> > > the statistics for an entire cluster rather than just a portion of it.
> >
> > If more flexibility is wanted in this function, could it be an option
> > to consider a flavor like pg_stat_reset_shared(text[]), where it is
> > possible to specify a list of shared stats types to reset?  Perhaps
> > there are no real use cases for it, just wanted to mention it anyway
> > regarding the fact that it could have benefits to refactor this code
> > to use a bitwise operator for its internals with bit flags for each
> > type.
>
> I don't see a strong reason to introduce yet-another API when someone
> can just call things in a loop. I could recollect a recent analogy - a
> proposal to have a way to define multiple custom wait events with a
> single function call instead of callers defining in a loop didn't draw
> much interest.

Knowing that your metrics have a shared starting point can be quite
valuable, as it allows you to do some math that would otherwise be
much less accurate when working with stats over a short amount of
time. I've not used these stats systems much myself, but skew between
metrics caused by different reset points can be difficult to detect
and debug, so I think an atomic call to reset all these stats could be
worth implementing.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Detoasting optionally to make Explain-Analyze less misleading

2023-11-02 Thread Matthias van de Meent

On Thu, 2 Nov 2023 at 20:32, Tomas Vondra  wrote:
> On 11/2/23 20:09, stepan rutz wrote:
> > db1=# explain (analyze, serialize) select * from test;
> > QUERY PLAN
> > ---
> >  Seq Scan on test  (cost=0.00..22.00 rows=1200 width=40) (actual
> > time=0.023..0.027 rows=1 loops=1)
> >  Planning Time: 0.077 ms
> >  Execution Time: 303.281 ms
> >  Serialized Bytes: 7953 Bytes. Mode Text. Bandwidth 248.068 MB/sec
> [...]
> BTW if you really want to print amount of memory, maybe print it in
> kilobytes, like every other place in explain.c?

Isn't node width in bytes, or is it an opaque value not to be
interpreted by users? I've never really investigated that part of
Postgres' explain output...

> Also, explain generally
> prints stuff in "key: value" style (in text format).

That'd be key: metrickey=metricvalue for expanded values like those in
plan nodes and the buffer usage, no?

> >  Serialized Bytes: 7953 Bytes. Mode Text. Bandwidth 248.068 MB/sec

I was thinking more along the lines of something like this:

[...]
Execution Time: xxx ms
Serialization: time=yyy.yyy (in ms) size=yyy (in KiB, or B) mode=text
(or binary)

This is significantly different from your output, as it doesn't hide
the measured time behind a lossy calculation of bandwidth, but gives
the measured data to the user; allowing them to derive their own
precise bandwidth if they're so inclined.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2023-11-01 Thread Matthias van de Meent

On Wed, 1 Nov 2023 at 07:47, Pavel Stehule  wrote:
>
> Hi
>
> út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent 
>  napsal:
>> This patch was originally suggested at [0], but it was mentioned that
>> they could be pulled out into it's own thread. Earlier, the
>> performance gains were not clearly there for just this patch, but
>> after further benchmarking this patch stands on its own for
>> performance: it sees no obvious degradation of performance, while
>> gaining 0-5% across various normal indexes on the cc-complete sample
>> dataset, with the current worst-case index shape getting a 60%+
>> improved performance on INSERTs in the tests at [0].
>
>
> +1

Thanks for showing interest.

> This can be nice functionality. I had a customer with a very slow index scan 
> - the main problem was a long common prefix like prg010203.

I'll have to note that this patch doesn't cover cases where e.g. text
attributes have large shared prefixes, but are still unique: the
dynamic prefix compression in this patch is only implemented at the
tuple attribute level; it doesn't implement type aware dynamic prefix
compression inside the attributes. So, a unique index on a column of
int32 formatted like '%0100i' would not materially benefit from this
patch.

While would certainly be possible to add some type-level prefix
truncation in the framework of this patch, adding that would require
significant code churn in btree compare operators, because we'd need
an additional return argument to contain a numerical "shared prefix",
and that is not something I was planning to implement in this patch.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

btree: downlink right separator/HIKEY optimization

2023-10-31 Thread Matthias van de Meent

(now really to -hackers)
Hi,

Over at [0] I'd implemented an optimization that allows us to skip
calling _bt_compare in _bt_moveright in many common cases. This patch,
when stacked on top of the prefix truncation patch, improves INSERT
performance by an additional 2-9%pt, with an extreme case of 45% in
the worscase index tests at [0].

The optimization is that we now recognze that our page split algorithm
all but guarantees that the HIKEY matches this page's downlink's right
separator key bytewise, excluding the data stored in the
IndexTupleData struct.

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

By adding this, the number of _bt_compare calls per _bt_search is
often reduced by one per btree level.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with dynamic prefix truncation [1] and btree specialization [0].

[0] 
https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8s...@mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/CAEze2Wh-h20DmPSMXp4qHR0-ykh9=z3ejx8mssbikboqaye...@mail.gmail.com
From 55a2d06037f530b6d79bc73ed21bd27b78a1cc53 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v1] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README  | 20 +
 src/backend/access/nbtree/nbtsearch.c | 65 +--
 src/include/access/nbtree.h   |  3 --
 3 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5b..602a0f45e1 100644
--- a/src/backend/access/nbtree/nbtsearc

btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2023-10-31 Thread Matthias van de Meent

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with the downlink right separator/HIKEY optimization
(separate patch to be submitted soon), and specialization over at [0].

[0]
https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8s...@mail.gmail.com

v14-0001-btree-Implement-dynamic-prefix-compression.patch
Description: Binary data

Re: DRAFT GIST support for ORDER BY

2023-10-30 Thread Matthias van de Meent

On Mon, 30 Oct 2023 at 09:04, Michał Kłeczek  wrote:
>
> Hi All,
>
> Attached is a first attempt to implement GIST index (only) scans for ORDER BY 
> column clauses.

Cool!

> The solution is not ideal as it requires registering “<“ and “>” operators as 
> ordering operators in opfamily
> (which in turn makes it possible to issue somewhat meaningless “ORDER BY a < 
> ‘constant’)

I don't quite understand why we need to register new "<" and ">"
operators. Can't we update the current ones?

> The problem is though that right now handling of ORDER BY column clauses is 
> tightly coupled to BTree.
> It would be good to refactor the code so that semantics of ORDER BY column 
> could be more flexible.

The existence of a BTREE operator class for the type is the indicator
that (and how) the type can be ordered - that is where PostgreSQL gets
its methods for ordering most types. Although I agree that it's a
quirk, I don't mind it that much as an indicator of how a type is
ordered.
I do agree, though, that operator classes by themselves should be able
to say "hey, we support full ordered retrieval as well". Right now,
that seems to be limited to btrees, but indeed a GiST index with
btree_gist columns should be able to support the same.

> It would be great if someone could take a look at it.

I've not looked in detail at the patch, but here's some comments:

> --- a/contrib/btree_gist/btree_gist--1.6--1.7.sql
> +++ b/contrib/btree_gist/btree_gist--1.6--1.7.sql

You seem to be modifying an existing migration of a released version
of the btree_bist extension. I suggest you instead add a migration
from 1.7 to a new version 1.8, and update the control file's default
installed version.

> ORDER BY a == ORDER BY a <-> MIN_VALUE
> and
> ORDER BY a DESC == ORDER BY a <-> MAX_VALUE
>
> This allows implementing GIST ordered scans for btree_gist datatypes.
>
> This in turn makes using GIST with partitioning feasible (I have described 
> issues with such usage in my previous e-mails - see below).

Did you take into account that GiST's internal distance function uses
floating point, and is thus only an approximation for values that
require more than 2^54 significant bits in their distance function?
For example, GiST wouldn't be guaranteed to yield correct ordering of
int8/bigint when you use `my_column <-> UINT64_MAX` because as far as
the floating point math is concerned, 0 is about as far away from
INT64_MAX as (say) 20 and -21.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: RFC: Pluggable TOAST

2023-10-26 Thread Matthias van de Meent

On Thu, 26 Oct 2023 at 15:18, Aleksander Alekseev
 wrote:
>
> Hi,
>
> > And the goal of *THIS* topic is to gather a picture on how the community 
> > sees
> > improvements in TOAST mechanics if it doesn't want it the way we proposed
> > before, to understand which way to go with JSON advanced storage and other
> > enhancements we already have. Previous topic was not of any help here.
>
> Publish your code under an appropriate license first so that 1. anyone
> can test/benchmark it and 2. merge it to the PostgreSQL core if
> necessary.
>
> Or better consider participating in the [1] discussion where we
> reached a consensus on RFC and are working on improving TOAST for JSON
> and other types. We try to be mindful of use cases you named before
> like 64-bit TOAST pointers but we still could use your input.

I feel that the no. 2 proposal is significantly different from the
discussion over at [1] in that it concerns changes in the interface
between types and toast, as opposed to as opposed to the no. 1
proposal (and [1]'s) changes that stay mostly inside the current TOAST
apis and abstractions.

The "Compression dictionaries for JSONB" thread that you linked went
the way of "store and use compression dictionaries for TOAST
compression algorithms", which is at a lower level than one of the
other ideas, which was to "allow JSONB to use a dictionary of common
values to dictionary-encode some of the contained entries". Naive
compression of the Datum's bytes makes the compressed datum
unparseable without decompression, even when dictionaries are used to
decrease the compressed size, while a type's own compression
dictionary substitutions could allow it to maintain it's structure and
would thus allow for a lower memory and storage footprint of the
column's datums during query processing.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: RFC: Pluggable TOAST

2023-10-26 Thread Matthias van de Meent

g the data
until it fits:

Things that it currently does: varlena values are compressed and
out-of-lined with generic compression algorithms and a naive
slice-and-dice algorithm, and reconstructed (fully, or just a prefix)
when needed.

Things that it could potentially do in the future: Interface with
types to allow the type to slice the tuple; use type-aware
compression (or encoding) algorithms to allow partial detoasting and
partial updates of a single value.

This would presumably be implemented using a set of new varattrib_1b_e
pointer subtypes whose contents are mostly managed by the type;
allowing for partial detoasting of the original datum, and allowing
for more efficient access to not just the prefix, but intermediate
spans as well: If compression spans .

So, the question would be: how do we expose such an API?

I suspect that each type will have only one meaningful specialized
method to toast its values. I don't see much value for registering
custom TOASTers when they only work with only the types that have code
to support explicitly that toaster. This was visible in the 'Pluggable
Toaster' patch that was provided earlier as well - both example
implementations of this pluggable toaster were specialized to the
needs of one type each, and the type had direct calls into those
"pluggable" toaster's internals, showing no good reason to extend this
support to elsewhere outside the type.

Because there would be only one meaningful type-aware method of
TOASTing a value, we could implement this as an optional type support
function that would allow the type to specify how it wants to TOAST
its values, with the default TOAST as backup in case of still
too-large tuples or if the type does not implement these support
functions. With this I'm thinking mostly towards "new inout functions
for on-disk representations; which return/consume TOASTed slices to
de/construct the original datum", and less "replacement of all of
toast's internals".

So, in short, I don't think there is a need for a specific "Pluggable
toast API" like the one in the patchset at [0] that can be loaded
on-demand, but I think that updating our current TOAST system to a
system for which types can provide support functions would likely be
quite beneficial, for efficient extraction of data from composite
values.

Example support functions:

/* TODO: bikeshedding on names, signatures, further support functions. */
Datum typsup_roastsliceofbread(Datum ptr, int sizetarget, char cmethod)
Datum typsup_unroastsliceofbread(Datum ptr)
void typsup_releaseroastedsliceofbread(Datump ptr) /* in case of
non-unitary in-memory datums */

We would probably want at least 2 more subtypes of varattrib_1b_e -
one for on-disk pointers, and one for in-memory pointers - where the
payload of those pointers is managed by the type's toast mechanism and
considered opaque to the rest of PostgreSQL (and thus not compatible
with the binary transfer protocol). Types are currently already
expected to be able to handle their own binary representation, so
allowing types to manage parts of the toast representation should IMHO
not be too dangerous, though we should make sure that BINARY COERCIBLE
types share this toast support routine, or be returned to their
canonical binary version before they are cast to the coerced type, as
using different detoasting mechanisms could result in corrupted data
and thus crashes.

Lastly, there is the compression part of TOAST. I think it should be
relatively straightforward to expose the compression-related
components of TOAST through functions that can then be used by
type-specific toast support functions.
Note that this would be opt-in for a type, thus all functions that use
that type's internals should be aware of the different on-disk format
for toasted values and should thus be able to handle it gracefully.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru

Re: PostgreSQL domains and NOT NULL constraint

2023-10-23 Thread Matthias van de Meent

On Mon, 23 Oct 2023, 19:34 Tom Lane,  wrote:
>
> I wrote:
> > Given the exception the spec makes for CAST, I wonder if we shouldn't
> > just say "NULL is a valid value of every domain type, as well as every
> > base type.  If you don't like it, too bad; write a separate NOT NULL
> > constraint for your table column."
>
> After ruminating on this for awhile, here's a straw-man proposal:
>
> 1. Domains are data types, with the proviso that NULL is always
> a valid value no matter what the domain constraints might say.
> Implementation-wise, this'd just require that CoerceToDomain
> immediately return any null input without checking the constraints.
> This has two big attractions:

Agreed.

> 2. In INSERT and UPDATE queries, thumb through the constraints of
> any domain-typed target columns to see if any of them are NOT NULL
> or CHECK(VALUE IS NOT NULL).  If so, act as though there's a table
> NOT NULL constraint on that column.

How does this work w.r.t. concurrently created tables that contain the
domain? Right now, you can do something along the lines of the
following due to a lack of locking on domains for new columns/tables
that use said domain, and I believe that this is the main source of
domain constraint violations:

CREATE DOMAIN mydomain text;
CREATE TABLE c (d mydomain);

S1: BEGIN; INSERT INTO c VALUES (''); CREATE TABLE t (d mydomain);
INSERT INTO t VALUES (NULL);

S2: BEGIN; ALTER DOMAIN mydomain SET NOT NULL;
-- waits for S1 to release lock on c

S1: COMMIT;
-- S2's ALTER DOMAIN gets unblocked and succeeds, despite the NULL
value in "t" because that table is invisible to the transaction of
ALTER DOMAIN.

So my base question is, should we then require e.g. SHARE locks on
types that depend on domains when we do DDL that depends on the type,
and SHARE UPDATE EXCLUSIVE when we modify the type?

> The idea of point #2 is to have a cheap check that 99% satisfies
> what the spec says about not-null constraints on domains.  If we
> don't do #2, I think we have to fully recheck all the domain's
> constraints during column assignment.  I find that ugly as well
> as expensive performance-wise.  It does mean that if you have
> some domain constraint that would act to reject NULLs, but it's
> spelled in some weird way, it won't reject NULLs.  I don't find
> that possibility compelling enough to justify the performance hit
> of recomputing every constraint just in case it acts like that.

Makes sense.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Lowering the default wal_blocksize to 4K

2023-10-11 Thread Matthias van de Meent

On Wed, 11 Oct 2023 at 01:29, Andres Freund  wrote:
>
> Hi,
>
> On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > On Tue, 10 Oct 2023 at 06:14, Andres Freund  wrote:
> > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
> > >> Andres Freund  writes:
> > >>> There's an alternative approach we could take, which is to write in 4KB
> > >>> increments, while keeping 8KB pages. With the current format that's not
> > >>> obviously a bad idea. But given there aren't really advantages in 8KB 
> > >>> WAL
> > >>> pages, it seems we should just go for 4KB?
> > >>
> > >> Seems like that's doubling the overhead of WAL page headers.  Do we need
> > >> to try to skinny those down?
> > >
> > > I think the overhead is small, and we are wasting so much space in other
> > > places, that I am not worried about the proportional increase page header
> > > space usage at this point, particularly compared to saving in overall 
> > > write
> > > rate and increase in TPS. There's other areas we can save much more 
> > > space, if
> > > we want to focus on that.
> > >
> > > I was thinking we should perhaps do the opposite, namely getting rid of 
> > > short
> > > page headers. The overhead in the "byte position" <-> LSN conversion due 
> > > to
> > > the differing space is worse than the gain. Or do something inbetween - 
> > > having
> > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > convinced that having segment and block size in there, as 32bit numbers no
> > > less, is worthwhile. After all, if the system id matches, it's not likely 
> > > that
> > > the xlog block or segment size differ.
> >
> > Hmm. I don't think we should remove those checks, as I can see people
> > that would want to change their XLog block size with e.g.
> > pg_reset_wal.
>
> I don't think that's something we need to address in every physical
> segment. For one, there's no option to do so.

Not block size, but xlog segment size is modifiable with pg_resetwal,
and could thus reasonably change across restarts. Apart from more
practical concerns around compile-time options requiring you to swap
out binaries, I don't really see why xlog block size couldn't be
changed with pg_resetwal in a securely shutdown cluster as one does
with the WAL segment size.

> But more importantly, if they
> don't change the xlog block size, we'll just accept random WAL as well. If
> somebody goes to the trouble of writing a custom tool, they can live with the
> consequences of that potentially causing breakage. Particularly if the checks
> wouldn't meaningfully prevent that anyway.

I don't understand what you mean by that "we'll just accept random WAL
as well". We do significant validation in XLogReaderValidatePageHeader
to make sure that all pages of WAL are sufficiently formatted so that
they can securely be read by the available infrastructure with the
least chance of misreading data. There is no chance currently that we
read WAL from WAL segments that contain correct data for different
segment or block sizes. That includes WAL from segments created before
a pg_resetwal changed the WAL segment size.

If this "custom tool" refers to the typo-ed name of pg_resetwal, that
is hardly a custom tool, it is shipped with PostgreSQL and you can
find the sources under src/bin/pg_resetwal.

> > After that we'll only have the system ID left from the extended
> > header, which we could store across 2 pages in the (current) alignment
> > losses of xlp_rem_len - even pages the upper half, uneven pages the
> > lower half of the ID. This should allow for enough integrity checks
> > without further increasing the size of XLogPageHeader in most
> > installations.
>
> I doubt that that's a good idea - what if there's just a single page in a
> segment? And there aren't earlier segments? That's not a rare case, IME.

Then we'd still have 50% of a system ID which we can check against for
any corruption. I agree that it increases the chance of conflics, but
it's still strictly better than nothing at all.
An alternative solution would be to write the first two pages of a WAL
segment regardless of contents, so that we essentially never only have
access to the first page during crash recovery. Physical replication's
recovery wouldn't be able to read ahead, but I consider that as less
problematic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Lowering the default wal_blocksize to 4K

2023-10-10 Thread Matthias van de Meent

On Tue, 10 Oct 2023 at 06:14, Andres Freund  wrote:
>
> Hi,
>
> On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> There's an alternative approach we could take, which is to write in 4KB
>>> increments, while keeping 8KB pages. With the current format that's not
>>> obviously a bad idea. But given there aren't really advantages in 8KB WAL
>>> pages, it seems we should just go for 4KB?
>>
>> Seems like that's doubling the overhead of WAL page headers.  Do we need
>> to try to skinny those down?
>
> I think the overhead is small, and we are wasting so much space in other
> places, that I am not worried about the proportional increase page header
> space usage at this point, particularly compared to saving in overall write
> rate and increase in TPS. There's other areas we can save much more space, if
> we want to focus on that.
>
> I was thinking we should perhaps do the opposite, namely getting rid of short
> page headers. The overhead in the "byte position" <-> LSN conversion due to
> the differing space is worse than the gain. Or do something inbetween - having
> the system ID in the header adds a useful crosscheck, but I'm far less
> convinced that having segment and block size in there, as 32bit numbers no
> less, is worthwhile. After all, if the system id matches, it's not likely that
> the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.
But I think we can relatively easily move segsize/blocksize checks to
a different place in the normal page header, which would reduce the
number of bytes we'd have to store elsewhere.

We could move segsize/blocksize into the xlp_info flags: 12 of the 16
bits are currently unused. Using 4 of these bits for segsize
(indicating 2^N MB, current accepted values are N=0..10 for 1 MB ...
1024MB) and 4 (or 3) for blcksz (as we currently support 1..64 kB
blocks, or 2^{0..6} kB). This would remove the need for 2 of the 3
fields in the large xlog block header.

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Lowering the default wal_blocksize to 4K

2023-10-10 Thread Matthias van de Meent

On Tue, 10 Oct 2023 at 01:08, Andres Freund  wrote:
>
> Hi,
>
> I've mentioned this to a few people before, but forgot to start an actual
> thread. So here we go:
>
> I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
> the current 8192.

Seems like a good idea.

> It's IMO quite interesting that even at the higher client counts, the number
> of bytes written don't reach parity.
>
> It's fun to see how the total number of writes *decreases* at higher
> concurrency, because it becomes more likely that pages are filled completely.

With higher client counts and short transactions I think it is not too
unexpected to see commit_delay+commit_siblings configured. Did you
measure the impact of this change on such configurations?

> One thing I noticed is that our auto-configuration of wal_buffers leads to
> different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
> great.

Hmm.

> Performing the same COPY workload (1024 files, split across N clients) for
> both settings shows no performance difference, but a very slight increase in
> total bytes written (about 0.25%, which is roughly what I'd expect).
>
> Personally I'd say the slight increase in WAL volume is more than outweighed
> by the increase in throughput and decrease in bytes written.

Agreed.

> There's an alternative approach we could take, which is to write in 4KB
> increments, while keeping 8KB pages. With the current format that's not
> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> pages, it seems we should just go for 4KB?

It is not just the disk overhead of blocks, but we also maintain some
other data (currently in the form of XLogRecPtrs) in memory for each
WAL buffer, the overhead of which will also increase when we increase
the number of XLog pages per MB of WAL that we cache.
Additionally, highly concurrent workloads with transactions that write
a high multiple of XLOG_BLCKSZ bytes to WAL may start to see increased
overhead due to the .25% additional WAL getting written and a doubling
of the number of XLog pages being touched (both initialization and the
smaller memcpy for records that would now cross an extra page
boundary).

However, for all of these issues I doubt that they actually matter
much in the grand scheme of things, so I definitely wouldn't mind
moving to 4KiB XLog pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Comparing two double values method

2023-10-10 Thread Matthias van de Meent

On Tue, 10 Oct 2023 at 12:33, Bowen Shi  wrote:
>
> Dears,
>
> I noticed that in the `check_GUC_init` function, there is a direct
> comparison using the != operator for two double values, which seems
> problematic.

I don't think I understand the problem. The code checks that the
dynamic initialization values are equal to the current value of the
GUC, or 0. Why would a "margin for error" of 1e-6 be of any use?
Why was the margin of 1e-6 chosen instead of one based on the exponent
of the GUC's current value (if any)?

In my view, this would break the code, not fix it, as it would
decrease the cases where we detect broken GUC registrations.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Pre-proposal: unicode normalized text

2023-10-06 Thread Matthias van de Meent

On Fri, 6 Oct 2023, 21:08 Jeff Davis,  wrote:

> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.
>

One reason someone would like this is because a database cluster may have
been initialized with something like --no-locale (thus getting defaulted to
LC_COLLATE=C, which is desired behaviour and gets fast strcmp operations
for indexing, and LC_CTYPE=SQL_ASCII, which is not exactly expected but can
be sufficient for some workloads), but now that the data has grown they
want to use utf8.EN_US collations in some of their new and modern table's
fields?
Or, a user wants to maintain literal translation tables, where different
encodings would need to be used for different languages to cover the full
script when Unicode might not cover the full character set yet.
Additionally, I'd imagine specialized encodings like Shift_JIS could be
more space efficient than UTF-8 for e.g. japanese text, which might be
useful for someone who wants to be a bit more frugal with storage when they
know text is guaranteed to be in some encoding's native language:
compression can do the same work, but also adds significant overhead.

I've certainly experienced situations where I forgot to explicitly include
the encoding in initdb --no-locale and then only much later noticed that my
big data load is useless due to an inability to create UTF-8 collated
indexes.
I often use --no-locale to make string indexing fast (locales/collation are
not often important to my workload) and to block any environment variables
from being carried over into the installation. An ability to set or update
the encoding of columns would help reduce the pain: I would no longer have
to re-initialize the database or cluster from 0.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: pg16: invalid page/page verification failed

2023-10-05 Thread Matthias van de Meent

On Thu, 5 Oct 2023 at 18:48, Justin Pryzby  wrote:
>
> On an instance running pg16.0:
>
> log_time | 2023-10-05 10:03:00.014-05
> backend_type | autovacuum worker
> left | page verification failed, calculated checksum 5074 but 
> expected 5050
> context  | while scanning block 119 of relation 
> "public.postgres_log_2023_10_05_0900"
>
> This is the only error I've seen so far, and for all I know there's a
> issue on the storage behind the VM, or a cosmic ray hit.  But I moved
> the table out of the way and saved a copy of get_raw_page() in case
> someone wants to ask about it.
>
> postgres=# SELECT * FROM 
> heap_page_item_attrs(get_raw_page(801594131::regclass::text, 119), 801594131);
>  lp  | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_attrs
>1 |   2304 |1 | 16 |||  || 
> ||||   |
>2 |   8160 |1 | 16 |||  || 
> ||||   |
>3 |   8144 |1 | 16 |||  || 
> ||||   |
> ...all the same except for lp_off...
>  365 |   2352 |1 | 16 |||  || 
> ||||   |
>  366 |   2336 |1 | 16 |||  || 
> ||||   |
>  367 |   2320 |1 | 16 |||  || 
> ||||   |

That's not a HEAP page; it looks more like a btree page: lp_len is too
short for heap (which starts at lp_len = 24), and there are too many
line pointers for an 8KiB heap page. btree often has lp_len of 16: 8
bytes indextuple header, one maxalign of data (e.g. int or bigint).

So, assuming it's a block of a different relation kind, then it's also
likely it was originally located elsewhere in that other relation,
indeed causing the checksum failure. You can further validate this by
looking at the page header's pd_special value - if it is 8176, that'd
be another indicator for it being a btree.

Kind regards,

Matthias van de Meent.

Re: Change of behaviour for creating same type name in multiple schemas

2023-10-05 Thread Matthias van de Meent

On Thu, 5 Oct 2023 at 14:13, Dave Cramer  wrote:
>
> Greetings,
>
> Before 16 if I created an array type in schema1 it would be named 
> schema1._array_type
> if I created the same type in schema 2 it would have been named
>
> schema2.__array_type
>
> Can someone point me to where the code was changed ?

This was with commit 70988b7b [0] in July 2022, based on this thread
[1] (moved from -bugs).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://github.com/postgres/postgres/commits/70988b7b0a0bd03c59a2314d0b5bcf2135692349
[1] 
https://www.postgresql.org/message-id/flat/b84cd82c-cc67-198a-8b1c-60f44e1259ad%40postgrespro.ru

Re: Allow deleting enumerated values from an existing enumerated data type

2023-10-03 Thread Matthias van de Meent

On Tue, 3 Oct 2023 at 22:49, Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > On 2023-09-28 Th 14:46, Tom Lane wrote:
> >> We went through all these points years ago when the enum feature
> >> was first developed, as I recall.  Nobody thought that the ability
> >> to remove an enum value was worth the amount of complexity it'd
> >> entail.
>
> > That's quite true, and I accept my part in this history. But I'm not
> > sure we were correct back then.
>
> I think it was the right decision at the time, given that the
> alternative was to not add the enum feature at all.  The question
> is whether we're now prepared to do additional work to support DROP
> VALUE.  But the tradeoff still looks pretty grim, because the
> problems haven't gotten any easier.
>
> I've been trying to convince myself that there'd be some value in
> your idea about a DISABLE flag, but I feel like there's something
> missing there.  The easiest implementation would be to have
> enum_in() reject disabled values, while still allowing enum_out()
> to print them.  But that doesn't seem to lead to nice results:
>
> [...]
>
> On the whole this is still a long way from a clean easy-to-use DROP
> facility, and it adds a lot of complexity of its own for pg_dump.
> So I'm not sure we want to build it.

I don't quite get what the hard problem is that we haven't already
solved for other systems:
We already can add additional constraints to domains (e.g. VALUE::int
<> 4), which (according to docs) scan existing data columns for
violations. We already drop columns without rewriting the table to
remove the column's data, and reject new data insertions for those
still-in-the-catalogs-but-inaccessible columns.

So, if a user wants to drop an enum value, why couldn't we "just" use
the DOMAIN facilities and 1.) add a constraint WHERE value NOT IN
(deleted_values), and after validation of that constraint 2.) mark the
enum value as deleted like we do with table column's pg_attribute
entries?

The only real issue that I can think of is making sure that concurrent
backends don't modify this data, but that shouldn't be very different
from the other locks we already have to take in e.g. ALTER TYPE ...
DROP ATTRIBUTE.

Kind regards,

Matthias van de Meent

Re: Index AmInsert Parameter Confused?

2023-09-27 Thread Matthias van de Meent

On Wed, 27 Sept 2023 at 05:03, jacktby jacktby  wrote:
>
>
>
> > 2023年9月27日 00:45，Matthias van de Meent  写道：
> >
> > On Tue, 26 Sept 2023 at 18:38, jacktby jacktby  wrote:
> >>
> >> typedef bool (*aminsert_function) (Relation indexRelation,
> >>  Datum *values,
> >>  bool *isnull,
> >>  ItemPointer heap_tid,
> >>  Relation heapRelation,
> >>  IndexUniqueCheck checkUnique,
> >>  bool indexUnchanged,
> >>  struct IndexInfo *indexInfo);
> >>
> >> Why is there a heap_tid, We haven’t inserted the value, so where does it 
> >> from ?
> >
> > Index insertion only happens after the TableAM tuple has been
> > inserted. As indexes refer to locations in the heap, this TID contains
> > the TID of the table tuple that contains the indexed values, so that
> > the index knows which tuple to refer to.
> >
> > Note that access/amapi.h describes only index AM APIs; it does not
> > cover the table AM APIs descibed in access/tableam.h
> >
> > Kind regards,
> >
> > Matthias van de Meent
> 1.Thanks, so if we insert a tuple into a table which has a index on itself, 
> pg will insert tuple into heap firstly, and the give the heaptid form heap to 
> the Index am api right?

Correct. I think this is also detailed in various places of the
documentation, yes.

> 2. I’m trying to implement a new index, but I just need the data held in 
> index table, I hope it’s not inserted into heap, because the all data I want 
> can be in index table.

In PostgreSQL, a table maintains the source of truth for the data, and
indexes are ephemeral data structures that improve the speed of
querying the data in their table. As such, dropping an index should
not impact the availability of the table's data.
If the only copy of your (non-derived) data is in the index, then it
is likely that some normal table operations will result in failures
due to the tableAM/indexAM breaking built-in assumptions about access
methods and data availability.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Index AmInsert Parameter Confused?

2023-09-26 Thread Matthias van de Meent

On Tue, 26 Sept 2023 at 18:38, jacktby jacktby  wrote:
>
> typedef bool (*aminsert_function) (Relation indexRelation,
>   Datum *values,
>   bool *isnull,
>   ItemPointer heap_tid,
>   Relation heapRelation,
>   IndexUniqueCheck checkUnique,
>   bool indexUnchanged,
>   struct IndexInfo *indexInfo);
>
> Why is there a heap_tid, We haven’t inserted the value, so where does it from 
> ?

Index insertion only happens after the TableAM tuple has been
inserted. As indexes refer to locations in the heap, this TID contains
the TID of the table tuple that contains the indexed values, so that
the index knows which tuple to refer to.

Note that access/amapi.h describes only index AM APIs; it does not
cover the table AM APIs descibed in access/tableam.h

Kind regards,

Matthias van de Meent

Re: XLog size reductions: Reduced XLog record header size for PG17

2023-09-25 Thread Matthias van de Meent

On Wed, 20 Sept 2023 at 07:06, Michael Paquier  wrote:
>
> On Tue, Sep 19, 2023 at 12:07:07PM +0200, Matthias van de Meent wrote:
> > V5 is a rebased version of v4, and includes the latest patch from
> > "smaller XLRec block header" [0] as 0001.
>
> 0001 and 0007 are the meat of the changes.

Correct.

> -#define XLR_CHECK_CONSISTENCY  0x02
> +#define XLR_CHECK_CONSISTENCY  (0x20)
>
> I can't help but notice that there are a few stylistic choices like
> this one that are part of the patch.  Using parenthesis in the case of
> hexa values is inconsistent with the usual practices I've seen in the
> tree.

Yes, I'll take another look at that.

>  #define COPY_HEADER_FIELD(_dst, _size)\
>  do {\
> -if (remaining < _size)\
> +if (remaining < (_size))\
>  goto shortdata_err;\
>
> There are a couple of stylistic changes like this one, that I guess
> could just use their own patch to make these macros easier to use.

They actually fix complaints of my IDE, but are otherwise indeed stylistic.

> -#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
> +#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info & 
> XLR_INFO_MASK)
> +#define XLogRecGetRmgrInfo(decoder) (((decoder)->record->header.xl_info) & 
> XLR_RMGR_INFO_MASK)
>
> This stuff in 0002 is independent of 0001, am I right?  Doing this
> split with an extra macro is okay by me, reducing the presence of
> XLR_INFO_MASK and bitwise operations based on it.

Yes, that change is to stop making use of (~XLR_INFO_MASK) where
XLR_RMGR_INFO_MASK is the correct bitmask (whilst also being quite
useful in the later patch).

> 0003 is also mechanical, but if you begin to enforce the use of
> XLR_RMGR_INFO_MASK as the bits allowed to be passed down to the RMGR
> identity callback, we should have at least a validity check to make
> sure that nothing, even custom RMGRs, pass down unexpected bits?

I think that's already handled in XLogInsert(), but I'll make sure to
add more checks if they're not in place yet.

> I am not convinced that XLOG_INCLUDE_XID is a good interface, TBH, and
> I fear that people are going to forget to set it.  Wouldn't it be
> better to use an option where the XID is excluded instead, making the
> inclusing the an XID the default?

Most rmgrs don't actually use the XID. Only XACT, MULTIXACT, HEAP,
HEAP2, and LOGICALMSG use the xid, so I thought it would be easier to
just find the places where those RMGR's records were being logged than
to update all other places.

I don't mind changing how we decide to log the XID, but I don't think
EXCLUDE_XID is a good alternative: most records just don't need the
transaction ID. There are many more index AMs with logging than table
AMs, so I don't think it is that weird to default to 'not included'.

> > The resource manager has ID = 0, thus requiring some special
> > handling in other code. Apart from being generally useful, it is
> > used in future patches to detect the end of wal in lieu of a zero-ed
> > fixed-size xl_tot_len field.
>
> Err, no, that may not be true.  See for example this thread where the
> topic of improving the checks of xl_tot_len and rely on this value on
> when a record header has been validated, even across page borders:
> https://www.postgresql.org/message-id/17928-aa92416a70ff4...@postgresql.org

Yes, there are indeed exceptions when reusing WAL segments, but it's
still a good canary, like xl_tot_len before this patch.

> Except that, in which cases could an invalid RMGR be useful?

A sentinel value that is obviously invalid is available for several
types, e.g. BlockNumber, TransactionId, XLogRecPtr, Buffer, and this
is quite useful if you want to check if something is definitely
invalid. I think that's fine in principle, we're already "wasting"
some IDs in the gap between RM_MAX_BUILTIN_ID and RM_MIN_CUSTOM_ID.

In the current xlog infrastructure, we use xl_tot_len as that sentinel
to detect whether a new record may exist, but in this patch that can't
be used because the field may not exist and depends on other bytes. So
I used xl_rmgr_id as the field to base the 'may a next record exist'
checks on, which required the 0 rmgr ID to be invalid.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: XLog size reductions: smaller XLRec block header for PG17

2023-09-25 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 01:03, Andres Freund  wrote:
>
> Hi,
>
> On 2023-05-18 19:22:26 +0300, Heikki Linnakangas wrote:
> > On 18/05/2023 17:59, Matthias van de Meent wrote:
> > > It changes the block IDs used to fit in 6 bits, using the upper 2 bits
> > > of the block_id field to store how much data is contained in the
> > > record (0, <=UINT8_MAX, or <=UINT16_MAX bytes).
> >
> > Perhaps we should introduce a few generic inline functions to do varint
> > encoding. That could be useful in many places, while this scheme is very
> > tailored for XLogRecordBlockHeader.

This scheme is reused later for the XLogRecord xl_tot_len field over
at [0], and FWIW is thus being reused. Sure, it's tailored to this WAL
use case, but IMO we're getting good value from it. We don't use
protobuf or JSON for WAL, we use our own serialization format. Having
some specialized encoding/decoding in that format for certain fields
is IMO quite acceptable.

> Yes - I proposed that and wrote an implementation of reasonably efficient
> varint encoding. Here's my prototype:
> https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de

As I mentioned on that thread, that prototype has a significant
probability of doing nothing to improve WAL size, or even increasing
the WAL size for installations which consume a lot of OIDs.

> I think it's a bad tradeoff to write lots of custom varint encodings, just to
> eek out a bit more space savings.

This is only a single "custom" varint encoding though, if you can even
call it that. It makes a field's size depend on flags set in another
byte, which is not that much different from the existing use of
XLR_BLOCK_ID_DATA_[LONG, SHORT].

> The increase in code complexity IMO makes it a bad tradeoff.

Pardon me for asking, but what would you consider to be a good
tradeoff then? I think the code relating to the WAL storage format is
about as simple as you can get it within the feature set it provides
and the size of the resulting records. While I think there is still
much to gain w.r.t. WAL record size, I don't think we can get much of
those improvements without adding at least some amount of complexity,
something I think to be true for most components in PostgreSQL.

So, except for redesigning significant parts of the public WAL APIs,
are we just going to ignore any potential improvements because they
"increase code complexity"?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://commitfest.postgresql.org/43/4386/

Re: Improving btree performance through specializing by key shape, take 2

2023-09-25 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 22:49, Peter Geoghegan  wrote:
>
> On Tue, Sep 19, 2023 at 6:28 AM Matthias van de Meent
>  wrote:
> > > To be clear, page deletion does what I described here (it does an
> > > in-place update of the downlink to the deleted page, so the same pivot
> > > tuple now points to its right sibling, which is our page of concern),
> > > in addition to fully removing the original pivot tuple whose downlink
> > > originally pointed to our page of concern. This is why page deletion
> > > makes the key space "move to the right", very much like a page split
> > > would.
> >
> > I am still aware of this issue, and I think we've discussed it in
> > detail earlier. I think it does not really impact this patchset. Sure,
> > I can't use dynamic prefix compression to its full potential, but I
> > still do get serious performance benefits:
>
> Then why have you linked whatever the first patch does with the high
> key to dynamic prefix compression in the first place? Your commit
> message makes it sound like it's a way to get around the race
> condition that affects dynamic prefix compression, but as far as I can
> tell it has nothing whatsoever to do with that race condition.

We wouldn't have to store the downlink's right separator and compare
it to the highkey if we didn't deviate from L's algorithm for DELETE
operations (which causes the race condition): just the right sibling's
block number would be enough.

(Yes, the right sibling's block number isn't available for the
rightmost downlink of a page. In those cases, we'd have to reuse the
parent page's high key with that of the downlink page, but I suppose
that'll be relatively rare).

> Questions:
>
> 1. Why shouldn't the high key thing be treated as an unrelated piece of work?

Because it was only significant and relatively visible after getting
rid of the other full key compare operations, and it touches
essentially the same areas. Splitting them out in more patches would
be a hassle.

> I guess it's possible that it really should be structured that way,
> but even then it's your responsibility to make it clear why that is.

Sure. But I think I've made that clear upthread too.

> As things stand, this presentation is very confusing.

I'll take a look at improving the presentation.

> 2. Separately, why should dynamic prefix compression be tied to the
> specialization work? I also see no principled reason why it should be
> tied to the other two things.

My performance results show that insert performance degrades by 2-3%
for single-column indexes if only dynamic the prefix truncation patch
is applied [0]. The specialization patches fix that regression on my
machine (5950x) due to having optimized code for the use case. I can't
say for certain that other machines will see the same results, but I
think results will at least be similar.

> I didn't mind this sort of structure so much back when this work was
> very clearly exploratory -- I've certainly structured work in this
> area that way myself, in the past. But if you want this patch set to
> ever go beyond being an exploratory patch set, something has to
> change.

I think it's fairly complete, and mostly waiting for review.

> I don't have time to do a comprehensive (or even a fairly
> cursory) analysis of which parts of the patch are helping, and which
> are marginal or even add no value.

It is a shame that you don't have the time to review this patch.

> > > You'd have
> > > to compare the lower bound separator key from the parent (which might
> > > itself be the page-level low key for the parent) to the page low key.
> > > That's not a serious suggestion; I'm just pointing out that you need
> > > to be able to compare like with like for a canary condition like this
> > > one, and AFAICT there is no lightweight practical way of doing that
> > > that is 100% robust.
> >
> > True, if we had consistent LOWKEYs on pages, that'd make this job much
> > easier: the prefix could indeed be carried over in full. But that's
> > not currently the case for the nbtree code, and this is the next best
> > thing, as it also has the benefit of working with all currently
> > supported physical formats of btree indexes.
>
> I went over the low key thing again because I had to struggle to
> understand what your high key optimization had to do with dynamic
> prefix compression. I'm still struggling. I think that your commit
> message very much led me astray. Quoting it here:
>
> """
> Although this limits [...] relatively expensive _bt_compare.
> """
>
> You're directly tying the high key optimization to the dynamic prefix
> compression optimization. But why?

The value

Re: GenBKI emits useless open;close for catalogs without rows

2023-09-22 Thread Matthias van de Meent

On Fri, 22 Sept 2023 at 00:25, Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-19 21:05:41 +0300, Heikki Linnakangas wrote:
> > On 18/09/2023 17:50, Matthias van de Meent wrote:
> > > (initdb takes about 73ms locally with syncing disabled)
> >
> > That's impressive. It takes about 600 ms on my laptop. Of which about 140 ms
> > goes into processing the BKI file. And that's with "initdb -no-sync" option.
>
> I think there must be a digit missing in Matthias' numbers.

Yes, kind of. The run was on 50 iterations, not the assumed 250.
Also note that the improved measurements were recorded inside the
boostrap-mode PostgreSQL instance, not inside the initdb that was
processing the postgres.bki file. So it might well be that I didn't
improve the total timing by much.

> > > Various methods of reducing the size of postgres.bki were applied, as
> > > detailed in the patch's commit message. I believe the current output
> > > is still quite human readable.
> >
> > Overall this does not seem very worthwhile to me.
>
> Because the wins are too small?
>
> FWIW, Making postgres.bki smaller and improving bootstrapping time does seem
> worthwhile to me. But it doesn't seem quite right to handle the batching in
> the file format, it should be on the server side, no?

The main reason I did batching in the file format is to reduce the
storage overhead of the current one "INSERT" per row. Batching
improved that by replacing the token with a different construct, but
it's not neccessarily the only solution. The actual parser still
inserts the tuples one by one in the relation, as I didn't spend time
on making a simple_heap_insert analog for bulk insertions.

> We really should stop emitting WAL during initdb...

I think it's quite elegant that we're able to bootstrap the relation
data of a new PostgreSQL cluster from the WAL generated in another
cluster, even if it is indeed a bit wasteful. I do see your point
though - the WAL shouldn't be needed if we're already fsyncing the
files to disk.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: GenBKI emits useless open;close for catalogs without rows

2023-09-22 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 20:05, Heikki Linnakangas  wrote:
>
> On 18/09/2023 17:50, Matthias van de Meent wrote:
> > (initdb takes about 73ms locally with syncing disabled)
>
> That's impressive. It takes about 600 ms on my laptop. Of which about
> 140 ms goes into processing the BKI file. And that's with "initdb
> -no-sync" option.

Hmm, yes, I misinterpreted my own benchmark setup, the actual value
would be somewhere around 365ms: I thought I was doing 50*50 runs in
one timed run, but really I was doing only 50 runs. TO add insult to
injury, I divided the total time taken by 250 instead of either 50 or
2500... Thanks for correcting me on that.

> > Various methods of reducing the size of postgres.bki were applied, as
> > detailed in the patch's commit message. I believe the current output
> > is still quite human readable.
>
> Overall this does not seem very worthwhile to me.

Reducing the size of redistributables sounds worthwhile to me, but if
none of these changes are worth the effort, then alright, nothing
gained, only time lost.

> Looking at "perf" profile of initdb, I also noticed that a small but
> measurable amount of time is spent in the "isatty(0)" call in do_end().
> Does anyone care about doing bootstrap mode interactively? We could
> probably remove that.

Yeah, that sounds like a good idea.

Kind regards,

Matthias van de Meent

Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 18:52, Robert Haas  wrote:
>
> On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> wrote:
> > I was thinking something vaguely like "a table size that's roughly what
> > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > vacuum_scale_factor.  You would determine the absolute minimum size for
> > the table given the current live tuples in the table, then add 20% to
> > account for a steady state of dead tuples and vacuumed space.  So it's
> > not 1.2x of the "current" table size at the time the local_update_limit
> > feature is installed, but 1.2x of the optimal table size.
>
> Right, that would be great. And honestly if that's something we can
> figure out, then why does the parameter even need to be an integer
> instead of a Boolean? If the system knows the optimal table size, then
> the user can just say "try to compact this table" and need not say to
> what size. The 1.2 multiplier is probably situation dependent and
> maybe the multiplier should indeed be a configuration parameter, but
> we would be way better off if the absolute size didn't need to be.

Mostly agreed, but I think there's a pitfall here. You seem to assume
we have a perfect oracle that knows the optimal data size, but we
already know that our estimates can be significantly off. I don't
quite trust the statistics enough to do any calculations based on the
number of tuples in the relation. That also ignores the fact that we
don't actually have any good information about the average size of the
tuples in the table. So with current statistics, any automated "this
is how large the table should be" decisions would result in an
automated footgun, instead of the current patch's where the user has
to decide to configure it to an explicit value.

But about that: I'm not sure what the "footgun" is that you've
mentioned recently?
The issue with excessive bloat (when the local_update_limit is set too
small and fillfactor is low) was fixed in the latest patch nearly
three weeks ago, so the only remaining issue with misconfiguration is
slower updates. Sure, that's not great, but in my opinion not a
"footgun": performance returns immediately after resetting
local_update_limit, and no space was lost.

> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> +1 to all of that. Anything we can do to reduce the chance of the
> parameter doing the opposite of what it's intended to do is, IMHO,
> really, really valuable. If you're in the situation where you really
> need something like this, you're probably having a pretty bad day
> already.

Yes, it does that with the latest patch, from not quite 3 weeks ago.

> Just to be more clear about my position, I don't think that having
> some kind of a feature along these lines is a bad idea.

Thanks for clarifying.

> I do think
> that this is one of those cases where the perfect is the enemy of the
> good, and we can fall into the trap of saying that since we can't do
> the perfect thing let's not do anything at all. At the same time, just
> because we need to do something doesn't mean we should do exactly the
> first thing that anybody thought up, or that we shouldn't try as hard
> as we can to mitigate the downsides. If we add something like this I
> bet it will get a lot of use. Even a minor improvement to the design
> that removes one pitfall of many could turn out to help a lot of
> people.

100% agreed.

> > Having to set AEL is not nice for sure, but wouldn't
> > ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
> > for which that is sufficient.
>
> Hmm, yeah, I think you're right.

Updating the reloption after relation truncation implies having the
same lock as relation truncation, i.e. AEL (if the vacuum docs are to
be believed). So the AEL is not reqiored for updating the storage
option (that would require SUEL), but for the block truncation
operation operation.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)

Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 18:56, Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> Yea, a setting like what's discussed here seems, uh, not particularly useful
> for achieving the goal of compacting tables.  I don't think guiding this
> through SQL makes a lot of sense. For decent compaction you'd want to scan the
> table backwards, and move rows from the end to earlier, but stop once
> everything is filled up. You can somewhat do that from SQL, but it's going to
> be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> logging.

We can't move tuples around (or, not that I know of) without using a
transaction ID to control the visibility of the two locations of that
tuple. Doing table compaction would thus likely require using
transactions to move these tuples around. Using a single backend and
bulk operations, it'll still lock each tuple that is being moved, and
that can be noticed by user DML queries. I'd rather make the user's
queries move the data around than this long-duration, locking
background operation.

> I think having explicit compaction support in VACUUM or somewhere similar
> would make sense, but I don't think the proposed GUC is a useful stepping
> stone.

The point of this GUC is that the compaction can happen organically in
the user's UPDATE workflow, so that there is no long locking operation
going on (as you would see with VACUUM FULL / CLUSTER / pg_repack).

> > > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > > feature. Sure, some people will get use out of it, if they're
> > > sufficiently knowledgeable and sufficiently determined. But I think
> > > for most people in most situations, it will be a struggle.
>
> Indeed. I think it'd often just explode table and index sizes, because HOT
> pruning won't be able to make usable space in pages anymore (due to dead
> items).

You seem to misunderstand the latest patch. It explicitly only blocks
local updates if the update can then move the new tuple to an earlier
page. If that is not possible, then it'll insert locally (assuming
that is still possible) and HOT can then still apply.

And yes, moving tuples to earlier pages will indeed increase index
bloat, because it does create dead tuples where previously we could've
applied HOT. But we do have VACUUM and REINDEX CONCURRENTLY to clean
that up without serious long-duration stop-the-world actions, while
the other builtin cleanup methods don't.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Improving btree performance through specializing by key shape, take 2

2023-09-19 Thread Matthias van de Meent

On Tue, 19 Sept 2023 at 03:56, Peter Geoghegan  wrote:
>
> On Mon, Sep 18, 2023 at 6:29 PM Peter Geoghegan  wrote:
> > I also have significant doubts about your scheme for avoiding
> > invalidating the bounds of the page based on its high key matching the
> > parent's separator. The subtle dynamic prefix compression race
> > condition that I was worried about was one caused by page deletion.
> > But page deletion doesn't change the high key at all (it does that for
> > the deleted page, but that's hardly relevant). So how could checking
> > the high key possibly help?
>
> To be clear, page deletion does what I described here (it does an
> in-place update of the downlink to the deleted page, so the same pivot
> tuple now points to its right sibling, which is our page of concern),
> in addition to fully removing the original pivot tuple whose downlink
> originally pointed to our page of concern. This is why page deletion
> makes the key space "move to the right", very much like a page split
> would.

I am still aware of this issue, and I think we've discussed it in
detail earlier. I think it does not really impact this patchset. Sure,
I can't use dynamic prefix compression to its full potential, but I
still do get serious performance benefits:

FULL KEY _bt_compare calls:
'Optimal' full-tree DPT: average O(3)
Paged DPT (this patch):  average O(2 * height)
... without HK opt:  average O(3 * height)
Current: O(log2(n))

Single-attribute compares:
'Optimal' full-tree DPT: O(log(N))
Paged DPT (this patch):  O(log(N))
Current: 0 (or, O(log(N) * natts))

So, in effect, this patch moves most compare operations to the level
of only one or two full key compare operations per page (on average).

I use "on average": on a sorted array with values ranging from
potentially minus infinity to positive infinity, it takes on average 3
compares before a binary search can determine the bounds of the
keyspace it has still to search. If one side's bounds is already
known, it takes on average 2 compare operations before these bounds
are known.

> IMV it would be better if it made the key space "move to the left"
> instead, which would make page deletion close to the exact opposite of
> a page split -- that's what the Lanin & Shasha paper does (sort of).
> If you have this symmetry, then things like dynamic prefix compression
> are a lot simpler.
>
> ISTM that the only way that a scheme like yours could work, assuming
> that making page deletion closer to Lanin & Shasha is not going to
> happen, is something even more invasive than that: it might work if
> you had a page low key (not just a high key) on every page.

Note that the "dynamic prefix compression" is currently only active on
the page level.

True, the patch does carry over _bt_compare's prefix result for the
high key on the child page, but we do that only if the highkey is
actually an exact copy of the right separator on the parent page. This
carry-over opportunity is extremely likely to happen, because the high
key generated in _bt_split is then later inserted on the parent page.
The only case where it could differ is in concurrent page deletions.
That is thus a case of betting a few cycles to commonly save many
cycles (memcmp vs _bt_compare full key compare.

Again, we do not actually skip a prefix on the compare call of the
P_HIGHKEY tuple, nor for the compares of the midpoints unless we've
found a tuple on the page that compares as smaller than the search
key.

> You'd have
> to compare the lower bound separator key from the parent (which might
> itself be the page-level low key for the parent) to the page low key.
> That's not a serious suggestion; I'm just pointing out that you need
> to be able to compare like with like for a canary condition like this
> one, and AFAICT there is no lightweight practical way of doing that
> that is 100% robust.

True, if we had consistent LOWKEYs on pages, that'd make this job much
easier: the prefix could indeed be carried over in full. But that's
not currently the case for the nbtree code, and this is the next best
thing, as it also has the benefit of working with all currently
supported physical formats of btree indexes.

Kind regards,

Matthias van de Meent

Re: XLog size reductions: smaller XLRec block header for PG17

2023-09-18 Thread Matthias van de Meent

On Tue, 5 Sept 2023 at 15:04, Aleksander Alekseev
 wrote:
>
> Hi,
>
> I noticed that the patch needs review and decided to take a look.

Thanks for reviewing!

> All in all the patch looks good to me, but I have a couple of nitpicks:
>
> * The comment for XLogSizeClass seems to be somewhat truncated as if
> Ctr+S was not pressed before creating the patch. I also suggest
> double-checking the grammar.

I've updated the various comments with improved wording.

> * `Size written = -1;` in XLogWriteLength() can lead to compiler
> warnings some day considering the fact that Size / size_t are
> unsigned. Also this assignment doesn't seem to serve any particular
> purpose. So I suggest removing it.

Fixed, it now uses `int` instead, as does XLogReadLength().

> * I don't see much value in using the WRITE_OP macro in
> XLogWriteLength(). The code is read more often than it's written and I
> wouldn't call this code particularly readable (although it's shorter).
> * XLogReadLength() - ditto

I use READ_OP and WRITE_OP mostly to make sure that each operation's
code is clear. Manually expanding the macro would allow the handling
of each variant to have different structure code, and that would allow
for more coding errors. I think it's extra important to make sure the
code isn't wrong because this concerns WAL (de)serialization, and one
copy is (in my opinion) easier to check for errors than 3 copies.

I've had my share of issues in copy-edited code, so I rather like keep
the template around as long as I don't need to modify the underlying
code.

> * `if (read < 0)` in DecodeXLogRecord() is noop since `read` is unsigned

Yes, thanks for noticing. I've been working with Rust recently, where
unsigned size is `usize` and `size` is signed. The issue has been
fixed in the attached patch with 'int' types instead.

Kind regards,

Matthias van de Meent

v2-0001-Reduce-overhead-of-small-block-data-in-xlog-recor.patch
Description: Binary data

Re: Improving btree performance through specializing by key shape, take 2

2023-09-18 Thread Matthias van de Meent

On Wed, 30 Aug 2023 at 21:50, Matthias van de Meent
 wrote:
>
> Updated in the attached version 12 of the patchset (which is also
> rebased on HEAD @ 9c13b681). No changes apart from rebase fixes and
> these added comments.

Rebased again to v13 to account for API changes in 9f060253 "Remove
some more "snapshot too old" vestiges."

Kind regards,

Matthias van de Meent

Re: GenBKI emits useless open;close for catalogs without rows

2023-09-18 Thread Matthias van de Meent

On Tue, 12 Sept 2023 at 17:51, Matthias van de Meent
 wrote:
>
> On Fri, 1 Sept 2023 at 19:52, Tom Lane  wrote:
> >
> > Alvaro Herrera  writes:
> > > On 2023-Sep-01, Matthias van de Meent wrote:
> > >> A potential addition to the patch would to stop manually closing
> > >> relations: initdb and check-world succeed without manual 'close'
> > >> operations because the 'open' command auto-closes the previous open
> > >> relation (in boot_openrel). Testing also suggests that the last opened
> > >> relation apparently doesn't need closing - check-world succeeds
> > >> without issues (incl. with TAP enabled). That is therefore implemented
> > >> in attached patch 2 - it removes the 'close' syntax in its entirety.
> >
> > > Hmm, what happens with the last relation in the bootstrap process?  Is
> > > closerel() called via some other path for that one?
> >
> > Taking a quick census of existing closerel() callers: there is
> > cleanup() in bootstrap.c, but it's called uncomfortably late
> > and outside any transaction, so I misdoubt that it works
> > properly if asked to actually shoulder any responsibility.
> > (A little code reshuffling could fix that.)
> > There are also a couple of low-level elog warnings in CREATE
> > that would likely get triggered, though I suppose we could just
> > remove those elogs.
>
> Yes, that should be easy to fix.
>
> > I guess my reaction to this patch is "why bother?".  It seems
> > unlikely to yield any measurable benefit, though of course
> > that guess could be wrong.
>
> There is a small but measurable decrease in size of the generated bki
> (2kb with both patches, on an initial 945kB), and there is some
> related code that can be eliminated. If that's not worth bothering,
> then I can drop the patch. Otherwise, I can update the patch to do the
> cleanup that was within the transaction boundaries at the end of
> boot_yyparse.
>
> If decreasing the size of postgres.bki is not worth the effort, I'll
> drop any effort on doing so, but considering that it is about 1MB of
> our uncompressed distributables, I'd say decreases in size are worth
> the effort, most of the time.

With the attached patch I've see a significant decrease in the size of
postgres.bki of about 25%, and a likely related decrease in wall clock
time spent in the bootstrap transaction: with timestamp logs inserted
around the boot_yyparse() transaction the measured time went from
around 49 ms on master to around 45 ms patched. In the grand scheme of
initdb that might not be a lot of time (initdb takes about 73ms
locally with syncing disabled) but it is a nice gain in performance.

Comparison:

master @ 9c13b681
 $ du -b pg_install/share/postgres.bki
945220
 $ initdb --no-instructions --auth=md5 --pwfile pwfile -N -D ~/test-dbinit/
[...]
2023-09-16 02:22:57.339 CEST [10422] LOG:  Finished bootstrapping:
to_start: 10 ms, transaction: 49 ms, finishing: 1 ms, total: 59 ms
[...]

patched
 $ du -b pg_install/share/postgres.bki
702574
 $ initdb --no-instructions --auth=md5 --pwfile pwfile -N -D ~/test-dbinit/
[...]
2023-09-16 02:25:57.664 CEST [15645] LOG:  Finished bootstrapping:
to_start: 10 ms, transaction: 45 ms, finishing: 1 ms, total: 54 ms
[...]

Various methods of reducing the size of postgres.bki were applied, as
detailed in the patch's commit message. I believe the current output
is still quite human readable.

There are other potential avenues for further reducing the bki size,
e.g. through using smaller generated OIDs (reducing the number of
characters used per OID), applying RLE on sequential NULLs (there are
3k+ occurances of /( __){2,10}/ in the generated bki file remaining),
and other tricks, but several of those are likely to be detrimental to
the readability and manual verifiability of the bki.

Kind regards,

Matthias van de Meent


v2-0001-Update-BKI-syntax-bootstrap-performance.patch
Description: Binary data

Re: GenBKI emits useless open;close for catalogs without rows

2023-09-12 Thread Matthias van de Meent

On Fri, 1 Sept 2023 at 19:52, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2023-Sep-01, Matthias van de Meent wrote:
> >> A potential addition to the patch would to stop manually closing
> >> relations: initdb and check-world succeed without manual 'close'
> >> operations because the 'open' command auto-closes the previous open
> >> relation (in boot_openrel). Testing also suggests that the last opened
> >> relation apparently doesn't need closing - check-world succeeds
> >> without issues (incl. with TAP enabled). That is therefore implemented
> >> in attached patch 2 - it removes the 'close' syntax in its entirety.
>
> > Hmm, what happens with the last relation in the bootstrap process?  Is
> > closerel() called via some other path for that one?
>
> Taking a quick census of existing closerel() callers: there is
> cleanup() in bootstrap.c, but it's called uncomfortably late
> and outside any transaction, so I misdoubt that it works
> properly if asked to actually shoulder any responsibility.
> (A little code reshuffling could fix that.)
> There are also a couple of low-level elog warnings in CREATE
> that would likely get triggered, though I suppose we could just
> remove those elogs.

Yes, that should be easy to fix.

> I guess my reaction to this patch is "why bother?".  It seems
> unlikely to yield any measurable benefit, though of course
> that guess could be wrong.

There is a small but measurable decrease in size of the generated bki
(2kb with both patches, on an initial 945kB), and there is some
related code that can be eliminated. If that's not worth bothering,
then I can drop the patch. Otherwise, I can update the patch to do the
cleanup that was within the transaction boundaries at the end of
boot_yyparse.

If decreasing the size of postgres.bki is not worth the effort, I'll
drop any effort on doing so, but considering that it is about 1MB of
our uncompressed distributables, I'd say decreases in size are worth
the effort, most of the time.

Kind regards,

Matthias van de Meent

Re: Detoasting optionally to make Explain-Analyze less misleading

2023-09-12 Thread Matthias van de Meent

On Tue, 12 Sept 2023 at 12:56, stepan rutz  wrote:
>
> Hi,
>
> I have fallen into this trap and others have too. If you run
> EXPLAIN(ANALYZE) no de-toasting happens. This makes query-runtimes
> differ a lot. The bigger point is that the average user expects more
> from EXPLAIN(ANALYZE) than what it provides. This can be suprising. You
> can force detoasting during explain with explicit calls to length(), but
> that is tedious. Those of us who are forced to work using java stacks,
> orms and still store mostly documents fall into this trap sooner or
> later. I have already received some good feedback on this one, so this
> is an issue that bother quite a few people out there.

Yes, the lack of being able to see the impact of detoasting (amongst
others) in EXPLAIN (ANALYZE) can hide performance issues.

> It would be great to get some feedback on the subject and how to address
> this, maybe in totally different ways.

Hmm, maybe we should measure the overhead of serializing the tuples instead.
The difference between your patch and "serializing the tuples, but not
sending them" is that serializing also does the detoasting, but also
includes any time spent in the serialization functions of the type. So
an option "SERIALIZE" which measures all the time the server spent on
the query (except the final step of sending the bytes to the client)
would likely be more useful than "just" detoasting.

> 0001_explain_analyze_and_detoast.patch

I notice that this patch creates and destroys a memory context for
every tuple that the DestReceiver receives. I think that's quite
wasteful, as you should be able to create only one memory context and
reset it before (or after) each processed tuple. That also reduces the
differences in measurements between EXPLAIN and normal query
processing of the tuples - after all, we don't create new memory
contexts for every tuple in the normal DestRemote receiver either,
right?

Kind regards,

Matthias van de Meent

Re: How to add a new pg oid?

2023-09-05 Thread Matthias van de Meent

On Tue, 5 Sept 2023 at 18:13, jacktby jacktby  wrote:
>
> I’m trying to add a new data type for my pg. How to do that? Can you give me 
> more details or an example?

You could get started by looking at the documentation on custom SQL
types with https://www.postgresql.org/docs/current/sql-createtype.html,
or look at the comments in pg_type.dat and the comments on TypInfo in
bootstrap.c on how the built-in types are created and managed.

Lastly, you could look at pg_class and the genbki documentation if you
want to add new catalog types.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Commitfest 2023-09 starts soon

2023-09-04 Thread Matthias van de Meent

On Mon, 4 Sept 2023 at 18:19, Aleksander Alekseev
 wrote:
>
> Hi Matthias,
>
> > I'm a bit confused about your use of "consensus". True, there was no
> > objection, but it looks like no patch author or reviewer was informed
> > (cc-ed or directly referenced) that the patch was being discussed
> > before achieving this "consensus", and the "consensus" was reached
> > within 4 days, of which 2 weekend, in a thread that has (until now)
> > involved only you and Peter E.
> >
> > Usually, you'd expect discussion about a patch to happen on the
> > patch's thread before any action is taken (or at least a mention on
> > that thread), but quite clearly that hasn't happened here.
> > Are patch authors expected to follow any and all discussion on threads
> > with "Commitfest" in the title?
> > If so, shouldn't the relevant wiki pages be updated, and/or the
> > -hackers community be updated by mail in a new thread about these
> > policy changes?
>
> I understand your disappointment and assure you that no one is acting
> with bad intentions here. Also please note that English is a second
> language for many of us which represents a challenge when it comes to
> expressing thoughts on the mailing list. We have a common goal here,
> to make PostgreSQL an even better system than it is now.
>
> The patches under question were in "Waiting for Author" state for a
> *long* time and the authors were notified about this. We could toss
> such patches from one CF to another month after month or mark as RwF
> and let the author know that no one is going to review that patch
> until the author takes the actions. It's been noted that the letter
> approach is more productive in the long run.

This far I agree - we can't keep patches around with issues if they're
not being worked on. And I do appreciate your work on pruning dead or
stale patches. But:

> The discussion can
> continue in the same thread and the same thread can be registered for
> the upcoming CF.

This is one of my major concerns here: Patch resolution is being
discussed on -hackers, but outside of the thread used to discuss that
patch (as indicated in the CF app), and without apparent author
inclusion.To me, that feels like going behind the author's back, and I
don't think that this should be normalized.

As mentioned in the earlier mail, my other concern is the use of
"consensus" in this context. You link to a message on -hackers, with
no visible agreements. As a patch author myself, if a lack of comments
on my patch in an otherwise unrelated thread is "consensus", then I'll
probably move all patches that have yet to be commented on to RfC, as
there'd be "consensus" that they should be included as-is in
PostgreSQL. But I digress.

I think it would be better to just remove the "consensus" part of your
mail, and just put down the real reason why each patch is being RfC-ed
or rejected. That is, don't imply that there are hackers that OK-ed it
when there are none, and inform patch authors directly about the
reasons why the patch is being revoked; so without "see consensus in
[0]".

Kind regards,

Matthias van de Meent

Re: Commitfest 2023-09 starts soon

2023-09-04 Thread Matthias van de Meent

On Thu, 31 Aug 2023 at 14:35, Aleksander Alekseev
 wrote:
>
> Hi,
> > On Thu, 31 Aug 2023 at 11:37, Peter Eisentraut  wrote:
> > > There are a number of patches carried over from the PG16 development
> > > cycle that have been in "Waiting on author" for several months.  I will
> > > aggressively prune those after the start of this commitfest if there
> > > hasn't been any author activity by then.
> >
> > [1 patch]
>
> This was the one that I could name off the top of my head.
>
> [5 more patches]
>
> I will apply corresponding status changes if there will be no objections.

On Mon, 4 Sept 2023 at [various], Aleksander Alekseev
 wrote:
>
> Hi,
>
> > [various patches]
>
> A consensus was reached [1] to mark this patch as RwF for now. There
> are many patches to be reviewed and this one doesn't seem to be in the
> best shape, so we have to prioritise. Please feel free re-submitting
> the patch for the next commitfest.

I'm a bit confused about your use of "consensus". True, there was no
objection, but it looks like no patch author or reviewer was informed
(cc-ed or directly referenced) that the patch was being discussed
before achieving this "consensus", and the "consensus" was reached
within 4 days, of which 2 weekend, in a thread that has (until now)
involved only you and Peter E.

Usually, you'd expect discussion about a patch to happen on the
patch's thread before any action is taken (or at least a mention on
that thread), but quite clearly that hasn't happened here.
Are patch authors expected to follow any and all discussion on threads
with "Commitfest" in the title?
If so, shouldn't the relevant wiki pages be updated, and/or the
-hackers community be updated by mail in a new thread about these
policy changes?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://wiki.postgresql.org/wiki/Submitting_a_Patch

Re: GenBKI emits useless open;close for catalogs without rows

2023-09-01 Thread Matthias van de Meent

On Fri, 1 Sept 2023 at 19:43, Alvaro Herrera  wrote:
>
> On 2023-Sep-01, Matthias van de Meent wrote:
>
> > A potential addition to the patch would to stop manually closing
> > relations: initdb and check-world succeed without manual 'close'
> > operations because the 'open' command auto-closes the previous open
> > relation (in boot_openrel). Testing also suggests that the last opened
> > relation apparently doesn't need closing - check-world succeeds
> > without issues (incl. with TAP enabled). That is therefore implemented
> > in attached patch 2 - it removes the 'close' syntax in its entirety.
>
> Hmm, what happens with the last relation in the bootstrap process?  Is
> closerel() called via some other path for that one?

There is a final cleanup() call that closes the last open boot_reldesc
relation (if any) at the end of BootstrapModeMain, after boot_yyparse
has completed and its changes have been committed.

- Matthias

GenBKI emits useless open;close for catalogs without rows

2023-09-01 Thread Matthias van de Meent

Hi,

Whilst looking at PostgreSQL's bootstrapping process, I noticed that
postgres.bki contains quite a few occurrances of the pattern "open
$catname; close $catname".
I suppose this pattern isn't too expensive, but according to my
limited research a combined open+close cycle doens't do anything
meaningful, so it does waste some CPU cycles in the process.

The attached patch 1 removes the occurances of those combined
open/close statements in postgresql.bki. Locally it passes
check-world, so I assume that opening and closing a table is indeed
not required for initializing a data-less catalog during
bootstrapping.

A potential addition to the patch would to stop manually closing
relations: initdb and check-world succeed without manual 'close'
operations because the 'open' command auto-closes the previous open
relation (in boot_openrel). Testing also suggests that the last opened
relation apparently doesn't need closing - check-world succeeds
without issues (incl. with TAP enabled). That is therefore implemented
in attached patch 2 - it removes the 'close' syntax in its entirety.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


0002-Remove-the-bki-close-command.patch
Description: Binary data


0001-Stop-emitting-open-nodata-close-patterns-in-genbki.p.patch
Description: Binary data

< 1 2 3 4 5 6 >

101 - 200 of 575 matches

Mail list logo