Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-20 Thread Matthias van de Meent
On Fri, 17 May 2024 at 15:02, Peter Eisentraut  wrote:
>
> On 17.05.24 09:32, Heikki Linnakangas wrote:
> > Dunno about having to click a link or sparkly gold borders, but +1 on
> > having a free-form text box for notes like that. Things like "cfbot says
> > this has bitrotted" or "Will review this after other patch this depends
> > on". On the mailing list, notes like that are both noisy and easily lost
> > in the threads. But as a little free-form text box on the commitfest, it
> > would be handy.
> >
> > One risk is that if we start to rely too much on that, or on the other
> > fields in the commitfest app for that matter, we de-value the mailing
> > list archives. I'm not too worried about it, the idea is that the
> > summary box just summarizes what's already been said on the mailing
> > list, or is transient information like "I'll get to this tomorrow"
> > that's not interesting to archive.
>
> We already have the annotations feature, which is kind of this.

But annotations are bound to mails in attached mail threads, rather
than a generic text input at the CF entry level. There isn't always an
appropriate link between (mail or in-person) conversations about the
patch, and a summary of that conversation.



The CommitFest App has several features, but contains little
information about how we're expected to use it. To start addressing
this limitation, I've just created a wiki page about the CFA [0], with
a handbook section. Feel free to extend or update the information as
appropriate; I've only added that information the best of my
knowledge, so it may contain wrong, incomplete and/or inaccurate
information.

Kind regards,

Matthias van de Meent

[0] https://wiki.postgresql.org/wiki/CommitFest_App




Re: WAL_LOG CREATE DATABASE strategy broken for non-standard page layouts

2024-05-13 Thread Matthias van de Meent
On Mon, 13 May 2024 at 16:13, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > PFA a patch that fixes this issue, by assuming that all pages in the
> > source database utilize a non-standard page layout.
>
> Surely that cure is worse than the disease?

I don't know where we would get the information whether the selected
relation fork's pages are standard-compliant. We could base it off of
the fork number (that info is available locally) but that doesn't
guarantee much.
For VM and FSM-pages we know they're essentially never
standard-compliant (hence this thread), but for the main fork it is
anyone's guess once the user has installed an additional AM - which we
don't detect nor pass through to the offending
RelationCopyStorageUsingBuffer.

As for "worse", the default template database is still much smaller
than the working set of most databases. This will indeed regress the
workload a bit, but only by the fraction of holes in the page + all
FSM/VM data.
I think the additional WAL volume during CREATE DATABASE is worth it
when the alternative is losing that data with physical
replication/secondary instances. Note that this does not disable page
compression, it just stops the logging of holes in pages; holes which
generally are only a fraction of the whole database.

It's not inconceivable that this will significantly increase WAL
volume, but I think we should go for correctness rather than fastest
copy. If we went with fastest copy, we'd better just skip logging the
FSM and VM forks because we're already ignoring the data of the pages,
so why not ignore the pages themselves, too? I don't think that holds
water when we want to be crash-proof in CREATE DATABASE, with a full
data copy of the template database.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




WAL_LOG CREATE DATABASE strategy broken for non-standard page layouts

2024-05-13 Thread Matthias van de Meent
Hi,

My collegue Konstantin Knizhnik pointed out that we fail to mark pages
with a non-standard page layout with page_std=false in
RelationCopyStorageUsingBuffer when we WAL-log them. This causes us to
interpret the registered buffer as a standard buffer, and omit the
hole in the page, which for FSM/VM pages covers the whole page.

The immediate effect of this bug is that replicas and primaries in a
physical replication system won't have the same data in their VM- and
FSM-forks until the first VACUUM on the new database has WAL-logged
these pages again. Whilst not actively harmful for the VM/FSM
subsystems, it's definitely suboptimal.
Secondary unwanted effects are that AMs that use the buffercache- but
which don't use or update the pageheader- also won't see the main data
logged in WAL, thus potentially losing user data in the physical
replication stream or with a system crash. I've not looked for any
such AMs and am unaware of any that would have this issue, but it's
better to fix this.


PFA a patch that fixes this issue, by assuming that all pages in the
source database utilize a non-standard page layout.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v1-0001-Fix-logging-of-non-standard-pages-in-RelationCopy.patch
Description: Binary data


Re: Allowing additional commas between columns, and at the end of the SELECT clause

2024-05-13 Thread Matthias van de Meent
On Mon, 13 May 2024 at 10:42, Artur Formella  wrote:
> Motivation:
> Commas of this type are allowed in many programming languages, in some
> it is even recommended to use them at the ends of lists or objects.

Single trailing commas are a feature that's more and more common in
languages, yes, but arbitrary excess commas is new to me. Could you
provide some examples of popular languages which have that, as I can't
think of any.

> Accepted:
>  SELECT 1,;
>  SELECT 1,;
>  SELECT *, from information_schema.sql_features;
>  (...) RETURNING a,,b,c,;
>
> Not accepted:
>  SELECT ,;
>  SELECT ,1;
>  SELECT ,,,;
>
> Advantages:
> - simplifies the creation and debugging of queries by reducing the most
> common syntax error,
> - eliminates the need to use the popular `1::int as dummy` at the end of
> a SELECT list,

This is the first time I've heard of this `1 as dummy`.

> - simplifies query generators,
> - the query is still deterministic,

What part of a query would (or would not) be deterministic? I don't
think I understand the potential concern here. Is it about whether the
statement can be parsed deterministically?

> Disadvantages:
> - counting of returned columns can be difficult,
> - syntax checkers will still report errors,
> - probably not SQL standard compliant,

I'd argue you better raise this with the standard committee if this
isn't compliant. I don't see enough added value to break standard
compliance here, especially when the standard may at some point allow
only a single trailing comma (and not arbitrarily many).

> What do you think?

Do you expect `SELECT 1,,,` to have an equivalent query identifier
to `SELECT 1;` in pg_stat_statements? Why, or why not?

Overall, I don't think unlimited commas is a good feature. A trailing
comma in the select list would be less problematic, but I'd still want
to follow the standard first and foremost.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Comments about TLS (no SSLRequest) and ALPN

2024-05-12 Thread Matthias van de Meent
acks (assuming
your ssl library does its job correctly). What more would PostgreSQL
need to do?

> I assume that more and more TLS on the serverside will be handled by proxies

I see only negative value there: We have TLS to ensure end-to-end
connection security. A proxy in between adds overhead and negates this
security principle.

> Don't add extra flags

We can't add completely new configurable features without adding new
configuration options (i.e. flags) for those configurable features. If
you don't want new options, you're free to stay at older versions.

> Allow the user to specify ALPN
>
> I don't think this is particularly controversial or nuanced, so I don't have 
> much to say here - most TLS tools allow the user to specify ALPN for the same 
> reason they allow specifying the port number - either for privacy, 
> security-by-obscurity, or navigating some form of application or user routing.

As I mentioned above, I don't see any value, and only demerits, in the
psql client reporting support for anything other than the protocol
that PostgreSQL supports, i.e. the alpn identifier "postgresql".

Kind regards,

Matthias van de Meent




Re: SQL:2011 application time

2024-05-12 Thread Matthias van de Meent
On Sun, 12 May 2024 at 05:26, Paul Jungwirth
 wrote:
> On 5/9/24 17:44, Matthias van de Meent wrote:
> > I haven't really been following this thread, but after playing around
> > a bit with the feature I feel there are new gaps in error messages. I
> > also think there are gaps in the functionality regarding the (lack of)
> > support for CREATE UNIQUE INDEX, and attaching these indexes to
> > constraints
> Thank you for trying this out and sharing your thoughts! I think these are 
> good points about CREATE
> UNIQUE INDEX and then creating the constraint by handing it an existing 
> index. This is something
> that I am hoping to add, but it's not covered by the SQL:2011 standard, so I 
> think it needs some
> discussion, and I don't think it needs to go into v17.

Okay.

> For instance you are saying:
>
>  > pg=# CREATE UNIQUE INDEX ON temporal_testing USING gist (id, valid_during);
>  > ERROR:  access method "gist" does not support unique indexes
>
> To me that error message seems correct. The programmer hasn't said anything 
> about the special
> temporal behavior they are looking for.

But I showed that I had a GIST index that does have the indisunique
flag set, which shows that GIST does support indexes with unique
semantics.

That I can't use CREATE UNIQUE INDEX to create such an index doesn't
mean the feature doesn't exist, which is what the error message
implies.

> To get non-overlapping semantics from an index, this more
> explicit syntax seems better, similar to PKs in the standard:

Yes, agreed on that part.

>  > pg=# CREATE UNIQUE INDEX ON temporal_testing USING gist (id, valid_during 
> WITHOUT OVERLAPS);
>  > ERROR:  access method "gist" does not support unique indexes
>
> We could also support *non-temporal* unique GiST indexes, particularly now 
> that we have the stratnum
> support function. Those would use the syntax you gave, omitting WITHOUT 
> OVERLAPS. But that seems
> like a separate effort to me.

No objection on that.

Kind regards,

Matthias van de Meent




Re: Use generation memory context for tuplestore.c

2024-05-10 Thread Matthias van de Meent
On Sat, 4 May 2024 at 04:02, David Rowley  wrote:
>
> On Sat, 4 May 2024 at 03:51, Matthias van de Meent
>  wrote:
> > Was a bump context considered? If so, why didn't it make the cut?
> > If tuplestore_trim is the only reason why the type of context in patch
> > 2 is a generation context, then couldn't we make the type of context
> > conditional on state->eflags & EXEC_FLAG_REWIND, and use a bump
> > context if we require rewind capabilities (i.e. where _trim is never
> > effectively executed)?
>
> I didn't really want to raise all this here, but to answer why I
> didn't use bump...
>
> There's a bit more that would need to be done to allow bump to work in
> use-cases where no trim support is needed. Namely, if you look at
> writetup_heap(), you'll see a heap_free_minimal_tuple(), which is
> pfreeing the memory that was allocated for the tuple in either
> tuplestore_puttupleslot(), tuplestore_puttuple() or
> tuplestore_putvalues().   So basically, what happens if we're still
> loading the tuplestore and we've spilled to disk, once the
> tuplestore_put* function is called, we allocate memory for the tuple
> that might get stored in RAM (we don't know yet), but then call
> tuplestore_puttuple_common() which decides if the tuple goes to RAM or
> disk, then because we're spilling to disk, the write function pfree's
> the memory we allocate in the tuplestore_put function after the tuple
> is safely written to the disk buffer.

Thanks, that's exactly the non-obvious issue I was looking for, but
couldn't find immediately.

> This is a fairly inefficient design.  While, we do need to still form
> a tuple and store it somewhere for tuplestore_putvalues(), we don't
> need to do that for a heap tuple. I think it should be possible to
> write directly from the table's page.

[...]

> I'd rather tackle these problems independently and I believe there are
> much bigger wins to moving from aset to generation than generation to
> bump, so that's where I've started.

That's entirely reasonable, and I wouldn't ask otherwise.

Thanks,

Matthias van de Meent




Re: SQL:2011 application time

2024-05-09 Thread Matthias van de Meent
Hi,

I haven't really been following this thread, but after playing around
a bit with the feature I feel there are new gaps in error messages. I
also think there are gaps in the functionality regarding the (lack of)
support for CREATE UNIQUE INDEX, and attaching these indexes to
constraints.

pg=# CREATE TABLE temporal_testing (
pg(#  id bigint NOT NULL
pg(#generated always as identity,
pg(#  valid_during tstzrange
pg(# );
CREATE TABLE
pg=# ALTER TABLE temporal_testing
pg-#  ADD CONSTRAINT temp_unique UNIQUE (id, valid_during WITHOUT OVERLAPS);
ALTER TABLE
pg=# \d+ temp_unique
 Index "public.temp_unique"
Column|Type | Key? |  Definition  | Storage  | Stats target
--+-+--+--+--+--
 id   | gbtreekey16 | yes  | id   | plain|
 valid_during | tstzrange   | yes  | valid_during | extended |
unique, gist, for table "public.temporal_testing"
-- ^^ note the "unique, gist"
pg=# CREATE UNIQUE INDEX ON temporal_testing USING gist (id, valid_during);
ERROR:  access method "gist" does not support unique indexes

Here we obviously have a unique GIST index in the catalogs, but
they're "not supported" by GIST when we try to create such index
ourselves (!). Either the error message needs updating, or we need to
have a facility to actually support creating these unique indexes
outside constraints.

Additionally, because I can't create my own non-constraint-backing
unique GIST indexes, I can't pre-create my unique constraints
CONCURRENTLY as one could do for the non-temporal case: UNIQUE
constraints hold ownership of the index and would drop the index if
the constraint is dropped, too, and don't support a CONCURRENTLY
modifier, nor an INVALID modifier. This means temporal unique
constraints have much less administrative wiggle room than normal
unique constraints, and I think that's not great.

Kind regards,

Matthias van de Meent.




Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Matthias van de Meent
On Thu, 9 May 2024 at 15:13, Tomas Vondra  wrote:
> Let me explain the relevant part of the patch, and how I understand the
> improvement suggested by Matthias. The patch does the work in three phases:
>
>
> 1) Worker gets data from table, split that into index items and add
> those into a "private" tuplesort, and finally sorts that. So a worker
> may see a key many times, with different TIDs, so the tuplesort may
> contain many items for the same key, with distinct TID lists:
>
>key1: 1, 2, 3, 4
>key1: 5, 6, 7
>key1: 8, 9, 10
>key2: 1, 2, 3
>...

This step is actually split in several components/phases, too.
As opposed to btree, which directly puts each tuple's data into the
tuplesort, this GIN approach actually buffers the tuples in memory to
generate these TID lists for data keys, and flushes these pairs of Key
+ TID list into the tuplesort when its own memory limit is exceeded.
That means we essentially double the memory used for this data: One
GIN deform buffer, and one in-memory sort buffer in the tuplesort.
This is fine for now, but feels duplicative, hence my "let's allow
tuplesort to merge the key+TID pairs into pairs of key+TID list"
comment.

> The trouble with (2) is that it "just copies" data from one tuplesort
> into another, increasing the disk space requirements. In an extreme
> case, when nothing can be combined, it pretty much doubles the amount of
> disk space, and makes the build longer.
>
> What I think Matthias is suggesting, is that this "TID list merging"
> could be done directly as part of the tuplesort in step (1). So instead
> of just moving the "sort tuples" from the appropriate runs, it could
> also do an optional step of combining the tuples and writing this
> combined tuple into the tuplesort result (for that worker).

Yes, but with a slightly more extensive approach than that even, see above.

> Matthias also mentioned this might be useful when building btree indexes
> with key deduplication.
>
> AFAICS this might work, although it probably requires for the "combined"
> tuple to be smaller than the sum of the combined tuples (in order to fit
> into the space).

*must not be larger than the sum; not "must be smaller than the sum" [^0].
For btree tuples with posting lists this is guaranteed to be true: The
added size of a btree tuple with a posting list (containing at least 2
values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
(12 on 32-bit systems). The smallest btree tuple with data is also 16
bytes (or 12 bytes on 32-bit systems), so this works out nicely.

> But at least in the GIN build in the workers this is
> likely true, because the TID lists do not overlap (and thus not hurting
> the compressibility).
>
> That being said, I still see this more as an optimization than something
> required for the patch,

Agreed.

> and I don't think I'll have time to work on this
> anytime soon. The patch is not extremely complex, but it's not trivial
> either. But if someone wants to take a stab at extending tuplesort to
> allow this, I won't object ...

Same here: While I do have some ideas on where and how to implement
this, I'm not planning on working on that soon.


Kind regards,

Matthias van de Meent

[^0] There's some overhead in the tuplesort serialization too, so
there is some leeway there, too.




Expand applicability of aggregate's sortop optimization

2024-05-08 Thread Matthias van de Meent
Hi,

As you may know, aggregates like SELECT MIN(unique1) FROM tenk1; are
rewritten as SELECT unique1 FROM tenk1 ORDER BY unique1 USING < LIMIT
1; by using the optional sortop field in the aggregator.
However, this optimization is disabled for clauses that in itself have
an ORDER BY clause such as `MIN(unique1 ORDER BY ), because
 can cause reordering of distinguisable values like 1.0 and
1.00, which then causes measurable differences in the output. In the
general case, that's a good reason to not apply this optimization, but
in some cases, we could still apply the index optimization.

One of those cases is fixed in the attached patch: if we order by the
same column that we're aggregating, using the same order class as the
aggregate's sort operator (i.e. the aggregate's sortop is in the same
btree opclass as the ORDER BY's sort operator), then we can still use
the index operation: The sort behaviour isn't changed, thus we can
apply the optimization.

PFA the small patch that implements this.

Note that we can't blindly accept just any ordering by the same
column: If we had an opclass that sorted numeric values by the length
of the significant/mantissa, then that'd provide a different (and
distinct) ordering from that which is expected by the normal
min()/max() aggregates for numeric, which could cause us to return
arguably incorrect results for the aggregate expression.

Alternatively, the current code could be changed to build indexed
paths that append the SORT BY paths to the aggregate's sort operator,
but that'd significantly increase complexity here.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)
From 215600d12164f214aae8f0de94b16373bd202d69 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 25 Apr 2024 15:23:57 +0200
Subject: [PATCH v1] Plan min()/max()-like aggregates with index accesses in
 more cases

When the aggregate's operator is included in the ORDER BY's ordering
opclass, we know they have a common ordering, and thus should get the
same output (or at least same consistency guarantee for that output).

So, check if the ORDER BY orders things by only the aggregated
expression, and check if that ordering shares a btree opclass with
the aggregator's operator. If so, we can use the index.
---
 src/backend/optimizer/plan/planagg.c | 87 
 src/test/regress/expected/aggregates.out | 65 ++
 src/test/regress/sql/aggregates.sql  | 18 +
 3 files changed, 156 insertions(+), 14 deletions(-)

diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index afb5445b77..d8479fe286 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -253,6 +253,16 @@ can_minmax_aggs(PlannerInfo *root, List **context)
 		if (list_length(aggref->args) != 1)
 			return false;		/* it couldn't be MIN/MAX */
 
+		/*
+		 * We might implement the optimization when a FILTER clause is present
+		 * by adding the filter to the quals of the generated subquery.  For
+		 * now, just punt.
+		 */
+		if (aggref->aggfilter != NULL)
+			return false;
+
+		curTarget = (TargetEntry *) linitial(aggref->args);
+
 		/*
 		 * ORDER BY is usually irrelevant for MIN/MAX, but it can change the
 		 * outcome if the aggsortop's operator class recognizes non-identical
@@ -267,22 +277,71 @@ can_minmax_aggs(PlannerInfo *root, List **context)
 		 * quickly.
 		 */
 		if (aggref->aggorder != NIL)
-			return false;
+		{
+			SortGroupClause *orderClause;
+
+			/*
+			 * If the order clause is the same column as the one we're
+			 * aggregating, we can still use the index: It is undefined which
+			 * value is MIN() or MAX(), as well as which value is first or
+			 * last when sorted. So, we can still use the index IFF the
+			 * aggregated expression equals the expression used in the
+			 * ordering operation.
+			 */
+
+			/*
+			 * We only accept a single argument to min/max aggregates,
+			 * orderings that have more clauses won't provide correct results.
+			 */
+			if (list_length(aggref->aggorder) > 1)
+return false;
+
+			orderClause = castNode(SortGroupClause, linitial(aggref->aggorder));
+
+			if (orderClause->tleSortGroupRef != curTarget->ressortgroupref)
+return false;
+
+			aggsortop = fetch_agg_sort_op(aggref->aggfnoid);
+			if (!OidIsValid(aggsortop))
+return false;		/* not a MIN/MAX aggregate */
+
+			if (orderClause->sortop != aggsortop)
+			{
+List   *btclasses;
+ListCell *cell;
+bool	match = false;
+
+btclasses = get_op_btree_interpretation(orderClause->sortop);
+
+foreach(cell, btclasses)
+{
+	OpBtreeInterpretation *interpretation;
+	interpretation = (OpBtreeInterpretation *) lfirst(cell);
+
+	if (!match)
+	{
+		if (op_in_opfamily(aggsortop, interpretation->opfamily_id))
+			match = true;
+	}
+	/* Now useless */
+	pfree(interpretation);
+}
+
+/* Now not useful anymore *

Re: Use generation memory context for tuplestore.c

2024-05-03 Thread Matthias van de Meent
On Fri, 3 May 2024 at 15:55, David Rowley  wrote:
>
> (40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
>
> I was looking at the tuplestore.c code a few days ago and noticed that
> it allocates tuples in the memory context that tuplestore_begin_heap()
> is called in, which for nodeMaterial.c, is ExecutorState.
>
> I didn't think this was great because:
> 1. Allocating many chunks in ExecutorState can bloat the context with
> many blocks worth of free'd chunks, stored on freelists that might
> never be reused for anything.
> 2. To clean up the memory, pfree must be meticulously called on each
> allocated tuple
> 3. ExecutorState is an aset.c context which isn't the most efficient
> allocator for this purpose.

Agreed on all counts.

> I've attached 2 patches:
>
> 0001:  Adds memory tracking to Materialize nodes, which looks like:
>
>  ->  Materialize (actual time=0.033..9.157 rows=1 loops=2)
>Storage: Memory  Maximum Storage: 10441kB
>
> 0002: Creates a Generation MemoryContext for storing tuples in tuplestore.
>
> Using generation has the following advantages:

[...]
> 6. Generation has a page-level freelist, so is able to reuse pages
> instead of freeing and mallocing another if tuplestore_trim() is used
> to continually remove no longer needed tuples. aset.c can only
> efficiently do this if the tuples are all in the same size class.

Was a bump context considered? If so, why didn't it make the cut?
If tuplestore_trim is the only reason why the type of context in patch
2 is a generation context, then couldn't we make the type of context
conditional on state->eflags & EXEC_FLAG_REWIND, and use a bump
context if we require rewind capabilities (i.e. where _trim is never
effectively executed)?

> master @ 8f0a97dff
> Storage: Memory  Maximum Storage: 16577kB
>
> patched:
> Storage: Memory  Maximum Storage: 8577kB

Those are some impressive numbers.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


On Fri, 3 May 2024 at 15:55, David Rowley  wrote:
>
> (40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
>
> I was looking at the tuplestore.c code a few days ago and noticed that
> it allocates tuples in the memory context that tuplestore_begin_heap()
> is called in, which for nodeMaterial.c, is ExecutorState.
>
> I didn't think this was great because:
> 1. Allocating many chunks in ExecutorState can bloat the context with
> many blocks worth of free'd chunks, stored on freelists that might
> never be reused for anything.
> 2. To clean up the memory, pfree must be meticulously called on each
> allocated tuple
> 3. ExecutorState is an aset.c context which isn't the most efficient
> allocator for this purpose.
>
> I've attached 2 patches:
>
> 0001:  Adds memory tracking to Materialize nodes, which looks like:
>
>  ->  Materialize (actual time=0.033..9.157 rows=1 loops=2)
>Storage: Memory  Maximum Storage: 10441kB
>
> 0002: Creates a Generation MemoryContext for storing tuples in tuplestore.
>
> Using generation has the following advantages:
>
> 1. It does not round allocations up to the next power of 2.  Using
> generation will save an average of 25% memory for tuplestores or allow
> an average of 25% more tuples before going to disk.
> 2. Allocation patterns in tuplestore.c are FIFO, which is exactly what
> generation was designed to handle best.
> 3. Generation is faster to palloc/pfree than aset. (See [1]. Compare
> the 4-bit times between aset_palloc_pfree.png and
> generation_palloc_pfree.png)
> 4. tuplestore_clear() and tuplestore_end() can reset or delete the
> tuple context instead of pfreeing every tuple one by one.
> 5. Higher likelihood of neighbouring tuples being stored consecutively
> in memory, resulting in better CPU memory prefetching.
> 6. Generation has a page-level freelist, so is able to reuse pages
> instead of freeing and mallocing another if tuplestore_trim() is used
> to continually remove no longer needed tuples. aset.c can only
> efficiently do this if the tuples are all in the same size class.
>
> The attached bench.sh.txt tests the performance of this change and
> result_chart.png shows the results I got when running on an AMD 3990x
> master @ 8f0a97dff vs patched.
> The script runs benchmarks for various tuple counts stored in the
> tuplestore -- 1 to 8192 in power-2 increments.
>
> The script does output the memory consumed by the tuplestore for each
> query.  Here are the results for the 8192 tuple test:
>
> master @ 8f0a97dff
> Storage: Memory  Maximum Storage: 16577kB
>
> patched:
> Storage: Memory  Maximum Storage: 8577kB
>
> Which is roughly half, but I did pad the t

Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

2024-05-02 Thread Matthias van de Meent
On Thu, 2 May 2024 at 18:50, Chris Cleveland  wrote:
>
> Sorry to have to ask for help here, but no amount of stepping through code is 
> giving me the answer.
>
> I'm building an index access method which supports an ordering operator:
[...]
> so there's no reason the system needs to know the actual float value returned 
> by rank_match(), the ordering operator distance function. In any case, that 
> value can only be calculated based on information in the index itself, and 
> can't be calculated by rank_match().
>
> Nevertheless, the system calls rank_match() after every call to amgettuple(), 
> and I can't figure out why. I've stepped through the code, and it looks like 
> it has something to do with ScanState.ps.ps.ps_ProjInfo, but I can't figure 
> out where or why it's getting set.
>
> Here's a sample query. I have not found a query that does *not* call 
> rank_match():
[...]
> I'd be grateful for any help or insights.

The ordering clause produces a junk column that's used to keep track
of the ordering, and because it's a projected column (not the indexed
value, but an expression over that column) the executor will execute
that projection. This happens regardless of it's use in downstream
nodes due to planner or executor limitations.

See also Heikki's thread over at [0], and a comment of me about the
same issue over at pgvector's issue board at [1].

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/2ca5865b-4693-40e5-8f78-f3b45d5378fb%40iki.fi
[1] https://github.com/pgvector/pgvector/issues/359#issuecomment-1840786021




Re: Parallel CREATE INDEX for GIN indexes

2024-05-02 Thread Matthias van de Meent
 (but with many same-valued
tuples at once). Now that we have a tuplesort with the full table's
data, couldn't the code be adapted to do more efficient btree loading,
such as that seen in the nbtree code, where the rightmost pages are
cached and filled sequentially without requiring repeated searches
down the tree? I suspect we can gain a lot of time there.

I don't need you to do that, but what's your opinion on this?

> 2) v20240502-0002-Use-mergesort-in-the-leader-process.patch
>
> The approach implemented by 0001 works, but there's a little bit of
> issue - if there are many distinct keys (e.g. for trigrams that can
> happen very easily), the workers will hit the memory limit with only
> very short TID lists for most keys. For serial build that means merging
> the data into a lot of random places, and in parallel build it means the
> leader will have to merge a lot of tiny lists from many sorted rows.
>
> Which can be quite annoying and expensive, because the leader does so
> using qsort() in the serial part. It'd be better to ensure most of the
> sorting happens in the workers, and the leader can do a mergesort. But
> the mergesort must not happen too often - merging many small lists is
> not cheaper than a single qsort (especially when the lists overlap).
>
> So this patch changes the workers to process the data in two phases. The
> first works as before, but the data is flushed into a local tuplesort.
> And then each workers sorts the results it produced, and combines them
> into results with much larger TID lists, and those results are written
> to the shared tuplesort. So the leader only gets very few lists to
> combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>
> In 0002 the workers still do an explicit qsort() on the TID list before
> writing the data into the shared tuplesort. But we can do better - the
> workers can do a merge sort too. To help with this, we add the first TID
> to the tuplesort tuple, and sort by that too - it helps the workers to
> process the data in an order that allows simple concatenation instead of
> the full mergesort.
>
> Note: There's a non-obvious issue due to parallel scans always being
> "sync scans", which may lead to very "wide" TID ranges when the scan
> wraps around. More about that later.

As this note seems to imply, this seems to have a strong assumption
that data received in parallel workers is always in TID order, with
one optional wraparound. Non-HEAP TAMs may break with this assumption,
so what's the plan on that?

> 4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch
>
> The parallel build passes data between processes using temporary files,
> which means it may need significant amount of disk space. For BRIN this
> was not a major concern, because the summaries tend to be pretty small.
>
> But for GIN that's not the case, and the two-phase processing introduced
> by 0002 make it worse, because the worker essentially creates another
> copy of the intermediate data. It does not need to copy the key, so
> maybe it's not exactly 2x the space requirement, but in the worst case
> it's not far from that.
>
> But there's a simple way how to improve this - the TID lists tend to be
> very compressible, and GIN already implements a very light-weight TID
> compression, so this patch does just that - when building the tuple to
> be written into the tuplesort, we just compress the TIDs.

See note on 0002: Could we do this in the tuplesort writeback, rather
than by moving the data around multiple times?

[...]
> So 0007 does something similar - it tracks if the TID value goes
> backward in the callback, and if it does it dumps the state into the
> tuplesort before processing the first tuple from the beginning of the
> table. Which means we end up with two separate "narrow" TID list, not
> one very wide one.

See note above: We may still need a merge phase, just to make sure we
handle all TAM parallel scans correctly, even if that merge join phase
wouldn't get hit in vanilla PostgreSQL.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Possible to get LIMIT in an index access method?

2024-04-29 Thread Matthias van de Meent
On Mon, 29 Apr 2024 at 18:17, Chris Cleveland
 wrote:
>
> I'm developing an index access method.
>
> SELECT *
> FROM foo
> WHERE col <=> constant
> ORDER BY col <==> constant
> LIMIT 10;
>
> I'm successfully getting the WHERE and the ORDER BY clauses in my beginscan() 
> method. Is there any way to get the LIMIT (or OFFSET, for that matter)?

No, that is not possible.
The index AM does not know (*should* not know) about the visibility
state of indexed entries, except in those cases where the indexed
entries are dead to all running transactions. Additionally, it can't
(shouldn't) know about filters on columns that are not included in the
index. As such, pushing down limits into the index AM is only possible
in situations where you know that the table is fully visible (which
can't be guaranteed at runtime) and that no other quals on the table's
columns exist (which is possible, but unlikely to be useful).

GIN has one "solution" to this when you enable gin_fuzzy_search_limit
(default: disabled), where it throws an error if you try to extract
more results from the resultset after it's been exhausted while the AM
knows more results could exist.

> My access method is designed such that you have to fetch the entire result 
> set in one go. It's not streaming, like most access methods. As such, it 
> would be very helpful to know up front how many items I need to fetch from 
> the index.

Sorry, but I don't think we can know in advance how many tuples are
going to be extracted from an index scan.


Kind regards,

Matthias van de Meent.




Re: Extend ALTER DEFAULT PRIVILEGES for large objects

2024-04-26 Thread Matthias van de Meent
On Fri, 26 Apr 2024 at 10:54, Yugo NAGATA  wrote:
>
> On Wed, 24 Apr 2024 16:08:39 -0500
> Nathan Bossart  wrote:
>
> > On Tue, Apr 23, 2024 at 11:47:38PM -0400, Tom Lane wrote:
> > > On the whole I find this proposed feature pretty unexciting
> > > and dubiously worthy of the implementation/maintenance effort.
> >
> > I don't have any particularly strong feelings on $SUBJECT, but I'll admit
> > I'd be much more interested in resolving any remaining reasons folks are
> > using large objects over TOAST.  I see a couple of reasons listed in the
> > docs [0] that might be worth examining.
> >
> > [0] https://www.postgresql.org/docs/devel/lo-intro.html
>
> If we could replace large objects with BYTEA in any use cases, large objects
> would be completely obsolete. However, currently some users use large objects
> in fact, so improvement in this feature seems beneficial for them.
>
>
> Apart from that, extending TOAST to support more than 1GB data and
> stream-style access seems a good challenge. I don't know if there was a
> proposal for this in past. This is  just a thought, for this purpose, we
> will need a new type of varlena that can contains large size information,
> and a new toast table schema that can store offset information or some way
> to convert a offset to chunk_seq.

If you're interested in this, you may want to check out [0] and [1] as
threads on the topic of improving TOAST handling of large values ([1]
being a thread where the limitations of our current external TOAST
pointer became clear once more), and maybe talk with Aleksander
Alekseev and Nikita Malakhov. They've been working closely with
systems that involve toast pointers and their limitations.

The most recent update on the work of Nikita (reworking TOAST
handling) [2] is that he got started adapting their externally
pluggable toast into type-internal methods only, though I've not yet
noticed any updated patches appear on the list.

As for other issues with creating larger TOAST values:
TOAST has a value limit of ~1GB, which means a single large value (or
two, for that matter) won't break anything in the wire protocol, as
DataRow messages have a message size field of uint32 [^3]. However, if
we're going to allow even larger values to be stored in table's
attributes, we'll have to figure out how we're going to transfer those
larger values to (and from) clients. For large objects, this is much
less of an issue because the IO operations are already chunked by
design, but this may not work well for types that you want to use in
your table's columns.

Kind regards,

Matthias van de Meent

[0] 
https://postgr.es/m/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com
[1] 
https://postgr.es/m/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
[2] 
https://postgr.es/m/CAN-LCVOgMrda9hOdzGkCMdwY6dH0JQa13QvPsqUwY57TEn6jww%40mail.gmail.com

[^3] Most, if not all PostgreSQL wire protocol messages have this
uint32 message size field, but the DataRow one is relevant here as
it's the one way users get their data out of the database.




Re: Statistics Import and Export

2024-04-24 Thread Matthias van de Meent
On Wed, 24 Apr 2024 at 21:31, Bruce Momjian  wrote:
>
> On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> > I've heard of use cases where dumping stats without data would help
> > with production database planner debugging on a non-prod system.
> >
> > Sure, some planner inputs would have to be taken into account too, but
> > having an exact copy of production stats is at least a start and can
> > help build models and alerts for what'll happen when the tables grow
> > larger with the current stats.
> >
> > As for other planner inputs: table size is relatively easy to shim
> > with sparse files; cumulative statistics can be copied from a donor
> > replica if needed, and btree indexes only really really need to
> > contain their highest and lowest values (and need their height set
> > correctly).
>
> Is it possible to prevent stats from being updated by autovacuum

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

> and other methods?

No nice ways. AFAIK there is no command (or command sequence) that can
"disable" only ANALYZE and which also guarantee statistics won't be
updated until ANALYZE is manually "re-enabled" for that table. An
extension could maybe do this, but I'm not aware of any extension
points where this would hook into PostgreSQL in a nice way.

You can limit maintenance access on the table to only trusted roles
that you know won't go in and run ANALYZE for those tables, or even
only your superuser (so only they can run ANALYZE, and have them
promise they won't). Alternatively, you can also constantly keep a
lock on the table that conflicts with ANALYZE. The last few are just
workarounds though, and not all something I'd suggest running on a
production database.

Kind regards,

Matthias van de Meent




Re: Cleanup: remove unused fields from nodes

2024-04-24 Thread Matthias van de Meent
On Wed, 24 Apr 2024 at 09:28, Michael Paquier  wrote:
>
> On Tue, Apr 23, 2024 at 11:03:40PM -0400, Tom Lane wrote:
> > Hah.  Seems like the comment for isall needs to explain that it
> > exists for this purpose, so we don't make this mistake again.
>
> How about something like the attached?

LGTM.

Thanks,

Matthias




Re: Statistics Import and Export

2024-04-23 Thread Matthias van de Meent
On Tue, 23 Apr 2024, 05:52 Tom Lane,  wrote:
> Jeff Davis  writes:
> > On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> >> Loading data without stats, and hoping
> >> that auto-analyze will catch up sooner not later, is exactly the
> >> current behavior that we're doing all this work to get out of.
>
> > That's the disconnect, I think. For me, the main reason I'm excited
> > about this work is as a way to solve the bad-plans-after-upgrade
> > problem and to repro planner issues outside of production. Avoiding the
> > need to ANALYZE at the end of a data load is also a nice convenience,
> > but not a primary driver (for me).
>
> Oh, I don't doubt that there are use-cases for dumping stats without
> data.  I'm just dubious about the reverse.  I think data+stats should
> be the default, even if only because pg_dump's default has always
> been to dump everything.  Then there should be a way to get stats
> only, and maybe a way to get data only.  Maybe this does argue for a
> four-section definition, despite the ensuing churn in the pg_dump API.

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

Sure, some planner inputs would have to be taken into account too, but
having an exact copy of production stats is at least a start and can
help build models and alerts for what'll happen when the tables grow
larger with the current stats.

As for other planner inputs: table size is relatively easy to shim
with sparse files; cumulative statistics can be copied from a donor
replica if needed, and btree indexes only really really need to
contain their highest and lowest values (and need their height set
correctly).

Kind regards,

Matthias van de Meent




Re: Cleanup: remove unused fields from nodes

2024-04-22 Thread Matthias van de Meent
On Mon, 22 Apr 2024 at 17:41, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > 0002/0004 remove fields in ExecRowMark which were added for FDWs to
> > use, but there are no FDWs which use this: I could only find two FDWs
> > who implement RefetchForeignRow, one being BlackHoleFDW, and the other
> > a no-op implementation in kafka_fdw [0]. We also don't seem to have
> > any testing on this feature.
>
> I'm kind of down on removing either of these.  ermExtra is explicitly
> intended for extensions to use, and just because we haven't found any
> users doesn't mean there aren't any, or might not be next week.

That's a good point, and also why I wasn't 100% sure removing it was a
good idea. I'm not quite sure why this would be used (rather than the
internal state of the FDW, or no state at all), but haven't looked
very deep into it either, so I'm quite fine with not channging that.

> Similarly, ermActive seems like it's at least potentially useful:
> is there another way for onlookers to discover that state?

The ermActive field is always true when RefetchForeignRow is called
(in ExecLockRows(), in nodeLockRows.c), and we don't seem to care
about the value of the field afterwards. Additionally, we always set
erm->curCtid to a valid value when ermActive is also first set in that
code path.
In all, it feels like a duplicative field with no real uses inside
PostgreSQL itself. If an extension (FDW) needs it, it should probably
use ermExtra instead, as ermActive seemingly doesn't carry any
meaningful value into the FDW call.

> I think it would be a good idea to push 0003 for v17, just so nobody
> grows an unnecessary dependency on that field.  0001 and 0005 could
> be left for v18, but on the other hand they're so trivial that it
> could also be sensible to just push them to get them out of the way.

Beta 1 scheduled to be released for quite some time, so I don't think
there are any problems with fixing these kinds of minor issues in the
provisional ABI for v17.

Kind regards,

Matthias van de Meent




Cleanup: remove unused fields from nodes

2024-04-22 Thread Matthias van de Meent
Hi,

While working on the 'reduce nodeToString output' patch, I noticed
that my IDE marked one field in the TidScanState node as 'unused'.
After checking this seemed to be accurate, and I found a few more such
fields in Node structs.

PFA some patches that clean this up: 0001 is plain removal of fields
that are not accessed anywhere anymore, 0002 and up clean up fields
that are effectively write-only, with no effective use inside
PostgreSQL's own code, and no effective usage found on Debian code
search, nor Github code search.

I'm quite confident about the correctness of patches 1 and 3 (no usage
at all, and newly introduced with no meaningful uses), while patches
2, 4, and 5 could be considered 'as designed'.
For those last ones I have no strong opinion for removal or against
keeping them around, this is just to point out we can remove the
fields, as nobody seems to be using them.

/cc Tom Lane and Etsuro Fujita: 2 and 4 were introduced with your
commit afb9249d in 2015.
/cc Amit Kapila: 0003 was introduced with your spearheaded commit
6185c973 this year.

Kind regards,

Matthias van de Meent


0001 removes two old fields that are not in use anywhere anymore, but
at some point these were used.

0002/0004 remove fields in ExecRowMark which were added for FDWs to
use, but there are no FDWs which use this: I could only find two FDWs
who implement RefetchForeignRow, one being BlackHoleFDW, and the other
a no-op implementation in kafka_fdw [0]. We also don't seem to have
any testing on this feature.

0003 drops the input_finfo field on the new JsonExprState struct. It
wasn't read anywhere, so keeping it around makes little sense IMO.

0005 drops field DeallocateStmt.isall: the value of the field is
implied by !name, and that field was used as such.


[0] https://github.com/cohenjo/kafka_fdw/blob/master/src/kafka_fdw.c#L1793


v1-0002-Remove-field-ExecRowMark.ermActive.patch
Description: Binary data


v1-0003-Remove-field-JsonExprState.input_finfo.patch
Description: Binary data


v1-0005-Remove-field-DeallocateStmt.isall.patch
Description: Binary data


v1-0001-Remove-some-unused-fields-from-nodes-in-execnodes.patch
Description: Binary data


v1-0004-Remove-field-ExecRowMark.ermExtra.patch
Description: Binary data


Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-10 Thread Matthias van de Meent
On Wed, 3 Apr 2024 at 23:50, Tom Lane  wrote:
> I've pushed this after a good deal of cosmetic polishing -- for
> example, I spent some effort on making serializeAnalyzeReceive
> look as much like printtup as possible, in hopes of making it
> easier to keep the two functions in sync in future.

Upthread at [0], Stepan mentioned that we should default to SERIALIZE
when ANALYZE is enabled. I suspect a patch in that direction would
primarily contain updates in the test plan outputs, but I've not yet
worked on that.

Does anyone else have a strong opinion for or against adding SERIALIZE
to the default set of explain features enabled with ANALYZE?

I'll add this to "Decisions to Recheck Mid-Beta"-section if nobody has
a clear objection.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://postgr.es/m/ea885631-21f1-425a-97ed-c4bfb8cf9c63%40gmx.de




Re: Parallel Recovery in PostgreSQL

2024-04-09 Thread Matthias van de Meent
On Tue, 9 Apr 2024 at 13:12, 范润泽  wrote:
>
> Hi, Hackers
> I have observed that there has been a paucity of discussion concerning the 
> parallel replay of WAL logs.
> The most recent discourse on this subject was during the PGCon event in 2023, 
> where it was noted that PostgreSQL utilizes a single process for WAL replay.
> However, when configuring primary-secondary replication, there exists a 
> tangible scenario where the primary accumulates an excessive backlog that the 
> secondary cannot replay promptly.
> This situation prompts me to question whether it is pertinent to consider 
> integrating a parallel replay feature.
> Such a functionality could potentially mitigate the risk of delayed WAL 
> application on replicas and enhance overall system resilience and performance.
> I am keen to hear your thoughts on this issue and whether you share the view 
> that parallel WAL replay is a necessity that we should collaboratively 
> explore further.

I think we should definitely explore this further, yes.

Note that parallel WAL replay is not the only thing we can improve
here: A good part of WAL redo time is spent in reading and validating
the WAL records. If we can move that part to a parallel worker/thread,
that could probably improve the throughput by a good margin.

Then there is another issue with parallel recovery that I also called
out at PGCon 2023: You can't just reorder WAL records in a simple
page- or relation-based parallelization approach.

Indexes often contain transient information, of which replay isn't
easily parallelizable with other relation's redo due to their
dependencies on other relation's pages while gauranteeing correct
query results.

E.g., the index insertion of page 2 in one backend's operations order
{ 1) Page 1: heap insertion, 2) Page 2: index insert, 3) commit }
cannot be reordered to before the heap page insertion at 1, because
that'd allow a concurrent index access after replay of 2), but before
replay of 1), to see an "all-visible" page 1 in its index scan, while
the heap tuple of the index entry wasn't even inserted yet. Index-only
scans could thus return invalid results.

See also the wiki page [0] on parallel recovery, and Koichi-san's
repository [1] with his code for parallel replay based on PG 14.6.

Kind regards,

Matthias van de Meent

[0] https://wiki.postgresql.org/wiki/Parallel_Recovery
[1] https://github.com/koichi-szk/postgres/commits/parallel_replay_14_6/




Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-09 Thread Matthias van de Meent
On Mon, 8 Apr 2024 at 20:15, Tomas Vondra  wrote:
>
>
> On 4/8/24 17:48, Matthias van de Meent wrote:
>> On Mon, 8 Apr 2024 at 17:21, Tomas Vondra  
>> wrote:
>>>
>>> Maybe it'd be better to start by expanding the existing rule about not
>>> committing patches introduced for the first time in the last CF.
>>
>> I don't think adding more hurdles about what to include into the next
>> release is a good solution. Why the March CF, and not earlier? Or
>> later? How about unregistered patches? Changes to the docs? Bug fixes?
>> The March CF already has a submission deadline of "before march", so
>> that already puts a soft limit on the patches considered for the april
>> deadline.
>>
>>> What if
>>> we said that patches in the last CF must not go through significant
>>> changes, and if they do it'd mean the patch is moved to the next CF?
>>
>> I also think there is already a big issue with a lack of interest in
>> getting existing patches from non-committers committed, reducing the
>> set of patches that could be considered is just cheating the numbers
>> and discouraging people from contributing. For one, I know I have
>> motivation issues keeping up with reviewing other people's patches
>> when none (well, few, as of this CF) of my patches get reviewed
>> materially and committed. I don't see how shrinking the window of
>> opportunity for significant review from 9 to 7 months is going to help
>> there.
>>
>
> I 100% understand how frustrating the lack of progress can be, and I
> agree we need to do better. I tried to move a number of stuck patches
> this CF, and I hope (and plan) to do more of this in the future.

Thanks in advance.

> But I don't quite see how would this rule modification change the
> situation for non-committers. AFAIK we already have the rule that
> (complex) patches should not appear in the last CF for the first time,
> and I'd argue that a substantial rework of a complex patch is not that
> far from a completely new patch. Sure, the reworks may be based on a
> thorough review, so there's a lot of nuance. But still, is it possible
> to properly review if it gets reworked at the very end of the CF?

Possible? Probably, if there is a good shared understanding of why the
previous patch version's approach didn't work well, and if the next
approach is well understood as well. But it's not likely, that I'll
agree with.

But the main issue I have with your suggestion is that the March
commitfest effectively contains all new patches starting from January,
and the leftovers not committed by February. If we start banning all
new patches and those with significant reworks from the March
commitfest, then combined with the lack of CF in May there wouldn't be
any chance for new patches in the first half of the year, and an
effective block on rewrites for 6 months- the next CF is only in July.
Sure, there is a bit of leeway there, as some patches get committed
before the commitfest they're registered in is marked active, but our
development workflow is already quite hostile to incidental
contributor's patches [^0], and increasing the periods in which
authors shouldn't expect their patches to be reviewed due to a major
release that's planned in >5 months is probably not going to help the
case.

> > So, I think that'd be counter-productive, as this would get the
> > perverse incentive to band-aid over (architectural) issues to limit
> > churn inside the patch, rather than fix those issues in a way that's
> > appropriate for the project as a whole.
> >
>
> Surely those architectural shortcomings should be identified in a review
> - which however requires time to do properly, and thus is an argument
> for ensuring there's enough time for such review (which is in direct
> conflict with the last-minute crush, IMO).
>
> Once again, I 100% agree we need to do better in handling patches from
> non-committers, absolutely no argument there. But does it require this
> last-minute crush? I don't think so, it can't be at the expense of
> proper review and getting it right.

Agreed on this, 100%, but as mentioned above, the March commitfest
contains more than just "last minute crushes" [^1]. I don't think we
should throw out the baby with the bathwater here.

> A complex patch needs to be
> submitted early in the cycle, not in the last CF. If it's submitted
> early, but does not receive enough interest/reviews, I think we need to
> fix & improve that - not to rework/push it at the very end.

Agree on all this, too.

-Matthias


[^0] (see e.g. the EXPLAIN SERIALIZE patch thread [0], where the
original author did not have the time capacity to maintain the
patchset over months:
https://www.postgres

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Matthias van de Meent
On Mon, 8 Apr 2024 at 17:21, Tomas Vondra  wrote:
>
>
>
> On 4/8/24 16:59, Tom Lane wrote:
> > Heikki Linnakangas  writes:
> >> On 08/04/2024 16:43, Tom Lane wrote:
> >>> I was just about to pen an angry screed along the same lines.
> >>> The commit flux over the past couple days, and even the last
> >>> twelve hours, was flat-out ridiculous.  These patches weren't
> >>> ready a week ago, and I doubt they were ready now.
> >
> >> Can you elaborate, which patches you think were not ready? Let's make
> >> sure to capture any concrete concerns in the Open Items list.
> >
> > [ shrug... ]  There were fifty-some commits on the last day,
> > some of them quite large, and you want me to finger particular ones?
> > I can't even have read them all yet.
> >
> >> Yeah, I should have done that sooner, but realistically, there's nothing
> >> like a looming deadline as a motivator. One idea to avoid the mad rush
> >> in the future would be to make the feature freeze deadline more
> >> progressive. For example:
> >> April 1: If you are still working on a feature that you still want to
> >> commit, you must explicitly flag it in the commitfest as "I plan to
> >> commit this very soon".
> >> April 4: You must give a short status update about anything that you
> >> haven't committed yet, with an explanation of how you plan to proceed
> >> with it.
> >> April 5-8: Mandatory daily status updates, explicit approval by the
> >> commitfest manager needed each day to continue working on it.
> >> April 8: Hard feature freeze deadline
> >
> >> This would also give everyone more visibility, so that we're not all
> >> surprised by the last minute flood of commits.
> >
> > Perhaps something like that could help, but it seems like a lot
> > of mechanism.  I think the real problem is just that committers
> > need to re-orient their thinking a little.  We must get *less*
> > willing to commit marginal patches, not more so, as the deadline
> > approaches.
> >
>
> For me the main problem with the pre-freeze crush is that it leaves
> pretty much no practical chance to do meaningful review/testing, and
> some of the patches likely went through significant changes (at least
> judging by the number of messages and patch versions in the associated
> threads). That has to have a cost later ...
>
> That being said, I'm not sure the "progressive deadline" proposed by
> Heikki would improve that materially, and it seems like a lot of effort
> to maintain etc. And even if someone updates the CF app with all the
> details, does it even give others sufficient opportunity to review the
> new patch versions? I don't think so. (It anything, it does not seem
> fair to expect others to do last-minute reviews under pressure.)
>
> Maybe it'd be better to start by expanding the existing rule about not
> committing patches introduced for the first time in the last CF.

I don't think adding more hurdles about what to include into the next
release is a good solution. Why the March CF, and not earlier? Or
later? How about unregistered patches? Changes to the docs? Bug fixes?
The March CF already has a submission deadline of "before march", so
that already puts a soft limit on the patches considered for the april
deadline.

> What if
> we said that patches in the last CF must not go through significant
> changes, and if they do it'd mean the patch is moved to the next CF?

I also think there is already a big issue with a lack of interest in
getting existing patches from non-committers committed, reducing the
set of patches that could be considered is just cheating the numbers
and discouraging people from contributing. For one, I know I have
motivation issues keeping up with reviewing other people's patches
when none (well, few, as of this CF) of my patches get reviewed
materially and committed. I don't see how shrinking the window of
opportunity for significant review from 9 to 7 months is going to help
there.

So, I think that'd be counter-productive, as this would get the
perverse incentive to band-aid over (architectural) issues to limit
churn inside the patch, rather than fix those issues in a way that's
appropriate for the project as a whole.

-Matthias




Re: Add bump memory context type and use it for tuplesorts

2024-04-06 Thread Matthias van de Meent
On Sun, 7 Apr 2024, 01:59 David Rowley,  wrote:

> On Sun, 7 Apr 2024 at 05:45, Matthias van de Meent
>  wrote:
> > Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and
> itself uses , so using powers of 2 for chunks would indeed fail to detect
> 1s in the 4th bit. I suspect you'll get different results when you check
> the allocation patterns of multiples of 8 bytes, starting from 40,
> especially on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than
> the 16 bytes on i386 and 64-bit architectures, assuming  [0] is accurate)


I'd hazard a guess that
> there are more instances of Postgres running on Windows today than on
> 32-bit CPUs and we don't seem too worried about the bit-patterns used
> for Windows.
>

Yeah, that is something I had some thoughts about too, but didn't check if
there was historical context around. I don't think it's worth bothering
right now though.

>> Another reason not to make it 5 bits is that I believe that would make
> >> the mcxt_methods[] array 2304 bytes rather than 576 bytes.  4 bits
> >> makes it 1152 bytes, if I'm counting correctly.
> >
> >
> > I don't think I understand why this would be relevant when only 5 of the
> contexts are actually in use (thus in caches). Is that size concern about
> TLB entries then?
>
> It's a static const array. I don't want to bloat the binary with
> something we'll likely never need.  If we one day need it, we can
> reserve another bit using the same overlapping method.
>

Fair points.

>> I revised the patch to simplify hdrmask logic.  This started with me
> >> having trouble finding the best set of words to document that the
> >> offset is "half the bytes between the chunk and block".  So, instead
> >> of doing that, I've just made it so these two fields effectively
> >> overlap. The lowest bit of the block offset is the same bit as the
> >> high bit of what MemoryChunkGetValue returns.
> >
> >
> > Works for me, I suppose.
>
> hmm. I don't detect much enthusiasm for it.
>

I had a tiring day leaving me short on enthousiasm, after which I realised
there were some things to this patch that would need fixing.

I could've worded this better, but nothing against this code.

-Matthias


Re: Add bump memory context type and use it for tuplesorts

2024-04-06 Thread Matthias van de Meent
On Sat, 6 Apr 2024, 14:36 David Rowley,  wrote:

> On Sat, 6 Apr 2024 at 02:30, Matthias van de Meent
>  wrote:
> >
> > On Thu, 4 Apr 2024 at 22:43, Tom Lane  wrote:
> > >
> > > Matthias van de Meent  writes:
> > > > It extends memory context IDs to 5 bits (32 values), of which
> > > > - 8 have glibc's malloc pattern of 001/010;
> > > > - 1 is unused memory's 0
> > > > - 1 is wipe_mem's 1
> > > > - 4 are used by existing contexts
> (Aset/Generation/Slab/AlignedRedirect)
> > > > - 18 are newly available.
> > >
> > > This seems like it would solve the problem for a good long time
> > > to come; and if we ever need more IDs, we could steal one more bit
> > > by requiring the offset to the block header to be a multiple of 8.
> > > (Really, we could just about do that today at little or no cost ...
> > > machines with MAXALIGN less than 8 are very thin on the ground.)
> >
> > Hmm, it seems like a decent idea, but I didn't want to deal with the
> > repercussions of that this late in the cycle when these 2 bits were
> > still relatively easy to get hold of.
>
> Thanks for writing the patch.
>
> I think 5 bits is 1 too many. 4 seems fine. I also think you've
> reserved too many slots in your patch as I disagree that we need to
> reserve the glibc malloc pattern anywhere but in the 1 and 2 slots of
> the mcxt_methods[] array.  I looked again at the 8 bytes prior to a
> glibc malloc'd chunk and I see the lowest 4 bits of the headers
> consistently set to 0001 for all powers of 2 starting at 8 up to
> 65536.


Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and itself
uses , so using powers of 2 for chunks would indeed fail to detect 1s in
the 4th bit. I suspect you'll get different results when you check the
allocation patterns of multiples of 8 bytes, starting from 40, especially
on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than the 16 bytes
on i386 and 64-bit architectures, assuming  [0] is accurate)

131072 seems to vary and beyond that, they seem to be set to
> 0010.
>

In your updated 0001, you don't seem to fill the RESERVED_GLIBC memctx
array entries with BOGUS_MCTX().

With that, there's no increase in the number of reserved slots from
> what we have reserved today. Still 4.  So having 4 bits instead of 3
> bits gives us a total of 12 slots rather than 4 slots.  Having 3x
> slots seems enough.  We might need an extra bit for something else
> sometime. I think keeping it up our sleeve is a good idea.
>
> Another reason not to make it 5 bits is that I believe that would make
> the mcxt_methods[] array 2304 bytes rather than 576 bytes.  4 bits
> makes it 1152 bytes, if I'm counting correctly.
>

I don't think I understand why this would be relevant when only 5 of the
contexts are actually in use (thus in caches). Is that size concern about
TLB entries then?


> I revised the patch to simplify hdrmask logic.  This started with me
> having trouble finding the best set of words to document that the
> offset is "half the bytes between the chunk and block".  So, instead
> of doing that, I've just made it so these two fields effectively
> overlap. The lowest bit of the block offset is the same bit as the
> high bit of what MemoryChunkGetValue returns.


Works for me, I suppose.

I also updated src/backend/utils/mmgr/README to explain this and
> adjust the mentions of 3-bits and 61-bits to 4-bits and 60-bits.  I
> also explained the overlapping part.
>

Thanks!

[0]
https://sourceware.org/glibc/wiki/MallocInternals#Platform-specific_Thresholds_and_Constants

>


Re: Add bump memory context type and use it for tuplesorts

2024-04-05 Thread Matthias van de Meent
On Thu, 4 Apr 2024 at 22:43, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > On Mon, 25 Mar 2024 at 22:44, Tom Lane  wrote:
> >> Basically, I'm not happy with consuming the last reasonably-available
> >> pattern for a memory context type that has little claim to being the
> >> Last Context Type We Will Ever Want.  Rather than making a further
> >> dent in our ability to detect corrupted chunks, we should do something
> >> towards restoring the expansibility that existed in the original
> >> design.  Then we can add bump contexts and whatever else we want.
>
> > So, would something like the attached make enough IDs available so
> > that we can add the bump context anyway?
>
> > It extends memory context IDs to 5 bits (32 values), of which
> > - 8 have glibc's malloc pattern of 001/010;
> > - 1 is unused memory's 0
> > - 1 is wipe_mem's 1
> > - 4 are used by existing contexts (Aset/Generation/Slab/AlignedRedirect)
> > - 18 are newly available.
>
> This seems like it would solve the problem for a good long time
> to come; and if we ever need more IDs, we could steal one more bit
> by requiring the offset to the block header to be a multiple of 8.
> (Really, we could just about do that today at little or no cost ...
> machines with MAXALIGN less than 8 are very thin on the ground.)

Hmm, it seems like a decent idea, but I didn't want to deal with the
repercussions of that this late in the cycle when these 2 bits were
still relatively easy to get hold of.

> The only objection I can think of is that perhaps this would slow
> things down a tad by requiring more complicated shifting/masking.
> I wonder if we could redo the performance checks that were done
> on the way to accepting the current design.

I didn't do very extensive testing, but the light performance tests
that I did with the palloc performance benchmark patch & script shared
above indicate didn't measure an observable negative effect.
An adapted version of the test that uses repalloc() to check
performance differences in MCXT_METHOD() doesn't show a significant
performance difference from master either. That test case is attached
as repalloc-performance-test-function.patch.txt.

The full set of patches would then accumulate to the attached v5 of
the patchset.
0001 is an update of my patch from yesterday, in which I update
MemoryContextMethodID infrastructure for more IDs, and use a new
naming scheme for unused/reserved IDs.
0002 and 0003 are David's patches, with minor changes to work with
0001 (rebasing, and I moved the location around to keep function
declaration in order with memctx ids)

Kind regards,

Matthias van de Meent
From b65320736cf63684b4710b3d63e243a0862d37c6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 5 Apr 2024 15:13:52 +0200
Subject: [PATCH] repalloc performance test function

---
 src/backend/utils/adt/mcxtfuncs.c | 231 ++
 src/include/catalog/pg_proc.dat   |  12 ++
 2 files changed, 243 insertions(+)

diff --git a/src/backend/utils/adt/mcxtfuncs.c 
b/src/backend/utils/adt/mcxtfuncs.c
index 4d4a70915b..88de4bbcb5 100644
--- a/src/backend/utils/adt/mcxtfuncs.c
+++ b/src/backend/utils/adt/mcxtfuncs.c
@@ -15,8 +15,11 @@
 
 #include "postgres.h"
 
+#include 
+
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -185,3 +188,231 @@ pg_log_backend_memory_contexts(PG_FUNCTION_ARGS)
 
PG_RETURN_BOOL(true);
 }
+
+typedef struct AllocateTestNext
+{
+   struct AllocateTestNext *next;  /* ptr to the next allocation */
+} AllocateTestNext;
+
+/* #define ALLOCATE_TEST_DEBUG */
+/*
+ * pg_allocate_memory_test
+ * Used to test the performance of a memory context types
+ */
+Datum
+pg_allocate_memory_test(PG_FUNCTION_ARGS)
+{
+   int32   chunk_size = PG_GETARG_INT32(0);
+   int64   keep_memory = PG_GETARG_INT64(1);
+   int64   total_alloc = PG_GETARG_INT64(2);
+   text   *context_type_text = PG_GETARG_TEXT_PP(3);
+   char   *context_type;
+   int64   curr_memory_use = 0;
+   int64   remaining_alloc_bytes = total_alloc;
+   MemoryContext context;
+   MemoryContext oldContext;
+   AllocateTestNext   *next_free_ptr = NULL;
+   AllocateTestNext   *last_alloc = NULL;
+   clock_t start, end;
+
+   if (chunk_size < sizeof(AllocateTestNext))
+   elog(ERROR, "chunk_size (%d) must be at least %ld bytes", 
chunk_size,
+sizeof(AllocateTestNext));
+   if (keep_memory > total_alloc)
+   elog(ERROR, "keep_memory (" INT64_FORMAT ") must be less than 
total_alloc (" IN

Re: Add bump memory context type and use it for tuplesorts

2024-04-04 Thread Matthias van de Meent
On Mon, 25 Mar 2024 at 22:44, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Tue, 26 Mar 2024 at 03:53, Tom Lane  wrote:
> >> Could we move the knowledge of exactly which context type it is out
> >> of the per-chunk header and keep it in the block header?
>
> > I wasn't 100% clear on your opinion about using 010 vs expanding the
> > bit-space. Based on the following it sounded like you were not
> > outright rejecting the idea of consuming the 010 pattern.
>
> What I said earlier was that 010 was the least bad choice if we
> fail to do any expansibility work; but I'm not happy with failing
> to do that.

Okay.

> Basically, I'm not happy with consuming the last reasonably-available
> pattern for a memory context type that has little claim to being the
> Last Context Type We Will Ever Want.  Rather than making a further
> dent in our ability to detect corrupted chunks, we should do something
> towards restoring the expansibility that existed in the original
> design.  Then we can add bump contexts and whatever else we want.

So, would something like the attached make enough IDs available so
that we can add the bump context anyway?

It extends memory context IDs to 5 bits (32 values), of which
- 8 have glibc's malloc pattern of 001/010;
- 1 is unused memory's 0
- 1 is wipe_mem's 1
- 4 are used by existing contexts (Aset/Generation/Slab/AlignedRedirect)
- 18 are newly available.

Kind regards,

Matthias
From 4c0b4b1af98fcfecf80a30ea1834668b59d543a5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 4 Apr 2024 21:34:46 +0200
Subject: [PATCH v0] Add bitspace for more memory context types in
 MemoryChunk's hdrmask

Assuming we don't want to use patterns from unused memory, common
glibc malloc patterns, and wipe_mem-ed memory, we now have 18 new IDs
to work with again, from 0.

In passing, simplify/clean up initialization of unused memory contexts
in the mcxt_methods array.

Inspiration: 
https://postgr.es/m/CAApHDvqGSpCU95TmM%3DBp%3D6xjL_nLys4zdZOpfNyWBk97Xrdj2w%40mail.gmail.com
---
 src/backend/utils/mmgr/mcxt.c| 55 +++-
 src/include/utils/memutils_internal.h| 35 ---
 src/include/utils/memutils_memorychunk.h | 15 ---
 3 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 5d426795d9..9373f782ce 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -102,26 +102,41 @@ static const MemoryContextMethods mcxt_methods[] = {
 * seems sufficient to provide routines for the methods that might get
 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
 */
-
-   [MCTX_UNUSED1_ID].free_p = BogusFree,
-   [MCTX_UNUSED1_ID].realloc = BogusRealloc,
-   [MCTX_UNUSED1_ID].get_chunk_context = BogusGetChunkContext,
-   [MCTX_UNUSED1_ID].get_chunk_space = BogusGetChunkSpace,
-
-   [MCTX_UNUSED2_ID].free_p = BogusFree,
-   [MCTX_UNUSED2_ID].realloc = BogusRealloc,
-   [MCTX_UNUSED2_ID].get_chunk_context = BogusGetChunkContext,
-   [MCTX_UNUSED2_ID].get_chunk_space = BogusGetChunkSpace,
-
-   [MCTX_UNUSED3_ID].free_p = BogusFree,
-   [MCTX_UNUSED3_ID].realloc = BogusRealloc,
-   [MCTX_UNUSED3_ID].get_chunk_context = BogusGetChunkContext,
-   [MCTX_UNUSED3_ID].get_chunk_space = BogusGetChunkSpace,
-
-   [MCTX_UNUSED4_ID].free_p = BogusFree,
-   [MCTX_UNUSED4_ID].realloc = BogusRealloc,
-   [MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
-   [MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
+#define BOGUS_MCTX(id) \
+   [id].free_p = BogusFree, \
+   [id].realloc = BogusRealloc, \
+   [id].get_chunk_context = BogusGetChunkContext, \
+   [id].get_chunk_space = BogusGetChunkSpace
+
+   BOGUS_MCTX(MCTX_UNUSED1_ID),
+   BOGUS_MCTX(MCTX_UNUSED2_ID),
+   BOGUS_MCTX(MCTX_UNUSED3_ID),
+   BOGUS_MCTX(MCTX_UNUSED4_ID),
+   BOGUS_MCTX(MCTX_UNUSED5_ID),
+   BOGUS_MCTX(MCTX_UNUSED6_ID),
+   BOGUS_MCTX(MCTX_UNUSED7_ID),
+   BOGUS_MCTX(MCTX_UNUSED8_ID),
+   BOGUS_MCTX(MCTX_UNUSED9_ID),
+   BOGUS_MCTX(MCTX_UNUSED10_ID),
+   BOGUS_MCTX(MCTX_UNUSED11_ID),
+   BOGUS_MCTX(MCTX_UNUSED12_ID),
+   BOGUS_MCTX(MCTX_UNUSED13_ID),
+   BOGUS_MCTX(MCTX_UNUSED14_ID),
+   BOGUS_MCTX(MCTX_UNUSED15_ID),
+   BOGUS_MCTX(MCTX_UNUSED16_ID),
+   BOGUS_MCTX(MCTX_UNUSED17_ID),
+   BOGUS_MCTX(MCTX_UNUSED18_ID),
+   BOGUS_MCTX(MCTX_UNUSED19_ID),
+   BOGUS_MCTX(MCTX_UNUSED20_ID),
+   BOGUS_MCTX(MCTX_UNUSED21_ID),
+   BOGUS_MCTX(MCTX_UNUSED22_ID),
+   BOGUS_MCTX(MCTX_UNUSED23_ID),
+   BOGUS_MCTX(MCTX_UNUSED24_ID),
+   BOGUS_MCTX(MCTX_UNUSED25_ID),
+   BOGUS_MCTX(MCTX_UNUSED26_ID),
+   BOGUS_MCTX(MCTX_UNUSED27_ID),
+   BOGUS_MCTX(MCTX_UNUSED28_ID)
+#undef BOGUS_MCTX

Re: Experiments with Postgres and SSL

2024-04-04 Thread Matthias van de Meent
On Thu, 28 Mar 2024, 13:37 Heikki Linnakangas,  wrote:
>
> On 28/03/2024 13:15, Matthias van de Meent wrote:
> > On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:
> >>
> >> I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
> >> some time rebase and fix various little things:
> >
> > With the recent changes to backend startup committed by you, this
> > patchset has gotten major apply failures.
> >
> > Could you provide a new version of the patchset so that it can be
> > reviewed in the context of current HEAD?
>
> Here you are.

Sorry for the delay. I've run some tests and didn't find any specific
issues in the patchset.

I did get sidetracked on trying to further improve the test suite,
where I was trying to find out how to use Test::More::subtests, but
have now decided it's not worth the lost time now vs adding this as a
feature in 17.

Some remaining comments:

patches 0001/0002: not reviewed in detail.

Patch 0003:

The read size in secure_raw_read is capped to port->raw_buf_remaining
if the raw buf has any data. While the user will probably call into
this function again, I think that's a waste of cycles.

pq_buffer_has_data now doesn't have any protections against
desynchronized state between PqRecvLength and PqRecvPointer. An
Assert(PqRecvLength >= PqRecvPointer) to that value would be
appreciated.

(in backend_startup.c)
> + elog(LOG, "Detected direct SSL handshake");

I think this should be gated at a lower log level, or a GUC, as this
wouls easily DOS a logfile by bulk sending of SSL handshake bytes.

0004:

backend_startup.c
> +if (!ssl_enable_alpn)
> +{
> +elog(WARNING, "Received direct SSL connection without 
> ssl_enable_alpn enabled");

This is too verbose, too.

> +   if (!port->alpn_used)
> +{
> +ereport(COMMERROR,
> +(errcode(ERRCODE_PROTOCOL_VIOLATION),
> + errmsg("Received direct SSL connection request without 
> required ALPN protocol negotiation extension")));

If ssl_enable_alpn is disabled, we shouln't report a COMMERROR when
the client does indeed not have alpn enabled.

0005:

As mentioned above, I'd have loved to use subtests here for the cube()
of tests, but I got in too much of a rabbit hole to get that done.

0006:

In CONNECTION_FAILED, we use connection_failed() to select whether we
need a new connection or stop trying altogether, but that function's
description states:

> + * Out-of-line portion of the CONNECTION_FAILED() macro
> + *
> + * Returns true, if we should retry the connection with different encryption 
> method.

Which to me reads like we should reuse the connection, and try a
different method on that same connection. Maybe we can improve the
wording to something like
+ * Returns true, if we should reconnect with a different encryption method.
to make the reconnect part more clear.

In select_next_encryption_method, there are several copies of this pattern:

if ((remaining_methods & ENC_METHOD) != 0)
{
conn->current_enc_method = ENC_METHOD;
    return true;
}

I think a helper macro would reduce the verbosity of the scaffolding,
like in the attached SELECT_NEXT_METHOD.diff.txt.


Kind regards,

Matthias van de Meent
 src/interfaces/libpq/fe-connect.c | 51 +++
 1 file changed, 20 insertions(+), 31 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index edc324dad0..ef95b07978 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4344,6 +4344,15 @@ select_next_encryption_method(PGconn *conn, bool 
have_valid_connection)
 {
int remaining_methods;
 
+#define SELECT_NEXT_METHOD(method) \
+   do { \
+   if ((remaining_methods & method) != 0) \
+   { \
+   conn->current_enc_method = method; \
+   return true; \
+   } \
+   } while (false)
+
remaining_methods = conn->allowed_enc_methods & 
~conn->failed_enc_methods;
 
/*
@@ -4373,20 +4382,14 @@ select_next_encryption_method(PGconn *conn, bool 
have_valid_connection)
}
}
}
-   if ((remaining_methods & ENC_GSSAPI) != 0)
-   {
-   conn->current_enc_method = ENC_GSSAPI;
-   return true;
-   }
}
+
+   SELECT_NEXT_METHOD(ENC_GSSAPI);
 #endif
 
/* With sslmode=allow, try plaintext connection before SSL. */
-   if (conn->sslmode[0] == 'a' && (remaining_methods & ENC_PLAINTEXT) != 0)
-   {
-   conn->current_enc_method = ENC_PLAINTEXT;
-   retur

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-04 Thread Matthias van de Meent
On Wed, 3 Apr 2024 at 23:50, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
>> On Tue, 2 Apr 2024 at 17:47, Tom Lane  wrote:
>>> IIUC, it's not possible to use the SERIALIZE option when explaining
>>> CREATE TABLE AS, because you can't install the instrumentation tuple
>>> receiver when the IntoRel one is needed.  I think that's fine because
>>> no serialization would happen in that case anyway, but should we
>>> throw an error if that combination is requested?  Blindly reporting
>>> that zero bytes were serialized seems like it'd confuse people.
>
>> I think it's easily explained as no rows were transfered to the
>> client. If there is actual confusion, we can document it, but
>> confusing disk with network is not a case I'd protect against. See
>> also: EXPLAIN (ANALYZE, SERIALIZE) INSERT without the RETURNING
>> clause.
>
> Fair enough.  There were a couple of spots in the code where I thought
> this was important to comment about, though.

Yeah, I'll agree with that.

>>> However, isn't the bottom half of serializeAnalyzeStartup doing
>>> exactly what the comment above it says we don't do, that is accounting
>>> for the RowDescription message?  Frankly I agree with the comment that
>>> it's not worth troubling over, so I'd just drop that code rather than
>>> finding a solution for the magic-number problem.
>
>> I'm not sure I agree with not including the size of RowDescription
>> itself though, as wide results can have a very large RowDescription
>> overhead; up to several times the returned data in cases where few
>> rows are returned.
>
> Meh --- if we're rounding off to kilobytes, you're unlikely to see it.
> In any case, if we start counting overhead messages, where shall we
> stop?  Should we count the eventual CommandComplete or ReadyForQuery,
> for instance?  I'm content to say that this measures data only; that
> seems to jibe with other aspects of EXPLAIN's behavior.

Fine with me.

> > Removed. I've instead added buffer usage, as I realised that wasn't
> > covered yet, and quite important to detect excessive detoasting (it's
> > not covered at the top-level scan).
>
> Duh, good catch.
>
> I've pushed this after a good deal of cosmetic polishing -- for
> example, I spent some effort on making serializeAnalyzeReceive
> look as much like printtup as possible, in hopes of making it
> easier to keep the two functions in sync in future.

Thanks for the review, and for pushing!

-Matthias




Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-03 Thread Matthias van de Meent
On Tue, 2 Apr 2024 at 17:47, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > Attached is v9, which is rebased on master to handle 24eebc65's
> > changed signature of pq_sendcountedtext.
> > It now also includes autocompletion, and a second patch which adds
> > documentation to give users insights into this new addition to
> > EXPLAIN.
>
> I took a quick look through this.  Some comments in no particular
> order:

Thanks!

> Documentation is not optional, so I don't really see the point of
> splitting this into two patches.

I've seen the inverse several times, but I've merged them in the
attached version 10.

> IIUC, it's not possible to use the SERIALIZE option when explaining
> CREATE TABLE AS, because you can't install the instrumentation tuple
> receiver when the IntoRel one is needed.  I think that's fine because
> no serialization would happen in that case anyway, but should we
> throw an error if that combination is requested?  Blindly reporting
> that zero bytes were serialized seems like it'd confuse people.

I think it's easily explained as no rows were transfered to the
client. If there is actual confusion, we can document it, but
confusing disk with network is not a case I'd protect against. See
also: EXPLAIN (ANALYZE, SERIALIZE) INSERT without the RETURNING
clause.

> I'd lose the stuff about measuring memory consumption.  Nobody asked
> for that and the total is completely misleading, because in reality
> we'll reclaim the memory used after each row.  It would allow cutting
> the text-mode output down to one line, too, instead of having your
> own format that's not like anything else.

Done.

> I thought the upthread agreement was to display the amount of
> data sent rounded to kilobytes, so why is the code displaying
> an exact byte count?

Probably it was because the other explain code I referenced was using
bytes in the json/yaml format. Fixed.

> I don't especially care for magic numbers like these:
>
> +   /* see printtup.h why we add 18 bytes here. These are the 
> infos
> +* needed for each attribute plus the attribute's name */
> +   receiver->metrics.bytesSent += (int64) namelen + 1 + 18;
>
> If the protocol is ever changed in a way that invalidates this,
> there's about 0 chance that somebody would remember to touch
> this code.
> However, isn't the bottom half of serializeAnalyzeStartup doing
> exactly what the comment above it says we don't do, that is accounting
> for the RowDescription message?  Frankly I agree with the comment that
> it's not worth troubling over, so I'd just drop that code rather than
> finding a solution for the magic-number problem.

In the comment above I intended to explain that it takes negligible
time to serialize the RowDescription message (when compared to all
other tasks of explain), so skipping the actual writing of the message
would be fine.
I'm not sure I agree with not including the size of RowDescription
itself though, as wide results can have a very large RowDescription
overhead; up to several times the returned data in cases where few
rows are returned.

Either way, I've removed that part of the code.

> Don't bother with duplicating valgrind-related logic in
> serializeAnalyzeReceive --- that's irrelevant to actual users.

Removed. I've instead added buffer usage, as I realised that wasn't
covered yet, and quite important to detect excessive detoasting (it's
not covered at the top-level scan).

> This seems like cowboy coding:
>
> +   self->destRecevier.mydest = DestNone;
>
> You should define a new value of the CommandDest enum and
> integrate this receiver type into the support functions
> in dest.c.

Done.

> BTW, "destRecevier" is misspelled...

Thanks, fixed.


Kind regards,

Matthias van de Meent.


v10-0001-Explain-Add-SERIALIZE-option.patch
Description: Binary data


Re: Experiments with Postgres and SSL

2024-03-28 Thread Matthias van de Meent
On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:
>
> I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
> some time rebase and fix various little things:

With the recent changes to backend startup committed by you, this
patchset has gotten major apply failures.

Could you provide a new version of the patchset so that it can be
reviewed in the context of current HEAD?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: DRAFT: Pass sk_attno to consistent function

2024-03-21 Thread Matthias van de Meent
On Tue, 19 Mar 2024 at 17:00, Michał Kłeczek  wrote:
>
> Hi All,
>
> Since it looks like there is not much interest in the patch I will try to 
> provide some background to explain why I think it is needed.
>
[...]
> What we really need is for Gist to support “= ANY (…)”.
> As Gist index is extensible in terms of queries it supports it is quite easy 
> to implement an operator class/family with operator:
>
> ||= (text, text[])
>
> that has semantics the same as “= ANY (…)”

I've had a similar idea while working on BRIN, and was planning on
overloading `@>` for @>(anyarray, anyelement) or using a new
`@>>(anyarray, anyelement)` operator. No implementation yet, though.

> With this operator we can write our queries like:
>
> account_number ||= [list of account numbers] AND
> account_number = ANY ([list of account numbers]) — redundant for partition 
> pruning as it does not understand ||=
>
> and have optimal plans:
>
> Limit
> — Merge Append
> —— Index scan of relevant partitions
>
> The problem is that now each partition scan is for the same list of accounts.
> The “consistent” function cannot assume anything about contents of the table 
> so it has to check all elements of the list
> and that in turn causes reading unnecessary index pages for accounts not in 
> this partition.

You seem to already be using your own operator class, so you may want
to look into CREATE FUNCTION's support_function parameter; which would
handle SupportRequestIndexCondition and/or SupportRequestSimplify.
I suspect a support function that emits multiple clauses that each
apply to only a single partition at a time should get you quite far if
combined with per-partition constraints that filter all but one of
those. Also, at plan-time you can get away with much more than at
runtime.

> What we need is a way for the “consistent” function to be able to pre-process 
> input query array and remove elements
> that are not relevant for this scan. To do that it is necessary to have 
> enough information to read necessary metadata from the catalog.

> The proposed patch addresses this need and seems (to me) largely 
> uncontroversial as it does not break any existing extensions.

I don't think "my index operator class will go into the table
definition and check if parts of the scankey are consistent with the
table constraints" is a good reason to expose the index column to
operator classes.
Note that operator classes were built specifically so that they're
independent from their column position. It doens't really make sense
to expose this. Maybe my suggestion up above helps?

Kind regards,

Matthias van de Meent




Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

2024-03-21 Thread Matthias van de Meent
On Thu, 21 Mar 2024 at 07:17, Andrey M. Borodin  wrote:
>
> > On 29 Jun 2022, at 17:43, Robins Tharakan  wrote:
>
> Sorry to bump ancient thread, I have some observations that might or might 
> not be relevant.
> Recently we noticed a corruption on one of clusters. The corruption at hand 
> is not in system catalog, but in user indexes.
> The cluster was correctly configured: checksums, fsync, FPI etc.
> The cluster never was restored from a backup. It’s a single-node cluster, so 
> it was not ever promoted, pg_rewind-ed etc. VM had never been rebooted.
>
> But, the cluster had been experiencing 10 OOMs a day. There were no torn 
> pages, no checsum erros at log at all. Yet, B-tree indexes became corrupted.

Would you happen to have a PostgreSQL version number (or commit hash)
to help debugging? Has it always had that PG version, or has the
version been upgraded?

Considering the age of this thread, and thus potential for v14 pre-.4:
Did this cluster use REINDEX (concurrently) for the relevant indexes?


Kind regards,

Matthias van de Meent




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-20 Thread Matthias van de Meent
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
>  wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> This is part of my next revision, v15, which I've attached (along with
> a test case that you might find useful, explained below).
>
> v15 makes the difference between the non-required scan key trigger and
> required scan key trigger cases clearer within _bt_advance_array_keys.

OK, here's a small additional review, with a suggestion for additional
changes to _bt_preprocess:

> @@ -1117,6 +3160,46 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey 
> op,
>  /*
>  * The opfamily we need to worry about is identified by the index column.
>  */
> Assert(leftarg->sk_attno == rightarg->sk_attno);
>
> +/*
> + * If either leftarg or rightarg are equality-type array scankeys, we 
> need
> + * specialized handling (since by now we know that IS NULL wasn't used)
> + */
> [...]
> +}
> +
> opcintype = rel->rd_opcintype[leftarg->sk_attno - 1];

Here, you insert your code between the comment about which opfamily to
choose and the code assigning the opfamily. I think this needs some
cleanup.

> + * Don't call _bt_preprocess_array_keys_final in this fast path
> + * (we'll miss out on the single value array transformation, but
> + * that's not nearly as important when there's only one scan key)

Why is it OK to ignore it? Or, why don't we apply it here?

---

Attached 2 patches for further optimization of the _bt_preprocess_keys
path (on top of your v15), according to the following idea:

Right now, we do "expensive" processing with xform merging for all
keys when we have more than 1 keys in the scan. However, we only do
per-attribute merging of these keys, so if there is only one key for
any attribute, the many cycles spent in that loop are mostly wasted.
By checking for single-scankey attributes, we can short-track many
multi-column index scans because they usually have only a single scan
key per attribute.
The first implements that idea, the second reduces the scope of
various variables so as to improve compiler optimizability.

I'll try to work a bit more on merging the _bt_preprocess steps into a
single main iterator, but this is about as far as I got with clean
patches. Merging the steps for array preprocessing with per-key
processing and post-processing is proving a bit more complex than I'd
anticipated, so I don't think I'll be able to finish that before the
feature freeze, especially with other things that keep distracting me.

Matthias van de Meent
From 16808e85f5fae7b16fd52d8a2be8437e4cff8640 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 19 Mar 2024 16:39:29 +0100
Subject: [PATCH v1 1/2] nbtree: Optimize preprocessing for single ScanKey per
 column

Before, there was only a single fast path: single-ScanKey index scans.
With this optimization, we can fast-track processing of attributes with
only a single scankey, significantly reducing overhead for common scans
like "a = 10 and b < 8".
---
 src/backend/access/nbtree/nbtutils.c | 529 ++-
 1 file changed, 278 insertions(+), 251 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c 
b/src/backend/access/nbtree/nbtutils.c
index b401b31191..8cd6270408 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2511,7 +2511,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
ScanKey inkeys;
ScanKey outkeys;
ScanKey cur;
-   BTScanKeyPreproc xform[BTMaxStrategyNumber];
booltest_result;
int i,
j;
@@ -2619,327 +2618,355 @@ _bt_preprocess_keys(IndexScanDesc scan)
 * xform[i] points to the currently best scan key of strategy type i+1; 
it
 * is NULL if we haven't yet found such a key for this attr.
 */
-   attno = 1;
-   memset(xform, 0, sizeof(xform));
 
-   /*
-* Loop iterates from 0 to numberOfKeys inclusive; we use the last pass 
to
-* handle after-last-key processing.  Actual exit from the loop is at 
the
-* "break" statement below.
-*/
-   for (i = 0;; cur++, i++)
+   for (i = 0; i < numberOfKeys;)
{
-   if (i < numberOfKeys)
+   BTScanKeyPreproc xform[BTMaxStrategyNumber];
+   int priorNumberOfEqualCols = 
numberOfEqualCols;
+   boolfast = true;
+
+   /* initialize for this attno */
+   attno = cur->sk_attno;
+
+   

Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent
On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent
 wrote:
>
> On Tue, 19 Mar 2024 at 20:58, Tom Lane  wrote:
> >
> > For the last few days, buildfarm member parula has been intermittently
> > failing the partition_prune regression test, due to unexpected plan
> > changes [1][2][3][4].  The symptoms can be reproduced exactly by
> > inserting a "vacuum" of one or another of the partitions of table
> > "ab", so we can presume that the underlying cause is an autovacuum run
> > against one of those tables.  But why is that happening?  None of
> > those tables receive any insertions during the test, so I don't
> > understand why autovacuum would trigger on them.

> This may be purely coincidental, but yesterday I also did have a
> seemingly random failure in the recovery test suite locally, in
> t/027_stream_regress.pl, where it changed the join order of exactly
> one of the queries (that uses the tenk table multiple times, iirc 3x
> or so).
[...]
> Sadly, I did not save the output of that test run, so this is just
> about all the information I have. The commit I was testing on was
> based on ca108be7, and system config is available if needed.

It looks like Cirrus CI reproduced my issue with recent commit
a0390f6c [0] and previously also with 4665cebc [1], 875e46a0 [2],
449e798c [3], and other older commits, on both the Windows Server 2019
build and Debian for a39f1a36 (with slightly different plan changes on
this Debian run).  That should rule out most of the environments.

After analyzing the logs produced by the various primaries, I can't
find a good explanation why they would have this issue.  The table is
vacuum analyzed before the regression tests start, and in this run
autovacuum/autoanalyze doesn't seem to kick in until (at least)
seconds after this query was run.


Kind regards,

Matthias van de Meent

[0] https://cirrus-ci.com/task/6295909005262848
[1] https://cirrus-ci.com/task/5229745516838912
[2] https://cirrus-ci.com/task/5098544567156736
[3] https://cirrus-ci.com/task/4783906419900416




Re: Reducing output size of nodeToString

2024-03-20 Thread Matthias van de Meent
On Wed, 20 Mar 2024 at 12:49, Peter Eisentraut  wrote:
>
> On 19.03.24 17:13, Peter Eisentraut wrote:
> > On 11.03.24 21:52, Matthias van de Meent wrote:
> >>> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
> >>>
> >>> This looks sensible, but maybe making Location a global type is a bit
> >>> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> >>> keep it under 12 characters.
> >> I've gone with ParseLoc in the attached v8 patchset.
> >
> > I have committed this one.
>
> Next, I was looking at
> v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch.

[...]

> So anyway, my idea was that we should turn this around and make
> nodeToString() always drop location information, and instead add
> nodeToStringWithLocations() for the few debugging uses.  And this would
> also be nice because then it matches exactly with the existing
> stringToNodeWithLocations().

That seems reasonable, yes.

-Matthias




Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent
On Tue, 19 Mar 2024 at 20:58, Tom Lane  wrote:
>
> For the last few days, buildfarm member parula has been intermittently
> failing the partition_prune regression test, due to unexpected plan
> changes [1][2][3][4].  The symptoms can be reproduced exactly by
> inserting a "vacuum" of one or another of the partitions of table
> "ab", so we can presume that the underlying cause is an autovacuum run
> against one of those tables.  But why is that happening?  None of
> those tables receive any insertions during the test, so I don't
> understand why autovacuum would trigger on them.
>
> I suppose we could attach "autovacuum=off" settings to these tables,
> but it doesn't seem to me that that should be necessary.  These test
> cases are several years old and haven't given trouble before.
> Moreover, if that's necessary then there are a lot of other regression
> tests that would presumably need the same treatment.
>
> I'm also baffled why this wasn't happening before.  I scraped the
> buildfarm logs for 3 months back and confirmed my impression that
> this is a new failure mode.  But one of these four runs is on
> REL_14_STABLE, eliminating the theory that the cause is a recent
> HEAD-only change.
>
> Any ideas?

This may be purely coincidental, but yesterday I also did have a
seemingly random failure in the recovery test suite locally, in
t/027_stream_regress.pl, where it changed the join order of exactly
one of the queries (that uses the tenk table multiple times, iirc 3x
or so). As the work I was doing wasn't related to join order-related
problems, this surprised me. After checking for test file changes
(none), I re-ran the tests without recompilation and the failure went
away, so I attributed this to an untimely autoanalize. However, as
this was also an unexpected plan change in the tests this could be
related.

Sadly, I did not save the output of that test run, so this is just
about all the information I have. The commit I was testing on was
based on ca108be7, and system config is available if needed.

-Matthias




Re: Reducing output size of nodeToString

2024-03-19 Thread Matthias van de Meent
On Tue, 19 Mar 2024 at 17:13, Peter Eisentraut  wrote:
>
> On 11.03.24 21:52, Matthias van de Meent wrote:
> >> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
> >>
> >> This looks sensible, but maybe making Location a global type is a bit
> >> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> >> keep it under 12 characters.
> > I've gone with ParseLoc in the attached v8 patchset.
>
> I have committed this one.

Thanks!

> I moved the typedef to nodes/nodes.h, where we already had similar
> typdefs (Cardinality, etc.).  The fields stmt_location and stmt_len in
> PlannedStmt were not converted, so I fixed that.  Also, between you
> writing your patch and now, at least one new node type was added, so I
> fixed that one up, too.

Good points, thank you for fixing that.

> (I diffed the generated node support functions
> to check.)  Hopefully, future hackers will apply the new type when
> appropriate.

Are you also planning on committing some of the other patches later,
or should I rebase the set to keep CFBot happy?

-Matthias




Re: documentation structure

2024-03-18 Thread Matthias van de Meent
On Mon, 18 Mar 2024 at 15:12, Robert Haas  wrote:

I'm not going into detail about the other docs comments, I don't have
much of an opinion either way on the mentioned sections. You make good
arguments; yet I don't usually use those sections of the docs but
rather do code searches.

> I don't know what to do about "I. SQL commands". It's obviously
> impractical to promote that to a top-level section, because it's got a
> zillion sub-pages which I don't think we want in the top-level
> documentation index. But having it as one of several unnumbered
> chapters interposed between 51 and 52 doesn't seem great either.

Could "SQL Commands" be a top-level construct, with subsections for
SQL/DML, SQL/DDL, SQL/Transaction management, and PG's
extensions/administrative/misc features? I sometimes find myself
trying to mentally organize what SQL commands users can use vs those
accessible to database owners and administrators, which is not
currently organized as such in the SQL Commands section.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-18 Thread Matthias van de Meent
On Sat, 16 Mar 2024 at 01:12, Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
>  wrote:
> > I've attached v14, where 0001 is v13, 0002 is a patch with small
> > changes + some large comment suggestions, and 0003 which contains
> > sorted merge join code for _bt_merge_arrays.
>
> I have accepted your changes from 0003. Agree that it's better that
> way. It's at least a little faster, but not meaningfully more
> complicated.

Thanks.

> > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
> > can make of it.
>
> That's been the big focus of this new v15, which now goes all out with
> teaching _bt_preprocess_keys with how to deal with arrays. We now do
> comprehensive normalization of even very complicated combinations of
> arrays and non-array scan keys in this version.

I was thinking about a more unified processing model, where
_bt_preprocess_keys would iterate over all keys, including processing
of array keys (by merging and reduction to normal keys) if and when
found. This would also reduce the effort expended when there are
contradictory scan keys, as array preprocessing is relatively more
expensive than other scankeys and contradictions are then found before
processing of later keys.
As I wasn't very far into the work yet it seems I can reuse a lot of
your work here.

> For example, consider this query:
[...]
> This has a total of 6 input scankeys -- 3 of which are arrays. But by
> the time v15's _bt_preprocess_keys is done with it, it'll have only 1
> scan key -- which doesn't even have an array (not anymore). And so we
> won't even need to advance the array keys one single time -- there'll
> simply be no array left to advance. In other words, it'll be just as
> if the query was written this way from the start:
>
> select *
> from multi_test
> where
>   a = 3::int;
>
> (Though of course the original query will spend more cycles on
> preprocessing, compared to this manually simplified variant.)

That's a good improvement, much closer to an optimal access pattern.

> It turned out to not be terribly difficult to teach
> _bt_preprocess_keys everything it could possibly need to know about
> arrays, so that it can operate on them directly, as a variant of the
> standard equality strategy (we do still need _bt_preprocess_array_keys
> for basic preprocessing of arrays, mostly just merging them). This is
> better overall (in that it gets every little subtlety right), but it
> also simplified a number of related issues. For example, there is no
> longer any need to maintain a mapping between so->keyData[]-wise scan
> keys (output scan keys), and scan->keyData[]-wise scan keys (input
> scan keys). We can just add a step to fix-up the references to the end
> of _bt_preprocess_keys, to make life easier within
> _bt_advance_array_keys.
>
> This preprocessing work should all be happening during planning, not
> during query execution -- that's the design that makes the most sense.
> This is something we've discussed in the past in the context of skip
> scan (see my original email to this thread for the reference).

Yes, but IIRC we also agreed that it's impossible to do this fully in
planning, amongst others due to joins on array fields.

> It
> would be especially useful for the very fancy kinds of preprocessing
> that are described by the MDAM paper, like using an index scan for a
> NOT IN() list/array (this can actually make sense with a low
> cardinality index).

Yes, indexes such as those on enums. Though, in those cases the NOT IN
() could be transformed into IN()-lists by the planner, but not the
index.

> The structure for preprocessing that I'm working towards (especially
> in v15) sets the groundwork for making those shifts in the planner,
> because we'll no longer treat each array constant as its own primitive
> index scan during preprocessing.

I hope that's going to be a fully separate patch. I don't think I can
handle much more complexity in this one.

> Right now, on HEAD, preprocessing
> with arrays kinda treats each array constant like the parameter of an
> imaginary inner index scan, from an imaginary nested loop join. But
> the planner really needs to operate on the whole qual, all at once,
> including any arrays. An actual nestloop join's inner index scan
> naturally cannot do that, and so might still require runtime/dynamic
> preprocessing in a world where that mostly happens in the planner --
> but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
> a in(4, 5)" are almost the same thing, and so should be handled in
> almost the same way by preprocessing).

Yeah, if the planner could handle some of this that'd be great. At the
same time, I think that this might need to be

Re: Disabling Heap-Only Tuples

2024-03-15 Thread Matthias van de Meent
On Wed, 13 Mar 2024 at 14:27, Laurenz Albe  wrote:
>
> On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> > I think a minimal working approach could be to have the configuration be 
> > based
> > on the relation size vs space known to the FSM. If the target block of an
> > update is higher than ((relation_size - fsm_free_space) *
> > new_reloption_or_guc), try finding the target block via the FSM, even if
> > there's space on the page.
>
> That sounds like a good way forward.
>
> The patch is in state "needs review", but it got review.  I'll change it to
> "waiting for author".

Then I'll withdraw this patch as I don't currently have (nor expect to
have anytime soon) the bandwitdh or expertise to rewrite this patch to
include a system that calculates the free space available in a
relation.

I've added a TODO item in the UPDATE section with a backlink to this
thread so the discussion isn't lost.

-Matthias




Re: Detoasting optionally to make Explain-Analyze less misleading

2024-03-12 Thread Matthias van de Meent
On Mon, 26 Feb 2024 at 21:54, stepan rutz  wrote:
>
> Hi Matthias, thanks for picking it up. I still believe this is valuable
> to a lot of people out there. Thanks for dealing with my proposal.
> Matthias, Tom, Tomas everyone.
>
> Two (more or less) controversial remarks from side.
>
> 1. Actually serialization should be the default for "analyze" in
> explain, as current analyze doesn't detoast and thus distorts the result
> in extreme (but common) cases easily by many order of magnitude (see my
> original video on that one). So current "explain analyze" only works for
> some queries and since detoasting is really transparent, it is quite
> something to leave detoasting out of explain analyze. This surprises
> people all the time, since explain analyze suggests it "runs" the query
> more realistically.

I'm not sure about this, but it could easily be a mid-beta decision
(if this is introduced before the feature freeze of 17, whenever that
is).

> 2. The bandwidth I computed in one of the previous versions of the patch
> was certainly cluttering up the explain output and it is misleading yes,
> but then again it performs a calculation people will now do in their
> head. The "bandwidth" here is how much data your query gets out of
> backend by means of the query and the deserialization. So of course if
> you do id-lookups you get a single row and such querries do have a lower
> data-retrieval bandwidth compared to bulk querries.

I think that's a job for post-processing the EXPLAIN output by the
user. If we don't give users the raw data, they won't be able to do
their own cross-referenced processing: "5MB/sec" doesn't tell you how
much time or data was actually spent.

> However having some
> measure of how fast data is delivered from the backend especially on
> larger joins is still a good indicator of one aspect of a query.

I'm not sure about that. Network speed is a big limiting factor that
we can't measure here, and the size on disk is often going to be
smaller than the data size when transfered across the network.

> Sorry for the remarks. Both are not really important, just restating my
> points here. I understand the objections and reasons that speak against
> both points and believe the current scope is just right.

No problem. Remarks from users (when built on solid arguments) are
always welcome, even if we may not always agree on the specifics.

-->8--

Attached is v9, which is rebased on master to handle 24eebc65's
changed signature of pq_sendcountedtext.
It now also includes autocompletion, and a second patch which adds
documentation to give users insights into this new addition to
EXPLAIN.


Kind regards,

Matthias van de Meent


v9-0002-Add-EXPLAIN-SERIALIZE-docs.patch
Description: Binary data


v9-0001-Explain-Add-SERIALIZE-option.patch
Description: Binary data


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-03-12 Thread Matthias van de Meent
On Thu, 7 Mar 2024 at 19:37, Michail Nikolaev
 wrote:
>
> Hello!
>
> > I'm not a fan of this approach. Changing visibility and cleanup
> > semantics to only benefit R/CIC sounds like a pain to work with in
> > essentially all visibility-related code. I'd much rather have to deal
> > with another index AM, even if it takes more time: the changes in
> > semantics will be limited to a new plug in the index AM system and a
> > behaviour change in R/CIC, rather than behaviour that changes in all
> > visibility-checking code.
>
> Technically, this does not affect the visibility logic, only the
> clearing semantics.
> All visibility related code remains untouched.

Yeah, correct. But it still needs to update the table relations'
information after finishing creating the indexes, which I'd rather not
have to do.

> But yes, still an inelegant and a little strange-looking option.
>
> At the same time, perhaps it can be dressed in luxury
> somehow - for example, add as a first class citizen in 
> ComputeXidHorizonsResult
> a list of blocks to clear some relations.

Not sure what you mean here, but I don't think
ComputeXidHorizonsResult should have anything to do with actual
relations.

> > But regardless of second scan snapshots, I think we can worry about
> > that part at a later moment: The first scan phase is usually the most
> > expensive and takes the most time of all phases that hold snapshots,
> > and in the above discussion we agreed that we can already reduce the
> > time that a snapshot is held during that phase significantly. Sure, it
> > isn't great that we have to scan the table again with only a single
> > snapshot, but generally phase 2 doesn't have that much to do (except
> > when BRIN indexes are involved) so this is likely less of an issue.
> > And even if it is, we would still have reduced the number of
> > long-lived snapshots by half.
>
> Hmm, but it looks like we don't have the infrastructure to "update" xmin
> propagating to the horizon after the first snapshot in a transaction is taken.

We can just release the current snapshot, and get a new one, right? I
mean, we don't actually use the transaction for much else than
visibility during the first scan, and I don't think there is a need
for an actual transaction ID until we're ready to mark the index entry
with indisready.

> One option I know of is to reuse the
> d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
> But if this is the case, then there is no point in re-taking the
> snapshot again during the first
> phase - just apply this "if" only for the first phase - and you're done.

Not a fan of that, as it is too sensitive to abuse. Note that
extensions will also have access to these tools, and I think we should
build a system here that's not easy to break, rather than one that is.

> Do you know any less-hacky way? Or is it a nice way to go?

I suppose we could be resetting the snapshot every so often? Or use
multiple successive TID range scans with a new snapshot each?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Reducing output size of nodeToString

2024-03-11 Thread Matthias van de Meent
On Mon, 11 Mar 2024 at 14:19, Peter Eisentraut  wrote:
>
> On 22.02.24 16:07, Matthias van de Meent wrote:
> > On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
> >  wrote:
> >>
> >> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
> >>  wrote:
> >>> Attached the updated version of the patch on top of 5497daf3, which
> >>> incorporates this last round of feedback.
> >>
> >> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> >> and an issue in the previous patchset: I attached one too many v3-0001
> >> from a previous patch I worked on.
> >
> > ... and now with a fix for not overwriting newly deserialized location
> > attributes with -1, which breaks test output for
> > READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
> > changes since the patch of last Monday.
>
> * v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
>
> This patch looks much more complicated than I was expecting.  I had
> suggested to model this after stringToNodeWithLocations().  This uses a
> global variable to toggle the mode.  Your patch creates a function
> nodeToStringNoQLocs() -- why the different naming scheme?

It doesn't just exclude .location fields, but also Query.len, a
similar field which contains the length of the query's string. The
name has been further refined to nodeToStringNoParseLocs() in the
attached version, but feel free to replace the names in the patch to
anything else you might want.

>  -- and passes
> the flag down as an argument to all the output functions.  I mean, in a
> green field, avoiding global variables can be sensible, of course, but I
> think in this limited scope here it would really be better to keep the
> code for the two directions read and write the same.

I'm a big fan of _not_ using magic global variables as passed context
without resetting on subnormal exits...
For GUCs their usage is understandable (and there is infrastructure to
reset them, and you're not supposed to manually update them), but IMO
here its usage should be a function-scoped variable or in a
passed-by-reference context struct, not a file-local static.
Regardless, attached is an adapted version with the file-local
variable implementation.

> Attached is a small patch that shows what I had in mind.  (It doesn't
> contain any callers, but your patch shows where all those would go.)

Attached a revised version that does it like stringToNodeInternal's
handling of restore_location_fields.

> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>
> This looks sensible, but maybe making Location a global type is a bit
> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> keep it under 12 characters.

I've gone with ParseLoc in the attached v8 patchset.

Kind regards,

Matthias van de Meent


v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v8-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v8-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v8-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v8-0002-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v8-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v8-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


v8-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


Re: btree: downlink right separator/HIKEY optimization

2024-03-11 Thread Matthias van de Meent
On Fri, 8 Mar 2024 at 20:14, Peter Geoghegan  wrote:
>
> On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent
>  wrote:
> > I forgot to address this in the previous patch, so here's v3 which
> > fixes the issue warning.
>
> What benchmarking have you done here?

I have benchmarked this atop various versions of master when it was
part of the btree specialization patchset, where it showed a 2-9%
increase in btree insert performance over the previous patch in the
patchset on the various index types in that set.
More recently, on an unlogged pgbench with foreign keys enabled (-s400
-j4 -c8) I can't find any obvious regressions (it gains 0-0.7% on
master across 5-minute runs), while being 4.5% faster on inserting
data on a table with an excessively bad index shape (single index of
10 columns of empty strings with the non-default "nl-BE-x-icu"
collation followed by 1 random uuid column, inserted from a 10M row
dataset. Extrapolation indicates this could indeed get over 7%
improvement when the index shape is 31 nondefault -collated nonnull
text columns and a single random ID index column).

> Have you tried just reordering things in _bt_search() instead? If we
> delay the check until after the binary search, then the result of the
> binary search is usually proof enough that we cannot possibly need to
> move right. That has the advantage of not requiring that we copy
> anything to the stack.

I've not tried that, because it would makes page-level prefix
truncation more expensive by ~50%: With this patch, we need only 2
full tuple _bt_compares per page before we can establish a prefix,
while without this patch (i.e. if we did a binsrch-first approach)
we'd need 3 on average (assuming linearly randomly distributed
accesses). Because full-key compares can be arbitrarily more expensive
than normal attribute compares, I'd rather not have that 50% overhead.

> > On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan  wrote:
> > What benchmarking have you done here?
> I think that the memcmp() test is subtly wrong:

Good catch, it's been fixed in the attached version, using a new function.

Kind regards,

Matthias van de Meent.


v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data


Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent
On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
 wrote:
>
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.
Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

- Matthias




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent
On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
>
> On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
>  wrote:
> > > +that searches for multiple values together.  Queries that use certain
> > > +SQL constructs to search for rows matching any 
> > > value
> > > +out of a list (or an array) of multiple scalar values might perform
> > > +multiple primitive index scans (up to one primitive 
> > > scan
> > > +per scalar value) at runtime.  See  > > linkend="functions-comparisons"/>
> > > +for details.
> >
> > I don't think the "see  for details" is
> > correctly worded: The surrounding text implies that it would contain
> > details about in which precise situations multiple primitive index
> > scans would be consumed, while it only contains documentation about
> > IN/NOT IN/ANY/ALL/SOME.
> >
> > Something like the following would fit better IMO:
> >
> > +that searches for multiple values together.  Queries that use certain
> > +SQL constructs to search for rows matching any value
> > +out of a list or array of multiple scalar values (such as those
> > described in
> > + might perform multiple primitive
> > +index scans (up to one primitive scan per scalar value) at runtime.
>
> I think that there is supposed to be a closing parenthesis here? So
> "... (such as those described in ") might
> perform...".

Correct.

> FWM, if that's what you meant.

WFM, yes?

> > Then there is a second issue in the paragraph: Inverted indexes such
> > as GIN might well decide to start counting more than one "primitive
> > scan" per scalar value,
[...]
> I've described the issues in this area (in the docs) in a way that is
> most consistent with historical conventions. That seems to have the
> fewest problems, despite everything I've said about it.

Clear enough, thank you for explaining your thoughts on this.

> > > > All that really remains now is to research how we might integrate this
> > > > work with the recently added continuescanPrechecked/haveFirstMatch
> > > > stuff from Alexander Korotkov, if at all.
> > >
> > > The main change in v12 is that I've integrated both the
> > > continuescanPrechecked and the haveFirstMatch optimizations. Both of
> > > these fields are now page-level state, shared between the _bt_readpage
> > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> > > appear next to the new home for _bt_checkkeys' continuescan field, in
> > > the new page state struct).
> >
> > Cool. I'm planning to review the rest of this patch this
> > week/tomorrow, could you take some time to review some of my btree
> > patches, too?
>
> Okay, I'll take a look again.

Thanks, greatly appreciated.

> At one point Heikki suggested that I just get rid of
> BTScanOpaqueData.arrayKeyData (which has been there for as long as
> nbtree had native support for SAOPs), and use
> BTScanOpaqueData.keyData exclusively instead. I've finally got around
> to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

> These simplifications were enabled by my new approach within
> _bt_preprocess_keys, described when I posted v12. v13 goes even
> further than v12 did, by demoting _bt_preprocess_array_keys to a
> helper function for _bt_preprocess_keys. That means that we do all of
> our scan key preprocessing at the same time, at the start of _bt_first
> (though only during the first _bt_first, or to be more precise during
> the first per btrescan). If we need fancier
> preprocessing/normalization for arrays, then it ought to be a lot
> easier with this structure.

Agreed.

> Note that we no longer need to have an independent representation of
> so->qual_okay for array keys (the convention of setting
> so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> required). There is no longer any need for a separate pass to carry
> over the contents of BTScanOpaqueData.arrayKeyData to
> BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

> Are you still interested in working directly on the preprocessing
> stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

>  I have a feeling that I was slightly too optimistic about how
> likely we were to be able to get away with not having certain 

Re: Statistics Import and Export

2024-03-06 Thread Matthias van de Meent
On Wed, 6 Mar 2024 at 11:33, Stephen Frost  wrote:
> On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent 
>  wrote:
>> Or even just one VALUES for the whole statistics loading?
>
>
> I don’t think we’d want to go beyond one relation at a time as then it can be 
> parallelized, we won’t be trying to lock a whole bunch of objects at once, 
> and any failures would only impact that one relation’s stats load.

That also makes sense.

>> I suspect the main issue with combining this into one statement
>> (transaction) is that failure to load one column's statistics implies
>> you'll have to redo all the other statistics (or fail to load the
>> statistics at all), which may be problematic at the scale of thousands
>> of relations with tens of columns each.
>
>
> I’m pretty skeptical that “stats fail to load and lead to a failed 
> transaction” is a likely scenario that we have to spend a lot of effort on.

Agreed on the "don't have to spend a lot of time on it", but I'm not
so sure on the "unlikely" part while the autovacuum deamon is
involved, specifically for non-upgrade pg_restore. I imagine (haven't
checked) that autoanalyze is disabled during pg_upgrade, but
pg_restore doesn't do that, while it would have to be able to restore
statistics of a table if it is included in the dump (and the version
matches).

> What are the cases where we would be seeing stats reloads failing where it 
> would make sense to re-try on a subset of columns, or just generally, if we 
> know that the pg_dump version matches the target server version?

Last time I checked, pg_restore's default is to load data on a
row-by-row basis without --single-transaction or --exit-on-error. Of
course, pg_upgrade uses it's own set of flags, but if a user is
restoring stats with  pg_restore, I suspect they'd rather have some
column's stats loaded than no stats at all; so I would assume this
requires one separate pg_import_pg_statistic()-transaction for every
column.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Statistics Import and Export

2024-03-06 Thread Matthias van de Meent
On Fri, 1 Mar 2024, 04:55 Corey Huinker,  wrote:
>> Also per our prior discussion- this makes sense to include in post-data 
>> section, imv, and also because then we have the indexes we may wish to load 
>> stats for, but further that also means it’ll be in the paralleliziable part 
>> of the process, making me a bit less concerned overall about the individual 
>> timing.
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement 
> parallelization or do we get transaction blocks? i.e. if we ended up 
> importing stats like this:
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-03-05 Thread Matthias van de Meent
On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev
 wrote:
>
> Hi!
>
> > How do you suppose this would work differently from a long-lived
> > normal snapshot, which is how it works right now?
>
> Difference in the ability to take new visibility snapshot periodically
> during the second phase with rechecking visibility of tuple according
> to the "reference" snapshot (which is taken only once like now).
> It is the approach from (1) but with a workaround for the issues
> caused by heap_page_prune_opt.
>
> > Would it be exclusively for that relation?
> Yes, only for that affected relation. Other relations are unaffected.

I suppose this could work. We'd also need to be very sure that the
toast relation isn't cleaned up either: Even though that's currently
DELETE+INSERT only and can't apply HOT, it would be an issue if we
couldn't find the TOAST data of a deleted for everyone (but visible to
us) tuple.

Note that disabling cleanup for a relation will also disable cleanup
of tuple versions in that table that are not used for the R/CIC
snapshots, and that'd be an issue, too.

> > How would this be integrated with e.g. heap_page_prune_opt?
> Probably by some flag in RelationData, but not sure here yet.
>
> If the idea looks sane, I could try to extend my POC - it should be
> not too hard, likely (I already have tests to make sure it is
> correct).

I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.

But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.

-Matthias




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-04 Thread Matthias van de Meent
On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan  wrote:
>
> On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan  wrote:
> > Attached is v11, which now says something like that in the commit
> > message.
>
> Attached is v12.

Some initial comments on the documentation:

> +that searches for multiple values together.  Queries that use certain
> +SQL constructs to search for rows matching any value
> +out of a list (or an array) of multiple scalar values might perform
> +multiple primitive index scans (up to one primitive scan
> +per scalar value) at runtime.  See  linkend="functions-comparisons"/>
> +for details.

I don't think the "see  for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+that searches for multiple values together.  Queries that use certain
+SQL constructs to search for rows matching any value
+out of a list or array of multiple scalar values (such as those
described in
+ might perform multiple primitive
+index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

> > All that really remains now is to research how we might integrate this
> > work with the recently added continuescanPrechecked/haveFirstMatch
> > stuff from Alexander Korotkov, if at all.
>
> The main change in v12 is that I've integrated both the
> continuescanPrechecked and the haveFirstMatch optimizations. Both of
> these fields are now page-level state, shared between the _bt_readpage
> caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> appear next to the new home for _bt_checkkeys' continuescan field, in
> the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Matthias van de Meent
On Mon, 4 Mar 2024 at 18:39, Ranier Vilela  wrote:
>
> Hi,
>
> The function var_strcmp is a critical function.
> Inside the function, there is a shortcut condition,
> which allows for a quick exit.
>
> Unfortunately, the current code calls a very expensive function beforehand, 
> which if the test was true, all the call time is wasted.
> So, IMO, it's better to postpone the function call until when it is actually 
> necessary.

Thank you for your contribution.

I agree it would be better, but your current patch is incorrect,
because we need to check if the user has access to the locale (and
throw an error if not) before we return that the two strings are
equal.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-03-01 Thread Matthias van de Meent
On Wed, 24 Jan 2024 at 13:02, Matthias van de Meent
 wrote:
> > 1.
> > Commit message refers to a non-existing reference '(see [0])'.
>
> Noted, I'll update that.
>
> > 2.
> > +When we do a binary search on a sorted set (such as a BTree), we know that 
> > a
> > +tuple will be smaller than its left neighbour, and larger than its right
> > +neighbour.
> >
> > I think this should be 'larger than left neighbour and smaller than
> > right neighbour' instead of the other way around.
>
> Noted, will be fixed, too.

Attached is version 15 of this patch, with the above issues fixed.
It's also rebased on top of 655dc310 of this morning, so that should
keep good for some time again.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v15-0001-btree-Implement-dynamic-prefix-compression.patch
Description: Binary data


Re: src/include/miscadmin.h outdated comments

2024-03-01 Thread Matthias van de Meent
On Fri, 1 Mar 2024 at 12:19, jian he  wrote:
>
> hi.
>
> /*
>  *   globals.h -- *
>  
> */
>
> The above comment src/include/miscadmin.h is not accurate?
> we don't have globals.h file?

The header of the file describes the following:

 * miscadmin.h
 *  This file contains general postgres administration and initialization
 *  stuff that used to be spread out between the following files:
 *globals.hglobal variables
 *pdir.hdirectory path crud
 *pinit.hpostgres initialization
 *pmod.hprocessing modes
 *  Over time, this has also become the preferred place for widely known
 *  resource-limitation stuff, such as work_mem and check_stack_depth().

So, presumably that section is what once was globals.h.

As to whether the comment should remain now that it's been 17 years
since those files were merged, I'm not sure: while the comment has
mostly historic value, there is something to be said about it
delineating sections of the file.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-01 Thread Matthias van de Meent
On Mon, 26 Feb 2024 at 12:46, shveta malik  wrote:
>
> Hi hackers,
>
> I would like to understand why we have code [1] that retrieves
> RecentFlushPtr in WalSndWaitForWal() outside of the loop. We utilize
> RecentFlushPtr later within the loop, but prior to that, we already
> have [2]. Wouldn't [2] alone be sufficient?
>
> Just to check the impact, I ran 'make check-world' after removing [1],
> and did not see any issue exposed by the test at-least.

Yeah, that seems accurate.

> Any thoughts?
[...]
> [2]:
> /* Update our idea of the currently flushed position. */
> else if (!RecoveryInProgress())

I can't find where this "else" of this "else if" clause came from, as
this piece of code hasn't changed in years. But apart from that, your
observation seems accurate, yes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Experiments with Postgres and SSL

2024-02-28 Thread Matthias van de Meent
On Thu, 22 Feb 2024 at 18:02, Heikki Linnakangas  wrote:
>
> On 22/02/2024 01:43, Matthias van de Meent wrote:
>> On Wed, 10 Jan 2024 at 09:31, Heikki Linnakangas  wrote:
>>> 4. The number of combinations of sslmode, gssencmode and sslnegotiation
>>> settings is scary. And we have very few tests for them.
>>
>> Yeah, it's not great. We could easily automate this better though. I
>> mean, can't we run the tests using a "cube" configuration, i.e. test
>> every combination of parameters? We would use a mapping function of
>> (psql connection parameter values -> expectations), which would be
>> along the lines of the attached pl testfile. I feel it's a bit more
>> approachable than the lists of manual option configurations, and makes
>> it a bit easier to program the logic of which connection security
>> option we should have used to connect.
>> The attached file would be a drop-in replacement; it's tested to work
>> with SSL only - without GSS - because I've been having issues getting
>> GSS working on my machine.
>
> +1 testing all combinations. I don't think the 'mapper' function
> approach in your version is much better than the original though. Maybe
> it would be better with just one 'mapper' function that contains all the
> rules, along the lines of: (This isn't valid perl, just pseudo-code)
>
> sub expected_outcome
> {
[...]
> }
>
> Or maybe a table that lists all the combinations and the expected
> outcome. Something lieke this:
[...]
>
> The problem is that there are more than two dimensions. So maybe an
> exhaustive list like this:
>
> usersslmode gssmode outcome
>
> nossluser   require disable fail
> ...

> I'm just throwing around ideas here, can you experiment with different
> approaches and see what looks best?

One issue with exhaustive tables is that they would require a product
of all options to be listed, and that'd require at least 216 rows to
manage: server_ssl 2 * server_gss 2 * users 3 * client_ssl 4 *
client_gss 3 * client_ssldirect 3 = 216 different states. I think the
expected_autcome version is easier in that regard.

Attached an updated version using a single unified connection type
validator using an approach similar to yours. Note that it does fail 8
tests, all of which are attributed to the current handling of
`sslmode=require gssencmode=prefer`: right now, we allow GSS in that
case, even though the user require-d sslmode.

An alternative check that does pass tests with the code of the patch
is commented out, at lines 209-216.

>>> ALPN
>>
>> Does the TLS ALPN spec allow protocol versions in the protocol tag? It
>> would be very useful to detect clients with new capabilities at the
>> first connection, rather than having to wait for one round trip, and
>> would allow one avenue for changing the protocol version.
>
> Looking at the list of registered ALPN tags [0], I can see "http/0.9";
> "http/1.0" and "http/1.1".

Ah, nice.

> I think we'd want to changing the major
> protocol version in a way that would introduce a new roundtrip, though.

I don't think I understand what you meant here, could you correct the
sentence or expand why we want to do that?
Note that with ALPN you could negotiate postgres/3.0 or postgres/4.0
during the handshake, which could save round-trips.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


001_negotiate_encryption.pl
Description: Binary data


Re: Detoasting optionally to make Explain-Analyze less misleading

2024-02-26 Thread Matthias van de Meent
Hi,

I've taken the liberty to update this patch, and register it in the
commitfest app to not lose track of progress [0].

The attached v8 patch measures scratch memory allocations (with MEMORY
option), total time spent in serialization (with TIMING on, measures
are inclusive of unseparated memcpy to the message buffer), and a
count of produced bytes plus the output format used (text or binary).
It's a light rework of the earlier 0007 patch, I've reused tests and
some infrastructure, while the implementation details and comments
have been updated significantly.

I think we can bikeshed on format and names, but overall I think the
patch is in a very decent shape.

Stepan, thank you for your earlier work, and feel free to check it out
or pick it up again if you want to; else I'll try to get this done.

Kind regards,

Matthias van de Meent

[0] https://commitfest.postgresql.org/47/4852/


v8-0001-Explain-Add-SERIALIZE-option.patch
Description: Binary data


Re: Sequence Access Methods, round two

2024-02-26 Thread Matthias van de Meent
On Mon, 26 Feb 2024 at 09:11, Michael Paquier  wrote:
>
> On Thu, Feb 22, 2024 at 05:36:00PM +0100, Tomas Vondra wrote:
> > 0002, 0003
> > 
> > seems fine, cosmetic changes
>
> Thanks, I've applied these two for now.  I'll reply to the rest
> tomorrow or so.

Huh, that's surprising to me. I'd expected this to get at least a
final set of patches before they'd get committed. After a quick check
6e951bf seems fine, but I do have some nits on 449e798c:

> +/* 
> + *validate_relation_kind - check the relation's kind
> + *
> + *Make sure relkind is from an index

Shouldn't this be "... from a sequence"?

> + * 
> + */
> +static inline void
> +validate_relation_kind(Relation r)

Shouldn't this be a bit more descriptive than just
"validate_relation_kind"? I notice this is no different from how this
is handled in index.c and table.c, but I'm not a huge fan of shadowing
names, even with static inlines functions.

> -ERROR:  "serialtest1" is not a sequence
> +ERROR:  cannot open relation "serialtest1"
> +DETAIL:  This operation is not supported for tables.

We seem to lose some details here: We can most definitely open tables.
We just can't open them while treating them as sequences, which is not
mentioned in the error message.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: btree: downlink right separator/HIKEY optimization

2024-02-22 Thread Matthias van de Meent
On Sat, 6 Jan 2024 at 16:40, vignesh C  wrote:
>
> CFBot shows the following compilation error at [1]:
> [16:56:22.153] FAILED:
> src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj
> [...]
> ../src/backend/access/nbtree/nbtsearch.c
> [16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error
> C2143: syntax error: missing ';' before 'type'
> [16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning
> C4091: ' ': ignored on left of 'int' when no variable is declared

I forgot to address this in the previous patch, so here's v3 which
fixes the issue warning.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-02-22 Thread Matthias van de Meent
On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
 wrote:
>
> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
>  wrote:
> > Attached the updated version of the patch on top of 5497daf3, which
> > incorporates this last round of feedback.
>
> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> and an issue in the previous patchset: I attached one too many v3-0001
> from a previous patch I worked on.

... and now with a fix for not overwriting newly deserialized location
attributes with -1, which breaks test output for
READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
changes since the patch of last Monday.

Sorry for the noise,

-Matthias


v7-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v7-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v7-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v7-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v7-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v7-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


Re: btree: downlink right separator/HIKEY optimization

2024-02-22 Thread Matthias van de Meent
On Tue, 5 Dec 2023 at 08:43, Heikki Linnakangas  wrote:
>
> On 01/11/2023 00:08, Matthias van de Meent wrote:
> > By caching the right separator index tuple in _bt_search, we can
> > compare the downlink's right separator and the HIKEY, and when they
> > are equal (memcmp() == 0) we don't have to call _bt_compare - the
> > HIKEY is known to be larger than the scan key, because our key is
> > smaller than the right separator, and thus transitively also smaller
> > than the HIKEY because it contains the same data. As _bt_compare can
> > call expensive user-provided functions, this can be a large
> > performance boon, especially when there are only a small number of
> > column getting compared on each page (e.g. index tuples of many 100s
> > of bytes, or dynamic prefix truncation is enabled).
>
> What would be the worst case scenario for this? One situation where the
> memcmp() would not match is when there is a concurrent page split. I
> think it's OK to pessimize that case. Are there any other situations?

There is also concurrent page deletion which can cause downlinked
pages to get removed from the set of accessible pages, but that's
quite rare, too: arguably even more rare than page splits.

> When the memcmp() matches, I think this is almost certainly not slower
> than calling the datatype's comparison function.
>
> >   if (offnum < PageGetMaxOffsetNumber(page))
> > [...]
> >   else if (!P_RIGHTMOST(opaque))
> > [...]
> >   }
>
> This could use a one-line comment above this, something like "Remember
> the right separator of the downlink we follow, to speed up the next
> _bt_moveright call".

Done.

> Should there be an "else rightsep = NULL;" here? Is it possible that we
> follow the non-rightmost downlink on a higher level and rightmost
> downlink on next level? Concurrent page deletion?

While possible, the worst this could do is be less efficient in those
fringe cases: The cached right separator is a key that is known to
compare larger than the search key and thus always correct to use as
an optimization for "is this HIKEY larger than my search key", as long
as we don't clobber the data in that cache (which we don't).
Null-ing the argument, while not incorrect, could be argued to be
worse than useless here, as the only case where NULL may match an
actual highkey is on the rightmost page, which we already special-case
in _bt_moveright before hitting the 'compare the highkey' code.
Removal of the value would thus remove any chance of using the
optimization after hitting the rightmost page in a layer below.

I've added a comment to explain this in an empty else block in the
attached version 2 of the patch.

> Please update the comment above _bt_moveright to describe the new
> argument. Perhaps the text from README should go there, this feels like
> a detail specific to _bt_search and _bt_moveright.

Done.

Thank you for the review.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v2-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-02-22 Thread Matthias van de Meent
On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
 wrote:
> Attached the updated version of the patch on top of 5497daf3, which
> incorporates this last round of feedback.

Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
and an issue in the previous patchset: I attached one too many v3-0001
from a previous patch I worked on.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v6-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v6-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v6-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v6-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v6-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


v6-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v6-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v6-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


Re: Experiments with Postgres and SSL

2024-02-21 Thread Matthias van de Meent
 ereport(FATAL,
> +(errcode(ERRCODE_PROTOCOL_VIOLATION),
> + errmsg("received unencrypted data after SSL 
> request"),
> + errdetail("This could be either a client-software 
> bug or evidence of an attempted man-in-the-middle attack.")));

We currently don't support 0-RTT SSL connections because (among other
reasons) we haven't yet imported many features from TLS1.3, but it
seems reasonable that clients may want to use 0RTT (or, session
resumption in 0 round trips), which would allow encrypted data after
the SSL startup packet.
It seems wise to add something to this note to these comments in
ProcessStartupPacket.

> ALPN

Does the TLS ALPN spec allow protocol versions in the protocol tag? It
would be very useful to detect clients with new capabilities at the
first connection, rather than having to wait for one round trip, and
would allow one avenue for changing the protocol version.

Apart from this, I didn't really find any serious problems in the sum
of these patches. The intermediate states were not great though, with
various broken states in between.

Kind regards,

Matthias van de Meent


001_negotiate_encryption.pl
Description: Binary data


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Matthias van de Meent
On Wed, 21 Feb 2024 at 09:35, Michail Nikolaev
 wrote:
>
> One more idea - is just forbid HOT prune while the second phase is
> running. It is not possible anyway currently because of snapshot held.
>
> Possible enhancements:
> * we may apply restriction only to particular tables
> * we may apply restrictions only to part of the tables (not yet
> scanned by R/CICs).
>
> Yes, it is not an elegant solution, limited, not reliable in terms of
> architecture, but a simple one.

How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Would it be exclusively for that relation? How would this be
integrated with e.g. heap_page_prune_opt?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Matthias van de Meent
On Wed, 21 Feb 2024 at 00:33, Michail Nikolaev
 wrote:
>
> Hello!
> > I think the best way for this to work would be an index method that
> > exclusively stores TIDs, and of which we can quickly determine new
> > tuples, too. I was thinking about something like GIN's format, but
> > using (generation number, tid) instead of ([colno, colvalue], tid) as
> > key data for the internal trees, and would be unlogged (because the
> > data wouldn't have to survive a crash)
>
> Yeah, this seems to be a reasonable approach, but there are some
> doubts related to it - it needs new index type as well as unlogged
> indexes to be introduced - this may make the patch too invasive to be
> merged.

I suppose so, though persistence is usually just to keep things
correct in case of crashes, and this "index" is only there to support
processes that don't expect to survive crashes.

> Also, some way to remove the index from the catalog in case of
> a crash may be required.

That's less of an issue though, we already accept that a crash during
CIC/RIC leaves unusable indexes around, so "needs more cleanup" is not
exactly a blocker.

> A few more thoughts:
> * it is possible to go without generation number - we may provide a
> way to do some kind of fast index lookup (by TID) directly during the
> second table scan phase.

While possible, I don't think this would be more performant than the
combination approach, at the cost of potentially much more random IO
when the table is aggressively being updated.

> * one more option is to maintain a Tuplesorts (instead of an index)
> with TIDs as changelog and merge with index snapshot after taking a
> new visibility snapshot. But it is not clear how to share the same
> Tuplesort with multiple inserting backends.

Tuplesort requires the leader process to wait for concurrent backends
to finish their sort before it can start consuming their runs. This
would make it a very bad alternative to the "changelog index" as the
CIC process would require on-demand actions from concurrent backends
(flush of sort state). I'm not convinced that's somehow easier.

> * crazy idea - what is about to do the scan in the index we are
> building? We have tuple, so, we have all the data indexed in the
> index. We may try to do an index scan using that data to get all
> tuples and find the one with our TID :)

We can't rely on that, because we have no guarantee we can find the
tuple quickly enough. Equality-based indexing is very much optional,
and so are TID-based checks (outside the current vacuum-related APIs),
so finding one TID can (and probably will) take O(indexsize) when the
tuple is not in the index, which is one reason for ambulkdelete() to
exist.

Kind regards,

Matthias van de Meent




Re: Add bump memory context type and use it for tuplesorts

2024-02-20 Thread Matthias van de Meent
On Tue, 20 Feb 2024 at 11:02, David Rowley  wrote:
> On Fri, 26 Jan 2024 at 01:29, Matthias van de Meent
>  wrote:
> > >> +allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ +
> > >> +if (minContextSize != 0)
> > >> +allocSize = Max(allocSize, minContextSize);
> > >> +else
> > >> +allocSize = Max(allocSize, initBlockSize);
> >
> > Shouldn't this be the following, considering the meaning of "initBlockSize"?
>
> No, we want to make the blocksize exactly initBlockSize if we can. Not
> initBlockSize plus all the header stuff.  We do it that way for all
> the other contexts and I agree that it's a good idea as it keeps the
> malloc request sizes powers of 2.

One part of the reason of my comment was that initBlockSize was
ignored in favour of minContextSize if that was configured, regardless
of the value of initBlockSize. Is it deliberately ignored when
minContextSize is set?

> > >> + * BumpFree
> > >> + *Unsupported.
> > >> [...]
> > >> + * BumpRealloc
> > >> + *Unsupported.
> >
> > Rather than the error, can't we make this a no-op (potentially
> > optionally, or in a different memory context?)
>
> Unfortunately not.  There are no MemoryChunks on bump chunks so we've
> no way to determine the context type a given pointer belongs to.  I've
> left the MemoryChunk on there for debug builds so we can get the
> ERRORs to allow us to fix the broken code that is pfreeing these
> chunks.
>
> > I understand that allowing pfree/repalloc in bump contexts requires
> > each allocation to have a MemoryChunk prefix in overhead, but I think
> > it's still a valid use case to have a very low overhead allocator with
> > no-op deallocator (except context reset). Do you have performance
> > comparison results between with and without the overhead of
> > MemoryChunk?
>
> Oh right, you've taken this into account.  I was hoping not to have
> the headers otherwise the only gains we see over using generation.c is
> that of the allocation function being faster.
>
> [...]  The smaller the tuple, the
> more that will be noticeable as the chunk header is a larger portion
> of the overall allocation with those.

I see. Thanks for the explanation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Add bump memory context type and use it for tuplesorts

2024-02-20 Thread Matthias van de Meent
On Tue, 20 Feb 2024 at 10:41, David Rowley  wrote:
> On Tue, 7 Nov 2023 at 07:55, Matthias van de Meent
>  wrote:
> > > +++ b/src/backend/utils/mmgr/bump.c
> > > +BumpBlockIsEmpty(BumpBlock *block)
> > > +{
> > > +/* it's empty if the freeptr has not moved */
> > > +return (block->freeptr == (char *) block + Bump_BLOCKHDRSZ);
> > > [...]
> > > +static inline void
> > > +BumpBlockMarkEmpty(BumpBlock *block)
> > > +{
> > > +#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY)
> > > +char   *datastart = ((char *) block) + Bump_BLOCKHDRSZ;
> >
> > These two use different definitions of the start pointer. Is that 
> > deliberate?
>
> hmm, I'm not sure if I follow what you mean.  Are you talking about
> the "datastart" variable and the assignment of block->freeptr (which
> you've not quoted?)

What I meant was that

> (char *) block + Bump_BLOCKHDRSZ
vs
>  ((char *) block) + Bump_BLOCKHDRSZ

, when combined with my little experience with pointer addition and
precedence, and a lack of compiler at the ready at that point in time,
I was afraid that "(char *) block + Bump_BLOCKHDRSZ" would be parsed
as "(char *) (block + Bump_BLOCKHDRSZ)", which would get different
offsets across the two statements.
Godbolt has since helped me understand that both are equivalent.

Kind regards,

Matthias van de Meent




Re: Proposal: Adjacent B-Tree index

2024-02-19 Thread Matthias van de Meent
On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov  wrote:
>
> - Motivation
>
> A regular B-tree index provides efficient mapping of key values to tuples 
> within a table. However, if you have two tables connected in some way, a 
> regular B-tree index may not be efficient enough. In this case, you would 
> need to create an index for each table. The purpose will become clearer if we 
> consider a simple example which is the main use-case as I see it.

I'm not sure why are two indexes not sufficient here? PostgreSQL can
already do merge joins, which would have the same effect of hitting
the same location in the index at the same time between all tables,
without the additional overhead of having to scan two table's worth of
indexes in VACUUM.

> During the vacuum of A an index tuple pointing to a dead tuple in A should be 
> cleaned as well as all index tuples for the same key.

This is definitely not correct. If I have this schema

table test (id int primary key, b text unique)
table test_ref(test_id int references test(id))

and if an index would contain entries for both test and test_ref, it
can't just remove all test_ref entries because an index entry with the
same id was removed: The entry could've been removed because (e.g.)
test's b column was updated thus inserting a new index entry for the
new HOT-chain's TID.

> would suffice for this new semantics.

With the provided explanation I don't think this is a great idea.

Kind regards,

Matthias van de Meent.




Re: Experiments with Postgres and SSL

2024-02-19 Thread Matthias van de Meent
I've been asked to take a look at this thread and review some patches,
and the subject looks interesting enough, so here I am.

On Thu, 19 Jan 2023 at 04:16, Greg Stark  wrote:
> I had a conversation a while back with Heikki where he expressed that
> it was annoying that we negotiate SSL/TLS the way we do since it
> introduces an extra round trip. Aside from the performance
> optimization I think accepting standard TLS connections would open the
> door to a number of other opportunities that would be worth it on
> their own.

I agree that this would be very nice.

> Other things it would open the door to in order from least
> controversial to most
>
> * Hiding Postgres behind a standard SSL proxy terminating SSL without
> implementing the Postgres protocol.

I think there is also the option "hiding Postgres behind a standard
SNI-based SSL router that does not terminate SSL", as that's arguably
a more secure way to deploy any SSL service than SSL-terminating
proxies.

> * "Service Mesh" type tools that hide multiple services behind a
> single host/port ("Service Mesh" is just a new buzzword for "proxy").

People proxying PostgreSQL seems fine, and enabling better proxying
seems reasonable.

> * Browser-based protocol implementations using websockets for things
> like pgadmin or other tools to connect directly to postgres using
> Postgres wire protocol but using native SSL implementations.
>
> * Postgres could even implement an HTTP based version of its protocol
> and enable things like queries or browser based tools using straight
> up HTTP requests so they don't need to use websockets.
>
> * Postgres could implement other protocols to serve up data like
> status queries or monitoring metrics, using HTTP based standard
> protocols instead of using our own protocol.

I don't think we should be trying to serve anything HTTP-like, even
with a ten-foot pole, on a port that we serve the PostgreSQL wire
protocol on.

If someone wants to multiplex the PostgreSQL wire protocol on the same
port that serves HTTPS traffic, they're welcome to do so with their
own proxy, but I'd rather we keep the PostgreSQL server's socket
handling fundamentaly incapable of servicng protocols primarily used
in web browsers on the same socket that handles normal psql data
connections.

PostgreSQL may have its own host-based authentication with HBA, but
I'd rather not have to depend on it to filter incoming connections
between valid psql connections and people trying to grab the latest
monitoring statistics at some http endpoint - I'd rather use my trusty
firewall that can already limit access to specific ports very
efficiently without causing undue load on the database server.

Matthias van de Meent
Neon (https://neon.tech)




Re: PGC_SIGHUP shared_buffers?

2024-02-18 Thread Matthias van de Meent
On Sun, 18 Feb 2024 at 02:03, Andres Freund  wrote:
>
> Hi,
>
> On 2024-02-17 23:40:51 +0100, Matthias van de Meent wrote:
> > > 5. Re-map the shared_buffers when needed.
> > >
> > > Between transactions, a backend should not hold any buffer pins. When
> > > there are no pins, you can munmap() the shared_buffers and mmap() it at
> > > a different address.
>
> I hadn't quite realized that we don't seem to rely on shared_buffers having a
> specific address across processes. That does seem to make it a more viable to
> remap mappings in backends.
>
>
> However, I don't think this works with mmap(MAP_ANONYMOUS) - as long as we are
> using the process model. To my knowledge there is no way to get the same
> mapping in multiple already existing processes. Even mmap()ing /dev/zero after
> sharing file descriptors across processes doesn't work, if I recall correctly.
>
> We would have to use sysv/posix shared memory or such (or mmap() if files in
> tmpfs) for the shared buffers allocation.
>
>
>
> > This can quite realistically fail to find an unused memory region of
> > sufficient size when the heap is sufficiently fragmented, e.g. through
> > ASLR, which would make it difficult to use this dynamic
> > single-allocation shared_buffers in security-hardened environments.
>
> I haven't seen anywhere close to this bad fragmentation on 64bit machines so
> far - have you?

No.

> Most implementations of ASLR randomize mmap locations across multiple runs of
> the same binary, not within the same binary. There are out-of-tree linux
> patches that make mmap() randomize every single allocation, but I am not sure
> that we ought to care about such things.

After looking into ASLR a bit more, I realise I was under the mistaken
impression that ASLR would implicate randomized mmaps(), too.
Apparently, that's wrong; ASLR only does some randomization for the
initialization of the process memory layout, and not the process'
allocations.

> Even if we were to care, on 64bit platforms it doesn't seem likely that we'd
> run out of space that quickly.  AMD64 had 48bits of virtual address space from
> the start, and on recent CPUs that has grown to 57bits [1], that's a lot of
> space.

Yeah, that's a lot of space, but it seems to me it's also easily
consumed; one only needs to allocate one allocation in every 4GB of
address space to make allocations of 8GB impossible; a utilization of
~1 byte/MiB. Applying this to 48 bits of virtual address space, a
process only needs to use ~256MB of memory across the address space to
block out any 8GB allocations; for 57 bits that's still "only" 128GB.
But after looking at ASLR a bit more, it is unrealistic that a normal
OS and process stack would get to allocating memory in such a pattern.

> And if you do run out of VM space, wouldn't that also affect lots of other
> things, like mmap() for malloc?

Yes. But I would usually expect that the main shared memory allocation
would be the single largest uninterrupted allocation, so I'd also
expect it to see more such issues than any current user of memory if
we were to start moving (reallocating) that allocation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: automating RangeTblEntry node support

2024-02-17 Thread Matthias van de Meent
On Fri, 16 Feb 2024 at 21:36, Paul Jungwirth
 wrote:
>
> On 1/15/24 02:37, Peter Eisentraut wrote:
> > In this updated patch set, I have also added the treatment of the 
> > Constraint type.  (I also noted
> > that the manual read/write functions for the Constraint type are 
> > out-of-sync again, so simplifying
> > this would be really helpful.)  I have also added commit messages to each 
> > patch.
> >
> > The way I have re-ordered the patch series now, I think patches 0001 
> > through 0003 are candidates for
> > inclusion after review, patch 0004 still needs a bit more analysis and 
> > testing, as described therein.
>
> Re bloating the serialization output, we could leave this last patch until 
> after the work on that
> other thread is done to skip default-valued items.

I'm not sure that the cleanup which is done when changing a RTE's
rtekind is also complete enough for this purpose.
Things like inline_cte_walker change the node->rtekind, which could
leave residual junk data in fields that are currently dropped during
serialization (as the rtekind specifically ignores those fields), but
which would add overhead when the default omission is expected to
handle these fields; as they could then contain junk. It looks like
there is some care about zeroing now unused fields, but I haven't
checked that it covers all cases and fields to the extent required so
that removing this specialized serializer would have zero impact on
size once the default omission patch is committed.

An additional patch with a single function that for this purpose
clears junk fields from RTEs that changed kind would be appreciated:
it is often hand-coded at those locations the kind changes, but that's
more sensitive to programmer error.

Kind regards,

Matthias van de Meent




Re: PGC_SIGHUP shared_buffers?

2024-02-17 Thread Matthias van de Meent
On Fri, 16 Feb 2024 at 21:24, Heikki Linnakangas  wrote:
>
> On 16/02/2024 06:28, Robert Haas wrote:
> > 3. Reserve lots of address space and then only use some of it. I hear
> > rumors that some forks of PG have implemented something like this. The
> > idea is that you convince the OS to give you a whole bunch of address
> > space, but you try to avoid having all of it be backed by physical
> > memory. If you later want to increase shared_buffers, you then get the
> > OS to back more of it by physical memory, and if you later want to
> > decrease shared_buffers, you hopefully have some way of giving the OS
> > the memory back. As compared with the previous two approaches, this
> > seems less likely to be noticeable to most PG code. Problems include
> > (1) you have to somehow figure out how much address space to reserve,
> > and that forms an upper bound on how big shared_buffers can grow at
> > runtime and (2) you have to figure out ways to reserve address space
> > and back more or less of it with physical memory that will work on all
> > of the platforms that we currently support or might want to support in
> > the future.
>
> A variant of this approach:
>
> 5. Re-map the shared_buffers when needed.
>
> Between transactions, a backend should not hold any buffer pins. When
> there are no pins, you can munmap() the shared_buffers and mmap() it at
> a different address.

This can quite realistically fail to find an unused memory region of
sufficient size when the heap is sufficiently fragmented, e.g. through
ASLR, which would make it difficult to use this dynamic
single-allocation shared_buffers in security-hardened environments.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-17 Thread Matthias van de Meent
On Thu, 1 Feb 2024, 17:06 Michail Nikolaev,  wrote:
>
> > > > I just realised there is one issue with this design: We can't cheaply
> > > > reset the snapshot during the second table scan:
> > > > It is critically important that the second scan of R/CIC uses an index
> > > > contents summary (made with index_bulk_delete) that was created while
> > > > the current snapshot was already registered.

I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash). Then we could do something
like this for the second table scan phase:

0. index->indisready is set
[...]
1. Empty the "changelog index", resetting storage and the generation number.
2. Take index contents snapshot of new index, store this.
3. Loop until completion:
4a. Take visibility snapshot
4b. Update generation number of the changelog index, store this.
4c. Take index snapshot of "changelog index" for data up to the
current stored generation number. Not including, because we only need
to scan that part of the index that were added before we created our
visibility snapshot, i.e. TIDs labeled with generation numbers between
the previous iteration's generation number (incl.) and this
iteration's generation (excl.).
4d. Combine the current index snapshot with that of the "changelog"
index, and save this.
Note that this needs to take care to remove duplicates.
4e. Scan segment of table (using the combined index snapshot) until we
need to update our visibility snapshot or have scanned the whole
table.

This should give similar, if not the same, behavour as that which we
have when we RIC a table with several small indexes, without requiring
us to scan a full index of data several times.

Attemp on proving this approach's correctness:
In phase 3, after each step 4b:
All matching tuples of the table that are in the visibility snapshot:
* Were created before scan 1's snapshot, thus in the new index's snapshot, or
* Were created after scan 1's snapshot but before index->indisready,
thus not in the new index's snapshot, nor in the changelog index, or
* Were created after the index was set as indisready, and committed
before the previous iteration's visibility snapshot, thus in the
combined index snapshot, or
* Were created after the index was set as indisready, after the
previous visibility snapshot was taken, but before the current
visibility snapshot was taken, and thus definitely included in the
changelog index.

Because we hold a snapshot, no data in the table that we should see is
removed, so we don't have a chance of broken HOT chains.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Reducing output size of nodeToString

2024-02-15 Thread Matthias van de Meent
On Thu, 15 Feb 2024 at 13:59, Peter Eisentraut  wrote:
>
> Thanks, this patch set is a good way to incrementally work through these
> changes.
>
> I have looked at
> v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today.
> Here are my thoughts:
>
> I believe we had discussed offline to not omit enum fields with value 0
> (WRITE_ENUM_FIELD).  This is because the values of enum fields are
> implementation artifacts, and this could be confusing for readers.

Thanks for reminding me, I didn't remember this when I worked on
updating the patchset. I'll update this soon.

> I have some concerns about the round-trippability of float values.  If
> we do, effectively, if (node->fldname != 0.0), then I think this would
> also match negative zero, but when we read it back it, it would get
> assigned positive zero.  Maybe there are other edge cases like this.
> Might be safer to not mess with this.

That's a good point. Would an additional check that the sign of the
field equals the default's sign be enough for this? As for other
cases, I'm not sure we currently want to support non-normal floats,
even if it is technically possible to do the round-trip in the current
format.

> On the reading side, the macro nesting has gotten a bit out of hand. :)
> We had talked earlier in the thread about the _DIRECT macros and you
> said there were left over from something else you want to try, but I see
> nothing else in this patch set uses this.  I think this could all be
> much simpler, like (omitting required punctuation)
[...]
> Not only is this simpler, but it might also have better performance,
> because we don't have separate pg_strtok_next() and pg_strtok() calls in
> sequence.

Good points. I'll see what I can do here.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Reducing output size of nodeToString

2024-02-12 Thread Matthias van de Meent
On Mon, 12 Feb 2024 at 20:32, Matthias van de Meent
 wrote:
>
> On Mon, 12 Feb 2024 at 19:03, Matthias van de Meent
>  wrote:
> > Attached is patchset v2, which contains the improvements from these patches:
>
> Attached v3, which fixes an out-of-bounds read in pg_strtoken_next,
> detected by asan, that was a likely cause of the problems in CFBot's
> FreeBSD regression tests.

Apparently that was caused by issues in my updated bitmapset
serializer; where I used bms_next_member(..., x=0) as first iteration
thus skipping the first bit. This didn't show up earlier because that
bit is not exercised in PG's builtin views, but is exercised when
WRITE_READ_PARSE_PLAN_TREES is defined (as on the FreeBSD CI job).

Trivial fix in the attached v4 of the patchset, with some fixes for
other assertions that'd get some exercise in non-pg_node_tree paths in
the WRITE_READ configuration.


v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v4-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v4-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v4-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v4-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v4-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v4-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-02-12 Thread Matthias van de Meent
On Mon, 12 Feb 2024 at 19:03, Matthias van de Meent
 wrote:
> Attached is patchset v2, which contains the improvements from these patches:

Attached v3, which fixes an out-of-bounds read in pg_strtoken_next,
detected by asan, that was a likely cause of the problems in CFBot's
FreeBSD regression tests.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v3-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v3-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v3-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v3-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v3-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v3-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-02-12 Thread Matthias van de Meent
On Wed, 31 Jan 2024 at 18:47, Robert Haas  wrote:
>
> On Wed, Jan 31, 2024 at 11:17 AM Matthias van de Meent
>  wrote:
> > I was also thinking about smaller per-attribute expression storage, for 
> > index attribute expressions, table default expressions, and functions. 
> > Other than that, less memory overhead for the serialized form of these 
> > constructs also helps for catalog cache sizes, etc.
> > People complained about the size of a fresh initdb, and I agreed with them, 
> > so I started looking at low-hanging fruits, and this is one.
> >
> > I've not done any tests yet on whether it's more performant in general. I'd 
> > expect the new code to do a bit better given the extremely verbose nature 
> > of the data and the rather complex byte-at-a-time token read method used, 
> > but this is currently hypothesis.
> > I do think that serialization itself may be slightly slower, but given that 
> > this generally happens only in DDL, and that we have to grow the output 
> > buffer less often, this too may still be a net win (but, again, this is an 
> > untested hypothesis).
>
> I think we're going to have to have separate formats for debugging and
> storage if we want to get very far here. The current format sucks for
> readability because it's so verbose, and tightening that up where we
> can makes sense to me. For me, that can include things like emitting
> unset location fields for sure, but delta-encoding of bitmap sets is
> more questionable. Turning 1 2 3 4 5 6 7 8 9 10 into 1-10 would be
> fine with me because that is both shorter and more readable, but
> turning 2 4 6 8 10 into 2 2 2 2 2 is way worse for a human reader.
> Such optimizations might make sense in a format that is designed for
> computer processing only but not one that has to serve multiple
> purposes.

I suppose so, yes. I've removed the delta-encoding from the
serialization of bitmapsets in the attached patchset.

Peter E. and I spoke about this patchset at FOSDEM PGDay, too. I said
to him that I wouldn't mind if this patchset was only partly applied:
The gains for most of the changes are definitely worth it even if some
others don't get in.

I think it'd be a nice QoL and storage improvement if even only (say)
the first two patches were committed, though the typmod default
markings (or alternatively, using a typedef-ed TypMod and one more
type-specific serialization handler) would also be a good improvement
without introducing too many "common value = default = omitted"
considerations that would reduce debugability.

Attached is patchset v2, which contains the improvements from these patches:

0001 has the "omit defaults" for the current types.
-23.5%pt / -35.1%pt (toasted / raw)
0002+0003 has new #defined type "Location" for those fields in Nodes
that point into (or have sizes of) query texts, and adds
infrastructure to conditionally omit them at all (see previous
discussions)
-3.5%pt / -6.3%pt
0004 has new #defined type TypeMod as alias for int32, that uses a
default value of -1 for (de)serialization purposes.
-3.0%pt / -6.1%pt
0005 updates Const node serialization to omit `:constvalue` if the
value is null.
+0.1%pt / -0.1%pt [^0]
0006 does run-length encoding for bitmaps and the various typed
integer lists, using "+int" as indicators of a run of a certain
length, excluding the start value.
 Bitmaps, IntLists and XidLists are based on runs with increments
of 1 (so, a notation (i 1 +3) means (i 1 2 3 4), while OidLists are
based on runs with no increments (so, (o 1 +3) means (o 1 1 1 1).
-2.5%pt / -0.6%pt
0007 does add some select custom 'default' values, in that the
varnosyn and varattnosyn fields now treat the value of varno and
varattno as their default values.
This reduces the size of lists of Vars significantly and has a
very meaningful impact on the size of the compressed data (the default
pg_rewrite dataset contains some 10.8k Var nodes).
-10.4%pt / 9.7%pt

Total for the full applied patchset:
55.5% smaller data in pg_rewrite.ev_action before TOAST
45.7% smaller data in pg_rewrite.ev_action after applying TOAST

Toast relation size, as fraction of the main pg_rewrite table:
select pg_relation_size(2838) *1.0 / pg_relation_size('pg_rewrite');
  master: 4.7
  0007: 1.3

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^0]: The small difference in size for patch 0005 is presumably due to
low occurrance of NULL-valued Const nodes. Additionally, the inline vs
out-of-line TOASTed data and data (not) fitting on the last blocks of
each relation are likely to cause the change in total compression
ratio. If we had more null-valued Const nodes in pg_rewrite, the ratio
would presumably have been better than this.

PS: query I used for my data collection, + combined data:

select 'master' as "ve

Re: Reducing output size of nodeToString

2024-01-31 Thread Matthias van de Meent
On Wed, 31 Jan 2024, 09:16 Peter Eisentraut,  wrote:

> On 30.01.24 12:26, Matthias van de Meent wrote:
> >> Most of the other defaults I'm doubtful about.  First, we are colliding
> >> here between the goals of minimizing the storage size and making the
> >> debug output more readable.
> > I've never really wanted to make the output "more readable". The
> > current one is too verbose, yes.
>
> My motivations at the moment to work in this area are (1) to make the
> output more readable, and (2) to reduce maintenance burden of node
> support functions.
>
> There can clearly be some overlap with your goals.  For example, a less
> verbose and less redundant output can ease readability.  But it can also
> go the opposite direction; a very minimalized output can be less readable.
>
> I would like to understand your target more.  You have shown some
> figures how these various changes reduce storage size in pg_rewrite.
> But it's a few hundred kilobytes, if I read this correctly, maybe some
> megabytes if you add a lot of user views.  Does this translate into any
> other tangible benefits, like you can store more views, or processing
> views is faster, or something like that?


I was also thinking about smaller per-attribute expression storage, for
index attribute expressions, table default expressions, and functions.
Other than that, less memory overhead for the serialized form of these
constructs also helps for catalog cache sizes, etc.
People complained about the size of a fresh initdb, and I agreed with them,
so I started looking at low-hanging fruits, and this is one.

I've not done any tests yet on whether it's more performant in general. I'd
expect the new code to do a bit better given the extremely verbose nature
of the data and the rather complex byte-at-a-time token read method used,
but this is currently hypothesis.
I do think that serialization itself may be slightly slower, but given that
this generally happens only in DDL, and that we have to grow the output
buffer less often, this too may still be a net win (but, again, this is an
untested hypothesis).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


Re: Reducing output size of nodeToString

2024-01-30 Thread Matthias van de Meent
On Tue, 9 Jan 2024, 09:23 Peter Eisentraut,  wrote:
>
> On 04.01.24 00:23, Matthias van de Meent wrote:
> > Something like the attached? It splits out into the following
> > 0001: basic 'omit default values'
>
>   /* Write an integer field (anything written as ":fldname %d") */
> -#define WRITE_INT_FIELD(fldname) \
> +#define WRITE_INT_FIELD_DIRECT(fldname) \
>  appendStringInfo(str, " :" CppAsString(fldname) " %d",
> node->fldname)
> +#define WRITE_INT_FIELD_DEFAULT(fldname, default) \
> +   ((node->fldname == default) ? (0) : WRITE_INT_FIELD_DIRECT(fldname))
> +#define WRITE_INT_FIELD(fldname) \
> +   WRITE_INT_FIELD_DEFAULT(fldname, 0)
>
> Do we need the _DIRECT macros at all?  Could we not combine that into
> the _DEFAULT ones?

I was planning on using them to reduce the size of generated code for
select fields that we know we will always serialize, but then later
decided against doing that in this patch as it'd add even more
arbitrary annotations to nodes. This is a leftover from that.

> I think the way the conditional operator (?:) is written is not
> technically correct C,
> [...]
> I think it would be better to write this
> in a more straightforward way like
>
> #define WRITE_INT_FIELD_DEFAULT(fldname, default) \
> do { \
> [...]
> while (0)
>
> Relatedly, this
>
> +/* a scaffold function to read an optionally-omitted field */
> [...]
> would need to be written with a do { } while (0) wrapper around it.

I'll fix that.

> > 0002: reset location and other querystring-related node fields for all
> > catalogs of type pg_node_tree.
>
> This goal makes sense, but I think it can be done in a better way.  If
> you look into the area of stringToNode(), stringToNodeWithLocations(),
> and stringToNodeInternal(), there already is support for selectively
> resetting or omitting location fields.  Notably, this works with the
> existing automated knowledge of where the location fields are and
> doesn't require a new hand-maintained table.  I think the way forward
> here would be to apply a similar toggle to nodeToString() (the reverse).

I'll try to work something out for that.

> > 0003: add default marking on typmod fields.
> > 0004 & 0006: various node fields marked with default() based on
> > observed common or initial values of those fields
>
> I think we could get about half the benefit here more automatically, by
> creating a separate type for typmods, like
>
> typedef int32 TypMod;
>
> and then having the node support automatically generate the
> serialization support with a -1 default.

Hm, I suspect that the code churn for that would be significant. I'd
also be confused when the type in storage (pg_attribute, pg_type's
typtypmod) is still int32 when it would be TypMod only in nodes.

> (A similar thing could be applied to the location fields, which would
> allow us to get rid of the current hack of parsing out the name.)

I suppose so.

> Most of the other defaults I'm doubtful about.  First, we are colliding
> here between the goals of minimizing the storage size and making the
> debug output more readable.

I've never really wanted to make the output "more readable". The
current one is too verbose, yes.

> If a Query dump would now omit the
> commandType field if it is CMD_SELECT, I think that would be widely
> confusing, and one would need to check the source code to identify the
> reason.

AFAIK, SELECT is the only command type you can possibly store in a
view (insert/delete/update/utility are all invalid there, and while
I'm not fully certain about MERGE, I'd say it's certainly a niche).

> Also, what if we later decide to change a "default" for a
> field.  Then output between version would differ.  Of course, node
> output does change between versions in general, but these kinds of
> differences would be confusing.

I've not heard of anyone trying to read and compare the contents of
pg_node_tree manually where they're not trying to debug some
deep-nested issue. Note

> Second, this relies on hand-maintained
> annotations that were created by you presumably through a combination of
> intuition and testing, based on what is in the template database.  Do we
> know whether this matches real-world queries created by users later?

No, or at least I don't know this for certain. But I think it's a good start.

> Also, my experience dealing with the node support over the last little
> while is that these manually maintained exceptions get ossified and
> outdated and create a maintenance headache for the future.

I'm not sure what headache this would become. nodeToString is a fairly
straightforward API with (AFAIK) no external dependencies, where only
nodes go in and out. The metadata on t

Re: Add bump memory context type and use it for tuplesorts

2024-01-25 Thread Matthias van de Meent
On Mon, 6 Nov 2023 at 19:54, Matthias van de Meent
 wrote:
>
> On Tue, 11 Jul 2023 at 01:51, David Rowley  wrote:
>>
>> On Tue, 27 Jun 2023 at 21:19, David Rowley  wrote:
>>> I've attached the bump allocator patch and also the script I used to
>>> gather the performance results in the first 2 tabs in the attached
>>> spreadsheet.
>>
>> I've attached a v2 patch which changes the BumpContext a little to
>> remove some of the fields that are not really required.  There was no
>> need for the "keeper" field as the keeper block always comes at the
>> end of the BumpContext as these are allocated in a single malloc().
>> The pointer to the "block" also isn't really needed. This is always
>> the same as the head element in the blocks dlist.

>> +++ b/src/backend/utils/mmgr/bump.c
>> [...]
>> +MemoryContext
>> +BumpContextCreate(MemoryContext parent,
>> +  const char *name,
>> +  Size minContextSize,
>> +  Size initBlockSize,
>> +  Size maxBlockSize)
>> [...]
>> +/* Determine size of initial block */
>> +allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ +
>> +if (minContextSize != 0)
>> +allocSize = Max(allocSize, minContextSize);
>> +else
>> +allocSize = Max(allocSize, initBlockSize);

Shouldn't this be the following, considering the meaning of "initBlockSize"?

+allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ +
+Bump_CHUNKHDRSZ + initBlockSize;
+if (minContextSize != 0)
+allocSize = Max(allocSize, minContextSize);

>> + * BumpFree
>> + *Unsupported.
>> [...]
>> + * BumpRealloc
>> + *Unsupported.

Rather than the error, can't we make this a no-op (potentially
optionally, or in a different memory context?)

What I mean is, I get that it is an easy validator check that the code
that uses this context doesn't accidentally leak memory through
assumptions about pfree, but this does make this memory context
unusable for more generic operations where leaking a little memory is
preferred over the overhead of other memory contexts, as
MemoryContextReset is quite cheap in the grand scheme of things.

E.g. using a bump context in btree descent could speed up queries when
we use compare operators that do allocate memory (e.g. numeric, text),
because btree operators must not leak memory and thus always have to
manually keep track of all allocations, which can be expensive.

I understand that allowing pfree/repalloc in bump contexts requires
each allocation to have a MemoryChunk prefix in overhead, but I think
it's still a valid use case to have a very low overhead allocator with
no-op deallocator (except context reset). Do you have performance
comparison results between with and without the overhead of
MemoryChunk?



Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-01-24 Thread Matthias van de Meent
On Fri, 19 Jan 2024 at 05:55, Dilip Kumar  wrote:
>
> On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
>  wrote:
> >
> > Hi,
> >
> > Currently, nbtree code compares each and every column of an index
> > tuple during the binary search on the index page. With large indexes
> > that have many duplicate prefix column values (e.g. an index on (bool,
> > bool, uuid) ) that means a lot of wasted time getting to the right
> > column.
> >
> > The attached patch improves on that by doing per-page dynamic prefix
> > truncation: If we know that on both the right and left side there are
> > index tuples where the first two attributes are equal to the scan key,
> > we skip comparing those attributes at the current index tuple and
> > start with comparing attribute 3, saving two attribute compares. We
> > gain performance whenever comparing prefixing attributes is expensive
> > and when there are many tuples with a shared prefix - in unique
> > indexes this doesn't gain much, but we also don't lose much in this
> > case.
> >
> > This patch was originally suggested at [0], but it was mentioned that
> > they could be pulled out into it's own thread. Earlier, the
> > performance gains were not clearly there for just this patch, but
> > after further benchmarking this patch stands on its own for
> > performance: it sees no obvious degradation of performance, while
> > gaining 0-5% across various normal indexes on the cc-complete sample
> > dataset, with the current worst-case index shape getting a 60%+
> > improved performance on INSERTs in the tests at [0].
>
> +1 for the idea, I have some initial comments while reading through the patch.

Thank you for the review.

> 1.
> Commit message refers to a non-existing reference '(see [0])'.

Noted, I'll update that.

> 2.
> +When we do a binary search on a sorted set (such as a BTree), we know that a
> +tuple will be smaller than its left neighbour, and larger than its right
> +neighbour.
>
> I think this should be 'larger than left neighbour and smaller than
> right neighbour' instead of the other way around.

Noted, will be fixed, too.

> 3.
> +With the above optimizations, dynamic prefix truncation improves the worst
> +case complexity of indexing from O(tree_height * natts * log(tups_per_page))
> +to O(tree_height * (3*natts + log(tups_per_page)))
>
> Where do the 3*natts come from?  Is it related to setting up the
> dynamic prefix at each level?

Yes: We need to establish prefixes for both a tuple that's ahead of
the to-be-compared tuple, and one that's after the to-be-compared
tuple. Assuming homogenous random distribution of scan key accesses
across the page (not always the case, but IMO a reasonable starting
point) this would average to 3 unprefixed compares before you have
established both a higher and a lower prefix.

> 4.
> + /*
> + * All tuple attributes are equal to the scan key, only later attributes
> + * could potentially not equal the scan key.
> + */
> + *compareattr = ntupatts + 1;
>
> Can you elaborate on this more? If all tuple attributes are equal to
> the scan key then what do those 'later attributes' mean?

In inner pages, tuples may not have all key attributes, as some may
have been truncated away in page splits. So, tuples that have at least
the same prefix as this (potentially truncated) tuple will need to be
compared starting at the first missing attribute of this tuple, i.e.
ntupatts + 1.

Kind regards,

Matthias van de Meent




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-22 Thread Matthias van de Meent
On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan  wrote:
>

Thank you for your replies so far.

> On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
>  wrote:
> > I would agree with you if this was about new APIs and features, but
> > here existing APIs are being repurposed without changing them. A
> > maintainer of index AMs would not look at the commit title 'Enhance
> > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> > my existing amsearcharray+amcanorder handling actually supports
> > non-prefix arrays keys and return data in index order".
>
> This is getting ridiculous.
>
> It is quite likely that there are exactly zero affected out-of-core
> index AMs. I don't count pgroonga as a counterexample (I don't think
> that it actually fullfills the definition of a ). Basically,
> "amcanorder" index AMs more or less promise to be compatible with
> nbtree, down to having the same strategy numbers. So the idea that I'm
> going to upset amsearcharray+amcanorder index AM authors is a
> completely theoretical problem. The planner code evolved with nbtree,
> hand-in-glove.

And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way. I don't mind changes in
the planner for nbtree per se, but as I've mentioned before in other
places, I really don't like how we handle amcanorder as if it were
amisbtree. But it's not called that, so we shouldn't expect that to
remain the case; and definitely not keep building on those
expectations that it always is going to be the nbtree when amcanorder
is set (or amsearcharray is set, or ..., or any combination of those
that is currently used by btree). By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

/rant

I'll see about the

> I'm more than happy to add a reminder to the commit message about
> adding a proforma listing to the compatibility section of the Postgres
> 17 release notes. Can we actually agree on which theoretical index AM
> types are affected first, though? Isn't it actually
> amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
> have that right?

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

> BTW, how should we phrase this compatibility note, so that it can be
> understood? It's rather awkward.

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

> > > As I said to Heikki, thinking about this some more is on my todo list.
> > > I mean the way that this worked had substantial revisions on HEAD
> > > right before I posted v9. v9 was only to fix the bit rot that that
> > > caused.
> >
> > Then I'll be waiting for that TODO item to be resolved.
>
> My TODO item is to resolve the question of whether and to what extent
> these two optimizations should be combined. It's possible that I'll
> decide that it isn't worth it, for whatever reason.

That's fine, this decision (and any work related to it) is exactly
what I was referring to with that mention of the TODO.

> > > Even single digit
> > > thousand of array elements should be considered huge. Plus this code
> > > path is only hit with poorly written queries.
> >
> > Would you accept suggestions?
>
> You mean that you want to have a go at it yourself? Sure, go ahead.
>
> I cannot promise that I'll accept your suggested revisions, but if
> they aren't too invasive/complicated compared to what I have now, then
> I will just accept them. I maintain that this isn't really necessary,
> but it might be the path of least resistance at this point.

Attac

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-18 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan  wrote:
>
> On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
>  wrote:
> > Can you pull these planner changes into their own commit(s)?
> > As mentioned upthread, it's a significant change in behavior that
> > should have separate consideration and reference in the commit log. I
> > really don't think it should be buried in the 5th paragraph of an
> > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> > changes of btree are arguably independent of the planner changes, as
> > the btree changes improve performance even if we ignore that it
> > implements strict result ordering.
>
> I'm not going to break out the planner changes, because they're *not*
> independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

> You could say the same thing about practically
> any work that changes the planner. They're "buried" in the 5th
> paragraph of the commit message. If an interested party can't even
> read that far to gain some understanding of a legitimately complicated
> piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

> As I said to Heikki, thinking about this some more is on my todo list.
> I mean the way that this worked had substantial revisions on HEAD
> right before I posted v9. v9 was only to fix the bit rot that that
> caused.

Then I'll be waiting for that TODO item to be resolved.

> > > +++ b/src/backend/access/nbtree/nbtutils.c
> > > +/*
> > > + * _bt_merge_arrays() -- merge together duplicate array keys
> > > + *
> > > + * Both scan keys have array elements that have already been sorted and
> > > + * deduplicated.
> > > + */
> >
> > As I mentioned upthread, I find this function to be very wasteful, as
> > it uses N binary searches to merge join two already sorted arrays,
> > resulting in a O(n log(m)) complexity, whereas a normal merge join
> > should be O(n + m) once the input datasets are sorted.
>
> And as I mentioned upthread, I think that you're making a mountain out
> of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

> Even single digit
> thousand of array elements should be considered huge. Plus this code
> path is only hit with poorly written queries.

Would you accept suggestions?

> > Please fix this, as it shows up in profiling of large array merges.
> > Additionally, as it merges two arrays of unique items into one,
> > storing only matching entries, I feel that it is quite wasteful to do
> > this additional allocation here. Why not reuse the original allocation
> > immediately?
>
> Because that's adding more complexity for a code path that will hardly
> ever be used in practice. For something that happens once, during a
> preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

> > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > > + IndexTuple tuple, int sktrig, bool 
> > > validtrig)
> >
> > I don't quite understand what the 'validtrig' argument is used for.
> > There is an assertion that it is false under some conditions in this
> > code, but it's not at all clear to me why that would have to be the
> > case - it is called with `true` in one of the three call sites. Could
> > the meaning of this be clarified?
>
> Sure, I'll add a comment.

Thanks!

> > I also feel that this patch includes several optimizations such as
> > this sktrig argument which aren't easy to understand. Could you pull
> > that into a separately reviewable patch?
>
> It 

Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Matthias van de Meent
On Thu, 18 Jan 2024 at 13:41, Montana Low  wrote:
>
> The overall trend in machine learning embedding sizes has been growing 
> rapidly over the last few years from 128 up to 4K dimensions yielding 
> additional value and quality improvements. It's not clear when this trend in 
> growth will ease. The leading text embedding models generate now exceeds the 
> index storage available in IndexTupleData.t_info.
>
> The current index tuple size is stored in 13 bits of IndexTupleData.t_info, 
> which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors 
> implemented by pgvector currently use a 32 bit float for elements, which 
> limits vector size to 2K dimensions, which is no longer state of the art.
>
> I've attached a patch that increases  IndexTupleData.t_info from 16bits to 
> 32bits allowing for significantly larger index tuple sizes. I would guess 
> this patch is not a complete implementation that allows for migration from 
> previous versions, but it does compile and initdb succeeds. I'd be happy to 
> continue work if the core team is receptive to an update in this area, and 
> I'd appreciate any feedback the community has on the approach.

I'm not sure why this is needed.
Vector data indexing generally requires bespoke index methods, which
are not currently available in the core PostgreSQL repository, and
indexes are not at all required to utilize the IndexTupleData format
for their data tuples (one example of this being BRIN).
The only hard requirement in AMs which use Postgres' relfile format is
that they follow the Page layout and optionally the pd_linp/ItemId
array, which limit the size of Page tuples to 2^15-1 (see
ItemIdData.lp_len) and ~2^16-bytes
(PageHeaderData.pd_pagesize_version).

Next, the only non-internal use of IndexTuple is in IndexOnlyScans.
However, here the index may fill the scandesc->xs_hitup with a heap
tuple instead, which has a length stored in uint32, too. So, I don't
quite see why this would be required for all indexes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Sequence Access Methods, round two

2024-01-18 Thread Matthias van de Meent
On Thu, 18 Jan 2024, 16:06 Peter Eisentraut,  wrote:
>
> On 01.12.23 06:00, Michael Paquier wrote:
> > Please find attached a patch set that aims at implementing sequence
> > access methods, with callbacks following a model close to table and
> > index AMs, with a few cases in mind:
> > - Global sequences (including range-allocation, local caching).
> > - Local custom computations (a-la-snowflake).
>
> That's a lot of code, but the use cases are summarized in two lines?!?
>
> I would like to see a lot more elaboration what these uses cases are (I
> recognize the words, but do we have the same interpretation of them?)
> and how they would be addressed by what you are proposing, and better
> yet an actual implementation of something useful, rather than just a
> dummy test module.

At $prevjob we had a use case for PRNG to generate small,
non-sequential "random" numbers without the birthday problem occurring
in sqrt(option space) because that'd increase the printed length of
the numbers beyond a set limit. The sequence API proposed here
would've been a great alternative to the solution we found, as it
would allow a sequence to be backed by an Linear Congruential
Generator directly, rather than the implementation of our own
transactional random_sequence table.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: cleanup patches for incremental backup

2024-01-18 Thread Matthias van de Meent
On Wed, 17 Jan 2024 at 21:10, Robert Haas  wrote:
>
> On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent
>  wrote:
> > Sure, added in attached.
>
> I think this mostly looks good now but I just realized that I think
> this needs rephrasing:
>
> + To restore incremental backups the tool  linkend="app-pgcombinebackup"/>
> + is used, which combines incremental backups with a base backup and
> + WAL to restore a
> + database cluster to
> + a consistent state.
>
> The way this is worded, at least to me, it makes it sound like
> pg_combinebackup is going to do the WAL recovery for you, which it
> isn't. Maybe:
>
> To restore incremental backups the tool  linkend="app-pgcombinebackup"/> is used, which combines incremental
> backups with a base backup. Afterwards, recovery can use  linkend="glossary-wal">WAL to bring the  linkend="glossary-db-cluster">database cluster to a
> consistent state.

Sure, that's fine with me.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: cleanup patches for incremental backup

2024-01-17 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 21:49, Robert Haas  wrote:
>
> On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent
>  wrote:
> + A special base 
> backup
> + that for some WAL-logged relations only contains the pages that were
> + modified since a previous backup, as opposed to the full relation data 
> of
> + normal base backups. Like base backups, it is generated by the tool
> + .
>
> Could we say "that for some files may contain only those pages that
> were modified since a previous backup, as opposed to the full contents
> of every file"?

Sure, added in attached.

> + To restore incremental backups the tool 
> + is used, which combines the incremental backups with a base backup and
> + [...]
> I wondered if this needed to be clearer that the chain of backups
> could have length > 2. But on further reflection, I think it's fine,
> unless you feel otherwise.

I removed "the" from the phrasing "the incremental backups", which
makes it a bit less restricted.

> The rest LGTM.

In the latest patch I also fixed the casing of "Incremental Backup" to
"... backup", to be in line with most other multi-word items.

Thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-16 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 16:39, Robert Haas  wrote:
>
> On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent
>  wrote:
> > Off-list I was notified that the new WAL summarizer process was not
> > yet added to the glossary, so PFA a patch that does that.
> > In passing, it also adds "incremental backup" to the glossary, and
> > updates the documented types of backends in monitoring.sgml with the
> > new backend type, too.
>
> I wonder if it's possible that you sent the wrong version of this
> patch, because:
>
> (1) The docs don't build with this applied. I'm not sure if it's the
> only problem, but  the closing >.

That's my mistake, I didn't check install-world yet due to unrelated
issues building the docs. I've since sorted out these issues (this was
a good stick to get that done), so this issue is fixed in the attached
patch.

> (2) The changes to monitoring.sgml contain an unrelated change, about
> pg_stat_all_indexes.idx_scan.

Thanks for noticing, fixed in attached.

> Also, I think the "For more information, see  /> bit should have a period after the markup tag, as we seem to do in
> other cases.

Fixed.

> One other thought is that the incremental backup only replaces
> relation files with incremental files, and it never does anything
> about FSM files. So the statement that it only contains data that was
> potentially changed isn't quite correct. It might be better to phrase
> it the other way around i.e. it is like a full backup, except that
> some files can be replaced by incremental files which omit blocks to
> which no WAL-logged changes have been made.

How about the attached?


Kind regards,

Matthias van de Meent


v2-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-15 Thread Matthias van de Meent
On Mon, 15 Jan 2024 at 17:58, Robert Haas  wrote:
>
> On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> > I've found one more typo in the sgml:
> > summarized_pid
> > And one in a comment:
> > sumamry
> >
> > A trivial fix is attached.
>
> Thanks, committed.

Off-list I was notified that the new WAL summarizer process was not
yet added to the glossary, so PFA a patch that does that.
In passing, it also adds "incremental backup" to the glossary, and
updates the documented types of backends in monitoring.sgml with the
new backend type, too.

Kind regards,

Matthias van de Meent.


v1-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Matthias van de Meent
I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = int[] doesn't seem to merge the first and last
array ops. I'm also concerned about being (un)able to merge

> +/*
> + * _bt_merge_arrays() -- merge together duplicate array keys
> + *
> + * Both scan keys have array elements that have already been sorted and
> + * deduplicated.
> + */

As I mentioned upthread, I find this function to be very wasteful, as
it uses N binary searches to merge join two already sorted arrays,
resulting in a O(n log(m)) complexity, whereas a normal merge join
should be O(n + m) once the input datasets are sorted.
Please fix this, as it shows up in profiling of large array merges.
Additionally, as it merges two arrays of unique items into one,
storing only matching entries, I feel that it is quite wasteful to do
this additional allocation here. Why not reuse the original allocation
immediately?

> +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> + IndexTuple tuple, int sktrig, bool validtrig)

I don't quite understand what the 'validtrig' argument is used for.
There is an assertion that it is false under some conditions in this
code, but it's not at all clear to me why that would have to be the
case - it is called with `true` in one of the three call sites. Could
the meaning of this be clarified?

I also feel that this patch includes several optimizations such as
this sktrig argument which aren't easy to understand. Could you pull
that into a separately reviewable patch?

Additionally, could you try to create a single point of entry for the
array key stuff that covers the new systems? I've been trying to wrap
my head around this, and it's taking a lot of effort.

> _bt_advance_array_keys

Thinking about the implementation here:
We require transitivity for btree opclasses, where A < B implies NOT A
= B, etc. Does this also carry over into cross-type operators? E.g. a
type 'truncatedint4' that compares only the highest 16 bits of an
integer would be strictly sorted, and could compare 0::truncatedint4 =
0::int4 as true, as well as 0::truncatedint4 = 2::int4, while 0::int4
= 2::int4 is false.
Would it be valid to add support methods for truncatedint4 to an int4
btree opclass, or is transitivity also required for all operations?
i.e. all values that one operator class considers unique within an
opfamily must be considered unique for all additional operators in the
opfamily, or is that not required?
If not, then that would pose problems for this patch, as the ordering
of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
('{0,65536}'::truncatedint4[]) could potentially skip results.

I'm also no fan of the (tail) recursion. I would agree that this is
unlikely to consume a lot of stack, but it does consume stackspace
nonetheless, and I'd prefer if it was not done this way.

I notice an assertion error here:
> +Assert(cur->sk_strategy != BTEqualStrategyNumber);
> +Assert(all_required_sk_satisfied);
> +Assert(!foundRequiredOppositeDirOnly);
> +
> +foundRequiredOppositeDirOnly = true;

This assertion can be hit with the following test case:

CREATE TABLE test AS
SELECT i a, i b, i c FROM generate_series(1, 1000) i;
CREATE INDEX ON test(a, b, c); ANALYZE;
SELECT count(*) FROM test
WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
AND b = ANY ('{1,2,3}');

> +_bt_update_keys_with_arraykeys(IndexScanDesc scan)

I keep getting confused by the mixing of integer increments and
pointer increments. Could you explain why in this code you chose to
increment a pointer for "ScanKey cur", while using arrray indexing for
other fields? It feels very arbitrary to me, and that makes the code
difficult to follow.

> +++ b/src/test/regress/sql/btree_index.sql
> +-- Add tests to give coverage of various subtle issues.
> +--
> +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> +--
> +-- Here we don't remember the scan's array keys before processing a page, 
> only
> +-- after processing a page (which is implicit, it's just the scan's current
> +-- keys).  So when we move the scan backwards we think that the top-level 
> scan
> +-- should terminate, when in reality it should jump backwards to the leaf 
> page
> +-- that we last visited.

I notice this adds a complex test case that outputs many rows. Can we
do with less rows if we build the index after data insertion, and with
a lower (non-default) fillfactor?

Note: I did not yet do any in-depth review of the planner changes in
indxpath.c/selfuncs.c.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Matthias van de Meent
On Wed, 10 Jan 2024 at 02:44, Andy Fan  wrote:
> Hi,
>
> I want to know if Andres or you have plan
> to do some code review. I don't expect this would happen very soon, just
> want to make sure this will not happen that both of you think the other
> one will do, but actually none of them does it in fact. a commit fest
> [1] has been added for this.


> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -5419,6 +5419,7 @@ LockBufHdr(BufferDesc *desc)
> perform_spin_delay();
> }
> finish_spin_delay();
> +START_SPIN_LOCK();
> return old_buf_state | BM_LOCKED;
> }

I think that we need to 'arm' the checks just before we lock the spin
lock, and 'disarm' the checks just after we unlock the spin lock,
rather than after and before, respectively. That way, we won't have a
chance of false negatives: with your current patch it is possible that
an interrupt fires between the acquisition of the lock and the code in
START_SPIN_LOCK() marking the thread as holding a spin lock, which
would cause any check in that signal handler to incorrectly read that
we don't hold any spin locks.

> +++ b/src/backend/storage/lmgr/lock.c
> @@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
> boolfound_conflict;
> boollog_lock = false;
>
> +Assert(SpinLockCount == 0);
> +

I'm not 100% sure on the policy of this, but theoretically you could
use LockAquireExtended(dontWait=true) while holding a spin lock, as
that would not have an unknown duration. Then again, this function
also does elog/ereport, which would cause issues, still, so this code
may be the better option.

> +elog(PANIC, "stuck spinlock detected at %s, %s:%d after waiting for %u 
> ms",
> + func, file, line, delay_ms);

pg_usleep doesn't actually guarantee that we'll wait for exactly that
duration; depending on signals received while spinning and/or OS
scheduling decisions it may be off by orders of magnitude.

> +++ b/src/common/scram-common.c

This is unrelated to the main patchset.

> +++ b/src/include/storage/spin.h

Minor: I think these changes could better be included in miscadmin, or
at least the definition for SpinLockCount should be moved there: The
spin lock system itself shouldn't be needed in places where we need to
make sure that we don't hold any spinlocks, and miscadmin.h already
holds things related to "System interrupt and critical section
handling", which seems quite related.

Kind regards,

Matthias van de Meent




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-04 Thread Matthias van de Meent
On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
 wrote:
>
> Hello!
>
> It seems like the idea of "old" snapshot is still a valid one.
>
> > Should this deal with any potential XID wraparound, too?
>
> As far as I understand in our case, we are not affected by this in any way.
> Vacuum in our table is not possible because of locking, so, nothing
> may be frozen (see below).
> In the case of super long index building, transactional limits will
> stop new connections using current
> regular infrastructure because it is based on relation data (but not
> actual xmin of backends).
>
> > How does this behave when the newly inserted tuple's xmin gets frozen?
> > This would be allowed to happen during heap page pruning, afaik - no
> > rules that I know of which are against that - but it would create
> > issues where normal snapshot visibility rules would indicate it
> > visible to both snapshots regardless of whether it actually was
> > visible to the older snapshot when that snapshot was created...
>
> As I can see, heap_page_prune never freezes any tuples.
> In the case of regular vacuum, it used this way: call heap_page_prune
> and then call heap_prepare_freeze_tuple and then
> heap_freeze_execute_prepared.

Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.

As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/caakru_a+g2oe6ahjcbibftnfiy2aib4e31x9qyj_qkjxzmz...@mail.gmail.com




Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Matthias van de Meent
On Thu, 4 Jan 2024 at 08:09, Andy Fan  wrote:
>
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
> [...]
> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.
>
> the attached code show the prototype in my mind. Any feedback is welcome.

While I understand your point and could maybe agree with the change
itself (a crash isn't great), I don't think it's an appropriate fix
for the problem of holding a spinlock while waiting for a LwLock, as
spin.h specifically mentions the following (and you quoted the same):

"""
Keep in mind the coding rule that spinlocks must not be held for more
than a few instructions.
"""

I suspect that we'd be better off with stronger protections against
waiting for LwLocks while we hold any spin lock. More specifically,
I'm thinking about something like tracking how many spin locks we
hold, and Assert()-ing that we don't hold any such locks when we start
to wait for an LwLock or run CHECK_FOR_INTERRUPTS-related code (with
potential manual contextual overrides in specific areas of code where
specific care has been taken to make it safe to hold spin locks while
doing those operations - although I consider their existence unlikely
I can't rule them out as I've yet to go through all lock-touching
code). This would probably work in a similar manner as
START_CRIT_SECTION/END_CRIT_SECTION.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent
On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut  wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the  consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me.  By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
 , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
 , sum(pg_column_size(ev_action)) as "compressed"
 , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | pg_rewrite | compressed |   raw
-|---+++-
 master  |   7545359 | 761856 | 573307 | 2998712
 0001|   7365135 | 622592 | 438224 | 1943772
 0002|   7258639 | 573440 | 401660 | 1835803
 0003|   7258639 | 565248 | 386211 | 1672539
 0004|   7176719 | 483328 | 317099 | 1316552
 0005|   7176719 | 483328 | 315556 | 1300420
 0006|   7160335 | 466944 | 302806 | 1208621
 0007|   7143951 | 450560 | 287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v1-0001-pg_node_tree-Don-t-serialize-fields-with-type-def.patch
Description: Binary data


v1-0002-pg_node_tree-reset-node-location-before-catalog-s.patch
Description: Binary data


v1-0005-NodeSupport-Don-t-emit-trailing-0s-in-outDatum.patch
Description: Binary data


v1-0004-NodeSupport-add-some-more-default-markers-for-var.patch
Description: Binary data


v1-0003-Nodesupport-add-support-for-custom-default-values.patch
Description: Binary data


v1-0007-NodeSupport-Apply-RLE-and-differential-encoding-o.patch
Description: Binary data


v1-0006-NodeSupport-Apply-some-more-defaults-serializatio.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent
On Wed, 3 Jan 2024 at 03:02, David Rowley  wrote:
>
> On Thu, 14 Dec 2023 at 19:21, Matthias van de Meent
>  wrote:
> >
> > On Thu, 7 Dec 2023 at 13:09, David Rowley  wrote:
> > > We could also easily serialize plans to binary format for copying to
> > > parallel workers rather than converting them to a text-based
> > > serialized format. It would also allow us to do things like serialize
> > > PREPAREd plans into a nicely compact single allocation that we could
> > > just pfree in a single pfree call on DEALLOCATE.
> >
> > I'm not sure what benefit you're refering to. If you mean "it's more
> > compact than the current format" then sure; but the other points can
> > already be covered by either the current nodeToString format, or by
> > nodeCopy-ing the prepared plan into its own MemoryContext, which would
> > allow us to do essentially the same thing.
>
> There's significantly less memory involved in just having a plan
> serialised into a single chunk of memory vs a plan stored in its own
> MemoryContext.  With the serialised plan, you don't have any power of
> 2 rounding up wastage that aset.c does and don't need extra space for
> all the MemoryChunks that would exist for every single palloc'd chunk
> in the MemoryContext version.

I was envisioning this to use the Bump memory context you proposed
over in [0], as to the best of my knowledge prepared plans are not
modified, so nodeCopy-ing a prepared plan into bump context could be a
good use case for those contexts. This should remove the issue of
rounding and memorychunk wastage in aset.

> I think it would be nice if one day in the future if a PREPAREd plan
> could have multiple different plans cached. We could then select which
> one to use by looking at statistics for the given parameters and
> choose the plan that's most suitable for the given parameters.   Of
> course, this is a whole entirely different project. I mention it just
> because being able to serialise a plan would make the memory
> management and overhead for such a feature much more manageable.
> There'd likely need to be some eviction logic in such a feature as the
> number of possible plans for some complex query is quite likely to be
> much more than we'd care to cache.

Yeah, that'd be nice, but is also definitely future work.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: 
https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM%3DBp%3D6xjL_nLys4zdZOpfNyWBk97Xrdj2w%40mail.gmail.com




Re: Next step towards 64bit XIDs: Switch to FullTransactionId for PGPROC->xid and XLogRecord->xl_xid

2023-12-29 Thread Matthias van de Meent
On Fri, 29 Dec 2023, 13:49 Maxim Orlov,  wrote:

> Hi!
>
> As were discussed in [0] our overall goal is to make Postgres 64 bit
> XIDs.  It's obvious, that such a big patch set
> couldn't possible to commit "at once".  SLUR patch set [1] was committed a
> short while ago as a first significant
> step in this direction.
>
> This thread is a next step in this enterprise.  My objective here is to
> commit some changes, which were mandatory,
> as far as I understand, for any type of 64 XIDs implementation. And I'm
> sure there will be points for discussion here.
>
> My original intention was to make PGPROC->xmin, PGPROC->xid and
> PROC_HDR->xids 64bit.  But in reality,
> it turned out to be much more difficult than I expected.  On the one hand,
> the patch became too big and on the other
> hand, it's heavily relayed on epoch and XID "adjustment" to FXID.  Therefore,
> for now, I decided to limit myself to
> more atomic and independent changes. However, as I said above, these
> changes are required for any implementation
> of 64bit XIDs.
>
> So, PFA patches to make switch PGPROC->xid
>

I think this could be fine, but ...

and XLogRecord->xl_xid to FullTransactionId.
>

I don't think this is an actionable change, as this wastes 4 more bytes (or
8 with alignment) in nearly all WAL records that don't use the
HEAP/HEAP2/XLOG rmgrs, which would then be up to 10 (if not 14, when
64but-aligned) bytes per record. Unless something like [0] gets committed
this will add a significant write overhead to all operations, even if they
are not doing anything that needs an XID.

Also, I don't think we need to support transactions that stay alive and
change things for longer than 2^31 concurrently created transactions, so we
could well add a fxid to each WAL segment header (and checkpoint record?)
and calculate the fxid of each record as a relative fxid off of that.


Kind regards

Matthias van de Meent

[0] https://commitfest.postgresql.org/46/4386/


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-20 Thread Matthias van de Meent
On Wed, 20 Dec 2023 at 10:56, Michail Nikolaev
 wrote:
> > Note that the use of such expressions would be a violation of the
> > function's definition; it would depend on data from other tables which
> > makes the function behave like a STABLE function, as opposed to the
> > IMMUTABLE that is required for index expressions. So, I don't think we
> > should specially care about being correct for incorrectly marked
> > function definitions.
>
> Yes, but such cases could probably cause crashes also...
> So, I think it is better to check them for custom functions. But I
> still not sure -
> if such limitations still required for proposed optimization or not.

I think contents could be inconsistent, but not more inconsistent than
if the index was filled across multiple transactions using inserts.
Either way I don't see it breaking more things that are not already
broken in that way in other places - at most it will introduce another
path that exposes the broken state caused by mislabeled functions.

> > I just realised there is one issue with this design: We can't cheaply
> > reset the snapshot during the second table scan:
> > It is critically important that the second scan of R/CIC uses an index
> > contents summary (made with index_bulk_delete) that was created while
> > the current snapshot was already registered.
>
> > So, the "reset the snapshot every so often" trick cannot be applied in
> > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > every time we reset the snapshot. Rescanning might be worth the cost
> > (e.g. when using BRIN), but that is very unlikely.
>
> Hm, I think it is still possible. We could just manually recheck the
> tuples we see
> to the snapshot currently used for the scan. If an "old" snapshot can see
> the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> index summary.
That's an interesting method.

How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?

> HeapTupleSatisfiesHistoricMVCC

That function has this comment marker:
   "Only usable on tuples from catalog tables!"
Is that correct even for this?

Should this deal with any potential XID wraparound, too?
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...

Either way, "Historic snapshot" isn't something I've worked with
before, so that goes onto my "figure out how it works" pile.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: pg_waldump

2023-12-19 Thread Matthias van de Meent
On Tue, 19 Dec 2023, 12:27 Fabrice Chapuis,  wrote:
>
> Hi,
> Is it possible to visualize the DDL with the pg_waldump tool. I created a 
> postgres user but I cannot find the creation command in the wals

Not really, no. PostgreSQL does not log DDL or DML as such in WAL.
Essentially all catalog updates are logged only as changes on a
certain page in some file: a new user getting inserted would be
approximately "Insert tuple [user's pg_role row data] on page X in
file [the file corresponding to the pg_role table]".

You could likely derive most DDL commands from Heap/Insert,
Heap/Delete, and Heap/Update records (after cross-referencing the
database's relfilemap), as most DDL is "just" a lot of in-memory
operations plus some record insertions/updates/deletes in catalog
tables. You'd also need to keep track of any relfilemap changes while
processing the WAL, as VACUUM FULL on the catalog tables would change
the file numbering of catalog tables...

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Add --check option to pgindent

2023-12-18 Thread Matthias van de Meent
On Mon, 18 Dec 2023 at 16:45, Tristan Partin  wrote:
>
> On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote:
> > > On 15 Dec 2023, at 16:43, Tristan Partin  wrote:
> >
> > > Here is a v3.
> >
> > I think this is pretty much ready to go, the attached v4 squashes the 
> > changes
> > and fixes the man-page which also needed an update.  The referenced Wiki 
> > page
> > will need an edit or two after this goes in, but that's easy enough.
>
> I have never edited the Wiki before. How can I do that? More than happy
> to do it.

As mentioned at the bottom of the main page of the wiki:

  NOTE: due to recent spam activity "editor" privileges are granted
manually for the time being.
  If you just created a new community account or if your current
account used to have "editor" privileges ask on either the PostgreSQL
-www Mailinglist or the PostgreSQL IRC Channel (direct your request to
'pginfra', multiple individuals in the channel highlight on that
string) for help. Please include your community account name in those
requests.

After that, it's just a case of loggin in on the wiki (link top right
corner, which uses the community account) and then just go on to your
prefered page, click edit, and do your modifications. Don't forget to
save the modifications - I'm not sure whether the wiki editor's state
is locally persisted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




  1   2   3   4   5   6   >