from:"David Rowley"

Re: Minor fixes for couple some comments around MERGE RETURNING

2024-05-22 Thread David Rowley

On Sun, 19 May 2024 at 15:20, David Rowley  wrote:
>
> I noticed that PlannedStmt.hasReturning and hasModifyingCTE have an
> outdated comment now that MERGE supports RETURNING (per commit
> c649fa24a)
>
> i.e. these two:
>
> > bool hasReturning; /* is it insert|update|delete RETURNING? */
>
> > bool hasModifyingCTE; /* has insert|update|delete in WITH? */

I've pushed the fix for that.

David

Re: First draft of PG 17 release notes

2024-05-22 Thread David Rowley

On Thu, 23 May 2024 at 14:01, Bruce Momjian  wrote:
>
> On Thu, May 23, 2024 at 01:34:10PM +1200, David Rowley wrote:
> > What is the best way to communicate this stuff so it's easily
> > identifiable when you parse the commit messages?
>
> This is why I think we need an "Internal Performance" section where we
> can be clear about simple scaling improvements that might have no
> user-visible explanation.  I would suggest putting it after our "Source
> code" section.

hmm, but that does not really answer my question. There's still a
communication barrier if you're parsing the commit messages are
committers don't clearly state some performance improvement numbers
for the reason I stated.

I also don't agree these should be left to "Source code" section. I
feel that section is best suited for extension authors who might care
about some internal API change.  I'm talking of stuff that makes some
user queries possibly N times (!) faster. Surely "Source Code" isn't
the place to talk about that?

David

Re: Speed up JSON escape processing with SIMD plus other optimisations

2024-05-22 Thread David Rowley

On Thu, 23 May 2024 at 13:23, David Rowley  wrote:
> Master:
> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
> tps = 362.494309 (without initial connection time)
> tps = 363.182458 (without initial connection time)
> tps = 362.679654 (without initial connection time)
>
> Master + 0001 + 0002
> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
> tps = 426.456885 (without initial connection time)
> tps = 430.573046 (without initial connection time)
> tps = 431.142917 (without initial connection time)
>
> About 18% faster.
>
> It would be much faster if we could also get rid of the
> escape_json_cstring() call in the switch default case of
> datum_to_json_internal(). row_to_json() would be heaps faster with
> that done. I considered adding a special case for the "text" type
> there, but in the end felt that we should just fix that with some
> hypothetical other patch that changes how output functions work.
> Others may feel it's worthwhile. I certainly could be convinced of it.

Just to turn that into performance numbers, I tried the attached
patch.  The numbers came out better than I thought.

Same test as before:

master + 0001 + 0002 + attached hacks:
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 616.094394 (without initial connection time)
tps = 615.928236 (without initial connection time)
tps = 614.175494 (without initial connection time)

About 70% faster than master.

David
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index a266f60ff3..b15f6c5e64 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -24,6 +24,7 @@
 #include "utils/builtins.h"
 #include "utils/date.h"
 #include "utils/datetime.h"
+#include "utils/fmgroids.h"
 #include "utils/json.h"
 #include "utils/jsonfuncs.h"
 #include "utils/lsyscache.h"
@@ -286,9 +287,15 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo 
result,
pfree(jsontext);
break;
default:
-   outputstr = OidOutputFunctionCall(outfuncoid, val);
-   escape_json_cstring(result, outputstr);
-   pfree(outputstr);
+   /* special case common case for text types */
+   if (outfuncoid == F_TEXTOUT)
+   escape_json_from_text(result, (text *) 
DatumGetPointer(val));
+   else
+   {
+   outputstr = OidOutputFunctionCall(outfuncoid, 
val);
+   escape_json_cstring(result, outputstr);
+   pfree(outputstr);
+   }
break;
}
 }

Re: First draft of PG 17 release notes

2024-05-22 Thread David Rowley

On Thu, 23 May 2024 at 10:04, Bruce Momjian  wrote:
> You might have seen in this thread, I do record commits that speed up
> workloads that are user-visible, or specifically make new workloads
> possible.  I assume that covers the items above, though I have to
> determine this from the commit message.

It sometimes is hard to write something specific in the commit message
about the actual performance increase.

For example, if a commit removes an O(N log2 N) algorithm and replaces
it with an O(1), you can't say there's an X% increase in performance
as the performance increase depends on the value of N.

Jelte did call me out for not mentioning enough detail about the
performance in c4ab7da60, but if I claimed any % of an increase, it
would have been specific to some workload.

What is the best way to communicate this stuff so it's easily
identifiable when you parse the commit messages?

David

Speed up JSON escape processing with SIMD plus other optimisations

2024-05-22 Thread David Rowley

Currently the escape_json() function takes a cstring and char-by-char
checks each character in the string up to the NUL and adds the escape
sequence if the character requires it.

Because this function requires a NUL terminated string, we're having
to do a little more work in some places.  For example, in
jsonb_put_escaped_value() we call pnstrdup() on the non-NUL-terminated
string to make a NUL-terminated string to pass to escape_json().

To make this faster, we can just have a version of escape_json which
takes a 'len' and stops after doing that many chars rather than
stopping when the NUL char is reached. Now there's no need to
pnstrdup() which saves some palloc()/memcpy() work.

There are also a few places where we do escape_json() with a "text"
typed Datum where we go and convert the text to a NUL-terminated
cstring so we can pass that along to ecape_json().  That's wasteful as
we could just pass the payload of the text Datum directly, and only
allocate memory if the text Datum needs to be de-toasted.  That saves
a useless palloc/memcpy/pfree cycle.

Now, to make this more interesting, since we have a version of
escape_json which takes a 'len', we could start looking at more than 1
character at a time. If you look closely add escape_json() all the
special chars apart from " and \ are below the space character.
pg_lfind8() and pg_lfind8_le() allow processing of 16 bytes at a time,
so we only need to search the 16 bytes 3 times to ensure that no
special chars exist within. When that test fails, just go into
byte-at-a-time processing first copying over the portion of the string
that passed the vector test up until that point.

I've attached 2 patches:

0001 does everything I've described aside from SIMD.
0002 does SIMD

I've not personally done too much work in the area of JSON, so I don't
have any canned workloads to throw at this.  I did try the following:

create table j1 (very_long_column_name_to_test_json_escape text);
insert into j1 select repeat('x', x) from generate_series(0,1024)x;
vacuum freeze j1;

bench.sql:
select row_to_json(j1)::jsonb from j1;

Master:
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 362.494309 (without initial connection time)
tps = 363.182458 (without initial connection time)
tps = 362.679654 (without initial connection time)

Master + 0001 + 0002
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 426.456885 (without initial connection time)
tps = 430.573046 (without initial connection time)
tps = 431.142917 (without initial connection time)

About 18% faster.

It would be much faster if we could also get rid of the
escape_json_cstring() call in the switch default case of
datum_to_json_internal(). row_to_json() would be heaps faster with
that done. I considered adding a special case for the "text" type
there, but in the end felt that we should just fix that with some
hypothetical other patch that changes how output functions work.
Others may feel it's worthwhile. I certainly could be convinced of it.

I did add a new regression test. I'm not sure I'd want to keep that,
but felt it's worth leaving in there for now.

Other things I considered were if doing 16 bytes at a time is too much
as it puts quite a bit of work into byte-at-a-time processing if just
1 special char exists in a 16-byte chunk. I considered doing SWAR [1]
processing to do the job of vector8_has_le() and vector8_has() byte
maybe with just uint32s.  It might be worth doing that. However, I've
not done it yet as it raises the bar for this patch quite a bit.  SWAR
vector processing is pretty much write-only code. Imagine trying to
write comments for the code in [2] so that the average person could
understand what's going on!?

I'd be happy to hear from anyone that can throw these patches at a
real-world JSON workload to see if it runs more quickly.

Parking for July CF.

David

[1] https://en.wikipedia.org/wiki/SWAR
[2] https://dotat.at/@/2022-06-27-tolower-swar.html


v1-0001-Add-len-parameter-to-escape_json.patch
Description: Binary data


v1-0002-Use-SIMD-processing-for-escape_json.patch
Description: Binary data

Re: Sort operation displays more tuples than it contains its subnode

2024-05-22 Thread David Rowley

On Thu, 23 May 2024 at 08:48, a.rybakina  wrote:
>->  Sort  (cost=613.48..632.86 rows=7754 width=8) (actual 
> time=7.612..21.214 rows=77531 loops=1)
>  Sort Key: broke_down_course.sno, broke_down_course.cno
>  Sort Method: quicksort  Memory: 374kB
>  ->  Seq Scan on broke_down_course  (cost=0.00..112.54 
> rows=7754 width=8) (actual time=0.016..1.366 rows=7754 loops=1)

> Maybe, my reproduction looks questionable, sorry for that, but I seriously 
> don't understand why we have so many tuples here in Sort node.

This is because of the "mark and restore" that occurs because of the
Merge Join. This must happen for joins because every tuple matching
the join condition must join to every other tuple that matches the
join condition. That means, if you have 3 tuples with the same key on
either side, you get 9 rows, not 3 rows.

Here's a simple example of the behaviour you describe shrunk down so
that it's more easily understandable:

create table t(a int);
insert into t values(1),(1),(1);
set enable_hashjoin=0;
set enable_nestloop=0;
explain (analyze, costs off) select * from t inner join t t1 on t.a=t1.a;
   QUERY PLAN

 Merge Join (actual time=0.036..0.038 rows=9 loops=1)
   Merge Cond: (t.a = t1.a)
   ->  Sort (actual time=0.025..0.025 rows=3 loops=1)
 Sort Key: t.a
 Sort Method: quicksort  Memory: 25kB
 ->  Seq Scan on t (actual time=0.017..0.018 rows=3 loops=1)
   ->  Sort (actual time=0.007..0.007 rows=7 loops=1)
 Sort Key: t1.a
 Sort Method: quicksort  Memory: 25kB
 ->  Seq Scan on t t1 (actual time=0.003..0.003 rows=3 loops=1)

Note the sort has rows=7 and the Seq Scan on t1 rows=3 and an output of 9 rows.

If you look at the code in [1], you can see the restoreMark() calls to
achieve this.

David

[1] https://en.wikipedia.org/wiki/Sort-merge_join

Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"

2024-05-21 Thread David Rowley

(Thanks for your review. I'm sorry I didn't have time and energy to
respond properly until now)

On Tue, 21 May 2024 at 23:48, Heikki Linnakangas  wrote:
> BTW, could the same machinery be used for INTERSECT as well? There was a
> brief mention of that in the original thread, but I didn't understand
> the details. Not for v17, but I'm curious. I was wondering if
> build_setop_child_paths() should be named build_union_child_paths(),
> since it's only used with UNIONs, but I guess it could be used as is for
> INTERSECT too.

I'd previously thought about that, but when I thought about it I'd
considered getting rid of the SetOp Intersect and replacing with a
join.  To do that my conclusion was that we'd first need to improve
joins using IS NOT DISTINCT FROM, as that's the behaviour we need for
correct setop NULL handling.  However, on relooking, I see that we
could still use SetOp Intersect with the flags injected into the
targetlist and get sorted results to it via Merge Append rather than
Append.  That might require better Const handling than what's in the
patch today due to the 1/0 flag that gets added to the subquery tlist.
I was unsure how much trouble to go to for INTERSECT. I spent about 7
years in a job writing queries and don't recall ever feeling the need
to use INTERSECT.  I did use EXCEPT, however... like at least once.
I'll probably circle back to it one day. People maybe don't use it
because it's so terribly optimised.

> # Testing
>
> postgres=# begin; create table foo as select i from generate_series(1,
> 100) i; create index on foo (i); commit;
> BEGIN
> SELECT 100
> CREATE INDEX
> COMMIT
> postgres=# set enable_seqscan=off;
> SET
> postgres=# explain (select 1 as i union select i from foo) order by i;
>QUERY PLAN
>
> --
>   Unique  (cost=144370.89..149370.89 rows=101 width=4)
> ->  Sort  (cost=144370.89..146870.89 rows=101 width=4)
>   Sort Key: (1)
>   ->  Append  (cost=0.00..31038.44 rows=101 width=4)
> ->  Result  (cost=0.00..0.01 rows=1 width=4)
> ->  Index Only Scan using foo_i_idx on foo
> (cost=0.42..26038.42 rows=100 width=4)
> (6 rows)
>
> I'm disappointed it couldn't produce a MergeAppend plan. If you replace
> the "union" with "union all" you do get a MergeAppend.
>
> Some more cases where I hoped for a MergeAppend:

I've not looked again in detail, but there was some discussion along
these lines in [1].  I think the problem is down to how we remove
redundant PathKeys when the EquivalenceClass has a Const. There can
only be 1 value, so no need for a PathKey to represent that. The
problem with that comes with lack of equivalence visibility through
subqueries.  The following demonstrates:

create table ab(a int, b int, primary key(a,b));
set enable_seqscan=0;
set enable_bitmapscan=0;

explain (costs off) select * from (select * from ab where a=1 order by
b) order by a,b;
QUERY PLAN
---
 Sort
   Sort Key: ab.a, ab.b
   ->  Index Only Scan using ab_pkey on ab
 Index Cond: (a = 1)
(4 rows)

explain (costs off) select * from (select * from ab where a=1 order by
b) order by b;
 QUERY PLAN
-
 Index Only Scan using ab_pkey on ab
   Index Cond: (a = 1)
(2 rows)

Because the subquery only publishes that it's ordered by "b", the
outer query thinks it needs to sort on "a,b".  That's a wasted effort
since the subquery has an equivalence class for "a" with a constant.
The outer query doesn't know that.

> postgres=# explain (select i, 'foo' from foo union select i, 'foo' from
> foo) order by 1;
>   QUERY PLAN
>
> -
>   Unique  (cost=380767.54..395767.54 rows=200 width=36)
> ->  Sort  (cost=380767.54..385767.54 rows=200 width=36)
>   Sort Key: foo.i, ('foo'::text)
>   ->  Append  (cost=0.42..62076.85 rows=200 width=36)
> ->  Index Only Scan using foo_i_idx on foo
> (cost=0.42..26038.42 rows=100 width=36)
> ->  Index Only Scan using foo_i_idx on foo foo_1
> (cost=0.42..26038.42 rows=100 width=36)
> (6 rows)
>
>
> postgres=# explain (select 'foo', i from foo union select 'bar', i from
> foo) order by 1;
>   QUERY PLAN
>
> -
>   Unique  (cost=380767.54..395767.54 rows=200 width=36)
> ->  Sort  (cost=380767.54..385767.54 rows=200 width=36)
>   Sort Key: ('foo'::text), foo.i
>   ->  Append  (cost=0.42..62076.85 rows=200 width=36)
> ->  Index Only

Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"

2024-05-21 Thread David Rowley

On Wed, 22 May 2024 at 05:36, Robert Haas  wrote:
> The consensus on pgsql-release was to unrevert this patch and commit
> the fix now, rather than waiting for the next beta. However, the
> consensus was also to push the un-revert as a separate commit from the
> bug fix, rather than together as suggested by Álvaro. Since time is
> short due to the impending release and it's very late where you are,
> I've taken care of this. Hope that's OK.

Thanks for handling that. It's much appreciated.

David

Re: Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"

2024-05-21 Thread David Rowley

On Wed, 22 May 2024 at 00:35, Alvaro Herrera  wrote:
>
> On 2024-May-21, David Rowley wrote:
>
> > I've attached 2 patches.
> >
> > 0001 is a simple revert of Tom's revert (7204f3591).
> > 0002 fixes the issue reported by Hubert.
>
> I would like to request that you don't keep 0001's message as you have
> it here.  It'd be more readable to take 66c0185a3d14's whole commit
> message with a small suffix like "try 2" in the commit title, and add an
> additional second paragraph stating it was transiently reverted by
> 7204f35919b7.  Otherwise it's harder to make sense of the commit on its
> own later.

Thanks for having a look.  I was planning to have the commit message
as per attached. I'd only split the patch for ease of review per
request of Tom. I should have mentioned that here.

I would adjust the exact wording in the final paragraph as required
depending on what plan materialises.

This also fixes up the comment stuff that Heikki mentioned.

David

v2-0001-Allow-planner-to-use-Merge-Append-to-efficiently-.patch
Description: Binary data

Path to unreverting "Allow planner to use Merge Append to efficiently implement UNION"

2024-05-20 Thread David Rowley

Earlier today in [1], a bug was reported regarding a problem with the
code added in 66c0185a3 where I'd failed to handle the case correctly
where the UNION's targetlist has columns which are not sortable. For
pg_class, that's relfrozenxid, relminmxid and relacl.

The most minimal reproducer prior to the revert is:

set enable_hashagg=0;
explain (costs off) select '123'::xid union select '123'::xid;

There is still some ongoing discussion about this on the release
mailing list as per mentioned by Tom in the commit message in
7204f3591.

At some point that discussion is going to need to circle back onto
-hackers again, and since I've already written a patch to fix the
issue and un-revert Tom's revert. I just wanted a place on -hackers to
allow that code to be viewed and discussed. I did also post a patch
on [2], but that no longer applies to master due to the revert.

I'll allow the RMT to choose where the outcome of the RMT decision
goes. Let this thread be for at least the coding portion of this or
be my thread for this patch for the v18 cycle if the RMT rules in
favour of keeping that code reverted for v17.

I've attached 2 patches.

0001 is a simple revert of Tom's revert (7204f3591).
0002 fixes the issue reported by Hubert.

If anyone wants to have a look, I'd be grateful for that. Tom did
call for further review after this being the 4th issue reported for
66c0185a3.

David

[1] https://postgr.es/message-id/Zktzf926vslR35Fv%40depesz.com
[2]
https://www.postgresql.org/message-id/CAApHDvpDQh1NcL7nAsd3YAKj4vgORwesB3GYuNPnEXXRfA2g4w%40mail.gmail.com

v2-0001-Revert-Revert-commit-66c0185a3-and-follow-on-patc.patch
Description: Binary data

v2-0002-Fix-UNION-planner-bug-and-add-regression-test.patch
Description: Binary data

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread David Rowley

On Mon, 20 May 2024 at 23:16, Thom Brown  wrote:
>
> On Mon, 20 May 2024 at 00:24, David Rowley  wrote:
>>
>> On Mon, 20 May 2024 at 09:35, Jonathan S. Katz  wrote:
>> > Thanks for all the feedback to date. Please see the next revision.
>> > Again, please provide feedback no later than 2024-05-22 18:00 UTC.
>>
>> Thanks for the updates.
>>
>> > [`COPY`](https://www.postgresql.org/docs/17/sql-copy.html) is used to 
>> > efficiently bulk load data into PostgreSQL, and with PostgreSQL 17 shows a 
>> > 2x performance improvement when loading large rows.
>>
>> The 2x thing mentioned by Jelte is for COPY TO rather than COPY FROM.
>> So I think "exporting" or "sending large rows to the client"  rather
>> than "loading".
>>
>> There's also a stray "with" in that sentence.
>
>
> Are you referring to the "with" in "and with PostgreSQL 17"? If so, it looks 
> valid to me.

Yes that one.  It sounds wrong to me, but that's from a British
English point of view. I'm continuing to learn the subtle differences
with American English. Maybe this is one.

It would make sense to me if it was "and with PostgreSQL 17, a 2x
...". From my point of view either "with" shouldn't be there or
"shows" could be replaced with a comma. However, if you're ok with it,
I'll say no more. I know this is well into your territory.

David

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread David Rowley

On Mon, 20 May 2024 at 22:11, Alvaro Herrera  wrote:
>
> On 2024-May-16, David Rowley wrote:
>
> > On Thu, 16 May 2024 at 17:37, zaidagilist  wrote:
> > > I am trying to open the 17 docs but it looks removed. Getting
> > > following message "Page not found"
> > >
> > > https://www.postgresql.org/docs/17/
> >
> > It's called "devel" for "development" until we branch sometime before July:
> >
> > https://www.postgresql.org/docs/devel/
>
> Hmm, but that would mean that the Beta1 announce would ship full of
> links that will remain broken until July.  I'm not sure what the
> workflow for this is, but I hope the /17/ URLs would become valid with
> beta1, later this week.

I didn't quite click that it was Jonathan's links that were being
complained about.

I don't know how the website picks up where to link the doc page for a
given version.  I see from e0b82fc8e8 that the PACKAGE_VERSION was
changed from 16devel to 16beta1. Does the website have something that
extracts "devel" from the former and "16" from the latter?  I see the
release announcement for 16beta1 had /16/ links per [1].  So, I guess
it works. I just don't know how.

David

[1] https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-19 Thread David Rowley

On Mon, 20 May 2024 at 09:35, Jonathan S. Katz  wrote:
> Thanks for all the feedback to date. Please see the next revision.
> Again, please provide feedback no later than 2024-05-22 18:00 UTC.

Thanks for the updates.

> [`COPY`](https://www.postgresql.org/docs/17/sql-copy.html) is used to 
> efficiently bulk load data into PostgreSQL, and with PostgreSQL 17 shows a 2x 
> performance improvement when loading large rows.

The 2x thing mentioned by Jelte is for COPY TO rather than COPY FROM.
So I think "exporting" or "sending large rows to the client"  rather
than "loading".

There's also a stray "with" in that sentence.

David

Re: First draft of PG 17 release notes

2024-05-18 Thread David Rowley

On Sun, 19 May 2024 at 02:40, Bruce Momjian  wrote:
>
> On Thu, May 16, 2024 at 03:35:17PM +1200, David Rowley wrote:
> > "Additionally, vacuum no longer silently imposes a 1GB tuple reference
> > limit even when maintenance_work_mem or autovacuum_work_mem are set to
> > higher values"

> Slightly adjusted wording patch attached and applied.

Thanks for adjusting.

It's a minor detail, but I'll mention it because you went to the
effort to adjust it away from what I'd written...

I didn't make a random choice to use "or" between the two GUCs.
Changing it to "and", IMO, isn't an improvement.  Using "and" implies
that the silent limited was only imposed when both of these GUCs were
set >= 1GB. That's not true. For the case we're talking about here, if
autovacuum_work_mem is set to anything apart from -1 then the value of
maintenance_work_mem does not matter.

David

Minor fixes for couple some comments around MERGE RETURNING

2024-05-18 Thread David Rowley

I noticed that PlannedStmt.hasReturning and hasModifyingCTE have an
outdated comment now that MERGE supports RETURNING (per commit
c649fa24a)

i.e. these two:

> bool hasReturning; /* is it insert|update|delete RETURNING? */

> bool hasModifyingCTE; /* has insert|update|delete in WITH? */

transformWithClause() has:

/* must be a data-modifying statement */
Assert(IsA(cte->ctequery, InsertStmt) ||
   IsA(cte->ctequery, UpdateStmt) ||
   IsA(cte->ctequery, DeleteStmt) ||
   IsA(cte->ctequery, MergeStmt));

pstate->p_hasModifyingCTE = true;

which eventually makes it into PlannedStmt.hasModifyingCTE.

The attached trivial patch fixes these.

David


merge_returning_comments.patch
Description: Binary data

Re: generic plans and "initial" pruning

2024-05-18 Thread David Rowley

On Sun, 19 May 2024 at 13:27, Tom Lane  wrote:
>
> David Rowley  writes:
> > 1. No ability to control the order that the locks are obtained. The
> > order in which the locks are taken will be at the mercy of the plan
> > the planner chooses.
>
> I do not think I buy this argument, because plancache.c doesn't
> provide any "ability to control the order" today, and never has.
> The order in which AcquireExecutorLocks re-gets relation locks is only
> weakly related to the order in which the parser/planner got them
> originally.  The order in which AcquirePlannerLocks re-gets the locks
> is even less related to the original.  This doesn't cause any big
> problems that I'm aware of, because these locks are fairly weak.

It may not bite many people, it's just that if it does, I don't see
what we could do to help those people. At the moment we could tell
them to adjust their DDL script to obtain the locks in the same order
as their query.  With your idea that cannot be done as the order could
change when the planner switches the join order.

> I think we do have a guarantee that for partitioned tables, parents
> will be locked before children, and that's probably valuable.
> But an executor-driven lock order could preserve that property too.

I think you'd have to lock the parent before the child. That would
remain true and consistent anyway when taking locks during a
breadth-first plan traversal.

> > For #3, I've been thinking about what improvements we can do to make
> > the executor more efficient. In [1], Andres talks about some very
> > interesting things. In particular, in his email items 3) and 5) are
> > relevant here. If we did move lots of executor startup code into the
> > planner, I think it would be possible to one day get rid of executor
> > startup and have the plan record how much memory is needed for the
> > non-readonly part of the executor state and tag each plan node with
> > the offset in bytes they should use for their portion of the executor
> > working state.
>
> I'm fairly skeptical about that idea.  The entire reason we have an
> issue here is that we want to do runtime partition pruning, which
> by definition can't be done at plan time.  So I doubt it's going
> to play nice with what we are trying to accomplish in this thread.

I think we could have both, providing there was a way to still
traverse the executor state tree in EXPLAIN. We'd need a way to skip
portions of the plan that are not relevant or could be invalid for the
current execution. e.g can't show Index Scan because index has been
dropped.

> > I think what Amit had before your objection was starting to turn into
> > something workable and we should switch back to working on that.
>
> The reason I posted this idea was that I didn't think the previously
> existing patch looked promising at all.

Ok.  It would be good if you could expand on that so we could
determine if there's some fundamental reason it can't work or if
that's because you were blinded by your epiphany and didn't give that
any thought after thinking of the alternative idea.

I've gone to effort to point out things that I think are concerning
with your idea. It would be good if you could do the same for the
previous patch other than "it didn't look promising". It's pretty hard
for me to argue with that level of detail.

David

Re: generic plans and "initial" pruning

2024-05-18 Thread David Rowley

On Fri, 20 Jan 2023 at 08:39, Tom Lane  wrote:
> I spent some time re-reading this whole thread, and the more I read
> the less happy I got.  We are adding a lot of complexity and introducing
> coding hazards that will surely bite somebody someday.  And after awhile
> I had what felt like an epiphany: the whole problem arises because the
> system is wrongly factored.  We should get rid of AcquireExecutorLocks
> altogether, allowing the plancache to hand back a generic plan that
> it's not certain of the validity of, and instead integrate the
> responsibility for acquiring locks into executor startup.  It'd have
> to be optional there, since we don't need new locks in the case of
> executing a just-planned plan; but we can easily add another eflags
> bit (EXEC_FLAG_GET_LOCKS or so).  Then there has to be a convention
> whereby the ExecInitNode traversal can return an indicator that
> "we failed because the plan is stale, please make a new plan".

I also reread the entire thread up to this point yesterday. I've also
been thinking about this recently as Amit has mentioned it to me a few
times over the past few months.

With the caveat of not yet having looked at the latest patch, my
thoughts are that having the executor startup responsible for taking
locks is a bad idea and I don't think we should go down this path. My
reasons are:

1. No ability to control the order that the locks are obtained. The
order in which the locks are taken will be at the mercy of the plan
the planner chooses.
2. It introduces lots of complexity regarding how to cleanly clean up
after a failed executor startup which is likely to make exec startup
slower and the code more complex
3. It puts us even further down the path of actually needing an
executor startup phase.

For #1, the locks taken for SELECT queries are less likely to conflict
with other locks obtained by PostgreSQL, but at least at the moment if
someone is getting deadlocks with a DDL type operation, they can
change their query or DDL script so that locks are taken in the same
order.  If we allowed executor startup to do this then if someone
comes complaining that PG18 deadlocks when PG17 didn't we'd just have
to tell them to live with it.  There's a comment at the bottom of
find_inheritance_children_extended() just above the qsort() which
explains about the deadlocking issue.

I don't have much extra to say about #2.  As mentioned, I've not
looked at the patch. On paper, it sounds possible, but it also sounds
bug-prone and ugly.

For #3, I've been thinking about what improvements we can do to make
the executor more efficient. In [1], Andres talks about some very
interesting things. In particular, in his email items 3) and 5) are
relevant here. If we did move lots of executor startup code into the
planner, I think it would be possible to one day get rid of executor
startup and have the plan record how much memory is needed for the
non-readonly part of the executor state and tag each plan node with
the offset in bytes they should use for their portion of the executor
working state. This would be a single memory allocation for the entire
plan.  The exact details are not important here, but I feel like if we
load up executor startup with more responsibilities, it'll just make
doing something like this harder.  The init run-time pruning code that
I worked on likely already has done that, but I don't think it's
closed the door on it as it might just mean allocating more executor
state memory than we need to. Providing the plan node records the
offset into that memory, I think it could be made to work, just with
the inefficiency of having a (possibly) large unused hole in that
state memory.

As far as I understand it, your objection to the original proposal is
just on the grounds of concerns about introducing hazards that could
turn into bugs.  I think we could come up with some way to make the
prior method of doing pruning before executor startup work. I think
what Amit had before your objection was starting to turn into
something workable and we should switch back to working on that.

David

[1] 
https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

Re: Incorrect Assert in BufFileSize()?

2024-05-17 Thread David Rowley

On Thu, 16 May 2024 at 07:20, Peter Geoghegan  wrote:
>
> On Tue, May 14, 2024 at 6:58 AM David Rowley  wrote:
> > On Fri, 3 May 2024 at 16:03, David Rowley  wrote:
> > > I'm trying to figure out why BufFileSize() Asserts that file->fileset
> > > isn't NULL, per 1a990b207.
> >
> > I was hoping to get some feedback here.
>
> Notice that comments above BufFileSize() say "Return the current
> fileset based BufFile size". There are numerous identical assertions
> at the start of several other functions within the same file.

hmm, unfortunately the comment and existence of numerous other
assertions does not answer my question. It just leads to more.  The
only Assert I see that looks like it might be useful is
BufFileExportFileSet() as fileset is looked at inside extendBufFile().
It kinda looks to me that it was left over fragments from the
development of a patch when it was written some other way?

Looking at the other similar Asserts in BufFileAppend(), I can't
figure out what those ones are for either.

> I'm not sure what it would take for this function to support
> OpenTemporaryFile()-based BufFiles -- possibly not very much at all. I
> assume that that's what you're actually interested in doing here (you
> didn't say). If it is, you'll need to update the function's contract
> -- just removing the assertion isn't enough.

I should have explained, I just wasn't quite done with what I was
working on when I sent the original email. In [1] I was working on
adding additional output in EXPLAIN for the "Material" node to show
the memory or disk space used by the tuplestore.  The use case there
uses a BufFile with no fileset.  Calling BufFileSize(state->myfile)
from tuplestore.c results in an Assert failure.

David

[1] https://commitfest.postgresql.org/48/4968/

Re: Docs: Always use true|false instead of sometimes on|off for the subscription options

2024-05-16 Thread David Rowley

On Thu, 16 May 2024 at 19:05, Peter Smith  wrote:
>
> On Thu, May 16, 2024 at 3:11 PM David Rowley  wrote:
> > If you want to do this, what's the reason to limit it to just this one
> > page of the docs?
>
> Yeah, I had a vested interest in this one place because I've been
> reviewing the other thread [1] that was mentioned above. If that other
> thread chooses "true|false" then putting "true|false" adjacent to
> another "on|off" will look silly.

OK, looking a bit further I see this option is new to v17.  After a
bit more study of the sgml, I agree that it's worth changing this.

I've pushed your patch.

David

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-16 Thread David Rowley

On Thu, 16 May 2024 at 17:37, zaidagilist  wrote:
> I am trying to open the 17 docs but it looks removed. Getting
> following message "Page not found"
>
> https://www.postgresql.org/docs/17/

It's called "devel" for "development" until we branch sometime before July:

https://www.postgresql.org/docs/devel/

David

Re: Docs: Always use true|false instead of sometimes on|off for the subscription options

2024-05-15 Thread David Rowley

On Thu, 16 May 2024 at 12:29, Peter Smith  wrote:
> There are lots of subscription options listed on the CREATE
> SUBSCRIPTION page [1].
>
> Although these boolean options are capable of accepting different
> values like "1|0", "on|off", "true|false", here they are all described
> only using values "true|false".

If you want to do this, what's the reason to limit it to just this one
page of the docs?

If the following is anything to go by, it doesn't seem we're very
consistent about this over the entire documentation.

doc$ git grep "on" | wc -l
122

doc$ git grep "true" | wc -l
222

And:

doc$ git grep "off" | wc -l
102

doc$ git grep "false" | wc -l
162

I think unless we're going to standardise on something then there's
not much point in adjusting individual cases. IMO, there could be an
endless stream of follow-on patches as a result of accepting this.

David

Re: First draft of PG 17 release notes

2024-05-15 Thread David Rowley

On Thu, 16 May 2024 at 14:48, Bruce Momjian  wrote:
>
> On Wed, May 15, 2024 at 09:13:14AM -0400, Melanie Plageman wrote:
> > Also +1 on the Sawada/Naylor change being on the highlight section of
> > the release (as David suggested upthread).
>
> Agreed, I went with the attached applied patch.

+Allow vacuum to more efficiently store tuple references and remove
its memory limit (Masahiko Sawada, John Naylor)
+

I don't want it to seem like I'm splitting hairs, but I'd drop the "
and remove its memory limit"

+
+Specifically, maintenance_work_mem and autovacuum_work_mem can now be
configured to use more than one gigabyte of memory.  WAL traffic
caused by vacuum is also more compact.

I'd say the first sentence above should be written as:

"Additionally, vacuum no longer silently imposes a 1GB tuple reference
limit even when maintenance_work_mem or autovacuum_work_mem are set to
higher values"

It's not "Specifically" as the "more efficiently store tuple
references" isn't the same thing as removing the 1GB cap. Also, there
was never a restriction in configuring maintenance_work_mem or
autovacuum_work_mem  to values higher than 1GB. The restriction was
that vacuum was unable to utilize anything more than that.

David

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-15 Thread David Rowley

Thanks for writing that up. It's always exciting to see this each year.

On Thu, 16 May 2024 at 13:45, Jonathan S. Katz  wrote:
> * Please indicate if you believe there's a notable omission, or if we
> should omit any of these callouts

I'd say the streaming read stuff added in b5a9b18cd and subsequent
commits like b7b0f3f27 and 041b96802 are worth a mention. I'd be happy
to see this over the IS NOT NULL qual stuff that I worked on in there
or even the AVX512 bit counting. Speeding up a backwater aggregate
function is nice, but IMO, not compatible with reducing the number
reads.

There's some benchmarking in a youtube video:
https://youtu.be/QAYzWAlxCYc?si=L0UT6Lrf067ZBv46=237

> * Please indicate if a description is confusing - I'm happy to rewrite
> to ensure it's clearer.
>
> Please provide feedback no later than Wed 2024-05-22 18:00 UTC.

The only other thing I saw from a quick read was a stray "the" in "the
copy proceed even if the there is an error inserting a row."

David

Re: explain format json, unit for serialize and memory are different.

2024-05-15 Thread David Rowley

On Wed, 15 May 2024 at 13:23, Tom Lane  wrote:
>
> Michael Paquier  writes:
> > Perhaps Alvaro and Tom would like to chime in, as committers of
> > respectively 5de890e3610d and 06286709ee06?
>
> No objection here.  In a green field I might argue for
> round-to-nearest instead of round-up, but it looks like we
> have several precedents for round-up, so let's avoid changing
> that existing behavior.

Thanks. I've pushed the patch now.

David

Re: Fixup a few 2023 copyright years

2024-05-15 Thread David Rowley

On Wed, 15 May 2024 at 17:32, Michael Paquier  wrote:
> While running src/tools/copyright.pl, I have noticed that that a
> newline was missing at the end of index_including.sql, as an effect of
> the test added by you in a63224be49b8.  I've cleaned up that while on
> it, as it was getting added automatically, and we tend to clean these
> like in 3f1197191685 or more recently c2df2ed90a82.

Thanks for fixing that.  I'm a little surprised that pgindent does not
fix that sort of thing.

David

Re: First draft of PG 17 release notes

2024-05-15 Thread David Rowley

On Wed, 15 May 2024 at 20:38, Alvaro Herrera  wrote:
>
> On 2024-May-14, Bruce Momjian wrote:
> > I don't think users really care about these details, just that it is
> > faster so they will not be surprised if there is a change.  I was
> > purposely vague to group multiple commits into the single item.  By
> > grouping them together, I got enough impact to warrant listing it.  If
> > you split it apart, it is harder to justify mentioning them.
>
> I disagree with this.  IMO the impact of the Sawada/Naylor change is
> likely to be enormous for people with large tables and large numbers of
> tuples to clean up (I know we've had a number of customers in this
> situation, I can't imagine any Postgres service provider that doesn't).
> The fact that maintenance_work_mem is no longer capped at 1GB is very
> important and I think we should mention that explicitly in the release
> notes, as setting it higher could make a big difference in vacuum run
> times.

I very much agree with Alvaro here. IMO, this should be on the
highlight feature list for v17. Prior to this, having to do multiple
index scans because of filling maintenance_work_mem was a performance
tragedy. If there were enough dead tuples to have filled
maintenance_work_mem, then the indexes are large. Having to scan
multiple large indexes multiple times isn't good use of I/O and CPU.
As far as I understand it, this work means it'll be unlikely that a
well-configured server will ever have to do multiple index passes. I
don't think "enormous impact" is an exaggeration here.

David

Re: explain format json, unit for serialize and memory are different.

2024-05-14 Thread David Rowley

On Wed, 15 May 2024 at 15:40, jian he  wrote:
> I am looking for an example where this information under json key
> "Serialization" is not zero.
> So far I have tried:

Something that requires detoasting.

> create table s(a text);
> insert into s select repeat('a', 1024) from generate_series(1,1024);
> explain (format json, analyze, wal, buffers, memory, serialize, timing
> off)  select * from s;

Something bigger than 1024 bytes or use SET STORAGE EXTERNAL or EXTENDED.

create table s(a text);
insert into s select repeat('a', 1024*1024) from generate_series(1,10);
explain (format text, analyze, buffers, serialize, timing off)  select * from s;

 Serialization: output=10241kB  format=text
   Buffers: shared hit=36

David

Re: Fixup a few 2023 copyright years

2024-05-14 Thread David Rowley

On Tue, 9 Apr 2024 at 15:26, Michael Paquier  wrote:
> I would suggest to also wait until we're clearer with the situation
> for all these mechanical changes, which I suspect is going to take 1~2
> weeks at least.

Since the Oid resequencing and pgindent run is now done, I've pushed this patch.

David

Re: First draft of PG 17 release notes

2024-05-14 Thread David Rowley

On Wed, 15 May 2024 at 14:06, Bruce Momjian  wrote:
> I can use your wording, but how much prefetching to we enable now?

I believe the read stream API is used for Seq Scan, ANALYZE and
pg_prewarm().  fadvise() is used when the next buffer that's required
is not in shared buffers on any build that has defined
HAVE_DECL_POSIX_FADVISE.

David

Re: explain format json, unit for serialize and memory are different.

2024-05-14 Thread David Rowley

On Wed, 15 May 2024 at 13:44, jian he  wrote:
>"Shared Hit Blocks": 0,
>"Shared Read Blocks": 0,
>"Shared Dirtied Blocks": 0,
>"Shared Written Blocks": 0,
>"Local Hit Blocks": 0,
>"Local Read Blocks": 0,
>"Local Dirtied Blocks": 0,
>"Local Written Blocks": 0,
>"Temp Read Blocks": 0,
>"Temp Written Blocks": 0
>
> these information duplicated for json key "Serialization" and json key
> "Planning"
> i am not sure this is intended?

Looks ok to me.  Buffers used during planning are independent from the
buffers used when outputting rows to the client.

David

Re: First draft of PG 17 release notes

2024-05-14 Thread David Rowley

On Wed, 15 May 2024 at 13:00, Bruce Momjian  wrote:
>
> On Tue, May 14, 2024 at 03:39:26PM -0400, Melanie Plageman wrote:
> > "Reduce system calls by automatically merging reads up to io_combine_limit"
>
> Uh, as I understand it, the reduced number of system calls is not the
> value of the feature, but rather the ability to request a larger block
> from the I/O subsystem.  Without it, you have to make a request and wait
> for each request to finish.  I am open to new wording, but I am not sure
> your new wording is accurate.

I think you have the cause and effect backwards. There's no advantage
to reading 128KB if you only need 8KB.  It's the fact that doing
*larger* reads allows *fewer* reads that allows it to be more
efficient.  There are also the efficiency gains from fadvise
POSIX_FADV_WILLNEED. I'm unsure how to jam that into a short sentence.
Maybe; "Optimize reading of tables by allowing pages to be prefetched
and read in chunks up to io_combine_limit", or a bit more buzzy;
"Optimize reading of tables by allowing pages to be prefetched and
performing vectored reads in chunks up to io_combine_limit".

David

Re: explain format json, unit for serialize and memory are different.

2024-05-14 Thread David Rowley

On Wed, 15 May 2024 at 01:18, jian he  wrote:
> else
> {
> ExplainPropertyInteger("Memory Used", "bytes",
>mem_counters->totalspace - mem_counters->freespace,
>es);
> ExplainPropertyInteger("Memory Allocated", "bytes",
>mem_counters->totalspace, es);
> }
> }
>
> the "else" branch, also need to apply BYTES_TO_KILOBYTES marco?

Yeah, I missed that. Here's another patch.

Thanks for looking.

David


v2-0001-Minor-fixes-to-EXPLAIN.patch
Description: Binary data

Re: Incorrect Assert in BufFileSize()?

2024-05-14 Thread David Rowley

On Fri, 3 May 2024 at 16:03, David Rowley  wrote:
> I'm trying to figure out why BufFileSize() Asserts that file->fileset
> isn't NULL, per 1a990b207.

I was hoping to get some feedback here.

Here's a patch to remove the Assert().  Changing it to
Assert(file->files != NULL); doesn't do anything useful.

David

remove_bogus_Assert_from_BufFileSize.patch
Description: Binary data

Re: explain format json, unit for serialize and memory are different.

2024-05-14 Thread David Rowley

On Tue, 14 May 2024 at 18:16, David Rowley  wrote:
> I think for v17, we should consider adding a macro to explain.c to
> calculate the KB from bytes.  There are other inconsistencies that it
> would be good to address. We normally round up to the nearest kilobyte
> with (bytes + 1023) / 1024, but if you look at what 06286709e did, it
> seems to be rounding to the nearest without any apparent justification
> as to why.  It does (metrics->bytesSent + 512) / 1024.
>
> show_memory_counters() could be modified to use the macro and show
> kilobytes rather than bytes.

Here's a patch for that.

I checked the EXPLAIN SERIALIZE thread and didn't see any mention of
the + 512 thing. It seems Tom added it just before committing and no
patch ever made it to the mailing list with + 512. The final patch on
the list is in [1].

For the EXPLAIN MEMORY part, the bytes vs kB wasn't discussed.  The
closest the thread came to that was what Abhijit mentioned in [2].

I also adjusted some inconsistencies around spaces between the digits
and kB.  In other places in EXPLAIN we write "100kB" not "100 kB".  I
see we print times with a space ("Execution Time: 1.719 ms"), so we're
not very consistent overall, but since the EXPLAIN MEMORY is new, it
makes sense to change it now to be aligned to the other kB stuff in
explain.c

The patch does change a long into an int64 in show_hash_info(). I
wondered if that part should be backpatched.  It does not seem very
robust to me to divide a Size by 1024 and expect it to fit into a
long. With MSVC 64 bit, sizeof(Size) == 8 and sizeof(long) == 4.  I
understand work_mem is limited to 2GB on that platform, but it does
not seem like a good reason to use a long.

David

[1] 
https://www.postgresql.org/message-id/CAEze2WhopAFRS4xdNtma6XtYCRqydPWAg83jx8HZTowpeXzOyg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/ZaF1fB_hMqycSq-S%40toroid.org

v1-0001-Minor-fixes-to-EXPLAIN.patch
Description: Binary data

Re: JIT compilation per plan node

2024-05-14 Thread David Rowley

On Tue, 14 May 2024 at 19:56, Jelte Fennema-Nio  wrote:
> I'm not saying I'd prefer the extra walk, but I don't think you'd need
> to do this extra walk for all plans. Afaict you could skip the extra
> walk when top_plan->total_cost < jit_above_cost. i.e. only doing the
> extra walk to determine which exact nodes to JIT for cases where we
> currently JIT all nodes. That would limit the extra walk overhead to
> cases where we currently already spend significant resources on JITing
> stuff.

You could do that, but wouldn't it just cause us to sometimes miss
doing JIT for plan nodes that have a total cost above the top node's?
To me, it seems like a shortcut that someone might complain about one
day and fixing it might require removing the short cut, which would
lead to traversing the whole plan tree.

Here's a plan where the total cost of a subnode is greater than the
total cost of the top node:

set max_parallel_workers_per_gather=0;
create table t0 as select a from generate_Series(1,100)a;
analyze t0;
explain select * from t0 order by a limit 1;
   QUERY PLAN

 Limit  (cost=19480.00..19480.00 rows=1 width=4)
   ->  Sort  (cost=19480.00..21980.00 rows=100 width=4)
 Sort Key: a
 ->  Seq Scan on t0  (cost=0.00..14480.00 rows=100 width=4)

Anyway, I don't think it's worth talking in detail about specifics
about implementation for the total cost of the node idea when the
whole replacement costing model design is still undecided. It feels
like we're trying to decide what colour to paint the bathroom when we
haven't even come up with a design for the house yet.

I'd be interested to hear your thoughts on using the estimated number
of invocations of the function to drive the JIT flags on a
per-expression level.

David

Re: explain format json, unit for serialize and memory are different.

2024-05-14 Thread David Rowley

On Tue, 14 May 2024 at 17:40, Michael Paquier  wrote:
>
> On Mon, May 13, 2024 at 11:22:08AM +0200, Daniel Gustafsson wrote:
> > Since json (and yaml/xml) is intended to be machine-readable I think we use 
> > a
> > single unit for all values, and document this fact.
>
> Agreed with the documentation gap.  Another thing that could be worth
> considering is to add the units aside with the numerical values, say:
> "Memory Used": {"value": 23432, "units": "bytes"}
>
> That would require changing ExplainProperty() so as the units are
> showed in some shape, while still being readable when parsed.  I
> wouldn't recommend doing that in v17, but perhaps v18 could do better?

I think for v17, we should consider adding a macro to explain.c to
calculate the KB from bytes.  There are other inconsistencies that it
would be good to address. We normally round up to the nearest kilobyte
with (bytes + 1023) / 1024, but if you look at what 06286709e did, it
seems to be rounding to the nearest without any apparent justification
as to why.  It does (metrics->bytesSent + 512) / 1024.

show_memory_counters() could be modified to use the macro and show
kilobytes rather than bytes.

David

Re: JIT compilation per plan node

2024-05-13 Thread David Rowley

On Fri, 15 Mar 2024 at 10:13, Tomas Vondra
 wrote:
> To clarify, I think the patch is a step in the right direction, and a
> meaningful improvement. It may not be the perfect solution we imagine
> (but who knows how far we are from that), but AFAIK moving these
> decisions to the node level is something the ideal solution would need
> to do too.

I got thinking about this patch again while working on [1]. I want to
write this down as I don't quite have time to get fully back into this
right now...

Currently, during execution, ExecCreateExprSetupSteps() traverses the
Node tree of the Expr to figure out the max varattno of for each slot.
That's done so all of the tuple deforming happens at once rather than
incrementally. Figuring out the max varattno is a price we have to pay
for every execution of the query. I think we'd be better off doing
that in the planner.

To do this, I thought that setrefs.c could do this processing in
fix_join_expr / fix_upper_expr and wrap up the expression in a new
Node type that stores the max varattno for each special var type.

This idea is related to this discussion because another thing that
could be stored in the very same struct is the "num_exec" value.   I
feel the number of executions of an ExprState is a better gauge of how
useful JIT will be than the cost of the plan node. Now, looking at
set_join_references(), the execution estimates are not exactly
perfect. For example;

#define NUM_EXEC_QUAL(parentplan)   ((parentplan)->plan_rows * 2.0)

that's not a great estimate for a Nested Loop's joinqual, but we could
easily make efforts to improve those and that could likely be done
independently and concurrently with other work to make JIT more
granular.

The problem with doing this is that there's just a huge amount of code
churn in the executor.  I am keen to come up with a prototype so I can
get a better understanding of if this is a practical solution.  I
don't want to go down that path if it's just me that thinks the number
of times an ExprState is evaluated is a better measure to go on for
JIT vs no JIT than the plan node's total cost.

Does anyone have any thoughts on that idea?

David

[1] 
https://postgr.es/m/CAApHDvoexAxgQFNQD_GRkr2O_eJUD1-wUGm=m0L+Gc=T=ke...@mail.gmail.com

Re: I have an exporting need...

2024-05-13 Thread David Rowley

On Tue, 14 May 2024 at 06:18, Juan Hernández  wrote:
> Do you consider useful to add a parameter (for example, --separatetables) so 
> when used the exporting file process can create a different tablename.sql 
> file for each table in database automatically?
>
> Example...
>
> PGHOST="/tmp" PGPASSWORD="mydbpass" pg_dump -U dbusername --separatetables 
> -Fp --inserts dbname > "/route/dbname.sql"
>
> And if this database has tables table1...table10, then 10 files are created...

pg_dump has code to figure out the dependency of objects in the
database so that the dump file produced can be restored.  If one file
was output per table, how would you know in which order to restore
them? For example, you have a table with a FOREIGN KEY to reference
some other table, you need to restore the referenced table first.
That's true for both restoring the schema and restoring the data.

David

Re: Why is parula failing?

2024-05-13 Thread David Rowley

On Thu, 21 Mar 2024 at 13:53, David Rowley  wrote:
>
> On Thu, 21 Mar 2024 at 12:36, Tom Lane  wrote:
> > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
> > that would yield a clue.
>
> FWIW, I agree with your earlier statement about it looking very much
> like auto-vacuum has run on that table, but equally, if something like
> the pg_index record was damaged we could get the same plan change.
>
> We could also do something like the attached just in case we're
> barking up the wrong tree.

I've not seen any recent failures from Parula that relate to this
issue.  The last one seems to have been about 4 weeks ago.

I'm now wondering if it's time to revert the debugging code added in
1db689715.  Does anyone think differently?

David

Re: 039_end_of_wal: error in "xl_tot_len zero" test

2024-05-12 Thread David Rowley

On Mon, 6 May 2024 at 15:06, Tom Lane  wrote:
> My best guess is that that changed the amount of WAL generated by
> initdb just enough to make the problem reproduce on this animal.
> However, why's it *only* happening on this animal?  The amount of
> WAL we generate isn't all that system-specific.

I'd say that's a good theory as it's now passing again [1] after the
recent system_views.sql change done in 521a7156ab.

David

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skimmer=2024-05-06%2017%3A43%3A38

Re: First draft of PG 17 release notes

2024-05-10 Thread David Rowley

On Sat, 11 May 2024 at 17:32, Andy Fan  wrote:
> Do you think we need to add the following 2 items?
>
> - 9f133763961e280d8ba692bcad0b061b861e9138 this is an optimizer
>   transform improvement.

I think this should be in the release notes.

Suggest:

* Allow correlated IN subqueries to be transformed into joins (Andy
Fan, Tom Lane)

> - a8a968a8212ee3ef7f22795c834b33d871fac262 this is an optimizer costing
>   improvement.
>
> Both of them can generate a better plan on some cases.

I think this should be present too.

Suggest:

* Improve optimizer's ability to use cheap startup plans when querying
partitioned tables, inheritance parents and for UNION ALL (Andy Fan,
David Rowley)

Both under "E.1.3.1.1. Optimizer"

David

Re: Support tid range scan in parallel?

2024-05-10 Thread David Rowley

On Fri, 10 May 2024 at 05:16, Cary Huang  wrote:
> I understand that the regression tests for parallel ctid range scan is a
> bit lacking now. It only has a few EXPLAIN clauses to ensure parallel
> workers are used when tid ranges are specified. They are added as
> part of select_parallel.sql test. I am not sure if it is more appropriate
> to have them as part of tidrangescan.sql test instead. So basically
> re-run the same test cases in tidrangescan.sql but in parallel?

I think tidrangescan.sql is a more suitable location than
select_parallel.sql I don't think you need to repeat all the tests as
many of them are testing the tid qual processing which is the same
code as it is in the parallel version.

You should add a test that creates a table with a very low fillfactor,
low enough so only 1 tuple can fit on each page and insert a few dozen
tuples. The test would do SELECT COUNT(*) to ensure you find the
correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
there too for extra coverage to ensure that the correct tuples are
being counted.  Just make sure and EXPLAIN (COSTS OFF) the query first
in the test to ensure that it's always testing the plan you're
expecting to test.

David

Re: Use generation memory context for tuplestore.c

2024-05-10 Thread David Rowley

Thanks for having a look at this.

On Sat, 11 May 2024 at 04:34, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Do I understand correctly, that the efficiency of generation memory
> context could be measured directly via counting number of malloc/free
> calls? In those experiments I've conducted the numbers were indeed
> visibly lower for the patched version (~30%), but I would like to
> confirm my interpretation of this difference.

For the test case I had in the script, I imagine the reduction in
malloc/free is just because of the elimination of the power-of-2
roundup.  If the test case had resulted in tuplestore_trim() being
used, e.g if Material was used to allow mark and restore for a Merge
Join or WindowAgg, then the tuplestore_trim() would result in
pfree()ing of tuples and further reduce malloc of new blocks.  This is
true because AllocSet records pfree'd non-large chunks in a freelist
element and makes no attempt to free() blocks.

In the tuplestore_trim() scenario with the patched version, the
pfree'ing of chunks is in a first-in-first-out order which allows an
entire block to become empty of palloc'd chunks which allows it to
become the freeblock and be reused rather than malloc'ing another
block when there's not enough space on the current block.  If the
scenario is that tuplestore_trim() always manages to keep the number
of tuples down to something that can fit on one GenerationBlock, then
we'll only malloc() 2 blocks and the generation code will just
alternate between the two, reusing the empty one when it needs a new
block rather than calling malloc.

> > 5. Higher likelihood of neighbouring tuples being stored consecutively
> > in memory, resulting in better CPU memory prefetching.
>
> I guess this roughly translates into better CPU cache utilization.
> Measuring cache hit ratio for unmodified vs patched versions in my case
> indeed shows about 10% less cache misses.

Thanks for testing that.  In simple test cases that's likely to come
from removing the power-of-2 round-up behaviour of AllocSet allowing
the allocations to be more dense.  In more complex cases when
allocations are making occasional use of chunks from AllowSet's
freelist[], the access pattern will be more predictable and allow the
CPU to prefetch memory more efficiently, resulting in a better CPU
cache hit ratio.

> The query you use in the benchmark, is it something like a "best-case"
> scenario for generation memory context? I was experimenting with
> different size of the b column, lower values seems to produce less
> difference between generation and aset (although still generation
> context is distinctly faster regarding final query latencies, see the
> attached PDF estimation, ran for 8192 rows). I'm curious what could be a
> "worst-case" workload type for this patch?

It's not purposefully "best-case".  Likely there'd be more improvement
if I'd scanned the Material node more than twice.  However, the tuple
size I picked is close to the best case as it's just over 1024 bytes.
Generation will just round the size up to the next 8-byte alignment
boundary, whereas AllocSet will round that up to 2048 bytes.

A case where the patched version would be even better would be where
the unpatched version spilled to disk but the patched version did not.
I imagine there's room for hundreds of percent speedup for that case.

> I've also noticed the first patch disables materialization in some tests.
>
> --- a/src/test/regress/sql/partition_prune.sql
> +++ b/src/test/regress/sql/partition_prune.sql
>
> +set enable_material = 0;
> +
>  -- UPDATE on a partition subtree has been seen to have problems.
>  insert into ab values (1,2);
>  explain (analyze, costs off, summary off, timing off)
>
> Is it due to the volatility of Maximum Storage values? If yes, could it
> be covered with something similar to COSTS OFF instead?

I don't think not showing this with COSTS OFF is very consistent with
Sort and Memoize's memory stats output.  I opted to disable the
Material node for that plan as it seemed like the easiest way to make
the output stable.  There are other ways that could be done. See
explain_memoize() in memoize.sql.  I thought about doing that, but it
seemed like overkill when there was a much more simple way to get what
I wanted. I'd certainly live to regret that if disabling Material put
the Nested Loop costs dangerously close to the costs of some other
join method and the plan became unstable on the buildfarm.

David

Re: First draft of PG 17 release notes

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 16:47, Muhammad Ikram  wrote:
> A minor formatting issue in the start below. Bullet is not required here.

This is a placeholder for the highlight features of v17 will go.
Bruce tends not to decide what those are all by himself.

David

Re: First draft of PG 17 release notes

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 16:04, Bruce Momjian  wrote:
> I welcome feedback.  For some reason it was an easier job than usual.

Thanks for working on that.

> +2023-11-02 [cac169d68] Increase DEFAULT_FDW_TUPLE_COST from 0.01 to 0.2

> +Double the default foreign data wrapper tuple cost (David Rowley, Umair 
> Shahid)

That's 20x rather than 2x.

David

Re: Expand applicability of aggregate's sortop optimization

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 13:08, David Rowley  wrote:
> One additional thought is that the above method would also help
> eliminate redundant sorting in queries with a GROUP BY clause.
> Whereas, the can_minmax_aggs optimisation is not applied in that case.

Another argument for using this method is that
SupportRequestOptimizeAggref could allow future unrelated
optimisations such as swapping count() for count(*).
Where  is a NOT NULL column and isn't nullable by
any outer join.  Doing that could speed some queries up quite a bit as
it may mean fewer columns to deform from the tuple. You could imagine
a fact table with many columns and a few dimensions, often the
dimension columns that you'd expect to use in GROUP BY would appear
before the columns you'd aggregate on.

David

Re: Support tid range scan in parallel?

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 10:23, Cary Huang  wrote:
> The v3 patch is attached.

I've not looked at the patch, but please add it to the July CF.  I'll
try and look in more detail then.

David

Re: Expand applicability of aggregate's sortop optimization

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 12:26, David Rowley  wrote:
> I wonder if we should also consider as an alternative to this to just
> have an aggregate support function, similar to
> SupportRequestOptimizeWindowClause that just nullifies the aggorder /
> aggdistinct fields for Min/Max aggregates on types where there's no
> possible difference in output when calling the transition function on
> rows in a different order.
>
> Would that apply in enough cases for you?

One additional thought is that the above method would also help
eliminate redundant sorting in queries with a GROUP BY clause.
Whereas, the can_minmax_aggs optimisation is not applied in that case.

David

Re: Expand applicability of aggregate's sortop optimization

2024-05-08 Thread David Rowley

On Wed, 8 May 2024 at 22:13, Matthias van de Meent
 wrote:
> As you may know, aggregates like SELECT MIN(unique1) FROM tenk1; are
> rewritten as SELECT unique1 FROM tenk1 ORDER BY unique1 USING < LIMIT
> 1; by using the optional sortop field in the aggregator.
> However, this optimization is disabled for clauses that in itself have
> an ORDER BY clause such as `MIN(unique1 ORDER BY ), because
>  can cause reordering of distinguisable values like 1.0 and
> 1.00, which then causes measurable differences in the output. In the
> general case, that's a good reason to not apply this optimization, but
> in some cases, we could still apply the index optimization.

I wonder if we should also consider as an alternative to this to just
have an aggregate support function, similar to
SupportRequestOptimizeWindowClause that just nullifies the aggorder /
aggdistinct fields for Min/Max aggregates on types where there's no
possible difference in output when calling the transition function on
rows in a different order.

Would that apply in enough cases for you?

I think it would rule out Min(numeric) and Max(numeric). We were
careful not to affect the number of decimal places in the numeric
output when using the moving aggregate inverse transition
infrastructure for WindowFuncs, so I agree we should maintain an
ability to control the aggregate transition order for numeric. (See
do_numeric_discard's maxScale if check)

I don't think floating point types have the same issues here. At least
+1.0 is greater than -1.0.

Are there any strange collation rules that would cause issues if we
did this with the text types?

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 06:24, Tom Lane  wrote:
> I like alternative-2.patch a lot better, not least because it
> only adds cycles when join removal actually fires.  Basically
> this is putting the onus on the data structure modifier to
> cope with shared bitmapsets, rather than trying to say that
> sharing is disallowed.
>
> Thoughts?

I'm fine with this one as it's the same as what I already mentioned
earlier.  I had imagined doing bms_del_member(bms_copy ... but maybe
the compiler is able to optimise away the additional store. Likely, it
does not matter much as pallocing memory likely adds far more overhead
anyway.

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 06:49, Tom Lane  wrote:
> BTW, now that I've wrapped my head around what's happening here,
> I believe that -DREALLOCATE_BITMAPSETS is introducing a bug where
> there was none before.  The changes that left-join removal makes
> won't cause any of these sets to go to empty, so the bms_del_member
> calls won't free the sets but just modify them in-place.  And the
> same change will/should be made in every relevant relid set, so
> the fact that the sets may be shared isn't hurting anything.

FWIW, it just feels like we're willing to accept that the
bms_del_member() is not updating all copies of the set in this case as
that particular behaviour is ok for this particular case. I know
you're not proposing this, but I don't think that would warrant
relaxing REALLOCATE_BITMAPSETS to not reallocate Bitmapsets on
bms_del_member() and bms_del_members().

If all we have to do to make -DREALLOCATE_BITMAPSETS builds happy in
make check-world is to add a bms_copy inside the bms_del_member()
calls in remove_rel_from_query(), then I think it's a small price to
pay to allow us to maintain the additional coverage that
REALLOCATE_BITMAPSETS provides. That seems like a small price to pay
when the gains are removing an entire join.

> This conclusion also reinforces my previously-vague feeling that
> we should not consider making -DREALLOCATE_BITMAPSETS the default in
> debug builds, as was proposed upthread.  It's really a fundamentally
> different behavior, and I strongly suspect that it can mask bugs as
> well as introduce them (by hiding sharing in cases that'd be less
> benign than this turns out to be).  I'd rather not do development on
> top of bitmapset infrastructure that acts entirely different from
> production bitmapset infrastructure.

My primary interest in this feature is using it to catch bugs that
we're unlikely to ever hit in the regression tests.  For example, the
planner works when there are <= 63 RTEs but falls over when there are
64 because some bms_add_member() must reallocate more memory to store
the 64th RTI in a Bitmapset. I'd like to have something to make it
more likely we'll find bugs like this before the release instead of
someone having a crash when they run some obscure query shape
containing > 63 RTEs 2 or 4 years after the release.

I'm happy Andrew added this to prion. Thanks for doing that.

David

Re: 2024-05-09 release announcement draft

2024-05-08 Thread David Rowley

On Thu, 9 May 2024 at 04:17, Jonathan S. Katz  wrote:
> * Fix how
> [`INSERT`](https://www.postgresql.org/docs/current/sql-insert.html)
> handles multiple
> [`VALUES`](https://www.postgresql.org/docs/current/sql-values.html) rows
> into a target column that is a domain over an array or composite type.

Maybe it's only me who thinks the double plural of "VALUES rows" is
hard to parse. If that's the case I'll just drop this as it's not that
important.

FWIW, I'd probably write:

* Fix issue with
[`INSERT`](https://www.postgresql.org/docs/current/sql-insert.html)
with a multi-row
[`VALUES`](https://www.postgresql.org/docs/current/sql-values.html) clause
where a target column is a domain over an array or composite type.

I'll argue no further with this.

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley

On Wed, 8 May 2024 at 10:55, Tom Lane  wrote:
>
> David Rowley  writes:
> > REALLOCATE_BITMAPSETS was invented for this and IMO, it found exactly
> > the problem it was invented to find.
>
> Not in a way that gives me any confidence that we found *all* the
> problems.

Here are some statements I believe to be true:
1. If REALLOCATE_BITMAPSETS is defined then modifications to a
Bitmapset will make a copy and free the original.
2. If a query runs successfully without REALLOCATE_BITMAPSETS and
Assert fails due to an invalid Bitmapset when REALLOCATE_BITMAPSETS is
defined, then we have > 1 pointer pointing to the same set and not all
of them are being updated when the members are added/removed.

Given the above, I can't see what Bitmapset sharing problems we won't
find with REALLOCATE_BITMAPSETS.

Can you share the exact scenario you're worried that we won't find so
I can understand your concern?

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley

On Wed, 8 May 2024 at 10:40, Tom Lane  wrote:
>
> David Rowley  writes:
> > We could make the policy copy-on-modify.  If you put bms_copy around
> > the bms_del_member() calls in remove_rel_from_query(), does it pass
> > then?
>
> Didn't test, but that route seems awfully invasive and fragile: how
> will we find all the places to modify, or ensure that the policy
> is followed by future patches?

REALLOCATE_BITMAPSETS was invented for this and IMO, it found exactly
the problem it was invented to find.

Copy-on-modify is our policy for node mutation. Why is it ok there but
awfully fragile here?

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley

On Wed, 8 May 2024 at 10:35, David Rowley  wrote:
>
> On Wed, 8 May 2024 at 06:20, Tom Lane  wrote:
> > I find that Richard's proposed fix makes the core regression tests
> > pass, but we still fail check-world.  So I'm afraid we need something
> > more aggressive, like the attached which makes make_restrictinfo
> > copy all its input bitmapsets.  Without that, we still have sharing
> > of bitmapsets across different RestrictInfos, which seems pretty
> > scary given what we now see about the effects of 00b41463c.  This
> > seems annoyingly expensive, but maybe there's little choice?
>
> We could make the policy copy-on-modify.  If you put bms_copy around
> the bms_del_member() calls in remove_rel_from_query(), does it pass
> then?

err, I mean bms_copy() the set before passing to bms_del_member().

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley

On Wed, 8 May 2024 at 06:20, Tom Lane  wrote:
> I find that Richard's proposed fix makes the core regression tests
> pass, but we still fail check-world.  So I'm afraid we need something
> more aggressive, like the attached which makes make_restrictinfo
> copy all its input bitmapsets.  Without that, we still have sharing
> of bitmapsets across different RestrictInfos, which seems pretty
> scary given what we now see about the effects of 00b41463c.  This
> seems annoyingly expensive, but maybe there's little choice?

We could make the policy copy-on-modify.  If you put bms_copy around
the bms_del_member() calls in remove_rel_from_query(), does it pass
then?

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 17:28, Tom Lane  wrote:
>
> David Rowley  writes:
> > Yeah, before the revert, that did:
> > -   sjinf->syn_lefthand = replace_relid(sjinf->syn_lefthand, relid, 
> > subst);
> > That replace code seems to have always done a bms_copy()
>
> Hmm, not always; see e0477837c.

It was the discussion on that thread that led to the invention of
REALLOCATE_BITMAPSETS

> What I'm trying to figure out here is whether we have a live bug
> in this area in released branches; and if so, why we've not seen
> reports of that.

We could check what portions of REALLOCATE_BITMAPSETS are
backpatchable. It may not be applicable very far back because of v16's
00b41463c. The bms_del_member() would have left a zero set rather than
doing bms_free() prior to that commit.  There could be a bug in v16.

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 15:35, jian he  wrote:
> i also found that not specifying c_args: `-DREALLOCATE_BITMAPSETS`,
> the regress test won't fail.

It would be good to get some build farm coverage of this so we don't
have to rely on manual testing.  I wonder if it's a good idea to just
define REALLOCATE_BITMAPSETS when #ifdef CLOBBER_FREED_MEMORY... or if
we should ask on the buildfarm-members list if someone wouldn't mind
defining it?

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 17:11, David Rowley  wrote:
> sjinf->syn_lefthand = bms_del_member(sjinf->syn_lefthand, relid);
>
> I've not looked, but I assumed the revert must have removed some
> common code that was added and reverted with SJE.

Yeah, before the revert, that did:

-   sjinf->syn_lefthand = replace_relid(sjinf->syn_lefthand, relid, subst);

That replace code seems to have always done a bms_copy()

-static Bitmapset *
-replace_relid(Relids relids, int oldId, int newId)
-{
-   if (oldId < 0)
-   return relids;
-
-   /* Delete relid without substitution. */
-   if (newId < 0)
-   return bms_del_member(bms_copy(relids), oldId);
-
-   /* Substitute newId for oldId. */
-   if (bms_is_member(oldId, relids))
-   return bms_add_member(bms_del_member(bms_copy(relids), oldId), newId);
-
-   return relids;
-}


David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 17:01, Tom Lane  wrote:
>
> Richard Guo  writes:
> > Thank you for the report.  I looked at this a little bit and I think
> > here is what happened.  In deconstruct_distribute_oj_quals we call
> > distribute_quals_to_rels using the uncopied sjinfo->syn_lefthand as
> > outerjoin_nonnullable, which eventually becomes rinfo->outer_relids.
> > Later on, when we remove useless left joins, we modify
> > sjinfo->syn_lefthand using bms_del_member and recycle
> > sjinfo->syn_lefthand.  And that causes the rinfo->outer_relids becomes
> > invalid, and finally triggers this issue in join_clause_is_movable_to.
>
> Hmm, the SJE code didn't really touch any of this logic, so why
> didn't we see the failure before?

The bms_free() occurs in remove_rel_from_query() on:

sjinf->syn_lefthand = bms_del_member(sjinf->syn_lefthand, relid);

I've not looked, but I assumed the revert must have removed some
common code that was added and reverted with SJE.

David

Re: Revert: Remove useless self-joins and -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 16:47, Richard Guo  wrote:
> --- a/src/backend/optimizer/plan/initsplan.c
> +++ b/src/backend/optimizer/plan/initsplan.c
> @@ -1888,7 +1888,7 @@ deconstruct_distribute_oj_quals(PlannerInfo *root,
> qualscope = bms_union(sjinfo->syn_lefthand, sjinfo->syn_righthand);
> qualscope = bms_add_member(qualscope, sjinfo->ojrelid);
> ojscope = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
> -   nonnullable_rels = sjinfo->syn_lefthand;
> +   nonnullable_rels = bms_copy(sjinfo->syn_lefthand);

I was busy looking at this too and I came to the same conclusion.

David

Re: 2024-05-09 release announcement draft

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 15:48, Tom Lane  wrote:
>
> David Rowley  writes:
> > I know this is the same wording as Tom added in [1], I might just have
> > failed to comprehend something, but if I strip out the links and try
> > to make sense of "Fix INSERT from multiple VALUES rows into", I just
> > can't figure out how to parse it.  I'm pretty sure it means "Fix
> > multiple-row VALUES clauses with INSERT statements when ...", but I'm
> > not sure.
>
> The problem happens in commands like
> INSERT INTO tab VALUES (1,2), (3,4), ...
> We treat this separately from the single-VALUES-row case for
> efficiency reasons.

Yeah, I know about the multi-row VALUES. What I'm mostly struggling to
parse is the "from" and the double plural of "VALUES" and "rows".
Also, why is it "from" and not "with"?  I get that "VALUES" is a
keyword that happens to be plural, but try reading it out loud.

Why not "Fix INSERT with multi-row VALUES clauses ..."?

David

Re: 2024-05-09 release announcement draft

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 15:09, Jonathan S. Katz  wrote:
> I opted for that; and it turned out the other fix was simple, so here's
> an updated draft.

Thanks

> * Fix [`INSERT`](https://www.postgresql.org/docs/current/sql-insert.html) from
> multiple [`VALUES`](https://www.postgresql.org/docs/current/sql-values.html)
> rows into a target column that is a domain over an array or composite type.

I know this is the same wording as Tom added in [1], I might just have
failed to comprehend something, but if I strip out the links and try
to make sense of "Fix INSERT from multiple VALUES rows into", I just
can't figure out how to parse it.  I'm pretty sure it means "Fix
multiple-row VALUES clauses with INSERT statements when ...", but I'm
not sure.

> * Require the [SELECT 
> privilege](https://www.postgresql.org/docs/current/sql-grant.html)
> on the target table when using 
> [`MERGE`](https://www.postgresql.org/docs/current/sql-merge.html)
> when using `MERGE ... DO NOTHING`.

I think the last line should just be "with `NO NOTHING`"

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7155cc4a60e7bfc837233b2dea2563a2edc673fd

Re: 2024-05-09 release announcement draft

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 14:58, Jonathan S. Katz  wrote:
> >> * Throw an error if an index is accessed while it is being reindexed.
> >
>
> Based on this, I'd vote to just remove it from the release announcement.

I'd prefer that over leaving the wording the way it is.  Looking at
the test case in [1], it does not seem like a very likely thing for
people to hit. It basically requires someone to be telling lies about
a function's immutability.

David

[1] https://www.postgresql.org/message-id/18363-e3598a5a572d0...@postgresql.org

Re: Incorrect explain output for updates/delete operations with returning-list on partitioned tables

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 09:18, Tom Lane  wrote:
>
> SAIKIRAN AVULA  writes:
> > I have been working on partitioned tables recently, and I have noticed
> > something that doesn't seem correct with the EXPLAIN output of an
> > update/delete query with a returning list.
>
> What do you think is not right exactly?  The output has to use some
> one of the correlation names for the partitioned table.  I think
> it generally chooses the one corresponding to the first Append arm,
> but really any would be good enough for EXPLAIN's purposes.

Also looks harmless to me.  But just a slight correction, you're
talking about the deparse Append condition that's in
set_deparse_plan().  Whereas the code that controls this for the
returningList is the following in nodeModifyTable.c:

/*
* Initialize result tuple slot and assign its rowtype using the first
* RETURNING list.  We assume the rest will look the same.
*/
mtstate->ps.plan->targetlist = (List *) linitial(node->returningLists);

David

Re: 2024-05-09 release announcement draft

2024-05-06 Thread David Rowley

On Tue, 7 May 2024 at 05:44, Jonathan S. Katz  wrote:
> Please provide feedback no later (and preferably sooner) than 2024-05-09
> 12:00 UTC.

Thanks for the draft.  Here's some feedback.

> * Fix [`INSERT`](https://www.postgresql.org/docs/current/sql-insert.html) from
> multiple [`VALUES`](https://www.postgresql.org/docs/current/sql-values.html)
> rows into a target column that is a domain over an array or composite type.
> including requiring the [SELECT 
> privilege](https://www.postgresql.org/docs/current/sql-grant.html)
> on the target table when using 
> [`MERGE`](https://www.postgresql.org/docs/current/sql-merge.html)
> when using `MERGE ... DO NOTHING`.

Something looks wrong with the above. Are two separate items merged
into one?  52898c63e and a3f5d2056?

> * Fix confusion for SQL-language procedures that returns a single 
> composite-type
> column.

Should "returns" be singular here?

> * Throw an error if an index is accessed while it is being reindexed.

 I know you want to keep these short and I understand the above is the
same wording from release notes, but these words sound like a terrible
oversite that we allow any concurrent query to still use the table
while a reindex is in progress.  Maybe we should give more detail
there so people don't think we made such a silly mistake. The release
note version I think does have enough words to allow the reader to
understand that the mistake is more complex. Maybe we could add
something here to make it sound like less of an embarrassing mistake?

David

Re: On disable_cost

2024-05-04 Thread David Rowley

On Sun, 5 May 2024 at 04:57, Tom Lane  wrote:
>
> David Rowley  writes:
> > That doesn't get you the benefits of fewer CPU cycles, but where did
> > that come from as a motive to change this? There's no shortage of
> > other ways to make the planner faster if that's an issue.
>
> The concern was to not *add* CPU cycles in order to make this area
> better.  But I do tend to agree that we've exhausted all the other
> options.

It really looks to me that Robert was talking about not generating
paths for disabled path types. He did write "just to save CPU cycles"
in the paragraph I quoted.

I think we should concern ourselves with adding overhead to add_path()
*only* when we actually see a patch which slows it down in a way that
we can measure.  I find it hard to imagine that adding a single
comparison for every Path is measurable.  Each of these paths has been
palloced and costed, both of which are significantly more expensive
than adding another comparison to compare_path_costs_fuzzily().  I'm
only willing for benchmarks on an actual patch to prove me wrong on
that. Nothing else. add_path() has become a rat's nest of conditions
over the years and those seem to have made it without concerns about
performance.

> BTW, I looked through costsize.c just now to see exactly what we are
> using disable_cost for, and it seemed like a majority of the cases are
> just wrong.  Where possible, we should implement a plan-type-disable
> flag by not generating the associated Path in the first place, not by
> applying disable_cost to it.  But it looks like a lot of people have
> erroneously copied the wrong logic.  I would say that only these plan
> types should use the disable_cost method:
>
> seqscan
> nestloop join
> sort

I think this oversimplifies the situation.  I only spent 30 seconds
looking and I saw cases where this would cause issues. If
enable_hashagg is false, we could fail to produce some plans where the
type is sortable but not hashable. There's also an issue with nested
loops being unable to FULL OUTER JOIN. However, I do agree that there
are some in there that are adding disable_cost that should be done by
just not creating the Path.  enable_gathermerge is one.
enable_bitmapscan is probably another.

I understand you only talked about the cases adding disable_cost in
costsize.c. But just as a reminder, there are other things we need to
be careful not to break. For example, enable_indexonlyscan=false
should defer to still making an index scan.  Nobody who disables
enable_indexonlyscan without disabling enable_indexscan wants queries
that are eligible to use IOS to use seq scan instead. They'd still
want Index Scan to be considered, otherwise they'd have disabled
enable_indexscan.

David

Re: On disable_cost

2024-05-04 Thread David Rowley

On Sat, 4 May 2024 at 08:34, Robert Haas  wrote:
> Another idea is to remove the ERROR mentioned above from
> set_cheapest() and just allow planning to continue even if some
> relations end up with no paths. (This would necessitate finding and
> fixing any code that could be confused by a pathless relation.) Then,
> if you get to the top of the plan tree and you have no paths there,
> redo the join search discarding the constraints (or maybe just some of
> the constraints, e.g. allow sequential scans and nested loops, or
> something).

I don't think you'd need to wait longer than where we do set_cheapest
and find no paths to find out that there's going to be a problem.

I don't think redoing planning is going to be easy or even useful. I
mean what do you change when you replan? You can't just do
enable_seqscan and enable_nestloop as if there's no index to provide
sorted input and the plan requires some sort, then you still can't
produce a plan.  Adding enable_sort to the list does not give me much
confidence we'll never fail to produce a plan either. It just seems
impossible to know which of the disabled ones caused the RelOptInfo to
have no paths.  Also, you might end up enabling one that caused the
planner to do something different than it would do today.  For
example, a Path that today incurs 2x disable_cost vs a Path that only
receives 1x disable_cost might do something different if you just went
and enabled a bunch of enable* GUCs before replanning.

> Now, I suppose it might be that even if we can't remove disable_cost,
> something along these lines is still worth doing, just to save CPU
> cycles. You could for example try planning with only non-disabled
> stuff and then do it over again with everything if that doesn't work
> out, still keeping disable_cost around so that you avoid disabled
> nodes where you can. But I'm kind of hoping that I'm missing something
> and there's some approach that could both kill disable_cost and save
> some cycles at the same time. If (any of) you have an idea, I'd love
> to hear it!

I think the int Path.disabledness idea is worth coding up to try it.
I imagine that a Path will incur the maximum of its subpath's
disabledness's then add_path() just needs to prefer lower-valued
disabledness Paths.

That doesn't get you the benefits of fewer CPU cycles, but where did
that come from as a motive to change this? There's no shortage of
other ways to make the planner faster if that's an issue.

David

Re: Use generation memory context for tuplestore.c

2024-05-03 Thread David Rowley

On Sat, 4 May 2024 at 03:51, Matthias van de Meent
 wrote:
> Was a bump context considered? If so, why didn't it make the cut?
> If tuplestore_trim is the only reason why the type of context in patch
> 2 is a generation context, then couldn't we make the type of context
> conditional on state->eflags & EXEC_FLAG_REWIND, and use a bump
> context if we require rewind capabilities (i.e. where _trim is never
> effectively executed)?

I didn't really want to raise all this here, but to answer why I
didn't use bump...

There's a bit more that would need to be done to allow bump to work in
use-cases where no trim support is needed. Namely, if you look at
writetup_heap(), you'll see a heap_free_minimal_tuple(), which is
pfreeing the memory that was allocated for the tuple in either
tuplestore_puttupleslot(), tuplestore_puttuple() or
tuplestore_putvalues().   So basically, what happens if we're still
loading the tuplestore and we've spilled to disk, once the
tuplestore_put* function is called, we allocate memory for the tuple
that might get stored in RAM (we don't know yet), but then call
tuplestore_puttuple_common() which decides if the tuple goes to RAM or
disk, then because we're spilling to disk, the write function pfree's
the memory we allocate in the tuplestore_put function after the tuple
is safely written to the disk buffer.

This is a fairly inefficient design.  While, we do need to still form
a tuple and store it somewhere for tuplestore_putvalues(), we don't
need to do that for a heap tuple. I think it should be possible to
write directly from the table's page.

Overall tuplestore.c seems quite confused as to how it's meant to
work.  You have tuplestore_begin_heap() function, which is the *only*
external function to create a tuplestore.  We then pretend we're
agnostic about how we store tuples that won't fit in memory by
providing callbacks for read/write/copy, but we only have 1 set of
functions for those and instead of having some other begin method we
use when not dealing with heap tuples, we use some other
tuplestore_put* function.

It seems like another pass is required to fix all this and I think
that should be:

1. Get rid of the function pointers and just hardcode which static
functions we call to read/write/copy.
2. Rename tuplestore_begin_heap() to tuplestore_begin().
3. See if we can rearrange the code so that the copying to the tuple
context is only done when we are in TSS_INMEM. I'm not sure what that
would look like, but it's required before we could use bump so we
don't pfree a bump allocated chunk.

Or maybe there's a way to fix this by adding other begin functions and
making it work more like tuplesort.  I've not really looked into that.

I'd rather tackle these problems independently and I believe there are
much bigger wins to moving from aset to generation than generation to
bump, so that's where I've started.

Thanks for having a look at the patch.

David

Re: Support tid range scan in parallel?

2024-05-03 Thread David Rowley

On Sat, 4 May 2024 at 06:55, Cary Huang  wrote:
> With syncscan enabled, the "table_block_parallelscan_nextpage()" would
> return the next block since the end of the first tid rangescan instead of the
> correct start block that should be scanned. I see that single tid rangescan
> does not have SO_ALLOW_SYNC set, so I figure syncscan should also be
> disabled in parallel case. With this change, then it would be okay to call
> heap_setscanlimits() in parallel case, so I added this call back to
> heap_set_tidrange() in both serial and parallel cases.

This now calls heap_setscanlimits() for the parallel version, it's
just that table_block_parallelscan_nextpage() does nothing to obey
those limits.

The only reason the code isn't reading the entire table is due to the
optimisation in heap_getnextslot_tidrange() which returns false when
the ctid goes out of range. i.e, this code:

/*
* When scanning forward, the TIDs will be in ascending order.
* Future tuples in this direction will be higher still, so we can
* just return false to indicate there will be no more tuples.
*/
if (ScanDirectionIsForward(direction))
return false;

If I comment out that line, I see all pages are accessed:

postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
QUERY PLAN
---
 Finalize Aggregate  (cost=18.80..18.81 rows=1 width=8) (actual
time=33.530..36.118 rows=1 loops=1)
   Buffers: shared read=4425
   ->  Gather  (cost=18.78..18.79 rows=2 width=8) (actual
time=33.456..36.102 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared read=4425
 ->  Partial Aggregate  (cost=18.78..18.79 rows=1 width=8)
(actual time=20.389..20.390 rows=1 loops=3)
   Buffers: shared read=4425
   ->  Parallel Tid Range Scan on a  (cost=0.01..16.19
rows=1035 width=0) (actual time=9.375..20.349 rows=829 loops=3)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
 Buffers: shared read=4425 <  this is all
pages in the table instead of 11 pages.

With that code still commented out, the non-parallel version still
won't read all pages due to the setscanlimits being obeyed.

postgres=# set max_parallel_workers_per_gather=0;
SET
postgres=# explain (analyze, buffers) select count(*) from a where
ctid >= '(0,1)' and ctid < '(11,0)';
  QUERY PLAN
--
 Aggregate  (cost=45.07..45.08 rows=1 width=8) (actual
time=0.302..0.302 rows=1 loops=1)
   Buffers: shared hit=11
   ->  Tid Range Scan on a  (cost=0.01..38.86 rows=2485 width=0)
(actual time=0.019..0.188 rows=2486 loops=1)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < '(11,0)'::tid))
 Buffers: shared hit=11

If I put that code back in, how many pages are read depends on the
number of parallel workers as workers will keep running with higher
page numbers and heap_getnextslot_tidrange() will just (inefficiently)
filter those out.

max_parallel_workers_per_gather=2;
   ->  Parallel Tid Range Scan on a  (cost=0.01..16.19
rows=1035 width=0) (actual time=0.191..0.310 rows=829 loops=3)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
 Buffers: shared read=17

max_parallel_workers_per_gather=3;
   ->  Parallel Tid Range Scan on a  (cost=0.01..12.54
rows=802 width=0) (actual time=0.012..0.114 rows=622 loops=4)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
 Buffers: shared hit=19

max_parallel_workers_per_gather=4;
   ->  Parallel Tid Range Scan on a  (cost=0.01..9.72
rows=621 width=0) (actual time=0.014..0.135 rows=497 loops=5)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid <
'(11,0)'::tid))
 Buffers: shared hit=21

To fix this you need to make table_block_parallelscan_nextpage obey
the limits imposed by heap_setscanlimits().

The equivalent code in the non-parallel version is in
heapgettup_advance_block().

/* check if the limit imposed by heap_setscanlimits() is met */
if (scan->rs_numblocks != InvalidBlockNumber)
{
if (--scan->rs_numblocks == 0)
return InvalidBlockNumber;
}

I've not studied exactly how you'd get the rs_numblock information
down to the parallel scan descriptor.  But when you figure that out,
just remember that you can't do the --scan->rs_numblocks from
table_block_parallelscan_nextpage() as that's not parallel safe. You
might be able to add an or condition to: "if (nallocated >=
pbscan->phs_nblocks)" to make it "if (nallocated >=
pbscan->phs_nblocks || nallocated >=

Re: Typos in the code and README

2024-05-03 Thread David Rowley

On Fri, 3 May 2024 at 00:00, Alexander Lakhin  wrote:
> (some of them are located in doc/, so it's not a code-only change)
> I've attached the patch for your convenience, though maybe some
> of the suggestions are to be discarded.

Thanks. I was hoping you'd do that.

I pushed the patch after only adjusting the path in the docs which had
"module" rather than "modules".

David

Use generation memory context for tuplestore.c

2024-05-03 Thread David Rowley

(40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)

I was looking at the tuplestore.c code a few days ago and noticed that
it allocates tuples in the memory context that tuplestore_begin_heap()
is called in, which for nodeMaterial.c, is ExecutorState.

I didn't think this was great because:
1. Allocating many chunks in ExecutorState can bloat the context with
many blocks worth of free'd chunks, stored on freelists that might
never be reused for anything.
2. To clean up the memory, pfree must be meticulously called on each
allocated tuple
3. ExecutorState is an aset.c context which isn't the most efficient
allocator for this purpose.

I've attached 2 patches:

0001:  Adds memory tracking to Materialize nodes, which looks like:

 ->  Materialize (actual time=0.033..9.157 rows=1 loops=2)
   Storage: Memory  Maximum Storage: 10441kB

0002: Creates a Generation MemoryContext for storing tuples in tuplestore.

Using generation has the following advantages:

1. It does not round allocations up to the next power of 2.  Using
generation will save an average of 25% memory for tuplestores or allow
an average of 25% more tuples before going to disk.
2. Allocation patterns in tuplestore.c are FIFO, which is exactly what
generation was designed to handle best.
3. Generation is faster to palloc/pfree than aset. (See [1]. Compare
the 4-bit times between aset_palloc_pfree.png and
generation_palloc_pfree.png)
4. tuplestore_clear() and tuplestore_end() can reset or delete the
tuple context instead of pfreeing every tuple one by one.
5. Higher likelihood of neighbouring tuples being stored consecutively
in memory, resulting in better CPU memory prefetching.
6. Generation has a page-level freelist, so is able to reuse pages
instead of freeing and mallocing another if tuplestore_trim() is used
to continually remove no longer needed tuples. aset.c can only
efficiently do this if the tuples are all in the same size class.

The attached bench.sh.txt tests the performance of this change and
result_chart.png shows the results I got when running on an AMD 3990x
master @ 8f0a97dff vs patched.
The script runs benchmarks for various tuple counts stored in the
tuplestore -- 1 to 8192 in power-2 increments.

The script does output the memory consumed by the tuplestore for each
query.  Here are the results for the 8192 tuple test:

master @ 8f0a97dff
Storage: Memory  Maximum Storage: 16577kB

patched:
Storage: Memory  Maximum Storage: 8577kB

Which is roughly half, but I did pad the tuple to just over 1024
bytes, so the alloc set allocations would have rounded up to 2048
bytes.

Some things I've *not* done:

1. Gone over other executor nodes which use tuplestore to add the same
additional EXPLAIN output.  CTE Scan, Recursive Union, Window Agg
could get similar treatment.
2. Given much consideration for the block sizes to use for
GenerationContextCreate(). (Maybe using ALLOCSET_SMALL_INITSIZE for
the start size is a good idea.)
3. A great deal of testing.

I'll park this here until we branch for v18.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvqUWhOMkUjYXzq95idAwpiPdJLCxxRbf8kV6PYcW5y=c...@mail.gmail.com
#!/bin/bash

db=postgres

psql -c "drop table if exists t1;" $db
psql -c "create table t1 (a int, b text);" $db
psql -c "insert into t1 select x,repeat('a',1024) from 
generate_series(1,1)x;" $db
psql -c "create index on t1(a);" $db
psql -c "vacuum freeze analyze t1;" $db
psql -c "select pg_prewarm('t1');" $db

for r in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
do
echo -n "$r rows "
psql -c "explain (analyze) select count(t1.b) from (values(1),(2)) 
t2(x) left join (select * from t1 where a <= $r) t1 on true;" $db | grep 
"Maximum Storage"
done

for r in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
do
echo -n "Testing with $r rows "
echo "select count(t1.b) from (values(1),(2)) t2(x) left join 
(select * from t1 where a <= $r) t1 on true;" > bench.sql
pgbench -n -T 10 -f bench.sql -M prepared $db | grep latency
done



v1-0001-Add-memory-disk-usage-for-Material-in-EXPLAIN-ANA.patch
Description: Binary data


v1-0002-Don-t-use-ExecutorState-memory-context-for-tuples.patch
Description: Binary data

Incorrect Assert in BufFileSize()?

2024-05-02 Thread David Rowley

I'm trying to figure out why BufFileSize() Asserts that file->fileset
isn't NULL, per 1a990b207.

The discussion for that commit is in [1], but I don't see any
explanation of the Assert in the discussion or commit message and
there's no comment explaining why it's there.

The code that comes after the Assert does not look at the fileset
field.  With the code as it is, it doesn't seem possible to get the
file size of a non-shared BufFile and I don't see any reason for that.

Should the Assert be checking file->files != NULL?

David

[1] 
https://postgr.es/m/CAH2-Wzn0ZNLZs3DhCYdLMv4xn1fnM8ugVHPvWz67dSUh1s_%3D2Q%40mail.gmail.com

Re: enhance the efficiency of migrating particularly large tables

2024-05-02 Thread David Rowley

On Fri, 3 May 2024 at 09:33, David Zhang  wrote:
> Is there a simple way to get the min of ctid faster than using min(), but 
> similar to get the max of ctid using pg_relation_size?

The equivalent approximation is always '(0,1)'.

David

Re: Type and CAST error on lowest negative integer values for smallint, int and bigint

2024-05-02 Thread David Rowley

On Thu, 2 May 2024 at 23:25, Hans Buschmann  wrote:
> postgres=# select -32768::smallint;
> ERROR:  smallint out of range

The precedence order of operations applies the cast before the unary
minus operator.

Any of the following will work:

postgres=# select cast(-32768 as smallint), (-32768)::smallint,
'-32768'::smallint;
  int2  |  int2  |  int2
++
 -32768 | -32768 | -32768
(1 row)

David

Re: Document NULL

2024-05-01 Thread David Rowley

On Thu, 2 May 2024 at 03:12, David G. Johnston
 wrote:
> Attached is a very rough draft attempting this, based on my own thoughts and 
> those expressed by Tom in [1], which largely align with mine.

Thanks for picking this up. I agree that we should have something to
improve this.

It would be good to see some subtitles in this e.g "Three-valued
boolean logic" and document about NULL being unknown, therefore false.
Giving a few examples would be good to, which I think is useful as it
at least demonstrates a simple way of testing these things using a
simple FROMless SELECT, e.g. "SELECT NULL = NULL;".  You could link to
this section from where we document WHERE clauses.

Maybe another subtitle would be "GROUP BY / DISTINCT clauses with NULL
values", and then explain that including some other examples using
"SELECT 1 IS NOT DISTINCT FROM NULL;" to allow the reader to
experiment and learn by running queries.

You likely skipped them due to draft status, but if not, references
back to other sections likely could do with links back to that
section, e.g "amount of precipitation Hayward" is not on that page.
Without that you're assuming the reader is reading the documents
linearly.

Another section might briefly explain about disallowing NULLs in
columns with NOT NULL constraints, then link to wherever we properly
document those.

typo:

+ Handling Unkowns (NULL)

Maybe inject "Values" after Unknown.

Let's bash it into shape a bit more before going any further on actual wording.

David

Re: New GUC autovacuum_max_threshold ?

2024-05-01 Thread David Rowley

On Sat, 27 Apr 2024 at 02:13, Robert Haas  wrote:
> Let's compare the current situation to the situation post-patch with a
> cap of 500k. Consider a table 1024 times larger than the one I
> mentioned above, so pgbench scale factor 25600, size on disk 320GB.
> Currently, that table will be vacuumed for bloat when the number of
> dead tuples exceeds 20% of the table size, because that's the default
> value of autovacuum_vacuum_scale_factor. The table has 2.56 billion
> tuples, so that means that we're going to vacuum it when there are
> more than 510 million dead tuples. Post-patch, we will vacuum when we
> have 500 thousand dead tuples. Suppose a uniform workload that slowly
> updates rows in the table. If we were previously autovacuuming the
> table once per day (1440 minutes) we're now going to try to vacuum it
> almost every minute (1440 minutes / 1024 = 84 seconds).

I've not checked your maths, but if that's true, that's not going to work.

I think there are fundamental problems with the parameters that drive
autovacuum that need to be addressed before we can consider a patch
like this one.

Here are some of the problems that I know about:

1. Autovacuum has exactly zero forward vision and operates reactively
rather than proactively.  This "blind operating" causes tables to
either not need vacuumed or suddenly need vacuumed without any
consideration of how busy autovacuum is at that current moment.
2. There is no prioritisation for the order in which tables are autovacuumed.
3. With the default scale factor, the larger a table becomes, the more
infrequent the autovacuums.
4. Autovacuum is more likely to trigger when the system is busy
because more transaction IDs are being consumed and there is more DML
occurring. This results in autovacuum having less work to do during
quiet periods when there are more free resources to be doing the
vacuum work.

In my opinion, the main problem with Frédéric's proposed GUC/reloption
is that it increases the workload that autovacuum is responsible for
and, because of #2, it becomes more likely that autovacuum works on
some table that isn't the highest priority table to work on which can
result in autovacuum starvation of tables that are more important to
vacuum now.

I think we need to do a larger overhaul of autovacuum to improve
points 1-4 above.  I also think that there's some work coming up that
might force us into this sooner than we think.  As far as I understand
it, AIO will break vacuum_cost_page_miss because everything (providing
IO keeps up) will become vacuum_cost_page_hit. Maybe that's not that
important as that costing is quite terrible anyway.

Here's a sketch of an idea that's been in my head for a while:

Round 1:
1a) Give autovacuum forward vision (#1 above) and instead of vacuuming
a table when it (atomically) crosses some threshold, use the existing
scale_factors and autovacuum_freeze_max_age to give each table an
autovacuum "score", which could be a number from 0-100, where 0 means
do nothing and 100 means nuclear meltdown.  Let's say a table gets 10
points for the dead tuples meeting the current scale_factor and maybe
an additional point for each 10% of proportion the size of the table
is according to the size of the database (gives some weight to space
recovery for larger tables).  For relfrozenxid, make the score the
maximum of dead tuple score vs the percentage of the age(relfrozenxid)
is to 2 billion. Use a similar maximum score calc for age(relminmxid)
2 billion.
1b) Add a new GUC that defines the minimum score a table must reach
before autovacuum will consider it.
1c) Change autovacuum to vacuum the tables with the highest scores first.

Round 2:
2a) Have autovacuum monitor the score of the highest scoring table
over time with buckets for each power of 2 seconds in history from
now. Let's say 20 buckets, about 12 days of history. Migrate scores
into older buckets to track the score over time.
2b) Have autovacuum cost limits adapt according to the history so that
if the maximum score of any table is trending upwards, that autovacuum
speeds up until the score buckets trend downwards towards the present.
2c) Add another GUC to define the minimum score that autovacuum will
be "proactive". Must be less than the minimum score to consider
autovacuum (or at least, ignored unless it is.).  This GUC would not
cause an autovacuum speedup due to 2b) as we'd only consider tables
which meet the GUC added in 1b) in the score history array in 2a).
This stops autovacuum running faster than autovacuum_cost_limit when
trying to be proactive.

While the above isn't well a well-baked idea. The exact way to
calculate the scores isn't well thought through, certainly. However, I
do think it's an idea that we should consider and improve upon.  I
believe 2c) helps solve the problem of large tables becoming bloated
as autovacuum could get to these sooner when the workload is low
enough for it to run proactively.

I think we need at least 1a) before we can give

Re: Support tid range scan in parallel?

2024-05-01 Thread David Rowley

On Wed, 1 May 2024 at 07:10, Cary Huang  wrote:
> Yes of course. These numbers were obtained earlier this year on master with 
> the patch applied most likely without the read stream code you mentioned. The 
> patch attached here is rebased to commit 
> dd0183469bb779247c96e86c2272dca7ff4ec9e7 on master, which is quite recent and 
> should have the read stream code for v17 as I can immediately tell that the 
> serial scans run much faster now in my setup. I increased the records on the 
> test table from 40 to 100 million because serial scans are much faster now. 
> Below is the summary and details of my test. Note that I only include the 
> EXPLAIN ANALYZE details of round1 test. Round2 is the same except for 
> different execution times.

It would be good to see the EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = 1;

Here's a quick review

1. Does not produce correct results:

-- serial plan
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
 count
---
  2260
(1 row)

-- parallel plan
postgres=# set max_parallel_workers_per_gather=2;
SET
postgres=# select count(*) from t where ctid >= '(0,0)' and ctid < '(10,0)';
 count
---
 0
(1 row)

I've not really looked into why, but I see you're not calling
heap_setscanlimits() in parallel mode. You need to somehow restrict
the block range of the scan to the range specified in the quals. You
might need to do more work to make the scan limits work with parallel
scans.

If you look at heap_scan_stream_read_next_serial(), it's calling
heapgettup_advance_block(), where there's  "if (--scan->rs_numblocks
== 0)".  But no such equivalent code in
table_block_parallelscan_nextpage() called by
heap_scan_stream_read_next_parallel().  To make Parallel TID Range
work, you'll need heap_scan_stream_read_next_parallel() to abide by
the scan limits.

2. There's a 4 line comment you've added to cost_tidrangescan() which
is just a copy and paste from cost_seqscan().  If you look at the
seqscan costing, the comment is true in that scenario, but not true in
where you've pasted it.  The I/O cost is all tied in to run_cost.

+ /* The CPU cost is divided among all the workers. */
+ run_cost /= parallel_divisor;
+
+ /*
+ * It may be possible to amortize some of the I/O cost, but probably
+ * not very much, because most operating systems already do aggressive
+ * prefetching.  For now, we assume that the disk run cost can't be
+ * amortized at all.
+ */

3. Calling TidRangeQualFromRestrictInfoList() once for the serial path
and again for the partial path isn't great. It would be good to just
call that function once and use the result for both path types.

4. create_tidrangescan_subpaths() seems like a weird name for a
function.  That seems to imply that scans have subpaths. Scans are
always leaf paths and have no subpaths.

This isn't a complete review. It's just that this seems enough to keep
you busy for a while. I can look a bit harder when the patch is
working correctly. I think you should have enough feedback to allow
that now.

David

Re: Streaming I/O, vectored I/O (WIP)

2024-04-30 Thread David Rowley

On Wed, 24 Apr 2024 at 14:32, David Rowley  wrote:
> I've attached a patch with a few typo fixes and what looks like an
> incorrect type for max_ios. It's an int16 and I think it needs to be
> an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> anything when max_ios is int16.

No feedback, so I'll just push this in a few hours unless anyone has anything.

David

Re: Support tid range scan in parallel?

2024-04-29 Thread David Rowley

On Tue, 30 Apr 2024 at 10:36, Cary Huang  wrote:
> In one of our migration scenarios, we rely on tid range scan to migrate huge 
> table from one database to another once the lower and upper ctid bound is 
> determined. With the support of parallel ctid range scan, this process could 
> be done much quicker.

I would have thought that the best way to migrate would be to further
divide the TID range into N segments and run N queries, one per
segment to get the data out.

>From a CPU point of view, I'd hard to imagine that a SELECT * query
without any other items in the WHERE clause other than the TID range
quals would run faster with multiple workers than with 1.  The problem
is the overhead of pushing tuples to the main process often outweighs
the benefits of the parallelism.  However, from an I/O point of view
on a server with slow enough disks, I can imagine there'd be a
speedup.

> The attached patch is my approach to add parallel ctid range scan to 
> PostgreSQL's planner and executor. In my tests, I do see an increase in 
> performance using parallel tid range scan over the single worker tid range 
> scan and it is also faster than parallel sequential scan covering similar 
> ranges. Of course, the table needs to be large enough to reflect the 
> performance increase.
>
> below is the timing to complete a select query covering all the records in a 
> simple 2-column table with 40 million records,
>
>  - tid range scan takes 10216ms
>  - tid range scan with 2 workers takes 7109ms
>  - sequential scan with 2 workers takes 8499ms

Can you share more details about this test? i.e. the query, what the
times are that you've measured (EXPLAIN ANALYZE, or SELECT, COPY?).
Also, which version/commit did you patch against?  I was wondering if
the read stream code added in v17 would result in the serial case
running faster because the parallelism just resulted in more I/O
concurrency.

Of course, it may be beneficial to have parallel TID Range for other
cases when more row filtering or aggregation is being done as that
requires pushing fewer tuples over from the parallel worker to the
main process. It just would be good to get to the bottom of if there's
still any advantage to parallelism when no filtering other than the
ctid quals is being done now that we've less chance of having to wait
for I/O coming from disk with the read streams code.

David

Re: Typos in the code and README

2024-04-28 Thread David Rowley

On Sat, 20 Apr 2024 at 16:09, David Rowley  wrote:
> Here are a few more to see if it motivates anyone else to do a more
> thorough search for another batch.

I've pushed these now.

David

Re: Add memory context type to pg_backend_memory_contexts view

2024-04-23 Thread David Rowley

On Tue, 16 Apr 2024 at 13:30, David Rowley  wrote:
> In [1] Andres mentioned that there's no way to determine the memory
> context type in pg_backend_memory_contexts. This is a bit annoying as
> I'd like to add a test to exercise BumpStats().
>
> Having the context type in the test's expected output helps ensure we
> are exercising BumpStats() and any future changes to the choice of
> context type in tuplesort.c gets flagged up by the test breaking.

bea97cd02 added a new regression test in sysviews.sql to call
pg_backend_memory_contexts to test the BumpStats() function.

The attached updates the v1 patch to add the new type column to the
new call to pg_backend_memory_contexts() to ensure the type = "Bump"

No other changes.

David
From 632be6de363e8f9975722debbd620665f3a0ea71 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 16 Apr 2024 13:05:34 +1200
Subject: [PATCH v2] Add context type field to pg_backend_memory_contexts

---
 doc/src/sgml/system-views.sgml |  9 +
 src/backend/utils/adt/mcxtfuncs.c  | 50 ++
 src/include/catalog/pg_proc.dat|  6 ++--
 src/test/regress/expected/rules.out|  5 +--
 src/test/regress/expected/sysviews.out | 16 -
 src/test/regress/sql/sysviews.sql  |  4 +--
 6 files changed, 61 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7ed617170f..18ae5b9138 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -463,6 +463,15 @@
 
 
 
+ 
+  
+   type text
+  
+  
+   Type of the memory context
+  
+ 
+
  
   
name text
diff --git a/src/backend/utils/adt/mcxtfuncs.c 
b/src/backend/utils/adt/mcxtfuncs.c
index 4d4a70915b..fe52d57fd4 100644
--- a/src/backend/utils/adt/mcxtfuncs.c
+++ b/src/backend/utils/adt/mcxtfuncs.c
@@ -36,12 +36,13 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
 TupleDesc 
tupdesc, MemoryContext context,
 const char 
*parent, int level)
 {
-#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS9
+#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS10
 
Datum   values[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
boolnulls[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
MemoryContextCounters stat;
MemoryContext child;
+   const char *type;
const char *name;
const char *ident;
 
@@ -67,10 +68,31 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
memset(values, 0, sizeof(values));
memset(nulls, 0, sizeof(nulls));
 
+   switch (context->type)
+   {
+   case T_AllocSetContext:
+   type = "AllocSet";
+   break;
+   case T_GenerationContext:
+   type = "Generation";
+   break;
+   case T_SlabContext:
+   type = "Slab";
+   break;
+   case T_BumpContext:
+   type = "Bump";
+   break;
+   default:
+   type = "???";
+   break;
+   }
+
+   values[0] = CStringGetTextDatum(type);
+
if (name)
-   values[0] = CStringGetTextDatum(name);
+   values[1] = CStringGetTextDatum(name);
else
-   nulls[0] = true;
+   nulls[1] = true;
 
if (ident)
{
@@ -86,22 +108,22 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
 
memcpy(clipped_ident, ident, idlen);
clipped_ident[idlen] = '\0';
-   values[1] = CStringGetTextDatum(clipped_ident);
+   values[2] = CStringGetTextDatum(clipped_ident);
}
else
-   nulls[1] = true;
+   nulls[2] = true;
 
if (parent)
-   values[2] = CStringGetTextDatum(parent);
+   values[3] = CStringGetTextDatum(parent);
else
-   nulls[2] = true;
-
-   values[3] = Int32GetDatum(level);
-   values[4] = Int64GetDatum(stat.totalspace);
-   values[5] = Int64GetDatum(stat.nblocks);
-   values[6] = Int64GetDatum(stat.freespace);
-   values[7] = Int64GetDatum(stat.freechunks);
-   values[8] = Int64GetDatum(stat.totalspace - stat.freespace);
+   nulls[3] = true;
+
+   values[4] = Int32GetDatum(level);
+   values[5] = Int64GetDatum(stat.totalspace);
+   values[6] = Int64GetDatum(stat.nblocks);
+   values[7] = Int64GetDatum(stat.freespace);
+   values[8] = Int64GetDatum(stat.freechunks);
+   values[9] = Int64GetDatum(stat.totalspace - stat.freespace);
tuplestore_putvalues(tupstore, tupdesc, values, nulls);

Re: Streaming I/O, vectored I/O (WIP)

2024-04-23 Thread David Rowley

I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

David
diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 634cf4f0d1..74b9bae631 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -26,12 +26,12 @@
  *
  * B) I/O is necessary, but fadvise is undesirable because the access is
  * sequential, or impossible because direct I/O is enabled or the system
- * doesn't support advice.  There is no benefit in looking ahead more than
- * io_combine_limit, because in this case only goal is larger read system
+ * doesn't support fadvise.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case the only goal is larger read system
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work looking ahead for no benefit.
  *
- * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * C) I/O is necessary, it appears random, and this system supports fadvise.
  * We'll look further ahead in order to reach the configured level of I/O
  * concurrency.
  *
@@ -418,7 +418,7 @@ read_stream_begin_relation(int flags,
ReadStream *stream;
size_t  size;
int16   queue_size;
-   int16   max_ios;
+   int max_ios;
int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
@@ -447,6 +447,8 @@ read_stream_begin_relation(int flags,
max_ios = 
get_tablespace_maintenance_io_concurrency(tablespace_id);
else
max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+   /* Cap to INT16_MAX to avoid overflowing below */
max_ios = Min(max_ios, PG_INT16_MAX);
 
/*

Re: Minor document typo

2024-04-23 Thread David Rowley

On Wed, 24 Apr 2024 at 00:11, Tatsuo Ishii  wrote:
> Just out of a curiosity, is it possible to say "low a wal_level on the
> primary"? (just "too" is removed)

Prefixing the adjective with "too" means it's beyond the acceptable
range.  "This coffee is too hot".

https://dictionary.cambridge.org/grammar/british-grammar/too

David

Re: Minor document typo

2024-04-23 Thread David Rowley

On Tue, 23 Apr 2024 at 23:17, Tatsuo Ishii  wrote:
>Number of uses of logical slots in this database that have been
>canceled due to old snapshots or too low a  linkend="guc-wal-level"/>
>on the primary
>
> I think "too low a" should be "too low" ('a' is not
> necessary). Attached is the patch.

The existing text looks fine to me.  The other form would use "of a"
and become "too low of a wal_level on the primary".

"too low wal_level on the primary" sounds wrong to my native
English-speaking ear.

There's some discussion in [1] that might be of interest to you.

David

[1] 
https://www.reddit.com/r/grammar/comments/qr9z6e/i_need_help_with_sothat_adj_of_a_sing_noun/?ref=share_source=link

Re: Typos in the code and README

2024-04-19 Thread David Rowley

On Fri, 19 Apr 2024 at 20:13, Daniel Gustafsson  wrote:
> Thanks, I incorporated these into 0001 before pushing.  All the commits in 
> this
> patchset are now applied.

Here are a few more to see if it motivates anyone else to do a more
thorough search for another batch.

Fixes duplicate words spanning multiple lines plus an outdated
reference to "streaming read" which was renamed to "read stream" late
in that patch's development.

duplicate words found using:
ag "\s([a-zA-Z]{2,})[\s*]*\n\1\b"
ag "\s([a-zA-Z]{2,})\n(\s*\*\s*)\1\b"

David
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4a4cf76269..32e7d3c146 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1122,7 +1122,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/*
 * Set up a read stream for sequential scans and TID range scans. This
 * should be done after initscan() because initscan() allocates the
-* BufferAccessStrategy object passed to the streaming read API.
+* BufferAccessStrategy object passed to the read stream API.
 */
if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
diff --git a/src/backend/access/heap/vacuumlazy.c 
b/src/backend/access/heap/vacuumlazy.c
index de109acc89..0c5379666b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -13,7 +13,7 @@
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
  * TID store is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
  * the pages that we've pruned). This frees up the memory space dedicated to
- * to store dead TIDs.
+ * store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 4548a91723..a2510cf80c 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -108,7 +108,7 @@
  * (if one exists).
  *
  * activeSearchPath is always the actually active path; it points to
- * to baseSearchPath which is the list derived from namespace_search_path.
+ * baseSearchPath which is the list derived from namespace_search_path.
  *
  * If baseSearchPathValid is false, then baseSearchPath (and other derived
  * variables) need to be recomputed from namespace_search_path, or retrieved
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index cb39adcd0e..ed8104ccb4 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -601,7 +601,7 @@ update_and_persist_local_synced_slot(RemoteSlot 
*remote_slot, Oid remote_dbid)
  * metadata of the slot as per the data received from the primary server.
  *
  * The slot is created as a temporary slot and stays in the same state until 
the
- * the remote_slot catches up with locally reserved position and local slot is
+ * remote_slot catches up with locally reserved position and local slot is
  * updated. The slot is then persisted and is considered as sync-ready for
  * periodic syncs.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 100f6454e5..a691aed1f4 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -244,7 +244,7 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 /*
  * smgrpin() -- Prevent an SMgrRelation object from being destroyed at end of
- * of transaction
+ * transaction
  */
 void
 smgrpin(SMgrRelation reln)

Re: Stability of queryid in minor versions

2024-04-19 Thread David Rowley

On Tue, 16 Apr 2024 at 15:16, Michael Paquier  wrote:
>
> On Tue, Apr 16, 2024 at 02:04:22PM +1200, David Rowley wrote:
> > It makes sense to talk about the hashing variations closer to the
> > object identifier part.
>
> Mostly what I had in mind.  A separate  for the new part you are
> adding at the end of the first part feels a bit more natural split
> here.  Feel free to discard my comment if you think that's not worth
> it.

Thanks for the review. I've now pushed this, backpatching to 12.

David

Re: Why is parula failing?

2024-04-16 Thread David Rowley

On Tue, 16 Apr 2024 at 18:58, Robins Tharakan  wrote:
> The last 25 consecutive runs have passed [1] after switching
> REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
> the compiler version is to blame, and while we're still here,
> is there anything else I could try?

I don't think it's conclusive proof that it's a compiler bug.  -O0
could equally just have changed the timing sufficiently to not trigger
the issue.

It might be a long shot, but I wonder if it might be worth running a
workload such as:

psql -c "create table a (a int primary key, b text not null, c int not
null); insert into a values(1,'',0);" postgres
echo "update a set b = repeat('a', random(1,10)), c=c+1 where a = 1;"
> bench.sql
pgbench -n -t 12500 -c 8 -j 8 -f bench.sql postgres
psql -c "table a;" postgres

On a build with the original CFLAGS.  I expect the value of "c" to be
100k after running that. If it's not then something bad has happened.

That would exercise the locking code heavily and would show us if any
updates were missed due to locks not being correctly respected or seen
by other backends.

David

Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread David Rowley

On Mon, 8 Apr 2024 at 00:37, David Rowley  wrote:
> I've now pushed all 3 patches.   Thank you for all the reviews on
> these and for the extra MemoryContextMethodID bit, Matthias.

I realised earlier today when working on [1] that bump makes a pretty
brain-dead move when adding dedicated blocks to the blocks list.  The
problem is that I opted to not have a current block field in
BumpContext and just rely on the head pointer of the blocks list to be
the "current" block.  The head block is the block we look at to see if
we've any space left when new allocations come in.  The problem there
is when adding a dedicated block in BumpAllocLarge(), the code adds
this to the head of the blocks list so that when a new allocation
comes in that's normal-sized, the block at the top of the list is full
and we have to create a new block for the allocation.

The attached fixes this by pushing these large/dedicated blocks to the
*tail* of the blocks list.  This means the partially filled block
remains at the head and is available for any new allocation which will
fit.  This behaviour is evident by the regression test change that I
added earlier today when working on [1].  The 2nd and smaller
allocation in that text goes onto the keeper block rather than a new
block.

I plan to push this tomorrow.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=bea97cd02ebb347ab469b78673c2b33a72109669
From 39a60420a83e5a059de50869c0981be9732909ab Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 16 Apr 2024 22:26:00 +1200
Subject: [PATCH v1] Push dedicated BumpBlocks to the tail of the blocks list

BumpContext relies on using the head block from its 'blocks' field to
use as the current block to allocate new chunks to.  When we receive an
allocation request larger than allocChunkLimit, we place these chunks on
a new dedicated block and push this block onto the head of the 'blocks'
list.  This behavior caused the previous bump block to no longer be
available for new normal-sized (non-large) allocations and would result
in wasting space on blocks when a large request arrived before the
current block was full.

Here adjust the code to push these dedicated blocks onto the *tail* of
the blocks list so that the head block remains intact and available to
be used by normal allocation request sizes.

Discussion: 
https://postgr.es/m/caaphdvp9___r-ayjj0nz6gd3mecgwgz0_6zptwpwj+zqhtm...@mail.gmail.com
---
 src/backend/utils/mmgr/bump.c  | 8 ++--
 src/test/regress/expected/sysviews.out | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
index 449bd29344..83798e2245 100644
--- a/src/backend/utils/mmgr/bump.c
+++ b/src/backend/utils/mmgr/bump.c
@@ -342,8 +342,12 @@ BumpAllocLarge(MemoryContext context, Size size, int flags)
randomize_mem((char *) MemoryChunkGetPointer(chunk), size);
 #endif

-   /* add the block to the list of allocated blocks */
-   dlist_push_head(>blocks, >node);
+   /*
+* Add the block to the tail of allocated blocks list.  The current 
block
+* is left at the head of the list as it may still have space for
+* non-large allocations.
+*/
+   dlist_push_tail(>blocks, >node);

 #ifdef MEMORY_CONTEXT_CHECKING
/* Ensure any padding bytes are marked NOACCESS. */
diff --git a/src/test/regress/expected/sysviews.out 
b/src/test/regress/expected/sysviews.out
index 634dc8d8b8..2f3eb4e7f1 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -47,7 +47,7 @@ select name, parent, total_bytes > 0, total_nblocks, 
free_bytes > 0, free_chunks
 from pg_backend_memory_contexts where name = 'Caller tuples';
  name  | parent | ?column? | total_nblocks | ?column? | 
free_chunks 

---++--+---+--+-
- Caller tuples | TupleSort sort | t| 3 | t|
   0
+ Caller tuples | TupleSort sort | t| 2 | t|
   0
 (1 row)

 rollback;
-- 
2.40.1

Re: Add bump memory context type and use it for tuplesorts

2024-04-16 Thread David Rowley

On Tue, 16 Apr 2024 at 17:13, Amul Sul  wrote:
> Attached is a small patch adding the missing BumpContext description to the
> README.

Thanks for noticing and working on the patch.

There were a few things that were not quite accurate or are misleading:

1.

> +These three memory contexts aim to free memory back to the operating system

That's not true for bump. It's the worst of the 4.  Worse than aset.
It only returns memory when the context is reset/deleted.

2.

"These memory contexts were initially developed for ReorderBuffer, but
may be useful elsewhere as long as the allocation patterns match."

The above isn't true for bump.  It was written for tuplesort.  I think
we can just remove that part now.  Slab and generation are both old
enough not to care why they were conceived.

Also since adding bump, I think the choice of which memory context to
use is about 33% harder than it used to be when there were only 3
context types.  I think this warrants giving more detail on what these
3 special-purpose memory allocators are good for.  I've added more
details in the attached patch.  This includes more details about
freeing malloc'd blocks

I've tried to detail out enough of the specialities of the context
type without going into extensive detail. My hope is that there will
be enough detail for someone to choose the most suitable looking one
and head over to the corresponding .c file to find out more.

Is that about the right level of detail?

David
diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index f484f7d6f5..695088bb66 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -471,25 +471,32 @@ thrashing.
 Alternative Memory Context Implementations
 --
 
-aset.c is our default general-purpose implementation, working fine
-in most situations. We also have two implementations optimized for
-special use cases, providing either better performance or lower memory
-usage compared to aset.c (or both).
-
-* slab.c (SlabContext) is designed for allocations of fixed-length
-  chunks, and does not allow allocations of chunks with different size.
-
-* generation.c (GenerationContext) is designed for cases when chunks
-  are allocated in groups with similar lifespan (generations), or
-  roughly in FIFO order.
-
-Both memory contexts aim to free memory back to the operating system
-(unlike aset.c, which keeps the freed chunks in a freelist, and only
-returns the memory when reset/deleted).
-
-These memory contexts were initially developed for ReorderBuffer, but
-may be useful elsewhere as long as the allocation patterns match.
-
+aset.c (AllocSetContext) is our default general-purpose allocator.  Three other
+allocator types also exist which are special-purpose:
+
+* slab.c (SlabContext) is designed for allocations of fixed-sized
+  chunks.  The fixed chunk size must be specified when creating the context.
+  New chunks are allocated to the fullest block, keeping used chunks densely
+  packed together to avoid memory fragmentation.  This also increases the
+  chances that pfree'ing a chunk will result in a block becoming empty of all
+  chunks and allow it to be free'd back to the operating system.
+
+* generation.c (GenerationContext) is best suited for cases when chunks are
+  allocated in groups with similar lifespan (generations), or roughly in FIFO
+  order.  No attempt is made to reuse space left by pfree'd chunks.  Blocks
+  are returned to the operating system when all chunks on them have been
+  pfree'd.
+
+* bump.c (BumpContext) is best suited for use cases that require densely
+  allocated chunks of memory that never need to be individually pfree'd or
+  repalloc'd.  These operations are unsupported due to BumpContext chunks
+  having no chunk header.  No chunk header means more densely packed chunks,
+  which is especially useful for workloads that perform lots of small
+  allocations.  Blocks are only free'd back to the operating system when the
+  context is reset or deleted.
+
+For further details, please read the header comment in the corresponding .c
+file.
 
 Memory Accounting
 -

Re: Differential code coverage between 16 and HEAD

2024-04-15 Thread David Rowley

On Tue, 16 Apr 2024 at 14:29, Andres Freund  wrote:
> I think total_nblocks might also not be entirely stable?

I think it is stable for this test.  However, I'll let the buildfarm
make the final call on that.

The reason I want to include it is that I'd like to push the large
allocations to the tail of the block list and make this workload use 2
blocks rather than 3.  If I fix that and update the test then it's a
bit of coverage to help ensure that doesn't get broken again.

> How about just
> checking if total_bytes, total_nblocks, free_bytes and used_bytes are bigger
> than 0?

Seems like a good idea.  I've done it that way and pushed.

Thanks

David

Re: Stability of queryid in minor versions

2024-04-15 Thread David Rowley

On Tue, 16 Apr 2024 at 12:10, Michael Paquier  wrote:
> Not sure that this is an improvement in clarity.  There are a few
> bullet points that treat about the instability of the query ID, and
> your patch is now mixing the query ID being different for two
> mostly-identical queries on the same host with larger conditions like
> the environment involved.  Perhaps it would be better to move the last
> sentence of the first  ("Furthermore, it is not safe..") with
> the part you are adding about replication in this paragraph.

Yeah, I think this is better.  I think the attached is what you mean.

It makes sense to talk about the hashing variations closer to the
object identifier part.

David
diff --git a/doc/src/sgml/pgstatstatements.sgml 
b/doc/src/sgml/pgstatstatements.sgml
index 44dd4db7ce..38c4ce3241 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -636,18 +636,21 @@
machine architecture and other facets of the platform.
Furthermore, it is not safe to assume that 
queryid
will be stable across major versions of 
PostgreSQL.
+   Two servers participating in replication based on physical WAL replay can
+   be expected to have identical queryid values for
+   the same query.  However, logical replication schemes do not promise to
+   keep replicas identical in all relevant details, so
+   queryid will not be a useful identifier for
+   accumulating costs across a set of logical replicas.
+   If in doubt, direct testing is recommended.
   
 
   
-   As a rule of thumb, queryid values can be 
assumed to be
-   stable and comparable only so long as the underlying server version and
-   catalog metadata details stay exactly the same.  Two servers
-   participating in replication based on physical WAL replay can be expected
-   to have identical queryid values for the same 
query.
-   However, logical replication schemes do not promise to keep replicas
-   identical in all relevant details, so queryid 
will
-   not be a useful identifier for accumulating costs across a set of logical
-   replicas.  If in doubt, direct testing is recommended.
+   Generally, it can be assumed that queryid values
+   are stable between minor version releases of 
PostgreSQL,
+   providing that instances are running on the same machine architecture and
+   the catalog metadata details match.  Compatibility will only be broken
+   between minor versions as a last resort.

Re: Differential code coverage between 16 and HEAD

2024-04-15 Thread David Rowley

On Tue, 16 Apr 2024 at 10:57, Andres Freund  wrote:
> I guess was thinking more about BumpIsEmpty() and BumpStats() then the "bogus"
> cases. But BumpIsEmpty() likely is unreachable as well.

The only call to MemoryContextIsEmpty() I see is AtSubCommit_Memory()
and it's on an aset.c context type. I see generation.c has the same
issue per [1].

> BumpStats() is
> reachable, but perhaps it's not worth it?
>
> BEGIN;
> DECLARE foo CURSOR FOR SELECT LEFT(a,10),b FROM (VALUES(REPEAT('a', 512 * 
> 1024),1),(REPEAT('b', 512 * 1024),2)) v(a,b) ORDER BY v.a DESC;
> FETCH 1 FROM foo;
> SELECT * FROM pg_backend_memory_contexts WHERE name = 'Caller tuples';

I think primarily it's good to exercise that code just to make sure it
does not crash.  Looking at the output of the above on my machine:

 name  | ident | parent | level | total_bytes |
total_nblocks | free_bytes | free_chunks | used_bytes
---+---++---+-+---++-+
 Caller tuples |   | TupleSort sort | 6 | 1056848 |
 3 |   8040 |   0 |1048808
(1 row)

I'd say:

Name: stable
ident: stable
parent: stable
level: could change from a refactor of code
total_bytes: could be different on other platforms or dependent on
MEMORY_CONTEXT_CHECKING
total_nblocks: stable enough
free_bytes: could be different on other platforms or dependent on
MEMORY_CONTEXT_CHECKING
free_chunks: always 0
used_bytes: could be different on other platforms or dependent on
MEMORY_CONTEXT_CHECKING

I've attached a patch which includes your test with unstable columns
stripped out.

I cut the 2nd row down to just 512 bytes as I didn't see the need to
add two large datums.  Annoyingly it still uses 3 blocks as I've opted
to do dlist_push_head(>blocks, >node); in BumpAllocLarge()
which is the block that's picked up again in BumpAlloc() per block =
dlist_container(BumpBlock, node, dlist_head_node(>blocks));
wonder if the large blocks should push tail instead.

> Hm, independent of this, seems a bit odd that we don't include the memory
> context type in pg_backend_memory_contexts?

That seems like useful information to include.  It sure would be
useful to have in there to verify that I'm testing BumpStats().  I've
written a patch [2].

David

[1] 
https://coverage.postgresql.org/src/backend/utils/mmgr/generation.c.gcov.html#997
[2] 
https://postgr.es/m/CAApHDvrXX1OR09Zjb5TnB0AwCKze9exZN=9nxxg1zcvv8w-...@mail.gmail.com
diff --git a/src/test/regress/expected/sysviews.out 
b/src/test/regress/expected/sysviews.out
index 9be7aca2b8..f537a1aeee 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -28,6 +28,26 @@ select name, ident, parent, level, total_bytes >= free_bytes
  TopMemoryContext |   || 0 | t
 (1 row)
 
+-- We can exercise some MemoryContext type stats functions.  Most of the
+-- column values are too platform-dependant to display.
+begin;
+declare cur cursor for select left(a,10), b
+from (values(repeat('a', 512 * 1024),1),(repeat('b', 512),2)) v(a,b)
+order by v.a desc;
+fetch 1 from cur;
+left| b 
++---
+ bb | 2
+(1 row)
+
+select name, parent, total_nblocks, free_chunks
+from pg_backend_memory_contexts where name = 'Caller tuples';
+ name  | parent | total_nblocks | free_chunks 
+---++---+-
+ Caller tuples | TupleSort sort | 3 |   0
+(1 row)
+
+rollback;
 -- At introduction, pg_config had 23 entries; it may grow
 select count(*) > 20 as ok from pg_config;
  ok 
diff --git a/src/test/regress/expected/tuplesort.out 
b/src/test/regress/expected/tuplesort.out
index 0e8b5bf4a3..6dd97e7427 100644
--- a/src/test/regress/expected/tuplesort.out
+++ b/src/test/regress/expected/tuplesort.out
@@ -343,6 +343,19 @@ ORDER BY ctid DESC LIMIT 5;
 (5 rows)
 
 ROLLBACK;
+
+-- test sorting of large datums VALUES
+
+-- Ensure the order is correct and values look intact
+SELECT LEFT(a,10),b FROM
+(VALUES(REPEAT('a', 512 * 1024),1),(REPEAT('b', 512 * 1024),2)) v(a,b)
+ORDER BY v.a DESC;
+left| b 
++---
+ bb | 2
+ aa | 1
+(2 rows)
+
 
 -- test forward and backward scans for in-memory and disk based tuplesort
 
diff --git a/src/test/regress/sql/sysviews.sql 
b/src/test/regress/sql/sysviews.sql
index 6b4e24601d..000caaea21 100644
--- a/src/test/regress/sql/sysviews.sql
+++ b/src/test/regress/sql/sysviews.sql
@@ -17,6 +17,17 @@ select count(*) >= 0 as ok from pg_available_extensions;
 select name, ident, parent, level, total_bytes >= free_bytes
   from pg_backend_memory_contexts where level = 0;
 
+-- We can exercise some MemoryContext type stats functions.  Most of the
+-- column values are too platform-dependant to display.
+begin;
+declare cur cursor for select left(a,10), b
+from (values(repeat('a', 512 * 1024),1),(repeat('b', 512),2)) v(a,b)
+order

Add memory context type to pg_backend_memory_contexts view

2024-04-15 Thread David Rowley

In [1] Andres mentioned that there's no way to determine the memory
context type in pg_backend_memory_contexts. This is a bit annoying as
I'd like to add a test to exercise BumpStats().

Having the context type in the test's expected output helps ensure we
are exercising BumpStats() and any future changes to the choice of
context type in tuplesort.c gets flagged up by the test breaking.

It's probably too late for PG17, but I'll leave this here for the July CF.

David

[1] 
https://www.postgresql.org/message-id/20240415225749.xg7uq2hwuq2qm...@awork3.anarazel.de
From 0a58697a4b88bc3ac80f09ed78b56ebe903a2aae Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 16 Apr 2024 13:05:34 +1200
Subject: [PATCH v1] Add context type field to pg_backend_memory_contexts

---
 doc/src/sgml/system-views.sgml |  9 +
 src/backend/utils/adt/mcxtfuncs.c  | 50 ++
 src/include/catalog/pg_proc.dat|  6 ++--
 src/test/regress/expected/rules.out|  5 +--
 src/test/regress/expected/sysviews.out |  8 ++---
 src/test/regress/sql/sysviews.sql  |  2 +-
 6 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7ed617170f..18ae5b9138 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -463,6 +463,15 @@
 
 
 
+ 
+  
+   type text
+  
+  
+   Type of the memory context
+  
+ 
+
  
   
name text
diff --git a/src/backend/utils/adt/mcxtfuncs.c 
b/src/backend/utils/adt/mcxtfuncs.c
index 4d4a70915b..fe52d57fd4 100644
--- a/src/backend/utils/adt/mcxtfuncs.c
+++ b/src/backend/utils/adt/mcxtfuncs.c
@@ -36,12 +36,13 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
 TupleDesc 
tupdesc, MemoryContext context,
 const char 
*parent, int level)
 {
-#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS9
+#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS10
 
Datum   values[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
boolnulls[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
MemoryContextCounters stat;
MemoryContext child;
+   const char *type;
const char *name;
const char *ident;
 
@@ -67,10 +68,31 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
memset(values, 0, sizeof(values));
memset(nulls, 0, sizeof(nulls));
 
+   switch (context->type)
+   {
+   case T_AllocSetContext:
+   type = "AllocSet";
+   break;
+   case T_GenerationContext:
+   type = "Generation";
+   break;
+   case T_SlabContext:
+   type = "Slab";
+   break;
+   case T_BumpContext:
+   type = "Bump";
+   break;
+   default:
+   type = "???";
+   break;
+   }
+
+   values[0] = CStringGetTextDatum(type);
+
if (name)
-   values[0] = CStringGetTextDatum(name);
+   values[1] = CStringGetTextDatum(name);
else
-   nulls[0] = true;
+   nulls[1] = true;
 
if (ident)
{
@@ -86,22 +108,22 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
 
memcpy(clipped_ident, ident, idlen);
clipped_ident[idlen] = '\0';
-   values[1] = CStringGetTextDatum(clipped_ident);
+   values[2] = CStringGetTextDatum(clipped_ident);
}
else
-   nulls[1] = true;
+   nulls[2] = true;
 
if (parent)
-   values[2] = CStringGetTextDatum(parent);
+   values[3] = CStringGetTextDatum(parent);
else
-   nulls[2] = true;
-
-   values[3] = Int32GetDatum(level);
-   values[4] = Int64GetDatum(stat.totalspace);
-   values[5] = Int64GetDatum(stat.nblocks);
-   values[6] = Int64GetDatum(stat.freespace);
-   values[7] = Int64GetDatum(stat.freechunks);
-   values[8] = Int64GetDatum(stat.totalspace - stat.freespace);
+   nulls[3] = true;
+
+   values[4] = Int32GetDatum(level);
+   values[5] = Int64GetDatum(stat.totalspace);
+   values[6] = Int64GetDatum(stat.nblocks);
+   values[7] = Int64GetDatum(stat.freespace);
+   values[8] = Int64GetDatum(stat.freechunks);
+   values[9] = Int64GetDatum(stat.totalspace - stat.freespace);
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 
for (child = context->firstchild; child != NULL; child = 
child->nextchild)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 134e3b22fd..adda675887 100644
---

Re: Differential code coverage between 16 and HEAD

2024-04-15 Thread David Rowley

On Mon, 15 Apr 2024 at 10:33, Andres Freund  wrote:
> - The new bump allocator has a fair amount of uncovered functionality:
>   
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/utils/mmgr/bump.c.gcov.html#L293

The attached adds a test to tuplesort to exercise BumpAllocLarge()

>   
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/utils/mmgr/bump.c.gcov.html#L613

I don't see a way to exercise those. They're meant to be "can't
happen" ERRORs.  I could delete them and use BogusFree, BogusRealloc,
BogusGetChunkContext, BogusGetChunkSpace instead, but the ERROR
message would be misleading. I think it's best just to leave this.

David
diff --git a/src/test/regress/expected/tuplesort.out 
b/src/test/regress/expected/tuplesort.out
index 0e8b5bf4a3..6dd97e7427 100644
--- a/src/test/regress/expected/tuplesort.out
+++ b/src/test/regress/expected/tuplesort.out
@@ -343,6 +343,19 @@ ORDER BY ctid DESC LIMIT 5;
 (5 rows)
 
 ROLLBACK;
+
+-- test sorting of large datums VALUES
+
+-- Ensure the order is correct and values look intact
+SELECT LEFT(a,10),b FROM
+(VALUES(REPEAT('a', 512 * 1024),1),(REPEAT('b', 512 * 1024),2)) v(a,b)
+ORDER BY v.a DESC;
+left| b 
++---
+ bb | 2
+ aa | 1
+(2 rows)
+
 
 -- test forward and backward scans for in-memory and disk based tuplesort
 
diff --git a/src/test/regress/sql/tuplesort.sql 
b/src/test/regress/sql/tuplesort.sql
index 658fe98dc5..8476e594e6 100644
--- a/src/test/regress/sql/tuplesort.sql
+++ b/src/test/regress/sql/tuplesort.sql
@@ -146,6 +146,15 @@ FROM abbrev_abort_uuids
 ORDER BY ctid DESC LIMIT 5;
 ROLLBACK;
 
+
+-- test sorting of large datums VALUES
+
+
+-- Ensure the order is correct and values look intact
+SELECT LEFT(a,10),b FROM
+(VALUES(REPEAT('a', 512 * 1024),1),(REPEAT('b', 512 * 1024),2)) v(a,b)
+ORDER BY v.a DESC;
+
 
 -- test forward and backward scans for in-memory and disk based tuplesort

Re: Why is parula failing?

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 16:10, Robins Tharakan  wrote:
> - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
> - I'll keep them (stuck) for this week, in case there's more we can get
> from them (and to see how long they take)
> - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - 
> REL_16_STABLE)

Thanks for getting those.

#4  0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406
delay = 
delay_ms = 
endtime = 0

This endtime looks like a problem. It seems unlikely to be caused by
gettimeofday's timeval fields being zeroed given that the number of
seconds should have been added to that.

I can't quite make sense of how we end up sleeping at all with a zero
endtime. Assuming the subsequent GetNowFloats() worked, "delay =
endtime - GetNowFloat();" would result in a negative sleep duration
and we'd break out of the sleep loop.

If GetNowFloat() somehow was returning a negative number then we could
end up with a large delay.  But if gettimeofday() was so badly broken
then wouldn't there be some evidence of this in the log timestamps on
failing runs?

I'm not that familiar with the buildfarm config, but I do see some
Valgrind related setting in there. Is PostgreSQL running under
Valgrind on these runs?

David

Re: Stability of queryid in minor versions

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 14:09, Michael Paquier  wrote:
>
> On Mon, Apr 15, 2024 at 01:31:47PM +1200, David Rowley wrote:
> > I'm unsure if "Rule of thumb" is the correct way to convey that. We
> > can't really write "We endeavour to", as who is "We".  Maybe something
> > like "Generally, it can be assumed that queryid is stable between all
> > minor versions of a major version of ..., providing that  > reasons>".
>
> It sounds to me that the term "best-effort" is adapted here?  Like in
> "The compatibility of query IDs is preserved across minor versions on
> a best-effort basis.  It is possible that the post-parse-analysis tree
> changes across minor releases, impacting the value of queryid for the
> same query run across two different minor versions.".

I had another try and ended up pushing the logical / physical replica
details up to the paragraph above. It seems more relevant to mention
this in the section which details reasons why the queryid can be
unstable due to metadata variations.  I think keeping the 2nd
paragraph for reasons it's stable is a good separation of
responsibility.  I didn't include the "best-effort" word, but here's
what I did come up with.

David
diff --git a/doc/src/sgml/pgstatstatements.sgml 
b/doc/src/sgml/pgstatstatements.sgml
index 44dd4db7ce..302421306e 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -632,22 +632,25 @@
pg_stat_statements will consider two 
apparently-identical
queries to be distinct, if they reference a table that was dropped
and recreated between the executions of the two queries.
+   Two servers participating in replication based on physical WAL replay can
+   be expected to have identical queryid values for
+   the same query.  However, logical replication schemes do not promise to
+   keep replicas identical in all relevant details, so
+   queryid will not be a useful identifier for
+   accumulating costs across a set of logical replicas.
The hashing process is also sensitive to differences in
machine architecture and other facets of the platform.
Furthermore, it is not safe to assume that 
queryid
will be stable across major versions of 
PostgreSQL.
+   If in doubt, direct testing is recommended.
   
 
   
-   As a rule of thumb, queryid values can be 
assumed to be
-   stable and comparable only so long as the underlying server version and
-   catalog metadata details stay exactly the same.  Two servers
-   participating in replication based on physical WAL replay can be expected
-   to have identical queryid values for the same 
query.
-   However, logical replication schemes do not promise to keep replicas
-   identical in all relevant details, so queryid 
will
-   not be a useful identifier for accumulating costs across a set of logical
-   replicas.  If in doubt, direct testing is recommended.
+   Generally, it can be assumed that queryid values
+   are stable between minor version releases of 
PostgreSQL,
+   providing that instances are running on the same machine architecture and
+   the catalog metadata details match.  Compatibility will only be broken
+   between minor versions as a last resort.

Re: Fix out-of-bounds in the function GetCommandTagName

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 12:12, Ranier Vilela  wrote:
>
> Em dom., 14 de abr. de 2024 às 20:38, David Rowley  
> escreveu:
>> You seem to have forgotten to attach it, but my comments above were
>> written with the assumption that the patch is what I've attached here.
>
> Yes, I actually forgot.
>
> +1 for your patch.

I've added a CF entry under your name for this:
https://commitfest.postgresql.org/48/4927/

If it was code new to PG17 I'd be inclined to go ahead with it now,
but it does not seem to align with making the release mode stable.t
I'd bet others will feel differently about that.  Delaying seems a
better default choice at least.

David

Re: Stability of queryid in minor versions

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 13:37, David G. Johnston
 wrote:
> Seems we can improve things by simply removing the "rule of thumb" sentence 
> altogether.  The prior paragraph states the things the queryid depends upon 
> at the level of detail the reader needs.

I don't think that addresses the following, which I mentioned earlier:

> but not stable across *major* versions does *not* mean stable across
> *minor* versions. The reader is just left guessing if that's true.

David

Re: Stability of queryid in minor versions

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 13:19, Tom Lane  wrote:
>
> Michael Paquier  writes:
> > On Mon, Apr 15, 2024 at 11:20:16AM +1200, David Rowley wrote:
> >> 1. We cannot change Node enums in minor versions
> >> 2. We're *unlikely* to add fields to Node types in minor versions, and
> >> if we did we'd likely be leaving them out of the jumble calc, plus it
> >> seems highly unlikely any new field we wedged into the padding would
> >> relate at all to the parsed query.
>
> > Since 16 these new fields would be added by default unless the node
> > attribute query_jumble_ignore is appended to it.
>
> They'd also be written/read by outfuncs/readfuncs, thereby breaking
> stored views/rules if the Node is one that can appear in a parsetree.
> So the bar to making such a change in a stable branch would be very
> high.

I think a soft guarantee in the docs for it being stable in minor
versions would be ok then.

I'm unsure if "Rule of thumb" is the correct way to convey that. We
can't really write "We endeavour to", as who is "We".  Maybe something
like "Generally, it can be assumed that queryid is stable between all
minor versions of a major version of ..., providing that ".

David

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2920 matches

Mail list logo