Regarding canditate_restart_lsn in logical decoding.

2024-09-01 Thread reddy manjunath
Hi,
I have created my own output plugin for logical decoding.
I am storing decoded data in Apache kafka via pg_recvlogical utility.
Using pg_recvlogical I am updating confirmed_flush_lsn of slot based on the
value that I'm storing in kafka,this is done for every 10 secs.

In case of walsender shutdown I found that canditate_restart_lsn is not
cleared from shared memory.I also found just after restarting
sometimes canditate_restart_lsn is far more greater than actual restart_lsn
of slot,due to this frequent checkpoints are deleting the required WAL
files.

Can I clear the canditate_restart_lsn  in plugin_start callback.Is there
any consequences for this?

Thanks and Regards,
G R MANJUNATH.


Invalid Assert while validating REPLICA IDENTITY?

2024-09-01 Thread Dilip Kumar
While working on some other code I noticed that in
FindReplTupleInLocalRel() there is an assert [1] that seems to be
passing IndexRelation to GetRelationIdentityOrPK() whereas it should
be passing normal relation.

[1]
if (OidIsValid(localidxoid))
{
#ifdef USE_ASSERT_CHECKING
Relation idxrel = index_open(localidxoid, AccessShareLock);

/* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */
Assert(GetRelationIdentityOrPK(idxrel) == localidxoid ||
IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel),
edata->targetRel->attrmap));
index_close(idxrel, AccessShareLock);
#endif

In the above code, we are passing idxrel to GetRelationIdentityOrPK(),
whereas we should be passing localrel

Fix should be


@@ -2929,7 +2929,7 @@ FindReplTupleInLocalRel(ApplyExecutionData
*edata, Relation localrel,

Relationidxrel = index_open(localidxoid,
AccessShareLock);

/* Index must be PK, RI, or usable for REPLICA
IDENTITY FULL tables */
-   Assert(GetRelationIdentityOrPK(idxrel) == localidxoid ||
+   Assert(GetRelationIdentityOrPK(localrel) == localidxoid ||

IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel),

edata->targetRel->attrmap));

index_close(idxrel, AccessShareLock);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-09-01 Thread Michail Nikolaev
Hello, Matthias!

Just wanted to update you with some information about the next steps in
work.

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

I have returned to the solution with a dedicated catalog_xmin for backends
[1].
Additionally, I have added catalog_xmin to pg_stat_activity [2].

> In heapam_index_build_range_scan, you pop the snapshot before the
> returned heaptuple is processed and passed to the index-provided
> callback. I think that's incorrect, as it'll change the visibility of
> the returned tuple before it's passed to the index's callback. I think
> the snapshot manipulation is best added at the end of the loop, if we
> add it at all in that function.

Now it's fixed, and the snapshot is reset between pages [3].

Additionally, I resolved the issue with potential duplicates in unique
indexes. It looks a bit clunky, but it works for now [4].

Single commit from [5] also included, just for stable stress testing.

Full diff is available at [6].

Best regards,
Mikhail.

[1]:
https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0
[2]:
https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6
[3]:
https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf
[4]:
https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d
[5]: https://commitfest.postgresql.org/49/5160/
[6]:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=


Re: scalability bottlenecks with (many) partitions (and more)

2024-09-01 Thread Robert Haas
On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra  wrote:
> I don't think that's possible with hard-coded size of the array - that
> allocates the memory for everyone. We'd need to make it variable-length,
> and while doing those benchmarks I think we actually already have a GUC
> for that - max_locks_per_transaction tells us exactly what we need to
> know, right? I mean, if I know I'll need ~1000 locks, why not to make
> the fast-path array large enough for that?

I really like this idea. I'm not sure about exactly how many fast path
slots you should get for what value of max_locks_per_transaction, but
coupling the two things together in some way sounds smart.

> Of course, the consequence of this would be making PGPROC variable
> length, or having to point to a memory allocated separately (I prefer
> the latter option, I think). I haven't done any experiments, but it
> seems fairly doable - of course, not sure if it might be more expensive
> compared to compile-time constants.

I agree that this is a potential problem but it sounds like the idea
works well enough that we'd probably still come out quite far ahead
even with a bit more overhead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: query_id, pg_stat_activity, extended query protocol

2024-09-01 Thread Michael Paquier
On Sat, Aug 31, 2024 at 09:47:41AM +0800, jian he wrote:
> /* test \bind queryid exists */
> select query_id is not null as query_id_exist
> from pg_stat_activity where pid = pg_backend_pid() \bind \g
> 
> /* test that \parse \bind_named queryid exists */
> select pg_backend_pid() as current_pid \gset pref01_
> select query_id is not null as query_id_exist from pg_stat_activity
> where pid = $1 \parse stmt11
> \bind_named stmt11 :pref01_current_pid \g

I need to spend a bit more time with my head down for this thread, but
couldn't we use these commands with various query patterns in
pg_stat_statements and look at the shmem counters reported through its
view?
--
Michael


signature.asc
Description: PGP signature


Re: Re: bt Scankey in another contradictory case

2024-09-01 Thread bigbro...@hotmail.com
Hi,
I have reanalysed the code of function _bt_first. I notice that using a 
multi-attribute index
if we can't identify the starting boundaries and the following attributes 
markded not required ,
that means we need start at first or last page in the index to examine every 
tuple to satisfy the 
qual or not, in the meantime the scan will be stopped while the first attribute 
evaluated failed.

For instance:
create table c_s( x int, y int);
insert into c_s select generate_series(1, 2), generate_series(1, 2);
create index my_c_s_idx on c_s using btree(x,y);
explain (analyze, buffers) select * from c_s where x >4000 and y >10 and y 
<10 order by x desc;

 Index Only Scan Backward using my_c_s_idx on c_s  (cost=0.29..384.31 rows=1 
width=8) (actual time=1.302..1.304 rows=0 loops=1)
   Index Cond: ((x > 4000) AND (y > 10) AND (y < 10))
   Heap Fetches: 0
   Buffers: shared read=46
 Planning:
   Buffers: shared hit=51 read=15
 Planning Time: 1.311 ms
 Execution Time: 1.435 ms
(8 rows)

The instance is a little different for description above due to the implies NOT 
NULL Scankey, 
but it has no effect on the whole situation. 

What's more, if i change the number 4000 to 1000.
-
 Sort  (cost=441.01..441.01 rows=1 width=8) (actual time=2.974..2.975 rows=0 
loops=1)
   Sort Key: x DESC
   Sort Method: quicksort  Memory: 25kB
   Buffers: shared hit=89
   ->  Seq Scan on c_s  (cost=0.00..441.00 rows=1 width=8) (actual 
time=2.971..2.971 rows=0 loops=1)
 Filter: ((x > 1000) AND (y > 10) AND (y < 10))
 Rows Removed by Filter: 2
 Buffers: shared hit=89
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.113 ms
 Execution Time: 2.990 ms
(12 rows)

The planner choosed the Seq Scan, and the executor have done the unnecessary 
jobs 2 times.

Let's don't confine to the plain attributes or row comparison and Seq Scan or 
Index Scan . 
We can pretend row-comparison as multi-attributes comparison. The qual is 
implicit-AND format, 
that means once one attribute is self-contradictory, we can abandon the qual 
immediately.

Maybe we can do more efficient jobs during planning time. Like at the phase of 
function deconstruct_recurse 
invoked, we can identify qual that is self-contradictory then flag it. With 
this information we know who is
a dummy relation named arbitrarily.

For instance:

explain (analyze, buffers) select * from c_s a , c_s b where a.y >10 and a.y<10 
and a.x=b.x;
  QUERY PLAN
   
---
 Nested Loop  (cost=0.29..393.31 rows=1 width=16) (actual time=1.858..1.858 
rows=0 loops=1)
   Buffers: shared hit=89
   ->  Seq Scan on c_s a  (cost=0.00..389.00 rows=1 width=8) (actual 
time=1.857..1.857 rows=0 loops=1)
 Filter: ((y > 10) AND (y < 10))
 Rows Removed by Filter: 2
 Buffers: shared hit=89
   ->  Index Only Scan using my_c_s_idx on c_s b  (cost=0.29..4.30 rows=1 
width=8) (never executed)
 Index Cond: (x = a.x)
 Heap Fetches: 0
 Planning:
   Buffers: shared hit=12
 Planning Time: 0.236 ms
 Execution Time: 1.880 ms
(13 rows)

After deconstruct_recurse invoked, qual(a.y >10 and a.y<10) was distributed to 
relation "a" and relation "a" is 
a dummy relation. When "a" and "b" make inner join, "a" will supply nothing. 
That means the inner-join rel is 
a dummy relation too. If there is a outer join above the inner join and some 
one who can commute with it,
we can do more efficient jobs and so on.
It also has benefit on path-competing phase with this feature due to it doesn't 
cost anything.

It's need to discuss the idea whether is feasible or not and it will takes a 
lot of effort to complete this feature. 
Anyway we can fix these issues what we had encountered first.



bigbro...@hotmail.com
 
From: Peter Geoghegan
Date: 2024-08-30 22:32
To: b ro
CC: pgsql-hackers
Subject: Re: bt Scankey in another contradictory case
On Fri, Aug 30, 2024 at 7:36 AM b ro  wrote:
>   this is the patch attachment.
 
We discussed this recently:
 
https://www.postgresql.org/message-id/80384.1715458896%40sss.pgh.pa.us
 
I think that we should do this.
 
It doesn't make a huge difference in practice, because we'll still end
the scan once the leaf level is reached. But it matters more when
array keys are involved, where there might be more than one descent to
the leaf level. Plus we might as well just be thorough about this
stuff.
 
-- 
Peter Geoghegan


Re: Collect statistics about conflicts in logical replication

2024-09-01 Thread Peter Smith
On Fri, Aug 30, 2024 at 4:24 PM shveta malik  wrote:
>
> On Fri, Aug 30, 2024 at 10:53 AM Peter Smith  wrote:
> >
...
> > 2. Arrange all the counts into an intuitive/natural order
> >
> > There is an intuitive/natural ordering for these counts. For example,
> > the 'confl_*' count fields are in the order insert -> update ->
> > delete, which LGTM.
> >
> > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
> > in a good order.
> >
> > IMO it makes more sense if everything is ordered as:
> > 'sync_error_count', then 'apply_error_count', then all the 'confl_*'
> > counts.
> >
> > This comment applies to lots of places, e.g.:
> > - docs (doc/src/sgml/monitoring.sgml)
> > - function pg_stat_get_subscription_stats (pg_proc.dat)
> > - view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
> > - TAP test SELECTs (test/subscription/t/026_stats.pl)
> >
> > As all those places are already impacted by this patch, I think it
> > would be good if (in passing) we (if possible) also swapped the
> > sync/apply counts so everything  is ordered intuitively top-to-bottom
> > or left-to-right.
>
> Not sure about this though. It does not seem to belong to the current patch.
>

Fair enough. But, besides being inappropriate to include in the
current patch, do you think the suggestion to reorder them made sense?
If it has some merit, then I will propose it again as a separate
thread.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: JIT: Remove some unnecessary instructions.

2024-09-01 Thread Xing Guo
On Fri, Aug 30, 2024 at 8:50 PM Andreas Karlsson  wrote:
>
> On 8/30/24 5:55 AM, Xing Guo wrote:
> > I find there are some unnecessary load/store instructions being
> > emitted by the JIT compiler.
>
> Well spotted! All of these are obvious dead instructions and while LLVM
> might be able to optimize them away there is no reason to create extra
> work for the optimizer.
>
> The patch looks good, applies and the tests passes.

Thanks for testing it! I spotted another unnecessary store instruction
and added it in my V2 patch.

Best Regards,
Xing
From ca30976ebf70ebed12ede7999d4ab637a9851641 Mon Sep 17 00:00:00 2001
From: Xing Guo 
Date: Mon, 2 Sep 2024 10:21:37 +0800
Subject: [PATCH v2] JIT: Remove unnecessary load/store instructions.

---
 src/backend/jit/llvm/llvmjit_expr.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 48ccdb942a..7ece2e3d2e 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -720,11 +720,6 @@ llvm_compile_expr(ExprState *state)
 	v_boolnull = l_load(b, TypeStorageBool, v_resnullp, "");
 	v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, "");
 
-	/* set resnull to boolnull */
-	LLVMBuildStore(b, v_boolnull, v_resnullp);
-	/* set revalue to boolvalue */
-	LLVMBuildStore(b, v_boolvalue, v_resvaluep);
-
 	/* check if current input is NULL */
 	LLVMBuildCondBr(b,
 	LLVMBuildICmp(b, LLVMIntEQ, v_boolnull,
@@ -816,11 +811,6 @@ llvm_compile_expr(ExprState *state)
 	v_boolnull = l_load(b, TypeStorageBool, v_resnullp, "");
 	v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, "");
 
-	/* set resnull to boolnull */
-	LLVMBuildStore(b, v_boolnull, v_resnullp);
-	/* set revalue to boolvalue */
-	LLVMBuildStore(b, v_boolvalue, v_resvaluep);
-
 	LLVMBuildCondBr(b,
 	LLVMBuildICmp(b, LLVMIntEQ, v_boolnull,
   l_sbool_const(1), ""),
@@ -875,10 +865,8 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_BOOL_NOT_STEP:
 {
 	LLVMValueRef v_boolvalue;
-	LLVMValueRef v_boolnull;
 	LLVMValueRef v_negbool;
 
-	v_boolnull = l_load(b, TypeStorageBool, v_resnullp, "");
 	v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, "");
 
 	v_negbool = LLVMBuildZExt(b,
@@ -887,8 +875,7 @@ llvm_compile_expr(ExprState *state)
 			l_sizet_const(0),
 			""),
 			  TypeSizeT, "");
-	/* set resnull to boolnull */
-	LLVMBuildStore(b, v_boolnull, v_resnullp);
+
 	/* set revalue to !boolvalue */
 	LLVMBuildStore(b, v_negbool, v_resvaluep);
 
@@ -1615,7 +1602,6 @@ llvm_compile_expr(ExprState *state)
 	LLVMPositionBuilderAtEnd(b, b_argsequal);
 	LLVMBuildStore(b, l_sbool_const(1), v_resnullp);
 	LLVMBuildStore(b, l_sizet_const(0), v_resvaluep);
-	LLVMBuildStore(b, v_retval, v_resvaluep);
 
 	LLVMBuildBr(b, opblocks[opno + 1]);
 	break;
-- 
2.46.0



Re: ANALYZE ONLY

2024-09-01 Thread torikoshia

On 2024-09-01 10:31, Michael Harris wrote:

Hello Atsushi & Melih

Thank you both for your further feedback.

On Thu, 29 Aug 2024 at 21:31, Melih Mutlu  
wrote:
I believe moving "[ ONLY ]" to the definitions of table_and_columns 
below, as you did with "[ * ]", might be better to be consistent with 
other places (see [1])


Agreed. I have updated this.

+ if ((options & VACOPT_VACUUM) && is_partitioned_table && ! 
include_children)


There are also some issues with coding conventions in some places 
(e.g. the space between "!" and "include_children" abode). I think 
running pgindent would resolve such issue in most places.


I fixed that extra space, and then ran pgindent. It did not report any
more problems.

On Fri, 30 Aug 2024 at 16:45, torikoshia  
wrote:

-- https://www.postgresql.org/docs/devel/progress-reporting.html
> Note that when ANALYZE is run on a partitioned table, all of its
> partitions are also recursively analyzed.

Should we also note this is the default, i.e. not using ONLY option
behavior here?



-- https://www.postgresql.org/docs/devel/ddl-partitioning.html
> If you are using manual VACUUM or ANALYZE commands, don't forget that
> you need to run them on each child table individually. A command like:
>
> ANALYZE measurement;
> will only process the root table.

This part also should be modified, shouldn't it?


Agreed. I have updated both of these.


Thanks!

When running ANALYZE VERBOSE ONLY on a partition table, the INFO 
message

is like this:

   =# ANALYZE VERBOSE ONLY only_parted;
   INFO:  analyzing "public.only_parted" inheritance tree

I may be wrong, but 'inheritance tree' makes me feel it includes child
tables.
Removing 'inheritance tree' and just describing the table name as 
below

might be better:

   INFO:  analyzing "public.only_parted"


I'm not sure about that one. If I understand the source correctly,
that particular progress
message tells the user that the system is gathering stats from the 
inheritance

tree in order to update the stats of the given table, not that it is
actually updating
the stats of the descendant tables.


That makes sense.


When analyzing an inheritance structure with the ONLY you see
something like this:

=> ANALYZE VERBOSE ONLY only_inh_parent;
INFO:  analyzing "public.only_inh_parent"
INFO:  "only_inh_parent": scanned 0 of 0 pages, containing 0 live rows
and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "public.only_inh_parent" inheritance tree
INFO:  "only_inh_child": scanned 1 of 1 pages, containing 3 live rows
and 0 dead rows; 3 rows in sample, 3 estimated total rows
ANALYZE

The reason you don't see the first one for partitioned tables is that
it corresponds
to sampling the contents of the parent table itself, which in the case
of a partitioned
table is guaranteed to be empty, so it is skipped.

I agree the text could be confusing, and in fact is probably confusing
even today
without the ONLY keyword,


Yeah, it seems this isn't dependent on your proposal.
(BTW I'm also wondering if the expression “inheritance" is appropriate 
when the target is a partitioned table, but this should be discussed in 
another thread)



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: optimizing pg_upgrade's once-in-each-database steps

2024-09-01 Thread Nathan Bossart
On Sat, Aug 31, 2024 at 01:18:10AM +0100, Ilya Gladyshev wrote:
> LGTM in general, but here are some final nitpicks:

Thanks for reviewing.

> + if (maxFd != 0)
> + (void) select(maxFd + 1, &input_mask, &output_mask, 
> &except_mask, NULL);
> 
> It´s a good idea to check for the return value of select, in case it
> returns any errors.

Done.

> + dbs_complete++;
> + (void) PQgetResult(slot->conn);
> + PQfinish(slot->conn);
> 
> Perhaps it´s rather for me to understand, what does PQgetResult call do
> here?

I believe I was trying to follow the guidance that you should always call
PQgetResult() until it returns NULL, but looking again, I don't see any
need to call it since we free the connection immediately afterwards.

> + /* Check for connection failure. */
> + if (PQconnectPoll(slot->conn) == PGRES_POLLING_FAILED)
> + pg_fatal("connection failure: %s", 
> PQerrorMessage(slot->conn));
> +
> + /* Check whether the connection is still establishing. 
> */
> + if (PQconnectPoll(slot->conn) != PGRES_POLLING_OK)
> + return;
> 
> Are the two consecutive calls of PQconnectPoll intended here? Seems a bit
> wasteful, but maybe that´s ok.

I think we can actually just use PQstatus() here.  But furthermore, I think
the way I was initiating connections was completely bogus.  IIUC before
calling PQconnectPoll() the first time, we should wait for a write
indicator from select(), and then we should only call PQconnectPoll() after
subsequent indicators from select().  After spending quite a bit of time
staring at the PQconnectPoll() code, I'm quite surprised I haven't seen any
problems thus far.  If I had to guess, I'd say that either PQconnectPoll()
is more resilient than I think it is, or I've just gotten lucky because
pg_upgrade establishes connections quickly.

Anyway, to fix this, I've added some more fields to the slot struct to
keep track of the information we need to properly establish connections,
and we now pay careful attention to the return value of select() so that we
know which slots are ready for processing.  This seemed like a nice little
optimization independent of fixing connection establishment.  I was worried
this was going to require a lot more code, but I think it only added ~50
lines or so.

> We should probably mention this change in the docs as well, I found these
> two places that I think can be improved:

I've adjusted the documentation in v11.

-- 
nathan
>From 02a9eb5bb035f5fdaf1c165161d96695dec06d45 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 28 Aug 2024 10:45:59 -0500
Subject: [PATCH v11 01/11] Introduce framework for parallelizing various
 pg_upgrade tasks.

A number of pg_upgrade steps require connecting to every database
in the cluster and running the same query in each one.  When there
are many databases, these steps are particularly time-consuming,
especially since these steps are performed sequentially in a single
process.

This commit introduces a new framework that makes it easy to
parallelize most of these once-in-each-database tasks.
Specifically, it manages a simple state machine of slots and uses
libpq's asynchronous APIs to establish the connections and run the
queries.  The --jobs option is used to determine the number of
slots to use.  To use this new task framework, callers simply need
to provide the query and a callback function to process its
results, and the framework takes care of the rest.  A more complete
description is provided at the top of the new task.c file.

None of the eligible once-in-each-database tasks are converted to
use this new framework in this commit.  That will be done via
several follow-up commits.

Reviewed-by: Jeff Davis, Robert Haas, Daniel Gustafsson, Ilya Gladyshev, Corey 
Huinker
Discussion: https://postgr.es/m/20240516211638.GA1688936%40nathanxps13
---
 doc/src/sgml/ref/pgupgrade.sgml  |   6 +-
 src/bin/pg_upgrade/Makefile  |   1 +
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/pg_upgrade.h  |  21 ++
 src/bin/pg_upgrade/task.c| 491 +++
 src/tools/pgindent/typedefs.list |   5 +
 6 files changed, 522 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_upgrade/task.c

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 9877f2f01c..fc2d0ff845 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -118,7 +118,7 @@ PostgreSQL documentation
  
   -j njobs
   --jobs=njobs
-  number of simultaneous processes or threads to use
+  number of simultaneous connections and processes/threads 
to use
   
  
 
@@ -587,8 +587,8 @@ NET STOP postgresql-&majorversion;
 
 
  The --jobs option allows multiple CPU cores to be used
- for copying/linking of files and to dump and restor

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-01 Thread shveta malik
On Mon, Sep 2, 2024 at 5:47 AM Peter Smith  wrote:
>
> Hi hackers. While reviewing another thread I had cause to look at the
> docs for the pg_replication_slot 'inactive_since' field [1] for the
> first time.
>
> I was confused by the description, which is as follows:
> 
> inactive_since timestamptz
> The time since the slot has become inactive. NULL if the slot is
> currently being used.
> 
>
> Then I found the github history for the patch [2], and the
> accompanying long thread discussion [3] about the renaming of that
> field. I have no intention to re-open that can-of-worms, but OTOH I
> feel the first sentence of the field description is wrong and needs
> fixing.
>
> Specifically, IMO describing something as "The time since..." means
> some amount of elapsed time since some occurrence, but that is not the
> correct description for this timestamp field.
>
> This is not just a case of me being pedantic. For example, here is
> what Chat-GPT had to say:
> 
> I asked:
> What does "The time since the slot has become inactive." mean?
>
> ChatGPT said:
> "The time since the slot has become inactive" refers to the duration
> that has passed from the moment a specific slot (likely a database
> replication slot or a similar entity) stopped being active. In other
> words, it measures how much time has elapsed since the slot
> transitioned from an active state to an inactive state.
>
> For example, if a slot became inactive 2 hours ago, "the time since
> the slot has become inactive" would be 2 hours.
> 
>
> To summarize, the current description wrongly describes the field as a
> time duration:
> "The time since the slot has become inactive."
>
> I suggest replacing it with:
> "The slot has been inactive since this time."
>

+1 for the change. If I had read the document without knowing about
the patch, I too would have interpreted it as a duration.

thanks
Shveta




Re: Improving tracking/processing of buildfarm test failures

2024-09-01 Thread Alexander Lakhin

Hello hackers,

Please take a look at the August report on buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' GROUP BY br;
REL_12_STABLE: 2
REL_13_STABLE: 2
REL_14_STABLE: 12
REL_15_STABLE: 3
REL_16_STABLE: 5
REL_17_STABLE: 17
HEAD: 38
-- Total: 79
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-08-01' AND dt < '2024-09-01');
21

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6;
https://www.postgresql.org/message-id/8ce8261a-bf3a-25e6-b473-4808f50a6ea7%40gmail.com:
 13
-- An environmental issue; fixed

https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com:
 9
-- An environmental issue?; probably fixed

https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319...@gmail.com:
 7
-- Fixed

https://www.postgresql.org/message-id/c720cdc3-5ce0-c410-4854-70788175c...@gmail.com:
 6
-- Expected to be fixed with Release 18 of the buildfarm client

https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com:
 5

https://www.postgresql.org/message-id/3618203.1722473...@sss.pgh.pa.us: 4
-- Fixed

# SELECT count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
13

Short-lived failures: 21

There were also two mysterious never-before-seen failures, both occurred on
POWER animals:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kingsnake&dt=2024-08-19%2019%3A17%3A59
 - REL_17_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
 - REL_15_STABLE

(I'm not sure yet, whether they should be considered "unhelpful". I'll
wait for more information from these animals/buildfarm in general to
determine what to do with these failures.)

Best regards,
Alexander




Re: ANALYZE ONLY

2024-09-01 Thread David Rowley
On Sun, 1 Sept 2024 at 13:41, Michael Harris  wrote:
>   https://commitfest.postgresql.org/49/5226/
>
> I was not sure what to put for some of the fields - for 'reviewer' should
> I have put the people who have provided feedback on this thread?

I think it's fairly common that only reviewers either add themselves
or don't bother.  I don't think it's common that patch authors add
reviewers. Some reviewers like their reviews to be more informal and
putting themselves as a reviewer can put other people off reviewing as
they think it's already handled by someone else. That may result in
the patch being neglected by reviewers.

> Is there anything else I need to do?

I see you set the target version as 17. That should be blank or "18".
PG17 is about to go into release candidate, so it is too late for
that. It's been fairly consistent that we open for new patches in
early July and close before the 2nd week of April the following year.
We're currently 2 months into PG18 dev. We don't back-patch new
features.

Nothing else aside from continuing to address reviews, as you are already.

David




Re: In-placre persistance change of a relation

2024-09-01 Thread Michael Paquier
On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote:
> I wonder if the twophase state files and undo log files should be merged
> into one file. They're similar in many ways: there's one file per
> transaction, named using the XID. I haven't thought this fully through, just
> a thought..

Hmm.  It could be possible to extract some of this knowledge out of
twophase.c and design some APIs that could be used for both, but would
that be really necessary?  The 2PC data and the LSNs used by the files
to check if things are replayed or on disk rely on
GlobalTransactionData that has its own idea of things and timings at
recovery.

Or perhaps your point is actually to do that and add one layer for the
file handlings and their flush timings?  I am not sure, TBH, what this
thread is trying to fix is complicated enough that it may be better to
live with two different code paths.  But perhaps my gut feeling is
just wrong reading your paragraph.
--
Michael


signature.asc
Description: PGP signature


Re: Make printtup a bit faster

2024-09-01 Thread Andy Fan
Andy Fan  writes:

> The attached is PoC of this idea, not matter which method are adopted
> (rewrite all the outfunction or a optional print function), I think the
> benefit will be similar. In the blew test case, it shows us 10%+
> improvements. (0.134ms vs 0.110ms)

After working on more {type}_print functions, I'm thinking it is pretty
like the 3rd IO function which shows some confused maintainence
effort. so I agree refactoring the existing out function is a better
idea. I'd like to work on _print function body first for easy review and
testing. after all, if some common issues exists in these changes,
it is better to know that before we working on the 700+ out functions.

-- 
Best Regards
Andy Fan





Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-01 Thread Michael Paquier
On Fri, Aug 30, 2024 at 04:10:32PM -0400, Andrew Dunstan wrote:
> 
> On 2024-08-29 Th 4:44 PM, Jacob Champion wrote:
> > As for the other patches, I'll ping Andrew about 0001,
> 
> 
> Patch 0001 looks sane to me.

So does 0002 to me.  I'm not much a fan of the addition of
pgstat_bestart_pre_auth(), which is just a shortcut to set a different
state in the backend entry to tell that it is authenticating.  Is
authenticating the term for this state of the process startups,
actually?  Could it be more transparent to use a "startup" or
"starting"" state instead that gets also used by pgstat_bestart() in
the case of the patch where !pre_auth?

The addition of the new wait event states in 0004 is a good idea,
indeed, and these can be seen in pg_stat_activity once we get out of
PGSTAT_END_WRITE_ACTIVITY() (err.. Right?).
--
Michael


signature.asc
Description: PGP signature


Re: Collect statistics about conflicts in logical replication

2024-09-01 Thread Peter Smith
On Fri, Aug 30, 2024 at 6:36 PM shveta malik  wrote:
>
> On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > Here is V5 patch which addressed above and Shveta's[1] comments.
> >
>
> The patch looks good to me.
>

Patch v5 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Improving tracking/processing of buildfarm test failures

2024-09-01 Thread sia kc
Hello everyone

I am a developer interested in this project. Had a little involvement with
MariaDB and now I like to work on Postgres. Never worked with mailing lists
so I am not sure if this is the way I should interact. Liked to be pointed
to some tasks and documents to get started.


Add ExprState hashing for GROUP BY and hashed SubPlans

2024-09-01 Thread David Rowley
adf97c156 added support to allow ExprStates to support hashing and
adjusted Hash Join to make use of that. That allowed a speedup in hash
value generation as it allowed JIT compilation of hash values. It also
allowed more efficient tuple deforming as all required attributes are
deformed in one go rather than on demand when hashing each join key.

The attached does the same for GROUP BY and hashed SubPlans. The win
for the tuple deformation does not exist here, but there does seem to
be some gains still to be had from JIT compilation.

Using a scale=1 TPC-H lineitem table, I ran the attached script.

The increase is far from impressive, but likely worth migrating these
over to use ExprState too.

master:

alter system set jit = 0;
latency average = 1509.116 ms
latency average = 1502.496 ms
latency average = 1507.560 ms
alter system set jit = 1;
latency average = 1396.015 ms
latency average = 1392.138 ms
latency average = 1396.476 ms
alter system set jit_optimize_above_cost = 0;
latency average = 1290.463 ms
latency average = 1293.364 ms
latency average = 1290.366 ms
alter system set jit_inline_above_cost = 0;
latency average = 1294.540 ms
latency average = 1300.970 ms
latency average = 1302.181 ms

patched:

alter system set jit = 0;
latency average = 1500.183 ms
latency average = 1500.911 ms
latency average = 1504.150 ms (+0.31%)
alter system set jit = 1;
latency average = 1367.427 ms
latency average = 1367.329 ms
latency average = 1366.473 ms (+2.03%)
alter system set jit_optimize_above_cost = 0;
latency average = 1273.453 ms
latency average = 1265.348 ms
latency average = 1272.598 ms (+1.65%)
alter system set jit_inline_above_cost = 0;
latency average = 1264.657 ms
latency average = 1272.661 ms
latency average = 1273.179 ms (+2.29%)

David
#!/bin/bash

nloops=3
rows=1000
dbname=postgres
port=5432
seconds=30

psql -c "drop table if exists hjtbl;" -p $port $dbname
psql -c "create table hjtbl (a int not null, b int not null, c int not null, d 
int not null, e int not null, f int not null);" -p $port $dbname
psql -c "insert into hjtbl select a,a,a,a,a,a from generate_series(1,$rows) a;" 
-p $port $dbname
psql -c "vacuum freeze analyze hjtbl;" -p $port $dbname
psql -c "alter system set jit_above_cost = 0;" -p $port $dbname
psql -c "alter system set jit_optimize_above_cost = 10;" -p $port 
$dbname
psql -c "alter system set jit_inline_above_cost = 10;" -p $port $dbname
psql -c "select pg_reload_conf();" -p $port $dbname

for alt_sys in "alter system set jit = 0;" "alter system set jit = 1;" "alter 
system set jit_optimize_above_cost = 0;" "alter system set 
jit_inline_above_cost = 0;"
do
echo "$alt_sys"
psql -c "$alt_sys" -p $port $dbname > /dev/null
psql -c "select pg_Reload_conf();" -p $port $dbname > /dev/null
q=1
for sql in "select count(*) c from (select a,b,c,d,e,f from hjtbl cross 
join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a);" 
"select count(*) c from (select a,b,c,d,e,f from hjtbl cross join 
generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b);" "select 
count(*) c from (select a,b,c,d,e,f from hjtbl cross join 
generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c);" "select 
count(*) c from (select a,b,c,d,e,f from hjtbl cross join 
generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d);" 
"select count(*) c from (select a,b,c,d,e,f from hjtbl cross join 
generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d,e);" 
"select count(*) c from (select a,b,c,d,e,f from hjtbl cross join 
generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d,e,f);"
do
echo $sql > bench.sql
echo -n "Q$q "
pgbench -n -f bench.sql -p $port -M prepared -T $seconds 
$dbname | grep latency
q=$((q+1))
done
done




v1-0001-Use-ExprStates-for-hashing-in-GROUP-BY-and-SubPla.patch
Description: Binary data


Re: Test 041_checkpoint_at_promote.pl faild in installcheck due to missing injection_points

2024-09-01 Thread Michael Paquier
On Fri, Aug 23, 2024 at 06:45:02PM -0400, Tom Lane wrote:
> It exports it twice, though, which is pretty confusing.

Right.  I am not sure what was my state of mind back then, but this
pattern has spread a bit.

REL_17_STABLE is frozen for a few more days, so I'll address all the
items of this thread that once the release of this week is tagged: the
export duplicates and the installcheck issue.  These are staged on a
local branch for now.
--
Michael


signature.asc
Description: PGP signature


Re: Collect statistics about conflicts in logical replication

2024-09-01 Thread shveta malik
On Mon, Sep 2, 2024 at 4:20 AM Peter Smith  wrote:
>
> On Fri, Aug 30, 2024 at 4:24 PM shveta malik  wrote:
> >
> > On Fri, Aug 30, 2024 at 10:53 AM Peter Smith  wrote:
> > >
> ...
> > > 2. Arrange all the counts into an intuitive/natural order
> > >
> > > There is an intuitive/natural ordering for these counts. For example,
> > > the 'confl_*' count fields are in the order insert -> update ->
> > > delete, which LGTM.
> > >
> > > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
> > > in a good order.
> > >
> > > IMO it makes more sense if everything is ordered as:
> > > 'sync_error_count', then 'apply_error_count', then all the 'confl_*'
> > > counts.
> > >
> > > This comment applies to lots of places, e.g.:
> > > - docs (doc/src/sgml/monitoring.sgml)
> > > - function pg_stat_get_subscription_stats (pg_proc.dat)
> > > - view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
> > > - TAP test SELECTs (test/subscription/t/026_stats.pl)
> > >
> > > As all those places are already impacted by this patch, I think it
> > > would be good if (in passing) we (if possible) also swapped the
> > > sync/apply counts so everything  is ordered intuitively top-to-bottom
> > > or left-to-right.
> >
> > Not sure about this though. It does not seem to belong to the current patch.
> >
>
> Fair enough. But, besides being inappropriate to include in the
> current patch, do you think the suggestion to reorder them made sense?
> If it has some merit, then I will propose it again as a separate
> thread.
>

 Yes, I think it makes sense. With respect to internal code, it might
be still okay as is, but when it comes to pg_stat_subscription_stats,
I think it is better if user finds it in below order:
 subid | subname | sync_error_count | apply_error_count | confl_*

 rather than the existing one:
 subid | subname | apply_error_count | sync_error_count | confl_*

thanks
Shveta




Re: long-standing data loss bug in initial sync of logical replication

2024-09-01 Thread Amit Kapila
On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal  wrote:
>
> Next I am planning to test solely on the logical decoding side and
> will share the results.
>

Thanks, the next set of proposed tests makes sense to me. It will also
be useful to generate some worst-case scenarios where the number of
invalidations is more to see the distribution cost in such cases. For
example, Truncate/Drop a table with 100 or 1000 partitions.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-01 Thread Amit Kapila
On Thu, Aug 29, 2024 at 11:31 AM Bharath Rupireddy
 wrote:
>
> Thanks for looking into this.
>
> On Mon, Aug 26, 2024 at 4:35 PM Amit Kapila  wrote:
> >
> > Few comments on 0001:
> > 1.
> > @@ -651,6 +651,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> >
> > + /*
> > + * Skip the sync if the local slot is already invalidated. We do this
> > + * beforehand to avoid slot acquire and release.
> > + */
> >
> > I was wondering why you have added this new check as part of this
> > patch. If you see the following comments in the related code, you will
> > know why we haven't done this previously.
>
> Removed. Can deal with optimization separately.
>
> > 2.
> > + */
> > + InvalidateObsoleteReplicationSlots(RS_INVAL_INACTIVE_TIMEOUT,
> > +0, InvalidOid, InvalidTransactionId);
> >
> > Why do we want to call this for shutdown case (when is_shutdown is
> > true)? I understand trying to invalidate slots during regular
> > checkpoint but not sure if we need it at the time of shutdown.
>
> Changed it to invalidate only for non-shutdown checkpoints. inactive_timeout 
> invalidation isn't critical for shutdown unlike wal_removed which can help 
> shutdown by freeing up some disk space.
>
> > The
> > other point is can we try to check the performance impact with 100s of
> > slots as mentioned in the code comments?
>
> I first checked how much does the wal_removed invalidation check add to the 
> checkpoint (see 2nd and 3rd column). I then checked how much inactive_timeout 
> invalidation check adds to the checkpoint (see 4th column), it is not more 
> than wal_remove invalidation check. I then checked how much the wal_removed 
> invalidation check adds for replication slots that have already been 
> invalidated due to inactive_timeout (see 5th column), looks like not much.
>
> | # of slots | HEAD (no invalidation) ms | HEAD (wal_removed) ms | PATCHED 
> (inactive_timeout) ms | PATCHED (inactive_timeout+wal_removed) ms |
> |||---|---|--|
> | 100| 18.591 | 370.586   | 359.299   
> | 373.882  |
> | 1000   | 15.722 | 4834.901  | 5081.751  
> | 5072.128 |
> | 1  | 19.261 | 59801.062 | 61270.406 
> | 60270.099|
>
> Having said that, I'm okay to implement the optimization specified. Thoughts?
>

The other possibility is to try invalidating due to timeout along with
wal_removed case during checkpoint. The idea is that if the slot can
be invalidated due to WAL then fine, otherwise check if it can be
invalidated due to timeout. This can avoid looping the slots and doing
similar work multiple times during the checkpoint.

-- 
With Regards,
Amit Kapila.




Re: Minor refactor: Use more consistent names for the labels of PG_Locale_Strategy

2024-09-01 Thread Michael Paquier
On Thu, Aug 29, 2024 at 10:06:32AM +0900, Michael Paquier wrote:
> +1 for your suggestion, as you are suggesting.  The original intention
> when PG_Locale_Strategy got introduced was to have everything named as
> PG_REGEX_LOCALE_*, but with the built-in part coming in play in this
> code adding "STRATEGY" is cleaner than just "LOCALE".

Applied this one as 23138284cde4.
--
Michael


signature.asc
Description: PGP signature


Re: Track the amount of time waiting due to cost_delay

2024-09-01 Thread Bertrand Drouvot
Hi,

On Tue, Aug 20, 2024 at 12:48:29PM +, Bertrand Drouvot wrote:
> As it looks like we have a consensus not to wait on [0] (as reducing the 
> number
> of interrupts makes sense on its own), then please find attached v4, a rebase
> version (that also makes clear in the doc that that new field might show 
> slightly
> old values, as mentioned in [1]).

Please find attached v5, a mandatory rebase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1a14b708e0ee74c2f38835968d828c54022a5526 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v5] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml | 13 
 src/backend/catalog/system_views.sql |  2 +-
 src/backend/commands/vacuum.c| 49 
 src/include/catalog/catversion.h |  2 +-
 src/include/commands/progress.h  |  1 +
 src/test/regress/expected/rules.out  |  3 +-
 6 files changed, 67 insertions(+), 3 deletions(-)
  23.5% doc/src/sgml/
   4.2% src/backend/catalog/
  63.4% src/backend/commands/
   4.6% src/include/
   4.0% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9..d87604331a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6307,6 +6307,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
cleaning up indexes.
   
  
+
+ 
+  
+   time_delayed bigint
+  
+  
+   Total amount of time spent in milliseconds waiting due to vacuum_cost_delay
+   or autovacuum_vacuum_cost_delay. In case of parallel
+   vacuum the reported time is across all the workers and the leader. This
+   column is updated at a 1 Hz frequency (one time per second) so could show
+   slightly old values.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 19cabc9a47..875df7d0e4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
 S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
-S.param10 AS indexes_processed
+S.param10 AS indexes_processed, S.param11 AS time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..5bf2e37d3f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -60,6 +61,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Minimum amount of time (in ms)

DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-01 Thread Peter Smith
Hi hackers. While reviewing another thread I had cause to look at the
docs for the pg_replication_slot 'inactive_since' field [1] for the
first time.

I was confused by the description, which is as follows:

inactive_since timestamptz
The time since the slot has become inactive. NULL if the slot is
currently being used.


Then I found the github history for the patch [2], and the
accompanying long thread discussion [3] about the renaming of that
field. I have no intention to re-open that can-of-worms, but OTOH I
feel the first sentence of the field description is wrong and needs
fixing.

Specifically, IMO describing something as "The time since..." means
some amount of elapsed time since some occurrence, but that is not the
correct description for this timestamp field.

This is not just a case of me being pedantic. For example, here is
what Chat-GPT had to say:

I asked:
What does "The time since the slot has become inactive." mean?

ChatGPT said:
"The time since the slot has become inactive" refers to the duration
that has passed from the moment a specific slot (likely a database
replication slot or a similar entity) stopped being active. In other
words, it measures how much time has elapsed since the slot
transitioned from an active state to an inactive state.

For example, if a slot became inactive 2 hours ago, "the time since
the slot has become inactive" would be 2 hours.


To summarize, the current description wrongly describes the field as a
time duration:
"The time since the slot has become inactive."

I suggest replacing it with:
"The slot has been inactive since this time."

The attached patch makes this suggested change.

==
[1] docs - https://www.postgresql.org/docs/devel/view-pg-replication-slots.html
[2] thread - 
https://www.postgresql.org/message-id/ca+tgmob_ta-t2ty8qrkhbgnnlrf4zycwhghgfsuuofraedw...@mail.gmail.com
[3] push - 
https://github.com/postgres/postgres/commit/6d49c8d4b4f4a20eb5b4c501d78cf894fa13c0ea

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-fix-description-for-inactive_since.patch
Description: Binary data


Re: scalability bottlenecks with (many) partitions (and more)

2024-09-01 Thread Tomas Vondra
Hi,

While discussing this patch with Robert off-list, one of the questions
he asked was is there's some size threshold after which it starts to
have negative impact. I didn't have a good answer to that - I did have
some intuition (that making it too large would not hurt), but I haven't
done any tests with "extreme" sizes of the fast-path structs.

So I ran some more tests, with up to 4096 "groups" (which means 64k
fast-path slots). And no matter how I slice the results, there's no
clear regression points, beyond which the performance would start to
decline (even just slowly). It's the same for all benchmarks, client
counts, query mode, and so on.

I'm attaching two PDFs with results for the "join" benchmark I described
earlier (query with a join on many partitions) from EPYC 7763 (64/128c).
The first one is with "raw" data (throughput = tps), the second one is
relative throughput to the first column (which is pretty much current
master, with no optimizations applied).

The complete results including some nice .odp spreadsheets and scripts
are available here:

  https://github.com/tvondra/pg-lock-scalability-results

There's often a very clear point where the performance significantly
improves - this is usually when all the relation locks start to fit into
the fast-path array. With 1000 relations that's ~64 groups, and so on.
But there's no point where it would start declining.

My explanation is that the PGPROC (where the fast-path array is) is so
large already (close to 1kB), that making it large does not really cause
any additional cache misses, etc. And if it does, it's far out-weighted
by cost of accessing (or not having to) the shared lock table.

So I don't think there's any point at which point we'd start to regress,
at least not because of cache misses, CPU etc. It stops improving, but
that's just a sign that we've hit some other bottleneck - that's not a
fault of this patch.


But that's not the whole story, of course. Because if there were no
issues, why not to just make the fast-path array insanely large? In
another off-list discussion Andres asked me about the memory this would
need, and after looking at the numbers I think that's a strong argument
to keep the numbers reasonable.

I did a quick experiment to see the per-connection memory requirements,
and how would it be affected by this patch. I simply logged the amount
of shared memory CalculateShmemSize(), started the server with 100 and
1000 connections, and did a bit of math to calculate how much memory we
need "per connection".

For master and different numbers of fast-path groups I got this:

master  64 1024 32765
-
 47668   52201   121324   2406892

So by default we need ~48kB / connection, with 64 groups we need ~52kB
(which makes sense because that's 1024 x 4B slots), and then with 1024
slots we get to 120kB etc and with 32k ~2.5MB.

I guess those higher values seem a bit insane - we don't want to just
increase the per-connection memory requirements 50x for everyone, right?

But what about the people who actually want this many locks? Let's bump
the max_locks_per_transactions from 64 to 1024, and we get this:

master64  1024  32765
-
4193674239094930222778590

Suddenly, the differences are much smaller, especially for the 64
groups, which is roughly the same number of fast-path slots as the max
locks per transactions. That shrunk to ~1% difference. But wen for 1024
groups it's now just ~20%, which I think it well worth the benefits.

And likely something the system should have available - with 1000
connections that's ~80MB. And if you run with 1000 connections, 80MB
should be rounding error, IMO.

Of course, it does not seem great to force everyone to pay this price,
even if they don't need that many locks (and so there's no benefit). So
how would we improve that?

I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?

Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.


At this point I think it's fairly clear we have significant bottlenecks
when having to lock many relations - and that won't go away, thanks to
partitioning etc. We're already fixing various other bottlenecks for
these workloads, which will just increase pressure on locking.

Fundamentally, I think we'll need to either evolve the fast-path sys

Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch

2024-09-01 Thread Michael Paquier
On Fri, Aug 30, 2024 at 04:06:03PM -0500, Nathan Bossart wrote:
> I suppose it would be difficult to argue that it is actually useful, given
> it hasn't worked since v11 and apparently nobody noticed until recently.
> If we're content to leave it unsupported, then sure, let's just remove the
> "relkind == RELKIND_SEQUENCE" check in pgstat_relation().  But I also don't
> have a great reason to _not_ support it.  It used to work (which appears to
> have been intentional, based on the code), it was unintentionally broken,
> and it'd work again with a ~1 line change.  "SELECT count(*) FROM
> my_sequence" probably doesn't provide a lot of value, but I have no
> intention of proposing a patch that removes support for that.

IMO, it can be useful to check the state of the page used by a
sequence.  We have a few tweaks in sequence.c like manipulations of
t_infomask, and I can be good to check for its status on corrupted
systems.
--
Michael


signature.asc
Description: PGP signature


Re: In-placre persistance change of a relation

2024-09-01 Thread Heikki Linnakangas

On 31/08/2024 19:09, Kyotaro Horiguchi wrote:

- UNDO log(0002): This handles file deletion during transaction aborts,
   which was previously managed, in part, by the commit XLOG record at
   the end of a transaction.

- Prevent orphan files after a crash (0005): This is another use-case
   of the UNDO log system.


Nice, I'm very excited if we can fix that long-standing issue! I'll try 
to review this properly later, but at a quick 5 minute glance, one thing 
caught my eye:


This requires fsync()ing the per-xid undo log file every time a relation 
is created. I fear that can be a pretty big performance hit for 
workloads that repeatedly create and drop small tables. Especially if 
they're otherwise running with synchronous_commit=off. Instead of 
flushing the undo log file after every write, I'd suggest WAL-logging 
the undo log like regular relations and SLRUs. So before writing the 
entry to the undo log, WAL-log it. And with a little more effort, you 
could postpone creating the files altogether until a checkpoint happens, 
similar to how twophase state files are checkpointed nowadays.


I wonder if the twophase state files and undo log files should be merged 
into one file. They're similar in many ways: there's one file per 
transaction, named using the XID. I haven't thought this fully through, 
just a thought..



+static void
+undolog_set_filename(char *buf, TransactionId xid)
+{
+   snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid);
+}


I'd suggest using FullTransactionId. Doesn't matter much, but seems like 
a good future-proofing.


--
Heikki Linnakangas
Neon (https://neon.tech)