date:20230104

Re: grouping pushdown

2023-01-04 Thread Antonin Houska

David Rowley  wrote:

> On Wed, 4 Jan 2023 at 23:21, Spring Zhong  wrote:
> > The plan is apparently inefficient, since the hash aggregate goes after the 
> > Cartesian product. We could expect the query's performance get much 
> > improved if the HashAggregate node can be pushed down to the SCAN node.
> 
> > Is someone has suggestions on this?
> 
> I think this is being worked on. See [1].

Well, the current version of that patch requires the query to contain at least
one aggregate. It shouldn't be a big deal to modify it. However note that this
feature pushes the aggregate/grouping only to one side of the join ("fake"
aggregate count(*) added to the query):

SET enable_agg_pushdown TO on;

EXPLAIN select i1,i2, count(*) from t1, t2 group by i1,i2;
   QUERY PLAN   

 Finalize GroupAggregate  (cost=440.02..440.04 rows=1 width=16)
   Group Key: t1.i1, t2.i2
   ->  Sort  (cost=440.02..440.02 rows=1 width=16)
 Sort Key: t1.i1, t2.i2
 ->  Nested Loop  (cost=195.00..440.01 rows=1 width=16)
   ->  Partial HashAggregate  (cost=195.00..195.01 rows=1 width=12)
 Group Key: t1.i1
 ->  Seq Scan on t1  (cost=0.00..145.00 rows=1 width=4)
   ->  Seq Scan on t2  (cost=0.00..145.00 rows=1 width=4)

If both sides should be grouped, finalization of the partial aggregates would
be more difficult, and I'm not sure it'd be worth the effort.

> [1] https://commitfest.postgresql.org/41/3764/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Ankit Kumar Pandey




On 05/01/23 12:53, David Rowley wrote:


We *can* reuse Sorts where a more strict or equivalent sort order is
available.  The question is how do we get the final WindowClause to do
something slightly more strict to save having to do anything for the
ORDER BY.  One way you might think would be to adjust the
WindowClause's orderClause to add the additional clauses, but that
cannot be done because that would cause are_peers() in nodeWindowAgg.c
to not count some rows as peers when they maybe should be given a less
strict orderClause in the WindowClause.

Okay, now I see issue in my approach.

It might be possible to adjust create_one_window_path() so that when
processing the final WindowClause that it looks at the DISTINCT or
ORDER BY clause to see if we can sort on a few extra columns to save
having to do any further sorting.  We just *cannot* make any
adjustments to the WindowClause's orderClause.


This is much better solution. I will check

create_one_window_path for the same.

--
Regards,
Ankit Kumar Pandey

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread David Rowley

On Thu, 5 Jan 2023 at 16:12, Tom Lane  wrote:
>
> David Rowley  writes:
> > Additionally, it's also not that clear to me that sorting by more
> > columns in the sort below the WindowAgg would always be a win over
> > doing the final sort for the ORDER BY.  What if the WHERE clause (that
> > could not be pushed down before a join) filtered out the vast majority
> > of the rows before the ORDER BY. It might be cheaper to do the sort
> > then than to sort by the additional columns earlier.
>
> That's certainly a legitimate question to ask, but I don't quite see
> where you figure we'd be sorting more rows?  WHERE filtering happens
> before window functions, which never eliminate any rows.  So it seems
> like a sort just before the window functions must sort the same number
> of rows as one just after them.

Yeah, I didn't think the WHERE clause thing out carefully enough. I
think it's only the WindowClause's runCondition that could possibly
filter any rows between the Sort below the WindowAgg and before the
ORDER BY is evaluated.

David

Re: Resolve UNKNOWN type to relevant type instead of text type while bulk update using values

2023-01-04 Thread David G. Johnston

Please don’t top-post

On Wednesday, January 4, 2023, Sayyid Ali Sajjad Rizavi 
wrote:

> Breaking working queries for this is not acceptable.
>
>
> Good point, let's exclude Option 2.
>
>
>> This happens when possible so any remaining cases are not possible.  Or,
>> at least apparently not worth the effort it would take to make work.
>
>
> Actually this doesn't happen when all of the values in that position are
> null. Or maybe I don't understand what you mean.
> If we don't consider the effort it would take to make it work, do you
> think Option 1 would be good to have? Because when I
> have an integer column in that position, I wouldn't want the unknown
> (null) values I supply to be resolved to `text` type.
>
>>
>>
The VALUES subquery has to produce its tabular output without being aware
of how the outer query is going to use it.  The second column of your
values subquery lacks type information so the system chooses a default -
text.

Dealing with types is one of the harder medium-hard problems in computer
science…encountering this problem in real life has never seen me motivated
enough to gripe about it rather than just add an explicit cast and move
on.  And I’ve been around long enough to know that the project is, and long
has been, aware of the dull pain points in this area.

David J.

Re: An oversight in ExecInitAgg for grouping sets

2023-01-04 Thread Richard Guo

On Thu, Jan 5, 2023 at 6:18 AM David Rowley  wrote:

> On Tue, 3 Jan 2023 at 10:25, Tom Lane  wrote:
> > The thing that I find really distressing here is that it's been
> > like this for years and none of our automated testing caught it.
> > You'd have expected valgrind testing to do so ... but it does not,
> > because we've never marked that word NOACCESS.  Maybe we should
> > rethink that?  It'd require making mcxt.c do some valgrind flag
> > manipulations so it could access the hdrmask when appropriate.
>
> Yeah, that probably could have been improved during the recent change.
> Here's a patch for it.

Thanks for the patch.  With it Valgrind is able to catch the invalid
read discussed in the initial email of this thread.

 VALGRINDERROR-BEGIN
 Invalid read of size 8
at 0x4DB056: ExecInitAgg
by 0x4C486A: ExecInitNode
by 0x4B92B7: InitPlan
by 0x4B81D7: standard_ExecutorStart
by 0x4B7F1B: ExecutorStart

I reviewed this patch and have some comments.

It seems that for MemoryContextMethods in alignedalloc.c the access to
the chunk header is not wrapped by VALGRIND_MAKE_MEM_DEFINED and
VALGRIND_MAKE_MEM_NOACCESS.  Should we do that?

In GenerationFree, I think the VALGRIND_MAKE_MEM_DEFINED should be moved
to the start of this function, before we call MemoryChunkIsExternal.

In SlabFree, we should call MemoryChunkGetBlock after the call of
VALGRIND_MAKE_MEM_DEFINED, just like how we do in SlabRealloc.

In AllocSetStats, we have a call of MemoryChunkGetValue in Assert.  I
think we should wrap it with VALGRIND_MAKE_MEM_DEFINED and
VALGRIND_MAKE_MEM_NOACCESS.

Thanks
Richard

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Amit Kapila

On Wed, Jan 4, 2023 at 11:30 PM Peter Geoghegan  wrote:
>
> On Wed, Jan 4, 2023 at 7:03 AM Robert Haas  wrote:
> > But that having been said, I'm kind of astonished that you didn't know
> > about this already. The freezing behavior is in general extremely hard
> > to get right, and I guess I feel if you don't understand how the
> > underlying functions work, including things like performance
> > considerations
>
> I was the one that reported the issue with CLOG lookups in the first place.
>
> > and which functions return fully reliable results, I do
> > not think you should be committing your own patches in this area.
>
> My mistake here had nothing to do with my own goals. I was trying to
> be diligent by hardening an existing check in passing, and it
> backfired.
>
> > There is probably a lot of potential benefit in improving the way this
> > stuff works, but there is also a heck of a lot of danger of creating
> > subtle data corrupting bugs that could easily take years to find.
>
> It's currently possible for VACUUM to set the all-frozen bit while
> unsetting the all-visible bit, due to a race condition [1]. This is
> your long standing bug. So apparently nobody is qualified to commit
> patches in this area.
>
> About a year ago, there was a massive argument over some earlier work
> in the same general area, by me. Being the subject of a pile-on on
> this mailing list is something that I find deeply upsetting and
> demoralizing. I just cannot take much more of it. At the same time,
> I've made quite an investment in the pending patches, and think that
> it's something that I have to see through.
>
> If I am allowed to finish what I've started, then I will stop all new
> work on VACUUM.
>

+1 for you to continue your work in this area. Personally, I don't
feel you need to stop working in VACUUM especially now that you have
built a good knowledge in this area and have a grip over the
improvement areas. AFAICS, the main takeaway is to get a review of
one's own work which I see in your case that Jeff is already doing in
the main project work. So, continuing with that and having some more
reviews should avoid such complaints. It is always possible that you,
me, or anyone can miss something important even after detailed reviews
by others but I think the chances will be much lower.

You are an extremely valuable person for this project and I wish that
you continue working with the same enthusiasm.

-- 
With Regards,
Amit Kapila.

Re: Resolve UNKNOWN type to relevant type instead of text type while bulk update using values

2023-01-04 Thread Sayyid Ali Sajjad Rizavi

>
> Breaking working queries for this is not acceptable.

Good point, let's exclude Option 2.

> This happens when possible so any remaining cases are not possible.  Or,
> at least apparently not worth the effort it would take to make work.

Actually this doesn't happen when all of the values in that position are
null. Or maybe I don't understand what you mean.
If we don't consider the effort it would take to make it work, do you think
Option 1 would be good to have? Because when I
have an integer column in that position, I wouldn't want the unknown (null)
values I supply to be resolved to `text` type.

On Thu, Jan 5, 2023 at 11:23 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wednesday, January 4, 2023, Sayyid Ali Sajjad Rizavi <
> sasriz...@gmail.com> wrote:
>>
>>
>> *Option 1:* Cast to the relevant column type in that position (to
>> `integer` in this case), whenever we have an unknown type.
>>
>
> This happens when possible so any remaining cases are not possible.  Or,
> at least apparently not worth the effort it would take to make work.
>
>
>> *Option 2:* Always give error if unknown type is not casted to desired
>> type (`null::integer` will be necessary).
>>
>
> Breaking working queries for this is not acceptable.
>
> David J.
>
>

Re: Fix showing XID of a spectoken lock in an incorrect field of pg_locks view.

2023-01-04 Thread Masahiko Sawada

On Wed, Jan 4, 2023 at 6:42 PM Amit Kapila  wrote:
>
> On Wed, Jan 4, 2023 at 12:16 PM Masahiko Sawada  wrote:
> >
> > It seems to be confusing and the user won't get the result even if
> > they search it by transactionid = 741. So I've attached the patch to
> > fix it. With the patch, the pg_locks views shows like:
> >
> >  locktype  | database | relation | page | tuple | virtualxid |
> > transactionid | classid | objid | objsubid | virtualtransaction |  pid
> >   | mode  | granted | fastpath | waitstart
> > ---+--+--+--+---++---+-+---+--+++---+-+--+---
> >  spectoken |  |  |  |   ||
> >   746 | | 1 |  | 3/4| 535618 |
> > ExclusiveLock | t   | f|
> > (1 row)
> >
>
> Is it a good idea to display spec token as objid, if so, how will
> users know? Currently for Advisory locks, we display values in
> classid, objid, objsubid different than the original meaning of fields
> but those are explained in docs [1]. Wouldn't it be better to mention
> this in docs?

Agreed. Attached the updated patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Fix-showing-transaction-id-of-a-spectoken-in-an-i.patch
Description: Binary data

Re: Resolve UNKNOWN type to relevant type instead of text type while bulk update using values

2023-01-04 Thread David G. Johnston

On Wednesday, January 4, 2023, Sayyid Ali Sajjad Rizavi 
wrote:
>
>
> *Option 1:* Cast to the relevant column type in that position (to
> `integer` in this case), whenever we have an unknown type.
>

This happens when possible so any remaining cases are not possible.  Or, at
least apparently not worth the effort it would take to make work.

> *Option 2:* Always give error if unknown type is not casted to desired
> type (`null::integer` will be necessary).
>

Breaking working queries for this is not acceptable.

David J.

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Ankit Kumar Pandey




On 05/01/23 07:48, Vik Fearing wrote:

On 1/4/23 13:07, Ankit Kumar Pandey wrote:

Also, one thing, consider the following query:

explain analyze select row_number() over (order by a,b),count(*) over 
(order by a) from abcd order by a,b,c;


In this case, sorting is done on (a,b) followed by incremental sort 
on c at final stage.


If we do just one sort: a,b,c at first stage then there won't be need 
to do another sort (incremental one).



This could give incorrect results.  Consider the following query:

postgres=# select a, b, c, rank() over (order by a, b)
from (values (1, 2, 1), (1, 2, 2), (1, 2, 1)) as abcd (a, b, c)
order by a, b, c;

 a | b | c | rank
---+---+---+--
 1 | 2 | 1 |    1
 1 | 2 | 1 |    1
 1 | 2 | 2 |    1
(3 rows)


If you change the window's ordering like you suggest, you get this 
different result:



postgres=# select a, b, c, rank() over (order by a, b, c)
from (values (1, 2, 1), (1, 2, 2), (1, 2, 1)) as abcd (a, b, c)
order by a, b, c;

 a | b | c | rank
---+---+---+--
 1 | 2 | 1 |    1
 1 | 2 | 1 |    1
 1 | 2 | 2 |    3
(3 rows)



We are already doing something like I mentioned.

Consider this example:

explain SELECT rank() OVER (ORDER BY a), count(*) OVER (ORDER BY a,b) 
FROM abcd;

    QUERY PLAN
--
 WindowAgg  (cost=83.80..127.55 rows=1250 width=24)
   ->  WindowAgg  (cost=83.80..108.80 rows=1250 width=16)
 ->  Sort  (cost=83.80..86.92 rows=1250 width=8)
   Sort Key: a, b
   ->  Seq Scan on abcd  (cost=0.00..19.50 rows=1250 width=8)
(5 rows)


If it is okay to do extra sort for first window function (rank) here, 
why would it be


any different in case which I mentioned?

My suggestion rest on assumption that for a window function, say

rank() OVER (ORDER BY a), ordering of columns (other than column 'a') 
shouldn't matter.



--
Regards,
Ankit Kumar Pandey

Resolve UNKNOWN type to relevant type instead of text type while bulk update using values

2023-01-04 Thread Sayyid Ali Sajjad Rizavi

Hi !
I discovered an interesting behavior in PostgreSQL bulk update query using
`from (values %s)` syntax.

Let's see an example;
```
update persons p
set age = t.age
from  (
values
('uuid1', null),
('uuid2', null)
) as t(id, age)
where p.id = t.id;
```
The `age` column is of type integer. The above query will give this
error: *"age"
is of type integer but expression is of type text.* (PostgreSQL resolves
the type as a text).

But if we change the values to these;
```
values
('uuid1', 21),
('uuid2', null)
```
We won't get any error because PostgreSQL will detect that at least one
integer value exists in the 2nd position, so let's resolve this guy to
`integer`.

The issue here is that it's very unexpected behavior which might succeed in
most of the cases and fail in one case. This behavior can be seen in the
`parser/parse_coerce.c` file.
```
 /*
  * If all the inputs were UNKNOWN type --- ie, unknown-type literals
---
  * then resolve as type TEXT.  This situation comes up with constructs
  * like SELECT (CASE WHEN foo THEN 'bar' ELSE 'baz' END); SELECT 'foo'
  * UNION SELECT 'bar'; It might seem desirable to leave the construct's
  * output type as UNKNOWN, but that really doesn't work, because we'd
  * probably end up needing a runtime coercion from UNKNOWN to something
  * else, and we usually won't have it.  We need to coerce the unknown
  * literals while they are still literals, so a decision has to be made
  * now.
  */
 if (ptype == UNKNOWNOID)
 ptype = TEXTOID;
```

So here are the 2 options I suggest:
*Option 1:* Cast to the relevant column type in that position (to `integer`
in this case), whenever we have an unknown type.
*Option 2:* Always give error if unknown type is not casted to desired type
(`null::integer` will be necessary).

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Amit Kapila

On Thu, Jan 5, 2023 at 10:16 AM Nathan Bossart  wrote:
>
> On Wed, Jan 04, 2023 at 08:12:37PM -0800, Nathan Bossart wrote:
> > On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote:
> >> But there doesn't appear to be any guarantee that the result for
> >> AllTablesyncsReady() will change between the time it is invoked
> >> earlier in the function and at the place you have it in the patch.
> >> This is because the value of 'table_states_valid' may not have
> >> changed. So, how is this supposed to work?
> >
> > The call to CommandCounterIncrement() should set table_states_valid to
> > false if needed.
>
> In v12, I moved the restart for two_phase mode to the end of
> process_syncing_tables_for_apply() so that we don't need to rely on another
> iteration of the loop.
>

This should work but it is better to add a comment before calling
CommandCounterIncrement() to indicate that this is for making changes
to the relation state visible.

Thinking along similar lines, won't apply worker need to be notified
of SUBREL_STATE_SYNCWAIT state change by the tablesync worker?

-- 
With Regards,
Amit Kapila.

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Amit Kapila

On Thu, Jan 5, 2023 at 6:19 AM Nathan Bossart  wrote:
>
> On Wed, Jan 04, 2023 at 10:12:19AM -0800, Nathan Bossart wrote:
> > From the discussion thus far, it sounds like the alternatives are to 1) add
> > a global flag that causes wal_retrieve_retry_interval to be bypassed for
> > all workers or to 2) add a hash map in the launcher and a
> > restart_immediately flag in each worker slot.  I'll go ahead and create a
> > patch for 2 since it seems like the most complete solution, and we can
> > evaluate whether the complexity seems appropriate.
>
> Here is a first attempt at adding a hash table to the launcher and a
> restart_immediately flag in each worker slot.  This provides a similar
> speedup to lowering wal_retrieve_retry_interval to 1ms.  I've noted a
> couple of possible race conditions in comments, but none of them seemed
> particularly egregious.  Ideally, we'd put the hash table in shared memory
> so that other backends could adjust it directly, but IIUC that requires it
> to be a fixed size, and the number of subscriptions is virtually unbounded.
>

True, if we want we can use dshash for this. The garbage collection
mechanism used in the patch seems odd to me as that will remove/add
entries to the hash table even when the corresponding subscription is
never dropped. Also, adding this garbage collection each time seems
like an overhead, especially for small values of
wal_retrieve_retry_interval and a large number of subscriptions.

Another point is immediately after cleaning the worker info, trying to
find it again seems of no use. In logicalrep_worker_launch(), using
both in_use and restart_immediately to find an unused slot doesn't
look neat to me, we could probably keep the in_use flag intact if we
want to reuse the worker. But again after freeing the worker, keeping
its associated slot allocated sounds odd to me.

-- 
With Regards,
Amit Kapila.

Re: fix and document CLUSTER privileges

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 11:27:05PM +0100, Gilles Darold wrote:
> Got it, this is patch add_cluster_skip_messages.patch . IMHO this patch
> should be part of this commitfest as it is directly based on this one. You
> could create a second patch here that adds the warning message so that
> committers can decide here if it should be applied.

That's fine with me.  I added the warning messages in v4.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 145101e6a5..749b410e16 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -67,7 +67,8 @@ CLUSTER [VERBOSE]
   
 
   
-   CLUSTER without any parameter reclusters all the
+   CLUSTER without a
+   table_name reclusters all the
previously-clustered tables in the current database that the calling user
owns or has the MAINTAIN privilege for, or all such tables
if called by a superuser or a role with privileges of the
@@ -134,6 +135,15 @@ CLUSTER [VERBOSE]
  
   Notes
 
+   
+To cluster a table, one must have the MAINTAIN privilege
+on the table or be the table's owner, a superuser, or a role with
+privileges of the
+pg_maintain
+role.  Database-wide clusters and clusters on partitioned tables will skip
+over any tables that the calling user does not have permission to cluster.
+   
+

 In cases where you are accessing single rows randomly
 within a table, the actual order of the data in the
@@ -202,7 +212,8 @@ CLUSTER [VERBOSE]

 Clustering a partitioned table clusters each of its partitions using the
 partition of the specified partitioned index.  When clustering a partitioned
-table, the index may not be omitted.
+table, the index may not be omitted.  CLUSTER on a
+partitioned table cannot be executed inside a transaction block.

 
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f11691aff7..55b699ef05 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -80,6 +80,7 @@ static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 static List *get_tables_to_cluster(MemoryContext cluster_context);
 static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
 			   Oid indexOid);
+static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
 
 
 /*---
@@ -366,8 +367,7 @@ cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params)
 	if (recheck)
 	{
 		/* Check that the user still has privileges for the relation */
-		if (!object_ownercheck(RelationRelationId, tableOid, save_userid) &&
-			pg_class_aclcheck(tableOid, save_userid, ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(tableOid, save_userid))
 		{
 			relation_close(OldHeap, AccessExclusiveLock);
 			goto out;
@@ -1645,8 +1645,7 @@ get_tables_to_cluster(MemoryContext cluster_context)
 
 		index = (Form_pg_index) GETSTRUCT(indexTuple);
 
-		if (!object_ownercheck(RelationRelationId, index->indrelid, GetUserId()) &&
-			pg_class_aclcheck(index->indrelid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK)
+		if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1695,10 +1694,7 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 			continue;
 
 		/* Silently skip partitions which the user has no access to. */
-		if (!object_ownercheck(RelationRelationId, relid, GetUserId()) &&
-			pg_class_aclcheck(relid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK &&
-			(!object_ownercheck(DatabaseRelationId, MyDatabaseId, GetUserId()) ||
-			 IsSharedRelation(relid)))
+		if (!cluster_is_permitted_for_relation(relid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
@@ -1714,3 +1710,16 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 
 	return rtcs;
 }
+
+static bool
+cluster_is_permitted_for_relation(Oid relid, Oid userid)
+{
+	if (object_ownercheck(RelationRelationId, relid, userid) ||
+		pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
+		return true;
+
+	ereport(WARNING,
+			(errmsg("permission denied to cluster \"%s\", skipping it",
+	get_rel_name(relid;
+	return false;
+}
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..27a5dff5d4 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -352,7 +352,9 @@ INSERT INTO clstr_3 VALUES (1);
 -- this user can only cluster clstr_1 and clstr_3, but the latter
 -- has not been clustered
 SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR;  -- order of "skipping" warnings may vary
 CLUSTER;
+RESET client_min_messages;
 SELECT * FROM clstr_

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 08:12:37PM -0800, Nathan Bossart wrote:
> On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote:
>> But there doesn't appear to be any guarantee that the result for
>> AllTablesyncsReady() will change between the time it is invoked
>> earlier in the function and at the place you have it in the patch.
>> This is because the value of 'table_states_valid' may not have
>> changed. So, how is this supposed to work?
> 
> The call to CommandCounterIncrement() should set table_states_valid to
> false if needed.

In v12, I moved the restart for two_phase mode to the end of
process_syncing_tables_for_apply() so that we don't need to rely on another
iteration of the loop.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 57bfee9615e25d62dc975325695fdf445c002a65 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 21 Nov 2022 16:01:01 -0800
Subject: [PATCH v12 1/2] wake up logical workers as needed instead of relying
 on periodic wakeups

---
 src/backend/access/transam/xact.c   |  3 ++
 src/backend/commands/alter.c|  7 +++
 src/backend/commands/subscriptioncmds.c |  4 ++
 src/backend/replication/logical/tablesync.c | 49 -
 src/backend/replication/logical/worker.c| 46 +++
 src/include/replication/logicalworker.h |  3 ++
 6 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 24221542e7..54145bf805 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -47,6 +47,7 @@
 #include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
@@ -2360,6 +2361,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true, is_parallel_worker);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2860,6 +2862,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_LogicalRepWorkers(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 70d359eb6a..4e8102b59f 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -59,6 +59,7 @@
 #include "commands/user.h"
 #include "miscadmin.h"
 #include "parser/parse_func.h"
+#include "replication/logicalworker.h"
 #include "rewrite/rewriteDefine.h"
 #include "tcop/utility.h"
 #include "utils/builtins.h"
@@ -279,6 +280,12 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
 		if (strncmp(new_name, "regress_", 8) != 0)
 			elog(WARNING, "subscriptions created by regression test cases should have names starting with \"regress_\"");
 #endif
+
+		/*
+		 * Wake up the logical replication workers to handle this change
+		 * quickly.
+		 */
+		LogicalRepWorkersWakeupAtCommit(objectId);
 	}
 	else if (nameCacheId >= 0)
 	{
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b9c5df796f..b9bbb2cf4e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
@@ -1362,6 +1363,9 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 	InvokeObjectPostAlterHook(SubscriptionRelationId, subid, 0);
 
+	/* Wake up the logical replication workers to handle this change quickly. */
+	LogicalRepWorkersWakeupAtCommit(subid);
+
 	return myself;
 }
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 2fdfeb5b4c..85e2a2861f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -415,6 +415,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
+	bool		should_exit = false;
 
 	Assert(!IsTransactionState());
 
@@ -446,28 +447,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		last_start_times = NULL;
 	}
 
-	/*
-	 * Even when the two_phase mode is requested by the user, it remains as
-	 * 'pending' until all tablesyncs have reached READY state.
-	 *
-	 * When this happens, we restart the apply worker and (if the conditions
-	 * are still ok) then the two_phase tri-state will become 'enabled' at
-	 * that time.
-	 *
-	 * Note: If the subscription has no tables then leave the state as
-	 * PENDING, which allows ALTER SU

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Nathan Bossart

On Thu, Jan 05, 2023 at 09:09:12AM +0530, Amit Kapila wrote:
> On Wed, Jan 4, 2023 at 11:03 PM Nathan Bossart  
> wrote:
>> On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote:
>> > If so, we probably also need to
>> > ensure that table_states_valid is marked false probably via
>> > invalidations so that we can get the latest state and then perform
>> > this check. I guess if we can do that then we can directly move the
>> > restart logic to the end.
>>
>> IMO this shows the advantage of just waking up the worker.  It doesn't
>> change the apply worker's behavior besides making it more responsive.
> 
> But there doesn't appear to be any guarantee that the result for
> AllTablesyncsReady() will change between the time it is invoked
> earlier in the function and at the place you have it in the patch.
> This is because the value of 'table_states_valid' may not have
> changed. So, how is this supposed to work?

The call to CommandCounterIncrement() should set table_states_valid to
false if needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Amit Kapila

On Wed, Jan 4, 2023 at 11:03 PM Nathan Bossart  wrote:
>
> On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote:
> > I am not sure if I understand the problem you are trying to solve with
> > this part of the patch. Are you worried that after we mark some of the
> > relation's state as READY, all the table syncs are in the READY state
> > but we will not immediately try to check the two_pahse stuff and
> > probably the apply worker may sleep before the next time it invokes
> > process_syncing_tables_for_apply()?
>
> Yes.
>
> > If so, we probably also need to
> > ensure that table_states_valid is marked false probably via
> > invalidations so that we can get the latest state and then perform
> > this check. I guess if we can do that then we can directly move the
> > restart logic to the end.
>
> IMO this shows the advantage of just waking up the worker.  It doesn't
> change the apply worker's behavior besides making it more responsive.
>

But there doesn't appear to be any guarantee that the result for
AllTablesyncsReady() will change between the time it is invoked
earlier in the function and at the place you have it in the patch.
This is because the value of 'table_states_valid' may not have
changed. So, how is this supposed to work?

-- 
With Regards,
Amit Kapila.

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Tom Lane

David Rowley  writes:
> Additionally, it's also not that clear to me that sorting by more
> columns in the sort below the WindowAgg would always be a win over
> doing the final sort for the ORDER BY.  What if the WHERE clause (that
> could not be pushed down before a join) filtered out the vast majority
> of the rows before the ORDER BY. It might be cheaper to do the sort
> then than to sort by the additional columns earlier.

That's certainly a legitimate question to ask, but I don't quite see
where you figure we'd be sorting more rows?  WHERE filtering happens
before window functions, which never eliminate any rows.  So it seems
like a sort just before the window functions must sort the same number
of rows as one just after them.

regards, tom lane

Re: pg_upgrade test failure

2023-01-04 Thread Thomas Munro

On Wed, Dec 7, 2022 at 7:15 AM Andres Freund  wrote:
> On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > So [1] on its own didn't fix this.  My next guess is that the attached
> > might help.

> What is our plan here? This afaict is the most common "false positive" for
> cfbot in the last weeks.

That branch hasn't failed on cfbot[1], except once due to one of the
other known flapping races we have to fix.  Which doesn't prove
anything, of course, but it is encouraging.  I wish we knew why the
test does this, though

Here's a better version that works harder to avoid opening more than
one fd at a time (like the pgfnames()-based code it replaces), and
also uses fd.c facilities in the backend version (unlike pgfnames(),
which looks like it could leak a descriptor if palloc() threw, and
also doesn't know how to handle file descriptor pressure).

[1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/41/4011
From 3c9ed22ee207f2a5c008a971f8b19561d2e0ccfa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 5 Jan 2023 14:15:37 +1300
Subject: [PATCH v2] Refactor rmtree() to use get_dirent_type().

Switch to get_dirent_type() instead of lstat() while traversing a
directory graph, to see if that fixes intermittent ENOTEMPTY failures
while cleaning up after the pg_upgrade tests, on our Windows CI build.

Our CI image uses Windows Server 2019, a version known not to have POSIX
unlink semantics enabled by default yet, unlike typical Windows 10 and
11 systems.

The theory is that the offending directory entries must be
STATUS_DELETE_PENDING.  With this change, rmtree() should not skip them,
and try to unlink again.  Our unlink() wrapper should either wait a
short time for them to go away (because other processes close the
handle) or log a message to tell us the path of the problem file if not,
so we can dig further.

Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de
---
 src/common/rmtree.c | 120 +++-
 1 file changed, 64 insertions(+), 56 deletions(-)

diff --git a/src/common/rmtree.c b/src/common/rmtree.c
index d445e21ba2..a3e536149b 100644
--- a/src/common/rmtree.c
+++ b/src/common/rmtree.c
@@ -20,13 +20,21 @@
 #include 
 #include 
 
+#include "common/file_utils.h"
+
 #ifndef FRONTEND
+#include "storage/fd.h"
 #define pg_log_warning(...) elog(WARNING, __VA_ARGS__)
+#define LOG_LEVEL WARNING
+#define OPENDIR(x) AllocateDir(x)
+#define CLOSEDIR(x) FreeDir(x)
 #else
 #include "common/logging.h"
+#define LOG_LEVEL PG_LOG_WARNING
+#define OPENDIR(x) opendir(x)
+#define CLOSEDIR(x) closedir(x)
 #endif
 
-
 /*
  *	rmtree
  *
@@ -41,82 +49,82 @@
 bool
 rmtree(const char *path, bool rmtopdir)
 {
-	bool		result = true;
 	char		pathbuf[MAXPGPATH];
-	char	  **filenames;
-	char	  **filename;
-	struct stat statbuf;
-
-	/*
-	 * we copy all the names out of the directory before we start modifying
-	 * it.
-	 */
-	filenames = pgfnames(path);
+	DIR		   *dir;
+	struct dirent *de;
+	bool		result = true;
+	size_t		dirnames_size = 0;
+	size_t		dirnames_capacity = 8;
+	char	  **dirnames = palloc(sizeof(char *) * dirnames_capacity);
 
-	if (filenames == NULL)
+	dir = OPENDIR(path);
+	if (dir == NULL)
+	{
+		pg_log_warning("could not open directory \"%s\": %m", path);
 		return false;
+	}
 
-	/* now we have the names we can start removing things */
-	for (filename = filenames; *filename; filename++)
+	while (errno = 0, (de = readdir(dir)))
 	{
-		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename);
-
-		/*
-		 * It's ok if the file is not there anymore; we were just about to
-		 * delete it anyway.
-		 *
-		 * This is not an academic possibility. One scenario where this
-		 * happens is when bgwriter has a pending unlink request for a file in
-		 * a database that's being dropped. In dropdb(), we call
-		 * ForgetDatabaseSyncRequests() to flush out any such pending unlink
-		 * requests, but because that's asynchronous, it's not guaranteed that
-		 * the bgwriter receives the message in time.
-		 */
-		if (lstat(pathbuf, &statbuf) != 0)
-		{
-			if (errno != ENOENT)
-			{
-pg_log_warning("could not stat file or directory \"%s\": %m",
-			   pathbuf);
-result = false;
-			}
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
-		}
-
-		if (S_ISDIR(statbuf.st_mode))
-		{
-			/* call ourselves recursively for a directory */
-			if (!rmtree(pathbuf, true))
-			{
-/* we already reported the error */
-result = false;
-			}
-		}
-		else
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
+		switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL))
 		{
-			if (unlink(pathbuf) != 0)
-			{
-if (errno != ENOENT)
+			case PGFILETYPE_ERROR:
+/* already logged, press on */
+break;
+			case PGFILETYPE_DIR:
+
+/*
+ * Defer recursion until after we've closed this directory, to
+ * avoid using more than one file descriptor at a time.
+ */
+if (dirnames_siz

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread David Rowley

On Thu, 5 Jan 2023 at 15:18, Vik Fearing  wrote:
>
> On 1/4/23 13:07, Ankit Kumar Pandey wrote:
> > Also, one thing, consider the following query:
> >
> > explain analyze select row_number() over (order by a,b),count(*) over
> > (order by a) from abcd order by a,b,c;
> >
> > In this case, sorting is done on (a,b) followed by incremental sort on c
> > at final stage.
> >
> > If we do just one sort: a,b,c at first stage then there won't be need to
> > do another sort (incremental one).
>
>
> This could give incorrect results.

Yeah, this seems to be what I warned against in [1].

If we wanted to make that work we'd need to do it without adjusting
the WindowClause's orderClause so that the peer row checks still
worked correctly in nodeWindowAgg.c.

Additionally, it's also not that clear to me that sorting by more
columns in the sort below the WindowAgg would always be a win over
doing the final sort for the ORDER BY.  What if the WHERE clause (that
could not be pushed down before a join) filtered out the vast majority
of the rows before the ORDER BY. It might be cheaper to do the sort
then than to sort by the additional columns earlier.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvp=r1lnekcmwcyarumpl-jp4j_sdc8yefywadt1ac5...@mail.gmail.com

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Tom Lane

Vik Fearing  writes:
> On 1/4/23 13:07, Ankit Kumar Pandey wrote:
>> Also, one thing, consider the following query:
>> explain analyze select row_number() over (order by a,b),count(*) over 
>> (order by a) from abcd order by a,b,c;
>> In this case, sorting is done on (a,b) followed by incremental sort on c 
>> at final stage.
>> If we do just one sort: a,b,c at first stage then there won't be need to 
>> do another sort (incremental one).

> This could give incorrect results.

Mmmm ... your counterexample doesn't really prove that.  Yes,
the "rank()" step must consider only two ORDER BY columns while
deciding which rows are peers, but I don't see why it wouldn't
be okay if the rows happened to already be sorted by "c" within
those peer groups.

I don't recall the implementation details well enough to be sure
how hard it would be to keep that straight.

regards, tom lane

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Vik Fearing


On 1/4/23 13:07, Ankit Kumar Pandey wrote:

Also, one thing, consider the following query:

explain analyze select row_number() over (order by a,b),count(*) over 
(order by a) from abcd order by a,b,c;


In this case, sorting is done on (a,b) followed by incremental sort on c 
at final stage.


If we do just one sort: a,b,c at first stage then there won't be need to 
do another sort (incremental one).



This could give incorrect results.  Consider the following query:

postgres=# select a, b, c, rank() over (order by a, b)
from (values (1, 2, 1), (1, 2, 2), (1, 2, 1)) as abcd (a, b, c)
order by a, b, c;

 a | b | c | rank
---+---+---+--
 1 | 2 | 1 |1
 1 | 2 | 1 |1
 1 | 2 | 2 |1
(3 rows)


If you change the window's ordering like you suggest, you get this 
different result:



postgres=# select a, b, c, rank() over (order by a, b, c)
from (values (1, 2, 1), (1, 2, 2), (1, 2, 1)) as abcd (a, b, c)
order by a, b, c;

 a | b | c | rank
---+---+---+--
 1 | 2 | 1 |1
 1 | 2 | 1 |1
 1 | 2 | 2 |3
(3 rows)


--
Vik Fearing

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-04 Thread Matthias van de Meent

On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan  wrote:
>
> Attached is v14.

Some reviews (untested; only code review so far) on these versions of
the patches:

> [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

> +/*
> + * Threshold cutoff point (expressed in # of physical heap rel blocks in
> + * rel's main fork) that triggers VACUUM's eager freezing strategy
> + */

I don't think the mention of 'cutoff point' is necessary when it has
'Threshold'.

> +intfreeze_strategy_threshold;/* threshold to use eager
> [...]
> +BlockNumber freeze_strategy_threshold;

Is there a way to disable the 'eager' freezing strategy? `int` cannot
hold the maximum BlockNumber...

> +lazy_scan_strategy(vacrel);
>  if (verbose)

I'm slightly suprised you didn't update the message for verbose vacuum
to indicate whether we used the eager strategy: there are several GUCs
for tuning this behaviour, so you'd expect to want direct confirmation
that the configuration is effective.
(looks at further patches) I see that the message for verbose vacuum
sees significant changes in patch 2 instead.

---

> [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

General comments:

I don't see anything regarding scan synchronization in the vmsnap scan
system. I understand that VACUUM is a load that is significantly
different from normal SEQSCANs, but are there good reasons to _not_
synchronize the start of VACUUM?

Right now, we don't use syncscan to determine a startpoint. I can't
find the reason why in the available documentation: [0] discusses the
issue but without clearly describing an issue why it wouldn't be
interesting from a 'nothing lost' perspective.

In addition, I noticed that progress reporting of blocks scanned
("heap_blocks_scanned", duh) includes skipped pages. Now that we have
a solid grasp of how many blocks we're planning to scan, we can update
the reported stats to how many blocks we're planning to scan (and have
scanned), increasing the user value of that progress view.

[0] 
https://www.postgresql.org/message-id/flat/19398.1212328662%40sss.pgh.pa.us#17b2feb0fde6a41779290632d8c70ef1

> +doubletableagefrac;

I think this can use some extra info on the field itself, that it is
the fraction of how "old" the relfrozenxid and relminmxid fields are,
as a fraction between 0 (latest values; nextXID and nextMXID), and 1
(values that are old by at least freeze_table_age and
multixact_freeze_table_age (multi)transaction ids, respectively).

> -#define VACOPT_DISABLE_PAGE_SKIPPING 0x80/* don't skip any pages */
> +#define VACOPT_DISABLE_PAGE_SKIPPING 0x80/* don't skip using VM */

I'm not super happy with this change. I don't think we should touch
the VM using snapshots _at all_ when disable_page_skipping is set:

> + * Decide vmsnap scanning strategy.
>   *
> - * This test also enables more frequent relfrozenxid advancement during
> - * non-aggressive VACUUMs.  If the range has any all-visible pages then
> - * skipping makes updating relfrozenxid unsafe, which is a real downside.
> + * First acquire a visibility map snapshot, which determines the number 
> of
> + * pages that each vmsnap scanning strategy is required to scan for us in
> + * passing.

I think we should not take disk-backed vm snapshots when
force_scan_all is set. We need VACUUM to be able to run on very
resource-constrained environments, and this does not do that - it adds
a disk space requirement for the VM snapshot for all but the smallest
relation sizes, which is bad when you realize that we need VACUUM when
we want to clean up things like CLOG.

Additionally, it took me several reads of the code and comments to
understand what the decision-making process for lazy vs eager is, and
why. The comments are interspersed with the code, with no single place
that describes it from a bird's eyes' view. I think something like the
following would be appreciated by other readers of the code:

+ We determine whether we choose the eager or lazy scanning strategy
based on how many extra pages the eager strategy would take over the
lazy strategy, and how "old" the table is (as determined in
tableagefrac):
+ When a table is still "young" (tableagefrac <
TABLEAGEFRAC_MIDPOINT), the eager strategy is accepted if we need to
scan 5% (MAX_PAGES_YOUNG_TABLEAGE) more of the table.
+ As the table gets "older" (tableagefrac between MIDPOINT and
HIGHPOINT), the threshold for eager scanning is relaxed linearly from
this 5% to 70% (MAX_PAGES_OLD_TABLEAGE) of the table being scanned
extra (over what would be scanned by the lazy strategy).
+ Once the tableagefrac passes HIGHPOINT, we stop considering the lazy
strategy, and always eagerly scan the table.

> @@ -1885,6 +1902,30 @@ retry:
> tuples_frozen = 0;/* avoid miscounts in instrumentation */
>  }
>
> /*
> + * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
> + * set.  Che

Re: [PATCH] Simple code cleanup in tuplesort.c.

2023-01-04 Thread John Naylor

On Fri, Sep 16, 2022 at 1:43 PM Richard Guo  wrote:
>
>
> On Wed, Jul 27, 2022 at 5:10 PM Xing Guo  wrote:
>>
>> The bounded heap sorting status flag is set twice in sort_bounded_heap()
and tuplesort_performsort(). This patch helps remove one of them.
>
>
> Revisiting this patch I think maybe it's better to remove the setting of
> Tuplesort status from tuplesort_performsort() for the TSS_BOUNDED case.
> Thus we keep the heap manipulation routines, make_bounded_heap and
> sort_bounded_heap, consistent in that they update their status
> accordingly inside the function.

The label TSS_BUILDRUNS has a similar style and also the following comment,
so I will push this patch with a similar comment added unless someone wants
to make a case for doing otherwise.

* Note that mergeruns sets the correct state->status.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Tom Lane

Thomas Munro  writes:
> On Thu, Jan 5, 2023 at 11:55 AM Tom Lane  wrote:
>> ... But if that is the direction
>> we're going to go in, we should probably revise these APIs to make them
>> less odd.  I'm not sure why we'd keep the REG_CANCEL error code at all.

> Ah, OK.  I had the impression from the way the code is laid out with a
> wall between "PostgreSQL" bits and "vendored library" bits that we
> might have some reason to want to keep that callback interface the
> same (ie someone else is using this code and we want to stay in
> sync?), but your reactions are a clue that maybe I imagined a
> requirement that doesn't exist.

The rcancelrequested API is something that I devised out of whole cloth
awhile ago.  It's not in Tcl's copy of the code, which AFAIK is the
only other project using this regex engine.  I do still have vague
hopes of someday seeing the engine as a standalone project, which is
why I'd prefer to keep a bright line between the engine and Postgres.
But there's no very strong reason to think that any hypothetical future
external users who need a cancel API would want this API as opposed to
one that requires exit() or longjmp() to get out of the engine.  So if
we're changing the way we use it, I think it's perfectly reasonable to
redesign that API to make it simpler and less of an impedance mismatch.

regards, tom lane

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 10:12:19AM -0800, Nathan Bossart wrote:
> From the discussion thus far, it sounds like the alternatives are to 1) add
> a global flag that causes wal_retrieve_retry_interval to be bypassed for
> all workers or to 2) add a hash map in the launcher and a
> restart_immediately flag in each worker slot.  I'll go ahead and create a
> patch for 2 since it seems like the most complete solution, and we can
> evaluate whether the complexity seems appropriate.

Here is a first attempt at adding a hash table to the launcher and a
restart_immediately flag in each worker slot.  This provides a similar
speedup to lowering wal_retrieve_retry_interval to 1ms.  I've noted a
couple of possible race conditions in comments, but none of them seemed
particularly egregious.  Ideally, we'd put the hash table in shared memory
so that other backends could adjust it directly, but IIUC that requires it
to be a fixed size, and the number of subscriptions is virtually unbounded.
There might still be problems with the patch, but I'm hoping it at least
helps further the discussion about which approach to take.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 2c4c3c552ca83eeb8a4a7cca724ce455ba98ac0f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 21 Nov 2022 16:01:01 -0800
Subject: [PATCH v11 1/2] wake up logical workers as needed instead of relying
 on periodic wakeups

---
 src/backend/access/transam/xact.c   |  3 ++
 src/backend/commands/alter.c|  7 
 src/backend/commands/subscriptioncmds.c |  4 ++
 src/backend/replication/logical/tablesync.c | 10 +
 src/backend/replication/logical/worker.c| 46 +
 src/include/replication/logicalworker.h |  3 ++
 6 files changed, 73 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 24221542e7..54145bf805 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -47,6 +47,7 @@
 #include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
@@ -2360,6 +2361,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true, is_parallel_worker);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2860,6 +2862,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_LogicalRepWorkers(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 70d359eb6a..4e8102b59f 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -59,6 +59,7 @@
 #include "commands/user.h"
 #include "miscadmin.h"
 #include "parser/parse_func.h"
+#include "replication/logicalworker.h"
 #include "rewrite/rewriteDefine.h"
 #include "tcop/utility.h"
 #include "utils/builtins.h"
@@ -279,6 +280,12 @@ AlterObjectRename_internal(Relation rel, Oid objectId, const char *new_name)
 		if (strncmp(new_name, "regress_", 8) != 0)
 			elog(WARNING, "subscriptions created by regression test cases should have names starting with \"regress_\"");
 #endif
+
+		/*
+		 * Wake up the logical replication workers to handle this change
+		 * quickly.
+		 */
+		LogicalRepWorkersWakeupAtCommit(objectId);
 	}
 	else if (nameCacheId >= 0)
 	{
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b9c5df796f..b9bbb2cf4e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
@@ -1362,6 +1363,9 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 	InvokeObjectPostAlterHook(SubscriptionRelationId, subid, 0);
 
+	/* Wake up the logical replication workers to handle this change quickly. */
+	LogicalRepWorkersWakeupAtCommit(subid);
+
 	return myself;
 }
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 2fdfeb5b4c..fcadc1e98e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -105,6 +105,7 @@
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 #include "replication/slot.h"
@@ -619,6 +620,15 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 
 	if (started_

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Thomas Munro

On Thu, Jan 5, 2023 at 11:55 AM Tom Lane  wrote:
> Andres Freund  writes:
> > Hm. Seems confusing for this to continue being called rcancelrequested() and
> > to be called via if(CANCEL_REQUESTED()), if we're not even documenting that
> > it's intended to be usable that way?
>
> Yeah.  I'm not very happy with this line of development at all,
> because I think we are painting ourselves into a corner by not allowing
> code to detect whether a cancel is pending without having it happen
> immediately.  (That is, I do not believe that backend/regex/ is the
> only code that will ever wish for that.)  But if that is the direction
> we're going to go in, we should probably revise these APIs to make them
> less odd.  I'm not sure why we'd keep the REG_CANCEL error code at all.

Ah, OK.  I had the impression from the way the code is laid out with a
wall between "PostgreSQL" bits and "vendored library" bits that we
might have some reason to want to keep that callback interface the
same (ie someone else is using this code and we want to stay in
sync?), but your reactions are a clue that maybe I imagined a
requirement that doesn't exist.

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-05 13:21:54 +1300, Thomas Munro wrote:
> Right, I contemplated variations on that theme.  I'd be willing to
> code something like that to kick the tyres, but it seems like it would
> make back-patching more painful?  We're trying to fix bugs here...

I think we need to accept that this mess can't be fixed in the back
branches. I'd rather get a decent fix sometime in PG16 than a crufty fix in PG
17 that we then backpatch a while later.

> Deciding to proceed with #6 (palloc) wouldn't mean we can't eventually
> also implement two phase/soft CFI() when we have a potential user, so
> I don't really get the painted-into-a-corner argument.

I think that's a fair point.

> However, it's all moot if the #6 isn't good enough on its own merits
> independent of other hypothetical future users (eg if the per regex_t
> MemoryContext overheads are considered too high and can't be tuned
> acceptably).

I'm not too worried about that, particularly because it looks like it'd not be
too hard to lower the overhead further. Arguably allocating memory outside of
mcxt.c is actually a bad thing independent of error handing, because it's
effectively "invisible" to our memory-usage-monitoring facilities.

Greetings,

Andres Freund

Re: Split index and table statistics into different types of stats

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-03 15:19:18 +0100, Drouvot, Bertrand wrote:
> diff --git a/src/backend/access/common/relation.c 
> b/src/backend/access/common/relation.c
> index 4017e175e3..fca166a063 100644
> --- a/src/backend/access/common/relation.c
> +++ b/src/backend/access/common/relation.c
> @@ -73,7 +73,10 @@ relation_open(Oid relationId, LOCKMODE lockmode)
>   if (RelationUsesLocalBuffers(r))
>   MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPNAMESPACE;
>  
> - pgstat_init_relation(r);
> + if (r->rd_rel->relkind == RELKIND_INDEX)
> + pgstat_init_index(r);
> + else
> + pgstat_init_table(r);
>  
>   return r;
>  }
> @@ -123,7 +126,10 @@ try_relation_open(Oid relationId, LOCKMODE lockmode)
>   if (RelationUsesLocalBuffers(r))
>   MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPNAMESPACE;
>  
> - pgstat_init_relation(r);
> + if (r->rd_rel->relkind == RELKIND_INDEX)
> + pgstat_init_index(r);
> + else
> + pgstat_init_table(r);
>  
>   return r;
>  }

Not this patch's fault, but the functions in relation.c have gotten duplicated
to an almost ridiculous degree :(


> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index 3fb38a25cf..98bb230b95 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -776,11 +776,19 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, 
> BlockNumber blockNum,
>* Read the buffer, and update pgstat counters to reflect a cache hit or
>* miss.
>*/
> - pgstat_count_buffer_read(reln);
> + if (reln->rd_rel->relkind == RELKIND_INDEX)
> + pgstat_count_index_buffer_read(reln);
> + else
> + pgstat_count_table_buffer_read(reln);
>   buf = ReadBuffer_common(RelationGetSmgr(reln), 
> reln->rd_rel->relpersistence,
>   forkNum, blockNum, 
> mode, strategy, &hit);
>   if (hit)
> - pgstat_count_buffer_hit(reln);
> + {
> + if (reln->rd_rel->relkind == RELKIND_INDEX)
> + pgstat_count_index_buffer_hit(reln);
> + else
> + pgstat_count_table_buffer_hit(reln);
> + }
>   return buf;
>  }

Not nice to have additional branches here :(.

I think going forward we should move buffer stats to a "per-relfilenode" stats
entry (which would allow to track writes too), then thiw branch would be
removed again.


> +/* -
> + *
> + * pgstat_index.c
> + * Implementation of index statistics.

This is a fair bit of duplicated code. Perhaps it'd be worth keeping
pgstat_relation with code common to table/index stats?


> +bool
> +pgstat_index_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
> +{
> + static const PgStat_IndexCounts all_zeroes;
> + Oid dboid;
> +
> + PgStat_IndexStatus *lstats; /* pending stats entry  */
> + PgStatShared_Index *shrelcomstats;

What does "com" stand for in shrelcomstats?


> + PgStat_StatIndEntry *indentry;  /* index entry of shared stats */
> + PgStat_StatDBEntry *dbentry;/* pending database entry */
> +
> + dboid = entry_ref->shared_entry->key.dboid;
> + lstats = (PgStat_IndexStatus *) entry_ref->pending;
> + shrelcomstats = (PgStatShared_Index *) entry_ref->shared_stats;
> +
> + /*
> +  * Ignore entries that didn't accumulate any actual counts, such as
> +  * indexes that were opened by the planner but not used.
> +  */
> + if (memcmp(&lstats->i_counts, &all_zeroes,
> +sizeof(PgStat_IndexCounts)) == 0)
> + {
> + return true;
> + }

I really need to propose pg_memiszero().



>  Datum
> -pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
> +pg_stat_get_tab_xact_numscans(PG_FUNCTION_ARGS)
>  {
>   Oid relid = PG_GETARG_OID(0);
>   int64   result;
> @@ -1360,17 +1413,32 @@ pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
>   PG_RETURN_INT64(result);
>  }
>  
> +Datum
> +pg_stat_get_ind_xact_numscans(PG_FUNCTION_ARGS)
> +{
> + Oid relid = PG_GETARG_OID(0);
> + int64   result;
> + PgStat_IndexStatus *indentry;
> +
> + if ((indentry = find_indstat_entry(relid)) == NULL)
> + result = 0;
> + else
> + result = (int64) (indentry->i_counts.i_numscans);
> +
> + PG_RETURN_INT64(result);
> +}

Why didn't all these get converted to the same macro based approach as the
!xact versions?


>  Datum
>  pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS)
>  {
>   Oid relid = PG_GETARG_OID(0);
>   int64   result;
> - PgStat_TableStatus *tabentry;
> + PgStat_IndexStatus *indentry;
>  
> - if ((tabentry = find_tabstat_entry(relid)) == NULL)
> + if ((indentry = find_indstat_entry(relid)) ==

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Thomas Munro

On Thu, Jan 5, 2023 at 12:33 PM Andres Freund  wrote:
> What about using a version of errsave() that can save FATALs too? We could
> have something roughly like the ProcessInterrupts() in the proposed patch that
> is used from within rcancelrequested(). But instead of actually throwing the
> error, we'd just remember the to-be-thrown-later error, that the next
> "real" CFI would throw.

Right, I contemplated variations on that theme.  I'd be willing to
code something like that to kick the tyres, but it seems like it would
make back-patching more painful?  We're trying to fix bugs here...
Deciding to proceed with #6 (palloc) wouldn't mean we can't eventually
also implement two phase/soft CFI() when we have a potential user, so
I don't really get the painted-into-a-corner argument.  However, it's
all moot if the #6 isn't good enough on its own merits independent of
other hypothetical future users (eg if the per regex_t MemoryContext
overheads are considered too high and can't be tuned acceptably).

Re: Some compiling warnings

2023-01-04 Thread David Rowley

On Wed, 4 Jan 2023 at 20:11, Richard Guo  wrote:
>
> When trying Valgrind I came across some compiling warnings with
> USE_VALGRIND defined and --enable-cassert not configured.  This is
> mainly because in this case we have MEMORY_CONTEXT_CHECKING defined
> while USE_ASSERT_CHECKING not defined.

> Attach a trivial patch for the fix.

I've just pushed that. Thanks.

David

Re: CI and test improvements

2023-01-04 Thread Justin Pryzby

On Tue, Nov 22, 2022 at 04:57:44PM -0600, Justin Pryzby wrote:
> I shuffled my branch around and sending now the current "docs" patches,
> but I suppose this is waiting on the "convert CompilerWarnings task to
> meson" patch.

In case it's not, here's a version to do that now.
>From 59bbbce620fdf3c4f228c0569b1fac29beb06988 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 25 May 2022 21:53:22 -0500
Subject: [PATCH 1/7] cirrus/windows: add compiler_warnings_script

I'm not sure how to write this test in windows shell; it's also easy to
write something that doesn't work in posix sh, since windows shell is
interpretting && and ||...

https://www.postgresql.org/message-id/20220212212310.f645c6vw3njkgxka%40alap3.anarazel.de

See also:
8a1ce5e54f6d144e4f8e19af7c767b026ee0c956
https://cirrus-ci.com/task/6241060062494720
https://cirrus-ci.com/task/6496366607204352

ci-os-only: windows
---
 .cirrus.yml | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 69837bcd5ad..fd461048372 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -564,12 +564,21 @@ task:
 
   build_script: |
 vcvarsall x64
-ninja -C build
+ninja -C build |tee build.txt
+REM Since pipes lose the exit status of the preceding command, rerun the compilation
+REM without the pipe, exiting now if it fails, to avoid trying to run checks
+ninja -C build > nul
 
   check_world_script: |
 vcvarsall x64
 meson test %MTEST_ARGS% --num-processes %TEST_JOBS%
 
+  # This should be last, so check_world is run even if there are warnings
+  always:
+compiler_warnings_script:
+  # this avoids using metachars which would be interpretted by the windows shell
+  - sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0'
+
   on_failure:
 <<: *on_failure_meson
 crashlog_artifacts:
-- 
2.25.1

>From 4edbdf0337c75bd69e536b542a4cd06dac87c33d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 4 Nov 2022 15:43:56 -0500
Subject: [PATCH 2/7] cirrus/macos: update to macos ventura

ci-os-only: macos
---
 .cirrus.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index fd461048372..858454a3d08 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -404,7 +404,7 @@ task:
 
 
 task:
-  name: macOS - Monterey - Meson
+  name: macOS - Ventura - Meson
 
   env:
 CPUS: 4 # always get that much for cirrusci macOS instances
@@ -429,7 +429,7 @@ task:
   only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*(macos|darwin|osx).*'
 
   macos_instance:
-image: ghcr.io/cirruslabs/macos-monterey-base:latest
+image: ghcr.io/cirruslabs/macos-ventura-base:latest
 
   sysinfo_script: |
 id
-- 
2.25.1

>From 2f6a7102f717d7fe5e291d6e439fd32b122180a1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 24 Jun 2022 00:09:12 -0500
Subject: [PATCH 3/7] cirrus/freebsd: run with more CPUs+RAM and do not
 repartition

There was some historic problem where tests under freebsd took 8+ minutes (and
before 4a288a37f took 15 minutes).

This reduces test time from 10min to 3min.
4 CPUs 4 tests https://cirrus-ci.com/task/4880240739614720
4 CPUs 6 tests https://cirrus-ci.com/task/4664440120410112 https://cirrus-ci.com/task/4586784884523008
4 CPUs 8 tests https://cirrus-ci.com/task/5001995491737600

6 CPUs https://cirrus-ci.com/task/6678321684545536
8 CPUs https://cirrus-ci.com/task/6264854121021440

See also:
https://www.postgresql.org/message-id/flat/20220310033347.hgxk4pyarzq4h...@alap3.anarazel.de#f36c0b17e33e31e7925e7e5812998686
8 jobs 7min https://cirrus-ci.com/task/6186376667332608

//-os-only: freebsd
---
 .cirrus.yml | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 858454a3d08..2ba6b7cc2d8 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -131,11 +131,9 @@ task:
   name: FreeBSD - 13 - Meson
 
   env:
-# FreeBSD on GCP is slow when running with larger number of CPUS /
-# jobs. Using one more job than cpus seems to work best.
-CPUS: 2
-BUILD_JOBS: 3
-TEST_JOBS: 3
+CPUS: 4
+BUILD_JOBS: 4
+TEST_JOBS: 6
 
 CCACHE_DIR: /tmp/ccache_dir
 CPPFLAGS: -DRELCACHE_FORCE_RELEASE -DCOPY_PARSE_PLAN_TREES -DWRITE_READ_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST
@@ -160,8 +158,6 @@ task:
 
   ccache_cache:
 folder: $CCACHE_DIR
-  # Work around performance issues due to 32KB block size
-  repartition_script: src/tools/ci/gcp_freebsd_repartition.sh
   create_user_script: |
 pw useradd postgres
 chown -R postgres:postgres .
-- 
2.25.1

>From b55c042ec467a03984a5e2132dc4e7fe0f49fef3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 9 Dec 2022 15:32:50 -0600
Subject: [PATCH 4/7] cirrus/freebsd: define
 ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

See also: 54100f5c6052404f68de9ce7310ceb61f1c291f8

ci-os-only: freebsd
---
 .cirrus.yml | 2 +-
 1 file changed, 1 insertion(+), 1 del

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 17:55:43 -0500, Tom Lane wrote:
> I'm not very happy with this line of development at all, because I think we
> are painting ourselves into a corner by not allowing code to detect whether
> a cancel is pending without having it happen immediately.  (That is, I do
> not believe that backend/regex/ is the only code that will ever wish for
> that.)

I first wrote that this is hard to make work without introducing overhead
(like a PG_TRY in rcancelrequested()), for a bunch of reasons discussed
upthread (see [1]).

But now I wonder if we didn't recently introduce most of the framework to make
this less hard / expensive.

What about using a version of errsave() that can save FATALs too? We could
have something roughly like the ProcessInterrupts() in the proposed patch that
is used from within rcancelrequested(). But instead of actually throwing the
error, we'd just remember the to-be-thrown-later error, that the next
"real" CFI would throw.

That still leaves us with some increased likelihood of erroring out within the
regex machinery, e.g. if there's an out-of-memory error within elog.c
processing. But I'd not be too worried about leaking memory in that corner
case. Which also could be closed using the approach in Thomas' patch, except
that it normally would still return in rcancelrequested().

Insane?

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CA%2BhUKG%2BqtNxDQAzC20AnUxuigKYb%3D7shtmsuSyMekjni%3Dik6BA%40mail.gmail.com

Re: CI and test improvements

2023-01-04 Thread Justin Pryzby

On Mon, Nov 21, 2022 at 02:45:42PM -0800, Andres Freund wrote:
> On 2022-11-13 17:53:04 -0600, Justin Pryzby wrote:
> > > > From: Justin Pryzby 
> > > > Date: Tue, 26 Jul 2022 20:30:02 -0500
> > > > Subject: [PATCH 6/8] cirrus/ccache: add explicit cache keys..
> > > >
> > > > Since otherwise, building with ci-os-only will probably fail to use the
> > > > normal cache, since the cache key is computed using both the task name
> > > > and its *index* in the list of caches (internal/executor/cache.go:184).
> > > 
> > > Seems like this would potentially better addressed by reporting a bug to 
> > > the
> > > cirrus folks?
> > 
> > You said that before, but I don't think so - since they wrote code to do
> > that, it's odd to file a bug that says that the behavior is wrong.  I am
> > curious why, but it seems delibrate.
> > 
> > https://www.postgresql.org/message-id/20220828171029.GO2342%40telsasoft.com
> 
> I suspect this is just about dealing with unnamed tasks and could be
> handled by just mixing in CI_NODE_INDEX if the task name isn't set.

I suppose it was their way of dealing with this:

|Cache artifacts are shared between tasks, so two caches with the same
|name on e.g. Linux containers and macOS VMs will share the same set of
|files. This may introduce binary incompatibility between caches. To
|avoid that, add echo $CIRRUS_OS into fingerprint_script or use
|$CIRRUS_OS in fingerprint_key, which will distinguish caches based on
|OS.

To make caches work automatically, without having to know to name them
differently.

-- 
Justin

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Tom Lane

Andres Freund  writes:
> Hm. Seems confusing for this to continue being called rcancelrequested() and
> to be called via if(CANCEL_REQUESTED()), if we're not even documenting that
> it's intended to be usable that way?

Yeah.  I'm not very happy with this line of development at all,
because I think we are painting ourselves into a corner by not allowing
code to detect whether a cancel is pending without having it happen
immediately.  (That is, I do not believe that backend/regex/ is the
only code that will ever wish for that.)  But if that is the direction
we're going to go in, we should probably revise these APIs to make them
less odd.  I'm not sure why we'd keep the REG_CANCEL error code at all.

> I think it might be nicer to create this below CacheMemoryContext?

Meh ... CacheMemoryContext might not exist yet, especially for the
use-cases in the login logic.

regards, tom lane

Re: meson oddities

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 23:17:30 +0100, Peter Eisentraut wrote:
> I meant the latter, which I see is already in there, but it doesn't actually
> fully work.  It only looks at the subdirectory (like "lib"), not the whole
> path (like "/usr/local/pgsql/lib").  With the attached patch I have it
> working and I get the same installation layout from both build systems.

Oh, oops. I tested this at some point, but I guess I over-simplified it at
some point.

Then I have zero objections to this. One question below though.

>  dir_data = get_option('datadir')
> -if not (dir_data.contains('pgsql') or dir_data.contains('postgres'))
> +if not ((dir_prefix/dir_data).contains('pgsql') or 
> (dir_prefix/dir_data).contains('postgres'))
>dir_data = dir_data / pkg
>  endif

Hm. Perhaps we should just test once whether prefix contains pgsql/postgres,
and then just otherwise leave the test as is? There afaict can't be a
dir_prefix/dir_* that matches postgres/pgsql that won't also match either of
the components.

Greetings,

Andres Freund

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 16:46:05 +1300, Thomas Munro wrote:
> postgres=# select 'x' ~ 'hello world .*';
> -[ RECORD 1 ]
> ?column? | f
> 
> postgres=# select * from pg_backend_memory_contexts where name =
> 'RegexpMemoryContext';
> -[ RECORD 1 ]-+-
> name  | RegexpMemoryContext
> ident | hello world .*
> parent| RegexpCacheMemoryContext
> level | 2
> total_bytes   | 13376
> total_nblocks | 5

Hm, if a trivial re uses 13kB, using ALLOCSET_SMALL_SIZES might actually
increase memory usage by increasing the number of blocks.

> free_bytes| 5144
> free_chunks   | 8
> used_bytes| 8232

Hm. So we actually have a bunch of temporary allocations in here. I assume
that's all the stuff from the "non-compact" representation that
src/backend/regex/README talks about?

I doesn't immedialy look trivial to use a separate memory context for the
"final" representation and scratch memory though.

> There's some more memory allocated in regc_pg_locale.c with raw
> malloc() that could probably benefit from a pallocisation just to be
> able to measure it, but I didn't touch that here.

It might also effectively reduce the overhead of using palloc, by filling the
context up further.

> diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
> index bb8c240598..c0f8e77b49 100644
> --- a/src/backend/regex/regcomp.c
> +++ b/src/backend/regex/regcomp.c
> @@ -2471,17 +2471,17 @@ rfree(regex_t *re)
>  /*
>   * rcancelrequested - check for external request to cancel regex operation
>   *
> - * Return nonzero to fail the operation with error code REG_CANCEL,
> - * zero to keep going
> - *
> - * The current implementation is Postgres-specific.  If we ever get around
> - * to splitting the regex code out as a standalone library, there will need
> - * to be some API to let applications define a callback function for this.
> + * The current implementation always returns 0, if CHECK_FOR_INTERRUPTS()
> + * doesn't exit non-locally via ereport().  Memory allocated while compiling 
> is
> + * expected to be cleaned up by virtue of being allocated using palloc in a
> + * suitable memory context.
>   */
>  static int
>  rcancelrequested(void)
>  {
> - return InterruptPending && (QueryCancelPending || ProcDiePending);
> + CHECK_FOR_INTERRUPTS();
> +
> + return 0;
>  }

Hm. Seems confusing for this to continue being called rcancelrequested() and
to be called via if(CANCEL_REQUESTED()), if we're not even documenting that
it's intended to be usable that way?

Seems at the minimum we ought to keep more of the old comment, to explain the
somewhat odd API?

> + /* Set up the cache memory on first go through. */
> + if (unlikely(RegexpCacheMemoryContext == NULL))
> + RegexpCacheMemoryContext =
> + AllocSetContextCreate(TopMemoryContext,
> +   
> "RegexpCacheMemoryContext",
> +   
> ALLOCSET_SMALL_SIZES);

I think it might be nicer to create this below CacheMemoryContext? Just so the
"memory context tree" stays nicely ordered.

Greetings,

Andres Freund

Re: Optimizing Node Files Support

2023-01-04 Thread Tom Lane

vignesh C  writes:
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:

Yeah.  The way that I'd been thinking of optimizing the copy functions
was more or less as attached: continue to write all the COPY_SCALAR_FIELD
macro calls, but just make them expand to no-ops after an initial memcpy
of the whole node.  This preserves flexibility to do something else while
still getting the benefit of substituting memcpy for retail field copies.
Having said that, I'm not very sure it's worth doing, because I do not
see any major reduction in code size:

HEAD:
   textdata bss dec hex filename
  53601   0   0   53601d161 copyfuncs.o
w/patch:
   textdata bss dec hex filename
  49850   0   0   49850c2ba copyfuncs.o

I've not looked at the generated assembly code, but I suspect that
my compiler is converting the memcpy's into inlined code that's
hardly any smaller than field-by-field assignment.  Also, it's
rather debatable that it'd be faster, especially for node types
that are mostly pointer fields, where the memcpy is going to be
largely wasted effort.

I tend to agree with John that the rest of the changes proposed
in the v1 patch are not useful improvements, especially with
no evidence offered that they'd make the code smaller or faster.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f2568ff5e6..1a8df587ce 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -27,8 +27,9 @@
  */
 
 /* Copy a simple scalar field (int, float, bool, enum, etc) */
+/* We now assume that scalar fields are flat-copied via initial memcpy */
 #define COPY_SCALAR_FIELD(fldname) \
-	(newnode->fldname = from->fldname)
+	((void) 0)
 
 /* Copy a field that is a pointer to some kind of Node or Node tree */
 #define COPY_NODE_FIELD(fldname) \
@@ -43,8 +44,9 @@
 	(newnode->fldname = from->fldname ? pstrdup(from->fldname) : (char *) NULL)
 
 /* Copy a field that is an inline array */
+/* These too should be taken care of by initial memcpy */
 #define COPY_ARRAY_FIELD(fldname) \
-	memcpy(newnode->fldname, from->fldname, sizeof(newnode->fldname))
+	((void) 0)
 
 /* Copy a field that is a pointer to a simple palloc'd object of size sz */
 #define COPY_POINTER_FIELD(fldname, sz) \
@@ -59,7 +61,7 @@
 
 /* Copy a parse location field (for Copy, this is same as scalar case) */
 #define COPY_LOCATION_FIELD(fldname) \
-	(newnode->fldname = from->fldname)
+	((void) 0)
 
 
 #include "copyfuncs.funcs.c"
@@ -72,8 +74,9 @@
 static Const *
 _copyConst(const Const *from)
 {
-	Const	   *newnode = makeNode(Const);
+	Const	   *newnode = palloc_object(Const);
 
+	memcpy(newnode, from, sizeof(Const));
 	COPY_SCALAR_FIELD(consttype);
 	COPY_SCALAR_FIELD(consttypmod);
 	COPY_SCALAR_FIELD(constcollid);
@@ -107,8 +110,9 @@ _copyConst(const Const *from)
 static A_Const *
 _copyA_Const(const A_Const *from)
 {
-	A_Const*newnode = makeNode(A_Const);
+	A_Const*newnode = palloc_object(A_Const);
 
+	memcpy(newnode, from, sizeof(A_Const));
 	COPY_SCALAR_FIELD(isnull);
 	if (!from->isnull)
 	{
@@ -150,8 +154,10 @@ _copyExtensibleNode(const ExtensibleNode *from)
 	const ExtensibleNodeMethods *methods;
 
 	methods = GetExtensibleNodeMethods(from->extnodename, false);
-	newnode = (ExtensibleNode *) newNode(methods->node_size,
-		 T_ExtensibleNode);
+
+	newnode = (ExtensibleNode *) palloc(methods->node_size);
+	memcpy(newnode, from, methods->node_size);
+
 	COPY_STRING_FIELD(extnodename);
 
 	/* copy the private fields */
diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index b3c1ead496..4f62fa09cb 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -671,8 +671,9 @@ foreach my $n (@node_types)
 static $n *
 _copy${n}(const $n *from)
 {
-\t${n} *newnode = makeNode($n);
+\t${n} *newnode = palloc_object($n);
 
+\tmemcpy(newnode, from, sizeof($n));
 " unless $struct_no_copy;
 
 	print $eff "

Re: Add a test to ldapbindpasswd

2023-01-04 Thread Andrew Dunstan


On 2023-01-04 We 16:26, Andrew Dunstan wrote:
> On 2023-01-02 Mo 09:45, Andrew Dunstan wrote:
>> On 2023-01-01 Su 18:31, Andrew Dunstan wrote:
>>> On 2023-01-01 Su 14:02, Thomas Munro wrote:
 On Mon, Jan 2, 2023 at 3:04 AM Andrew Dunstan  wrote:
> On 2022-12-19 Mo 11:16, Andrew Dunstan wrote:
>> There is currently no test for the use of ldapbindpasswd in the
>> pg_hba.conf file. This patch, mostly the work of John Naylor, remedies 
>> that.
>>
>>
> This currently has failures on the cfbot for meson builds on FBSD13 and
> Debian Bullseye, but it's not at all clear why. In both cases it fails
> where the ldap server is started.
 I think it's failing when using meson.  I guess it fails to fail on
 macOS only because you need to add a new path for Homebrew/ARM like
 commit 14d63dd2, so it's skipping (it'd be nice if we didn't need
 another copy of all that logic).  Trying locally... it looks like
 slapd is failing silently, and with some tracing I can see it's
 sending an error message to my syslog daemon, which logged:

 2023-01-02T07:50:20.853019+13:00 x1 slapd[153599]: main: TLS init def
 ctx failed: -1

 Ah, it looks like this test is relying on "slapd-certs", which doesn't 
 exist:

 tmunro@x1:~/projects/postgresql/build$ ls testrun/ldap/001_auth/data/
 ldap.conf  ldappassword  openldap-data  portlock  slapd-certs  slapd.conf
 tmunro@x1:~/projects/postgresql/build$ ls testrun/ldap/002_bindpasswd/data/
 portlock  slapd.conf

 I didn't look closely, but apparently there is something wrong in the
 part that copies certs from the ssl test?  Not sure why it works for
 autoconf...
>>> Let's see how we fare with this patch.
>>>
>>>
>> Not so well :-(. This version tries to make the tests totally
>> independent, as they should be. That's an attempt to get the cfbot to go
>> green, but I am intending to refactor this code substantially so the
>> common bits are in a module each test file will load.
>>
>>
> This version factors out the creation of the LDAP server into a separate
> perl Module. That makes both the existing test script and the new test
> script a lot shorter, and will be useful for the nearby patch for a hook
> for the ldapbindpassword.
>
>

Looks like I fat fingered this. Here's a version that works.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/ldap/LdapServer.pm b/src/test/ldap/LdapServer.pm
new file mode 100644
index 00..0b538defbf
--- /dev/null
+++ b/src/test/ldap/LdapServer.pm
@@ -0,0 +1,319 @@
+
+
+#
+# LdapServer.pm
+#
+# Module to set up an LDAP server for testing pg_hba.conf ldap authentication
+#
+# Copyright (c) 2023, PostgreSQL Global Development Group
+#
+
+
+=pod
+
+=head1 NAME
+
+LdapServer - class for an LDAP server for testing pg_hba.conf authentication
+
+=head1 SYNOPSIS
+
+  use LdapServer;
+
+  # have we found openldap binaies suitable for setting up a server?
+  my $ldap_binaries_found = $LdapServer::setup;
+
+  # create a server with the given root password and auth type
+  # (users or anonymous)
+  my $server = LdapServer->new($root_password, $auth_type);
+
+  # Add the contents of an LDIF file to the server
+  $server->ldapadd_file ($path_to_ldif_data);
+
+  # set the Ldap password for a user
+  $server->ldapsetpw($user, $password);
+
+  # get details of some settings for the server
+  my @properties = $server->prop($propname1, $propname2, ...);
+
+=head1 DESCRIPTION
+
+  LdapServer tests in its INIT phase for the presence of suitable openldap
+  binaries. Its constructor method sets up and runs an LDAP server, and any
+  servers that are set up are terminated during its END phase.
+
+=cut
+
+package LdapServer;
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use File::Copy;
+use File::Basename;
+
+# private variables
+my ($slapd, $ldap_schema_dir, @servers);
+
+# visible variable
+our ($setup);
+
+INIT
+{
+	$setup = 1;
+	if ($^O eq 'darwin' && -d '/opt/homebrew/opt/openldap')
+	{
+		# typical paths for Homebrew on ARM
+		$slapd   = '/opt/homebrew/opt/openldap/libexec/slapd';
+		$ldap_schema_dir = '/opt/homebrew/etc/openldap/schema';
+	}
+	elsif ($^O eq 'darwin' && -d '/usr/local/opt/openldap')
+	{
+		# typical paths for Homebrew on Intel
+		$slapd   = '/usr/local/opt/openldap/libexec/slapd';
+		$ldap_schema_dir = '/usr/local/etc/openldap/schema';
+	}
+	elsif ($^O eq 'darwin' && -d '/opt/local/etc/openldap')
+	{
+		# typical paths for MacPorts
+		$slapd   = '/opt/local/libexec/slapd';
+		$ldap_schema_dir = '/opt/local/etc/openldap/schema';
+	}
+	elsif ($^O eq 'linux')
+	{
+		$slapd   = '/usr/sbin/slapd';
+		$ldap_schema_dir = '/etc/ldap/schema' if -d '/etc/ldap/schema';
+		$ldap_schem

Re: fix and document CLUSTER privileges

2023-01-04 Thread Gilles Darold


Le 04/01/2023 à 19:18, Nathan Bossart a écrit :

On Wed, Jan 04, 2023 at 02:25:13PM +0100, Gilles Darold wrote:

This is the current behavior of the CLUSTER command and current patch adds a
sentence about the silent behavior in the documentation. This is good but I
just want to ask if we could want to fix this behavior too or just keep
things like that with the lack of noise.

I've proposed something like what you are describing in another thread [0].
I intended to simply fix and document the current behavior in this thread
and to take up any new changes in the other one.

[0] https://commitfest.postgresql.org/41/4070/



Got it, this is patch add_cluster_skip_messages.patch . IMHO this patch 
should be part of this commitfest as it is directly based on this one. 
You could create a second patch here that adds the warning message so 
that committers can decide here if it should be applied.



--
Gilles Darold

Re: An oversight in ExecInitAgg for grouping sets

2023-01-04 Thread David Rowley

On Tue, 3 Jan 2023 at 10:25, Tom Lane  wrote:
> The thing that I find really distressing here is that it's been
> like this for years and none of our automated testing caught it.
> You'd have expected valgrind testing to do so ... but it does not,
> because we've never marked that word NOACCESS.  Maybe we should
> rethink that?  It'd require making mcxt.c do some valgrind flag
> manipulations so it could access the hdrmask when appropriate.

Yeah, that probably could have been improved during the recent change.
Here's a patch for it.

I'm just doing a final Valgrind run on it now to check for errors.

David
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index ef10bb1690..30151d57d6 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -188,14 +188,6 @@ typedef struct AllocBlockData
char   *endptr; /* end of space in this block */
 }  AllocBlockData;
 
-/*
- * Only the "hdrmask" field should be accessed outside this module.
- * We keep the rest of an allocated chunk's header marked NOACCESS when using
- * valgrind.  But note that chunk headers that are in a freelist are kept
- * accessible, for simplicity.
- */
-#define ALLOCCHUNK_PRIVATE_LEN offsetof(MemoryChunk, hdrmask)
-
 /*
  * AllocPointerIsValid
  * True iff pointer is valid allocation pointer.
@@ -777,8 +769,8 @@ AllocSetAlloc(MemoryContext context, Size size)
VALGRIND_MAKE_MEM_NOACCESS((char *) 
MemoryChunkGetPointer(chunk) + size,
   chunk_size - 
size);
 
-   /* Disallow external access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Disallow access to the chunk header. */
+   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOC_CHUNKHDRSZ);
 
return MemoryChunkGetPointer(chunk);
}
@@ -823,8 +815,8 @@ AllocSetAlloc(MemoryContext context, Size size)
VALGRIND_MAKE_MEM_NOACCESS((char *) 
MemoryChunkGetPointer(chunk) + size,
   
GetChunkSizeFromFreeListIdx(fidx) - size);
 
-   /* Disallow external access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Disallow access to the chunk header. */
+   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOC_CHUNKHDRSZ);
 
return MemoryChunkGetPointer(chunk);
}
@@ -989,8 +981,8 @@ AllocSetAlloc(MemoryContext context, Size size)
VALGRIND_MAKE_MEM_NOACCESS((char *) MemoryChunkGetPointer(chunk) + size,
   chunk_size - size);
 
-   /* Disallow external access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Disallow access to the chunk header. */
+   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOC_CHUNKHDRSZ);
 
return MemoryChunkGetPointer(chunk);
 }
@@ -1005,8 +997,8 @@ AllocSetFree(void *pointer)
AllocSetset;
MemoryChunk *chunk = PointerGetMemoryChunk(pointer);
 
-   /* Allow access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_DEFINED(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Allow access to the chunk header. */
+   VALGRIND_MAKE_MEM_DEFINED(chunk, ALLOC_CHUNKHDRSZ);
 
if (MemoryChunkIsExternal(chunk))
{
@@ -1117,8 +1109,8 @@ AllocSetRealloc(void *pointer, Size size)
Sizeoldsize;
int fidx;
 
-   /* Allow access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_DEFINED(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Allow access to the chunk header. */
+   VALGRIND_MAKE_MEM_DEFINED(chunk, ALLOC_CHUNKHDRSZ);
 
if (MemoryChunkIsExternal(chunk))
{
@@ -1166,8 +1158,8 @@ AllocSetRealloc(void *pointer, Size size)
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
-   /* Disallow external access to private part of chunk 
header. */
-   VALGRIND_MAKE_MEM_NOACCESS(chunk, 
ALLOCCHUNK_PRIVATE_LEN);
+   /* Disallow access to the chunk header. */
+   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOC_CHUNKHDRSZ);
return NULL;
}
 
@@ -1223,8 +1215,8 @@ AllocSetRealloc(void *pointer, Size size)
/* Ensure any padding bytes are marked NOACCESS. */
VALGRIND_MAKE_MEM_NOACCESS((char *) pointer + size, chksize - 
size);
 
-   /* Disallow external access to private part of chunk header. */
-   VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
+   /* Disallow access to the chunk header . */
+   VALG

Re: meson oddities

2023-01-04 Thread Peter Eisentraut


On 04.01.23 20:35, Andres Freund wrote:

Unless someone comes up with a proposal to address the above broader issues,
also taking into account current packaging practices etc., then I think we
should do a short-term solution to either port the subdir-appending to the
meson scripts or remove it from the makefiles (or maybe a bit of both).

Just to be clear, with 'subdir-appending' you mean libdir defaulting to
'lib/x86_64-linux-gnu' (or similar)? Or do you mean adding 'postgresql' into
various dirs when the path doesn't already contain postgres?

I did try to mirror the 'postgresql' adding bit in the meson build.


I meant the latter, which I see is already in there, but it doesn't 
actually fully work.  It only looks at the subdirectory (like "lib"), 
not the whole path (like "/usr/local/pgsql/lib").  With the attached 
patch I have it working and I get the same installation layout from both 
build systems.From 579411b5c443ce4c6f45f30627bbf891a53c8f77 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Jan 2023 23:14:10 +0100
Subject: [PATCH] meson: Fix installation path computation

---
 meson.build | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index 8999170b4d..634dc7d167 100644
--- a/meson.build
+++ b/meson.build
@@ -467,19 +467,19 @@ dir_prefix = get_option('prefix')
 dir_bin = get_option('bindir')
 
 dir_data = get_option('datadir')
-if not (dir_data.contains('pgsql') or dir_data.contains('postgres'))
+if not ((dir_prefix/dir_data).contains('pgsql') or 
(dir_prefix/dir_data).contains('postgres'))
   dir_data = dir_data / pkg
 endif
 
 dir_sysconf = get_option('sysconfdir')
-if not (dir_sysconf.contains('pgsql') or dir_sysconf.contains('postgres'))
+if not ((dir_prefix/dir_sysconf).contains('pgsql') or 
(dir_prefix/dir_sysconf).contains('postgres'))
   dir_sysconf = dir_sysconf / pkg
 endif
 
 dir_lib = get_option('libdir')
 
 dir_lib_pkg = dir_lib
-if not (dir_lib_pkg.contains('pgsql') or dir_lib_pkg.contains('postgres'))
+if not ((dir_prefix/dir_lib_pkg).contains('pgsql') or 
(dir_prefix/dir_lib_pkg).contains('postgres'))
   dir_lib_pkg = dir_lib_pkg / pkg
 endif
 
@@ -489,7 +489,7 @@ dir_include = get_option('includedir')
 
 dir_include_pkg = dir_include
 dir_include_pkg_rel = ''
-if not (dir_include_pkg.contains('pgsql') or 
dir_include_pkg.contains('postgres'))
+if not ((dir_prefix/dir_include_pkg).contains('pgsql') or 
(dir_prefix/dir_include_pkg).contains('postgres'))
   dir_include_pkg = dir_include_pkg / pkg
   dir_include_pkg_rel = pkg
 endif
-- 
2.39.0

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-04 Thread Andres Freund

Hi,

On 2022-12-29 00:40:52 -0800, Noah Misch wrote:
> Incidentally, the affected test contains comment "# DROP TABLE containing
> block which standby has in a pinned buffer".  The standby holds no pin at
> that moment; the LOCK TABLE pins system catalog pages, but it drops every
> pin it acquires.

I guess that comment survived from an earlier version of that test (or another
test where it was copied from).

I'm inclined to just delete it.

Greetings,

Andres Freund

Re: Logical replication - schema change not invalidating the relation cache

2023-01-04 Thread Tom Lane

vignesh C  writes:
> [ v3-0001-Fix-for-invalidating-logical-replication-relation.patch ]

(btw, please don't send multiple patch versions with the same number,
it's very confusing.)

I looked briefly at this patch.  I wonder why you wrote a whole new
callback function instead of just using rel_sync_cache_publication_cb
for this case too.

The bigger picture here though is that in examples such as the one
you gave at the top of the thread, it's not very clear to me that
there's *any* principled behavior.  If the connection between publisher
and subscriber tables is only the relation name, fine ... but exactly
which relation name applies?  If you've got a transaction that is both
inserting some data and renaming the table, it's really debatable which
insertions should be sent under which name(s).  So how much should we
actually care about such cases?  Do we really want to force a cache flush
any time somebody changes a (possibly unrelated) pg_namespace entry?
We could be giving up significant performance and not accomplishing much
except changing from one odd behavior to a different one.

regards, tom lane

Re: Rework of collation code, extensibility

2023-01-04 Thread Peter Eisentraut


On 22.12.22 06:40, Jeff Davis wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.


Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

  * It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

  * libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

  * More platform issues (obviously) and harder to test.


I'm confused by this patch set.

It combines some refactoring that was previously posted with partial 
support for multiple ICU libraries with partial support for some new 
hooks.  Shouldn't those be three separate threads?  I think the multiple 
ICU libraries already does have a separate thread; how does this relate 
to that work?  I don't know what the hooks are supposed to be for?  What 
other locale libraries are you thinking about using this way?  How can 
we asses whether these interfaces are sufficient for that?  The 
refactoring patches don't look convincing just by looking at the numbers:


 3 files changed, 406 insertions(+), 247 deletions(-)
 6 files changed, 481 insertions(+), 150 deletions(-)
 12 files changed, 400 insertions(+), 323 deletions(-)

My sense is this is trying to do too many things at once, and those 
things are each not fully developed yet.

Re: meson oddities

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 16:18:38 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > If we don't do as Peter suggests, then any difference between the
> > results of one build system and the other could either be a bug or an
> > intentional deviation. There will be no easy way to know which it is.
> > And if or when people switch build systems, stuff will be randomly
> > different, and they won't understand why.

Given the difference is "localized", I think calling this out in the docs
would contain confusion.

> > I hear your point too. It's unpleasant for you to spend a lot of
> > effort overriding meson's behavior if the result is arguably worse
> > than the default, and it has the effect of carrying forward in
> > perpetuity hacks that may not have been a good idea in the first
> > place, or may not be a good idea any more. Those seem like valid
> > concerns, too.

This specific instance luckily is trivial to change from code POV.

> Yeah.  I think the way forward probably needs to be to decide that
> we are (or are not) going to make changes to the installation tree
> layout, and then make both build systems conform to that.  I don't
> really buy the argument that it's okay to let them install different
> layouts.  I *am* prepared to listen to arguments that "this is dumb
> and we shouldn't do it anymore".

What exactly shouldn't we do anymore?

I just want to re-iterate that, in my understanding, what we're talking about
here is just whether libdir defaults to just "lib" or whether it adapts to the
platform default (so we end up with libdir as 'lib64' or
'lib/x86_64-linux-gnu' etc). And *not* whether we should continue to force
"postgresql" into the paths.

Greetings,

Andres Freund

Re: Add a test to ldapbindpasswd

2023-01-04 Thread Andrew Dunstan


On 2023-01-02 Mo 09:45, Andrew Dunstan wrote:
> On 2023-01-01 Su 18:31, Andrew Dunstan wrote:
>> On 2023-01-01 Su 14:02, Thomas Munro wrote:
>>> On Mon, Jan 2, 2023 at 3:04 AM Andrew Dunstan  wrote:
 On 2022-12-19 Mo 11:16, Andrew Dunstan wrote:
> There is currently no test for the use of ldapbindpasswd in the
> pg_hba.conf file. This patch, mostly the work of John Naylor, remedies 
> that.
>
>
 This currently has failures on the cfbot for meson builds on FBSD13 and
 Debian Bullseye, but it's not at all clear why. In both cases it fails
 where the ldap server is started.
>>> I think it's failing when using meson.  I guess it fails to fail on
>>> macOS only because you need to add a new path for Homebrew/ARM like
>>> commit 14d63dd2, so it's skipping (it'd be nice if we didn't need
>>> another copy of all that logic).  Trying locally... it looks like
>>> slapd is failing silently, and with some tracing I can see it's
>>> sending an error message to my syslog daemon, which logged:
>>>
>>> 2023-01-02T07:50:20.853019+13:00 x1 slapd[153599]: main: TLS init def
>>> ctx failed: -1
>>>
>>> Ah, it looks like this test is relying on "slapd-certs", which doesn't 
>>> exist:
>>>
>>> tmunro@x1:~/projects/postgresql/build$ ls testrun/ldap/001_auth/data/
>>> ldap.conf  ldappassword  openldap-data  portlock  slapd-certs  slapd.conf
>>> tmunro@x1:~/projects/postgresql/build$ ls testrun/ldap/002_bindpasswd/data/
>>> portlock  slapd.conf
>>>
>>> I didn't look closely, but apparently there is something wrong in the
>>> part that copies certs from the ssl test?  Not sure why it works for
>>> autoconf...
>>
>> Let's see how we fare with this patch.
>>
>>
> Not so well :-(. This version tries to make the tests totally
> independent, as they should be. That's an attempt to get the cfbot to go
> green, but I am intending to refactor this code substantially so the
> common bits are in a module each test file will load.
>
>

This version factors out the creation of the LDAP server into a separate
perl Module. That makes both the existing test script and the new test
script a lot shorter, and will be useful for the nearby patch for a hook
for the ldapbindpassword.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/ldap/LdapServer.pm b/src/test/ldap/LdapServer.pm
new file mode 100644
index 00..a7198e06e5
--- /dev/null
+++ b/src/test/ldap/LdapServer.pm
@@ -0,0 +1,318 @@
+
+
+#
+# LdapServer.pm
+#
+# Module to set up an LDAP server for testing pg_hba.conf ldap authentication
+#
+# Copyright (c) 2023, PostgreSQL Global Development Group
+#
+
+
+=pod
+
+=head1 NAME
+
+LdapServer - class for an LDAP server for testing pg_hba.conf authentication
+
+=head1 SYNOPSIS
+
+  use LdapServer;
+
+  # have we found openldap binaies suitable for setting up a server?
+  my $ldap_binaries_found = $LdapServer::setup;
+
+  # create a server with the given root password and auth type
+  # (users or anonymous)
+  my $server = LdapServer->new($root_password, $auth_type);
+
+  # Add the contents of an LDIF file to the server
+  $server->ldapadd_file ($path_to_ldif_data);
+
+  # set the Ldap password for a user
+  $server->ldapsetpw($user, $password);
+
+  # get details of some settings for the server
+  my @properties = $server->prop($propname1, $propname2, ...);
+
+=head1 DESCRIPTION
+
+  LdapServer tests in its INIT phase for the presence of suitable openldap
+  binaries. Its constructor method sets up and runs an LDAP server, and any
+  servers that are set up are terminated during its END phase.
+
+=cut
+
+package LdapServer;
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use File::Copy;
+use File::Basename;
+
+# private variables
+my ($slapd, $ldap_schema_dir, @servers);
+
+# visible variable
+our ($setup);
+
+INIT
+{
+	$setup = 1;
+	if ($^O eq 'darwin' && -d '/opt/homebrew/opt/openldap')
+	{
+		# typical paths for Homebrew on ARM
+		$slapd   = '/opt/homebrew/opt/openldap/libexec/slapd';
+		$ldap_schema_dir = '/opt/homebrew/etc/openldap/schema';
+	}
+	elsif ($^O eq 'darwin' && -d '/usr/local/opt/openldap')
+	{
+		# typical paths for Homebrew on Intel
+		$slapd   = '/usr/local/opt/openldap/libexec/slapd';
+		$ldap_schema_dir = '/usr/local/etc/openldap/schema';
+	}
+	elsif ($^O eq 'darwin' && -d '/opt/local/etc/openldap')
+	{
+		# typical paths for MacPorts
+		$slapd   = '/opt/local/libexec/slapd';
+		$ldap_schema_dir = '/opt/local/etc/openldap/schema';
+	}
+	elsif ($^O eq 'linux')
+	{
+		$slapd   = '/usr/sbin/slapd';
+		$ldap_schema_dir = '/etc/ldap/schema' if -d '/etc/ldap/schema';
+		$ldap_schema_dir = '/etc/openldap/schema' if -d '/etc/openldap/schema';
+	}
+	elsif ($^O eq 'freebsd')
+	{
+		$slapd   = '/usr/local/libexec/slapd';
+		$ldap_schema

Re: verbose mode for pg_input_error_message?

2023-01-04 Thread Andrew Dunstan


On 2023-01-02 Mo 10:44, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I've been wondering if it might be a good idea to have a third parameter
>> for pg_input_error_message() which would default to false, but which if
>> true would cause it to emit the detail and hint fields, if any, as well
>> as the message field from the error_data.
> I don't think that just concatenating those strings would make for a
> pleasant API.  More sensible, perhaps, to have a separate function
> that returns a record.  Or we could redefine the existing function
> that way, but I suspect that "just the primary error" will be a
> principal use-case.
>
> Being able to get the SQLSTATE is likely to be interesting too.
>
>   


OK, here's a patch along those lines.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3bf8d021c3..d44d78fa67 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24768,6 +24768,40 @@ SELECT collation for ('foo' COLLATE "de_DE");
  numeric field overflow

   
+  
+   
+
+ pg_input_error_detail
+
+pg_input_error_detail (
+  string text,
+  type text
+)
+record
+( message text,
+detail text,
+hint text,
+sql_error_code text )
+   
+   
+Tests whether the given string is valid
+input for the specified data type; if not, return the details of
+the error that would have been thrown.  If the input is valid, the
+results are NULL.  The inputs are the same as
+for pg_input_is_valid.
+   
+   
+This function will only work as desired if the data type's input
+function has been updated to report invalid input as
+a soft error.  Otherwise, invalid input will abort
+the transaction, just as if the string had been cast to the type
+directly.
+
+
+ to_json(pg_input_error_detail('{1,2', 'integer[]'))
+ {"message":"malformed array literal: \"{1,2\"","detail":"Unexpected end of input.","hint":null,"sql_error_code":"22P02"}
+
+  
  
 

diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 220ddb8c01..622b534532 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -688,6 +688,63 @@ pg_input_error_message(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(escontext.error_data->message));
 }
 
+/*
+ * pg_input_error_detail - test whether string is valid input for datatype.
+ *
+ * Returns NULL data if OK, else the primary message, detail message,
+ * hint message and sql error code from the error.
+ *
+ * This will only work usefully if the datatype's input function has been
+ * updated to return "soft" errors via errsave/ereturn.
+ */
+Datum
+pg_input_error_detail(PG_FUNCTION_ARGS)
+{
+	text	   *txt = PG_GETARG_TEXT_PP(0);
+	text	   *typname = PG_GETARG_TEXT_PP(1);
+	ErrorSaveContext escontext = {T_ErrorSaveContext};
+	TupleDesc   tupdesc;
+	Datum   values[4];
+	boolisnull[4];
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/* Enable details_wanted */
+	escontext.details_wanted = true;
+
+	if (pg_input_is_valid_common(fcinfo, txt, typname,
+ &escontext))
+	{
+		memset(isnull,true,sizeof(isnull));
+	}
+	else
+	{
+		Assert(escontext.error_occurred);
+		Assert(escontext.error_data != NULL);
+		Assert(escontext.error_data->message != NULL);
+
+		memset(isnull, false, sizeof(isnull));
+
+		values[0] = CStringGetTextDatum(escontext.error_data->message);
+
+		if (escontext.error_data->detail != NULL)
+			values[1] = CStringGetTextDatum(escontext.error_data->detail);
+		else
+			isnull[1] = true;
+
+		if (escontext.error_data->hint != NULL)
+			values[2] = CStringGetTextDatum(escontext.error_data->hint);
+		else
+			isnull[2] = true;
+
+		values[3] = CStringGetTextDatum(
+			unpack_sql_state(escontext.error_data->sqlerrcode));
+	}
+
+	return HeapTupleGetDatum(heap_form_tuple(tupdesc, values, isnull));
+}
+
 /* Common subroutine for the above */
 static bool
 pg_input_is_valid_common(FunctionCallInfo fcinfo,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7be9a50147..dfc0846f6f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7081,6 +7081,14 @@
   descr => 'get error message if string is not valid input for data type',
   proname => 'pg_input_error_message', provolatile => 's', prorettype => 'text',
   proargtypes => 'text text', prosrc => 'pg_input_error_message' },
+{ oid => '8052',
+  descr => 'get error details if string is not valid input for data type',
+  proname => 'pg_input_error_detail', provolatile => 's',
+  prorettype => 'record', proargtypes => 'text text',
+  proallargtypes => '{text,text,text,text,tex

Re: meson oddities

2023-01-04 Thread Tom Lane

Robert Haas  writes:
> If we don't do as Peter suggests, then any difference between the
> results of one build system and the other could either be a bug or an
> intentional deviation. There will be no easy way to know which it is.
> And if or when people switch build systems, stuff will be randomly
> different, and they won't understand why.

> I hear your point too. It's unpleasant for you to spend a lot of
> effort overriding meson's behavior if the result is arguably worse
> than the default, and it has the effect of carrying forward in
> perpetuity hacks that may not have been a good idea in the first
> place, or may not be a good idea any more. Those seem like valid
> concerns, too.

Yeah.  I think the way forward probably needs to be to decide that
we are (or are not) going to make changes to the installation tree
layout, and then make both build systems conform to that.  I don't
really buy the argument that it's okay to let them install different
layouts.  I *am* prepared to listen to arguments that "this is dumb
and we shouldn't do it anymore".

regards, tom lane

Re: Getting an error if we provide --enable-tap-tests switch on SLES 12

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 17:27:55 +0530, tushar wrote:
> We found that if we provide *--enable-tap-tests * switch at the time of PG
> sources configuration,  getting this below error
> "
> checking for Perl modules required for TAP tests... Can't locate IPC/Run.pm
> in @INC (you may need to install the IPC::Run module) (@INC contains:
> /usr/lib/perl5/site_perl/5.18.2/x86_64-linux-thread-multi
> /usr/lib/perl5/site_perl/5.18.2
> /usr/lib/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
> /usr/lib/perl5/vendor_perl/5.18.2
> /usr/lib/perl5/5.18.2/x86_64-linux-thread-multi /usr/lib/perl5/5.18.2
> /usr/lib/perl5/site_perl .) at ./config/check_modules.pl line 11.
> 
> BEGIN failed--compilation aborted at ./config/check_modules.pl line 11.
> 
> configure: error: Additional Perl modules are required to run TAP tests
> "
> 
> look like this is happening because the Perl-IPC-Run package is not
> available on SLES 12 where Perl-IPC-Run3 is available.

Hm. It's available in newer suse versions:
https://scc.suse.com/packages/22892843


> Srinu (my teammate) found that  IPC::Run is hard coded in config/
> check_modules.pl and if we replace Run to Run3 it works (patch is attached,
> created by Srinu)

I don't think that can work. The patch changes what configure tests, but none
of the many uses of IPC::Run in the tests. And I don't think IPC::Run3
actually provides all the features of IPC::Run we use.

Have you actually tested running the tests with the patch applied?


> Do we have any better option to work without this workaround?

You could install the module via cpan :/.

Greetings,

Andres Freund

Re: meson oddities

2023-01-04 Thread Robert Haas

On Wed, Jan 4, 2023 at 2:35 PM Andres Freund  wrote:
> > I think we should get the two build systems to produce the same installation
> > layout when given equivalent options.
>
> I'm not convinced that that's the right thing to do. Distributions have
> helper infrastructure for buildsystems - why should we make it harder for them
> by deviating further from the buildsystem defaults?

If we don't do as Peter suggests, then any difference between the
results of one build system and the other could either be a bug or an
intentional deviation. There will be no easy way to know which it is.
And if or when people switch build systems, stuff will be randomly
different, and they won't understand why.

I hear your point too. It's unpleasant for you to spend a lot of
effort overriding meson's behavior if the result is arguably worse
than the default, and it has the effect of carrying forward in
perpetuity hacks that may not have been a good idea in the first
place, or may not be a good idea any more. Those seem like valid
concerns, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Wed, Jan 4, 2023 at 10:41 AM Andres Freund  wrote:
> > It's currently possible for VACUUM to set the all-frozen bit while
> > unsetting the all-visible bit, due to a race condition [1]. This is
> > your long standing bug. So apparently nobody is qualified to commit
> > patches in this area.
>
> That's a non-sequitur. Bugs are a fact of programming.

I agree.

> > About a year ago, there was a massive argument over some earlier work
> > in the same general area, by me. Being the subject of a pile-on on
> > this mailing list is something that I find deeply upsetting and
> > demoralizing. I just cannot take much more of it. At the same time,
> > I've made quite an investment in the pending patches, and think that
> > it's something that I have to see through.
>
> I'm, genuinely!, sorry that you feel piled on. That wasn't, and isn't, my
> goal.

Apology accepted.

I am making a simple, practical point here, too: I'm much too selfish
a person to continue to put myself in this position. I have nothing to
prove, and have little to gain over what I'd get out of working in
various other areas. I wasn't hired by my current employer to work on
VACUUM in particular. In the recent past I have found ways to be very
productive in other areas, without any apparent risk of protracted,
stressful fights -- which is something that I plan on getting back to
soon. I just don't have the stomach for this. It just isn't worth it
to me.

> I think the area of code desperately needs work. I complained because I
> didn't like the process and was afraid of the consequences and the perceived
> need on my part to do post-commit reviews.

The work that I did in 15 (in particular commit 0b018fab, the "oldest
extant XID" commit) really isn't very useful without the other patches
in place -- it was always supposed to be one piece of a larger whole.
It enables the freezing stuff because VACUUM now "gets credit" for
proactive freezing in a way that it didn't before. The motivating
examples wiki page shows examples of this [1].

Once the later patches are in place, the 15/16 work on VACUUM will be
complete, and I can walk away from working on VACUUM having delivered
a very useful improvement to performance stability -- a good outcome
for everybody. If you and Robert can find a way to accommodate that,
then in all likelihood we won't need to have any more heated and
protracted arguments like the one from early in 2022. I will be quite
happy to get back to working on B-Tree, likely the skip scan work.

 [1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
--
Peter Geoghegan

Re: allowing for control over SET ROLE

2023-01-04 Thread Robert Haas

On Tue, Jan 3, 2023 at 5:03 PM Noah Misch  wrote:
> I'd start with locations where the patch already added documentation.  In the
> absence of documentation otherwise, a reasonable person could think WITH SET
> controls just SET ROLE.  The documentation of WITH SET is a good place to list
> what else you opted for it to control.  If the documentation can explain the
> set of principles that would be used to decide whether WITH SET should govern
> another thing in the future, that would provide extra value.

>From the point of view of the code, we currently have four different
functions that make inquiries about role membership:
has_privs_of_role, is_member_of_role, is_member_of_role_nosuper, and
member_can_set_role.

I spent a while looking at how has_privs_of_role() is used. Basically,
there are three main patterns. First, in some places, you must have
the privileges of a certain role (typically, either a predefined role
or the role that owns some object) or the operation will fail with an
error indicating that you don't have sufficient permissions. Second,
there are places where having the privileges of a certain role exempts
you from some other permissions check; if you have neither, you'll get
an error. An example is that having the permissions of
pg_read_all_data substitutes for a select privilege. And third, there
are cases where you definitely won't get an error, but the behavior
will vary depending on whether you have the privileges of some role.
For instance, you can see more data in pg_stat_replication,
pg_stat_wal_receiver, and other stats views if you have
pg_read_all_stats. The GUC values reported in EXPLAIN output will
exclude superuser-only values unless you have pg_read_all_settings. It
looks like some maintenance commands like CLUSTER and VACUUM
completely skip over, or just warn about, cases where permission is
lacking. And weirdest of all, having the privileges of a role means
that the RLS policies applied to that role also apply to you. That's
odd because it makes permissions not strictly additive.

member_can_set_role() controls (a) whether you can SET ROLE to some
other role, (b) whether you can alter the owner of an existing object
to that role, and (c) whether you can create an object owned by some
other user in cases where the CREATE command has an option for that,
like CREATE DATABASE ... OWNER.

is_member_of_role_nosuper() is used to prevent creation of role
membership loops, and for pg_hba.conf matching.

The only remaining call to is_member_of_role() is in
pg_role_aclcheck(), which just supports the SQL-callable
pg_has_role(). has_privs_of_role() and member_can_set_role() are used
here, too.

How much of this should we document, do you think? If we're going to
go into the details, I sort of feel like it would be good to somehow
contrast what is attached to membership with what is attached to the
INHERIT option or the SET option. I think it would be slightly
surprising not to mention the way that RLS rules are triggered by
privilege inheritance yet include the fact that the SET option affects
ALTER ... OWNER TO, but maybe I've got the wrong idea. An exhaustive
concordance of what depends on what might be too much, or maybe it
isn't, but it's probably good if the level of detail is pretty
uniform.

Your thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: moving extraUpdatedCols out of RangeTblEntry (into ModifyTable)

2023-01-04 Thread Tom Lane

I wrote:
> After further thought: maybe we should get radical and postpone this
> work all the way to executor startup.  The downside of that is having
> to do it over again on each execution of a prepared plan.  But the
> upside is that when the UPDATE targets a many-partitioned table,
> we would have a chance at not doing the work at all for partitions
> that get pruned at runtime.  I'm not sure if that win would emerge
> immediately or if we still have executor work to do to manage pruning
> of the target table.  I'm also not sure that this'd be a net win
> overall.  But it seems worth considering.

Here's a draft patch that does it like that.  This seems like a win
for more reasons than just pruning, because I was able to integrate
the calculation into runtime setup of the expressions, so that we
aren't doing an extra stringToNode() on them.

There's still a code path that does such a calculation at plan time
(get_rel_all_updated_cols), but it's only used by postgres_fdw which
has some other rather-inefficient behaviors in the same area.

I've not looked into what it'd take to back-patch this.  We can't
add a field to ResultRelInfo in released branches (cf 4b3e37993),
but we might be able to repurpose RangeTblEntry.extraUpdatedCols.

regards, tom lane

diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 8bcf4784eb..8c82c71675 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -1342,25 +1342,16 @@ ExecGetUpdatedCols(ResultRelInfo *relinfo, EState *estate)
 Bitmapset *
 ExecGetExtraUpdatedCols(ResultRelInfo *relinfo, EState *estate)
 {
-	if (relinfo->ri_RangeTableIndex != 0)
-	{
-		RangeTblEntry *rte = exec_rt_fetch(relinfo->ri_RangeTableIndex, estate);
+#ifdef USE_ASSERT_CHECKING
+	/* Verify that ExecInitStoredGenerated has been called if needed. */
+	Relation	rel = relinfo->ri_RelationDesc;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
 
-		return rte->extraUpdatedCols;
-	}
-	else if (relinfo->ri_RootResultRelInfo)
-	{
-		ResultRelInfo *rootRelInfo = relinfo->ri_RootResultRelInfo;
-		RangeTblEntry *rte = exec_rt_fetch(rootRelInfo->ri_RangeTableIndex, estate);
-		TupleConversionMap *map = ExecGetRootToChildMap(relinfo, estate);
+	if (tupdesc->constr && tupdesc->constr->has_generated_stored)
+		Assert(relinfo->ri_GeneratedExprs != NULL);
+#endif
 
-		if (map != NULL)
-			return execute_attr_map_cols(map->attrMap, rte->extraUpdatedCols);
-		else
-			return rte->extraUpdatedCols;
-	}
-	else
-		return NULL;
+	return relinfo->ri_extraUpdatedCols;
 }
 
 /* Return columns being updated, including generated columns */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 56398c399c..687a5422ea 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -54,6 +54,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/optimizer.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -352,6 +353,93 @@ ExecCheckTIDVisible(EState *estate,
 	ExecClearTuple(tempSlot);
 }
 
+/*
+ * Initialize to compute stored generated columns for a tuple
+ *
+ * This fills the resultRelInfo's ri_GeneratedExprs and ri_extraUpdatedCols
+ * fields.  (Currently, ri_extraUpdatedCols is consulted only in UPDATE,
+ * but we might as well fill it for INSERT too.)
+ */
+static void
+ExecInitStoredGenerated(ResultRelInfo *resultRelInfo,
+		EState *estate,
+		CmdType cmdtype)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	int			natts = tupdesc->natts;
+	Bitmapset  *updatedCols;
+	MemoryContext oldContext;
+
+	/* Don't call twice */
+	Assert(resultRelInfo->ri_GeneratedExprs == NULL);
+
+	/* Nothing to do if no generated columns */
+	if (!(tupdesc->constr && tupdesc->constr->has_generated_stored))
+		return;
+
+	/*
+	 * In an UPDATE, we can skip computing any generated columns that do not
+	 * depend on any UPDATE target column.  But if there is a BEFORE ROW
+	 * UPDATE trigger, we cannot skip because the trigger might change more
+	 * columns.
+	 */
+	if (cmdtype == CMD_UPDATE &&
+		!(rel->trigdesc && rel->trigdesc->trig_update_before_row))
+		updatedCols = ExecGetUpdatedCols(resultRelInfo, estate);
+	else
+		updatedCols = NULL;
+
+	/*
+	 * Make sure these data structures are built in the per-query memory
+	 * context so they'll survive throughout the query.
+	 */
+	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
+
+	resultRelInfo->ri_GeneratedExprs =
+		(ExprState **) palloc0(natts * sizeof(ExprState *));
+	resultRelInfo->ri_NumGeneratedNeeded = 0;
+
+	for (int i = 0; i < natts; i++)
+	{
+		if (TupleDescAttr(tupdesc, i)->attgenerated == ATTRIBUTE_GENERATED_STORED)
+		{
+			Expr	   *expr;
+
+			/* Fetch the GENERATED AS expression tree */
+			expr = (Expr *) build_column_default(rel, i + 1);
+			if (expr

Re: grouping pushdown

2023-01-04 Thread David Rowley

On Wed, 4 Jan 2023 at 23:21, Spring Zhong  wrote:
> The plan is apparently inefficient, since the hash aggregate goes after the 
> Cartesian product. We could expect the query's performance get much improved 
> if the HashAggregate node can be pushed down to the SCAN node.

> Is someone has suggestions on this?

I think this is being worked on. See [1].

David

[1] https://commitfest.postgresql.org/41/3764/

Re: meson oddities

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 12:35:35 +0100, Peter Eisentraut wrote:
> On 16.11.22 18:07, Andres Freund wrote:
> > > > If I just want to install postgres into a prefix without 'postgresql' 
> > > > added in
> > > > a bunch of directories, e.g. because I already have pg-$version to be 
> > > > in the
> > > > prefix, there's really no good way to do so - you can't even specify
> > > > --sysconfdir or such, because we just override that path.
> > >
> > > At least for the libraries, the point of the 'postgresql' subdir IMO
> > > is to keep backend-loadable extensions separate from random libraries.
> > > It's not great that we may fail to do that depending on what the
> > > initial part of the library path is.
> >
> > Agreed, extensions really should never be in a path searched by the dynamic
> > linker, even if the prefix contains 'postgres'.
> >
> > To me that's a separate thing from adding postgresql to datadir, sysconfdir,
> > includedir, docdir... On a green field I'd say the 'extension library'
> > directory should just always be extensions/ or such.
>
> I think we should get the two build systems to produce the same installation
> layout when given equivalent options.

I'm not convinced that that's the right thing to do. Distributions have
helper infrastructure for buildsystems - why should we make it harder for them
by deviating further from the buildsystem defaults?

I have yet to hear an argument why installing libraries below
/usr/[local]/lib/{x86_64,i386,...}-linux-{gnu,musl,...}/ is the wrong thing to
do on Debian based systems (or similar, choosing lib64 over lib on RH based
systems).  But at the same time I haven't heard of an argument why we should
break existing scripts building with autoconf for this. To me a different
buildsystem is a convenient point to adapt to build path from the last decade.

> Unless someone comes up with a proposal to address the above broader issues,
> also taking into account current packaging practices etc., then I think we
> should do a short-term solution to either port the subdir-appending to the
> meson scripts or remove it from the makefiles (or maybe a bit of both).

Just to be clear, with 'subdir-appending' you mean libdir defaulting to
'lib/x86_64-linux-gnu' (or similar)? Or do you mean adding 'postgresql' into
various dirs when the path doesn't already contain postgres?

I did try to mirror the 'postgresql' adding bit in the meson build.

Greetings,

Andres Freund

Re: Add index scan progress to pg_stat_progress_vacuum

2023-01-04 Thread Imseih (AWS), Sami

Thanks for the review!

Addressed the comments.

> "Increment the indexes completed." (dot at the end) instead?

Used the commenting format being used in other places in this
file with an inclusion of a double-dash. i.,e.
/* Wraparound emergency -- end current index scan */

> It seems to me that "#define REPORT_PARALLEL_VACUUM_EVERY_PAGES 
> ((BlockNumber) (1024 * 1024 * 1024 / BLCKSZ))" would be fine too.

I kept this the same as it matches what we are doing in other places such
as FAILSAFE_EVERY_PAGES

v20 attached.


Regards,

--
Sami Imseih
Amazon Web Services: https://aws.amazon.com



v20-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch
Description: v20-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-04 09:59:37 -0800, Peter Geoghegan wrote:
> On Wed, Jan 4, 2023 at 7:03 AM Robert Haas  wrote:
> > and which functions return fully reliable results, I do
> > not think you should be committing your own patches in this area.
> 
> My mistake here had nothing to do with my own goals. I was trying to
> be diligent by hardening an existing check in passing, and it
> backfired.

When moving code around I strongly suggest to make as much of a diff to be
"move only". I find
  git diff --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space
quite helpful for that.

Being able to see just the "really changed" lines makes it a lot easier to see
the crucial parts of a change.

> > There is probably a lot of potential benefit in improving the way this
> > stuff works, but there is also a heck of a lot of danger of creating
> > subtle data corrupting bugs that could easily take years to find.
> 
> It's currently possible for VACUUM to set the all-frozen bit while
> unsetting the all-visible bit, due to a race condition [1]. This is
> your long standing bug. So apparently nobody is qualified to commit
> patches in this area.

That's a non-sequitur. Bugs are a fact of programming.

> About a year ago, there was a massive argument over some earlier work
> in the same general area, by me. Being the subject of a pile-on on
> this mailing list is something that I find deeply upsetting and
> demoralizing. I just cannot take much more of it. At the same time,
> I've made quite an investment in the pending patches, and think that
> it's something that I have to see through.

I'm, genuinely!, sorry that you feel piled on. That wasn't, and isn't, my
goal. I think the area of code desperately needs work. I complained because I
didn't like the process and was afraid of the consequences and the perceived
need on my part to do post-commit reviews.

Greetings,

Andres Freund

Re: fix and document CLUSTER privileges

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 02:25:13PM +0100, Gilles Darold wrote:
> This is the current behavior of the CLUSTER command and current patch adds a
> sentence about the silent behavior in the documentation. This is good but I
> just want to ask if we could want to fix this behavior too or just keep
> things like that with the lack of noise.

I've proposed something like what you are describing in another thread [0].
I intended to simply fix and document the current behavior in this thread
and to take up any new changes in the other one.

[0] https://commitfest.postgresql.org/41/4070/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 10:57:43AM +0530, Amit Kapila wrote:
> On Tue, Jan 3, 2023 at 11:40 PM Nathan Bossart  
> wrote:
>> My approach was to add a variable to LogicalRepWorker that indicated
>> whether a worker needed to be restarted immediately.  While this is a
>> little weird because the workers array is treated as slots, it worked
>> nicely for ALTER SUBSCRIPTION.
> 
> So, are you planning to keep its in_use and subid flag as it is in
> logicalrep_worker_cleanup()? Otherwise, without that it could be
> reused for some other subscription.

I believe I did something like this in my proof-of-concept.  I might have
used the new flag as another indicator that the slot was still "in use".
In any case, you are right that we need to prevent the slot from being
reused.

> What if we maintain a hash table similar to 'last_start_times'
> maintained in tablesync.c? It won't have entries for new
> subscriptions, so for those we may not need to wait till
> wal_retrieve_retry_interval.

I proposed this upthread [0].  I still think it is a worthwhile change.
Right now, if a worker needs to be restarted but another unrelated worker
was restarted less than wal_retrieve_retry_interval milliseconds ago, the
launcher waits to restart it.  I think it makes more sense for each worker
to have its own restart interval tracked.

>> IIUC you are suggesting just one variable that would bypass
>> wal_retrieve_retry_interval for all subscriptions, not just those newly
>> altered or created.  This definitely seems like it would prevent delays,
>> but it would also cause wal_retrieve_retry_interval to be incorrectly
>> bypassed for the other workers in some cases.
>
> Right, but I guess it would be rare in practical cases that someone
> Altered/Created a subscription, and also some workers are restarted
> due to errors/crashes as only in those cases launcher can restart the
> worker when it shouldn't. However, in that case, also, it won't
> restart the apply worker again and again unless there are concurrent
> Create/Alter Subscription operations going on. IIUC, currently also it
> can always first time restart the worker immediately after ERROR/CRASH
> because we don't maintain last_start_time for each worker. I think
> this is probably okay as we want to avoid repeated restarts after the
> ERROR.

This line of thinking is why I felt that lowering
wal_retrieve_retry_interval for the tests might be sufficient.  Besides the
fact that it revealed multiple bugs, I don't see the point in adding much
more complexity here.  In practice, workers will usually start right away,
unless of course there are other worker starts happening around the same
time.  This consistently causes testing delays because the tests stress
these code paths, but I don't think what the tests are doing is a typical
use-case.

>From the discussion thus far, it sounds like the alternatives are to 1) add
a global flag that causes wal_retrieve_retry_interval to be bypassed for
all workers or to 2) add a hash map in the launcher and a
restart_immediately flag in each worker slot.  I'll go ahead and create a
patch for 2 since it seems like the most complete solution, and we can
evaluate whether the complexity seems appropriate.

[0] https://postgr.es/m/20221214171023.GA689106%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

pgbench - adding pl/pgsql versions of tests

2023-01-04 Thread Hannu Krosing

Hello Hackers,

The attached patch adds pl/pgsql versions of "tpcb-like" and
"simple-update" internal test scripts

The tests perform functionally exactly the same, but are generally
faster as they avoid most client-server latency.

The reason I'd like to have them as part of pgbench are two

1. so I don't have to create the script and function manually each
time I want to test mainly the database (instead of the
client-database system)

2. so that new users of PostgreSQL can easily see how much better OLTP
workloads perform when packaged up as a server-side function

The new user-visible functionalities are two new build-in scripts  -b list :

$ pgbench -b list
Available builtin scripts:
  tpcb-like: 
  plpgsql-tpcb-like: 
  simple-update: 
  plpgsql-simple-update: 
select-only: 

which one can run  using the -b / --builtin= option

pgbench -b plpgsql-tpcb-like ...
or
pgbench -b plpgsql-simple-update ...

And a flag --no-functions which lets you not to create the functions at init

there are also character flags to -I / --init ,
-- Y to drop the functions and
-- y to create the functions. Creating is default behaviour, but can
be disabled fia long flag --no-functions )

I selected Yy as they were unused and can be thought of as "inverted
lambda symbol" :)

If there are no strong objections, I'll add it to the commitfest as well

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index c4a44debeb..1edcec2f5c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -172,8 +172,8 @@ typedef struct socket_set
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
-#define ALL_INIT_STEPS "dtgGvpf"	/* all possible steps */
+#define DEFAULT_INIT_STEPS "dYtgvpy"	/* default -I setting */
+#define ALL_INIT_STEPS "dYtgGvpfy"	/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -800,6 +800,15 @@ static const BuiltinScript builtin_script[] =
 		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
 		"END;\n"
 	},
+	{
+		"plpgsql-tpcb-like",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_tpcb_like(:aid, :bid, :tid, :delta);\n"
+	},
 	{
 		"simple-update",
 		"",
@@ -813,6 +822,15 @@ static const BuiltinScript builtin_script[] =
 		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
 		"END;\n"
 	},
+	{
+		"plpgsql-simple-update",
+		"",
+		"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+		"\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+		"\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+		"\\set delta random(-5000, 5000)\n"
+		"SELECT 1 FROM pgbench_simple_update(:aid, :bid, :tid, :delta);\n"
+	},
 	{
 		"select-only",
 		"",
@@ -885,6 +903,7 @@ usage(void)
 		   "  -q, --quiet  quiet logging (one message each 5 seconds)\n"
 		   "  -s, --scale=NUM  scaling factor\n"
 		   "  --foreign-keys   create foreign key constraints between tables\n"
+		   "  --no-functions   do not create pl/pgsql functions for internal scripts\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
 		   "  --partition-method=(range|hash)\n"
@@ -4836,6 +4855,54 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * Create the functions needed for plpgsql-* builting scripts
+ */
+static void
+initCreateFuntions(PGconn *con)
+{
+	fprintf(stderr, "creating functions...\n");
+
+	executeStatement(con, 
+		"CREATE FUNCTION pgbench_tpcb_like(_aid int, _bid int, _tid int, _delta int)\n"
+		"RETURNS void\n"
+		"LANGUAGE plpgsql\n"
+		"AS $plpgsql$\n"
+		"BEGIN\n"
+		"UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid;\n"
+		"PERFORM abalance FROM pgbench_accounts WHERE aid = _aid;\n"
+		"UPDATE pgbench_tellers SET tbalance = tbalance + _delta WHERE tid = _tid;\n"
+		"UPDATE pgbench_branches SET bbalance = bbalance + _delta WHERE bid = _bid;\n"
+		"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (_tid, _bid, _aid, _delta, CURRENT_TIMESTAMP);\n"
+		"END;\n"
+		"$plpgsql$;\n");
+	executeStatement(con,
+		"CREATE FUNCTION pgbench_simple_update(_aid int, _bid int, _tid int, _delta int)\n"
+		"RETURNS void\n"
+		"LANGUAGE plpgsql\n"
+		"AS $plpgsql$\n"
+		"BEGIN\n"
+		"UPDATE pgbench_accounts SET abala

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Wed, Jan 4, 2023 at 7:03 AM Robert Haas  wrote:
> But that having been said, I'm kind of astonished that you didn't know
> about this already. The freezing behavior is in general extremely hard
> to get right, and I guess I feel if you don't understand how the
> underlying functions work, including things like performance
> considerations

I was the one that reported the issue with CLOG lookups in the first place.

> and which functions return fully reliable results, I do
> not think you should be committing your own patches in this area.

My mistake here had nothing to do with my own goals. I was trying to
be diligent by hardening an existing check in passing, and it
backfired.

> There is probably a lot of potential benefit in improving the way this
> stuff works, but there is also a heck of a lot of danger of creating
> subtle data corrupting bugs that could easily take years to find.

It's currently possible for VACUUM to set the all-frozen bit while
unsetting the all-visible bit, due to a race condition [1]. This is
your long standing bug. So apparently nobody is qualified to commit
patches in this area.

About a year ago, there was a massive argument over some earlier work
in the same general area, by me. Being the subject of a pile-on on
this mailing list is something that I find deeply upsetting and
demoralizing. I just cannot take much more of it. At the same time,
I've made quite an investment in the pending patches, and think that
it's something that I have to see through.

If I am allowed to finish what I've started, then I will stop all new
work on VACUUM. I'll go back to working on B-Tree indexing. Nobody is
asking me to focus on VACUUM, and there are plenty of other things
that I could be doing that don't seem to lead to these situations.

[1] 
https://postgr.es/m/cah2-wznungszf8v6osgjac5aysb3cz6hw6mlm30x0d65cms...@mail.gmail.com
--
Peter Geoghegan

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Robert Haas

On Wed, Jan 4, 2023 at 11:36 AM Tom Lane  wrote:
> As you well know, psql's FETCH_COUNT mechanism is far older than
> single-row mode.  I don't think anyone's tried to transpose it
> onto that.  I agree that it seems like a good idea to try.
> There will be more per-row overhead, but the increase in flexibility
> is likely to justify that.

Yeah, I was vaguely worried that there might be more per-row overhead,
not that I know a lot about this topic. I wonder if there's a way to
mitigate that. I'm a bit suspicious that what we want here is really
more of an incremental mode than a single-row mode i.e. yeah, you want
to fetch rows without materializing the whole result, but maybe not in
batches of exactly size one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] CF app: add "Returned: Needs more interest"

2023-01-04 Thread Jacob Champion

On Tue, Jan 3, 2023 at 8:56 PM vignesh C  wrote:
> I'm not sure if this should be included in commitfest as we generally
> include the postgres repository patches in the commitfest. I felt we
> could have the discussion in the thread and remove the entry from
> commitfest.

Is there a good way to remind people that, hey, this exists as a
patchset? (Other than me pinging the list every so often.)

--Jacob

Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-04 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 09:41:47AM +0530, Amit Kapila wrote:
> I am not sure if I understand the problem you are trying to solve with
> this part of the patch. Are you worried that after we mark some of the
> relation's state as READY, all the table syncs are in the READY state
> but we will not immediately try to check the two_pahse stuff and
> probably the apply worker may sleep before the next time it invokes
> process_syncing_tables_for_apply()? 

Yes.

> If so, we probably also need to
> ensure that table_states_valid is marked false probably via
> invalidations so that we can get the latest state and then perform
> this check. I guess if we can do that then we can directly move the
> restart logic to the end.

IMO this shows the advantage of just waking up the worker.  It doesn't
change the apply worker's behavior besides making it more responsive.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Infinite Interval

2023-01-04 Thread jian he

On Tue, Jan 3, 2023 at 6:14 AM Joseph Koshakow  wrote:

> I have another patch, this one adds validations to operations that
> return intervals and updated error messages. I tried to give all of the
> error messages meaningful text, but I'm starting to think that almost all
> of them should just say "interval out of range". The current approach
> may reveal some implementation details and lead to confusion. For
> example, some subtractions are converted to additions which would lead
> to an error message about addition.
>
> SELECT date 'infinity' - interval 'infinity';
> ERROR:  cannot add infinite values with opposite signs
>
> I've also updated the commit message to include the remaining TODOs,
> which I've copied below
>
>   1. Various TODOs in code.
>   2. Correctly implement interval_part for infinite intervals.
>   3. Test consolidation.
>   4. Should we just use the months field to test for infinity?
>


3. Test consolidation.
I used the DO command, reduced a lot of test sql code.
I don't know how to generate an interval.out file.
I hope the format is ok. I use https://sqlformat.darold.net/ format the sql
code.
Then I saw on the internet that one line should be no more than 80 chars.
so I slightly changed the format.

-- 
 I recommend David Deutsch's <>

  Jian
diff --git a/src/test/regress/sql/interval.sql b/src/test/regress/sql/interval.sql
index 82f3180221..1fd99c53d4 100644
--- a/src/test/regress/sql/interval.sql
+++ b/src/test/regress/sql/interval.sql
@@ -652,30 +652,31 @@ SELECT timetz '11:27:42' + interval '-infinity';
 SELECT timetz '11:27:42' - interval 'infinity';
 SELECT timetz '11:27:42' - interval '-infinity';
 
-SELECT interval 'infinity' < interval 'infinity';
-SELECT interval 'infinity' < interval '-infinity';
-SELECT interval '-infinity' < interval 'infinity';
-SELECT interval '-infinity' < interval '-infinity';
-SELECT interval 'infinity' <= interval 'infinity';
-SELECT interval 'infinity' <= interval '-infinity';
-SELECT interval '-infinity' <= interval 'infinity';
-SELECT interval '-infinity' <= interval '-infinity';
-SELECT interval 'infinity' > interval 'infinity';
-SELECT interval 'infinity' > interval '-infinity';
-SELECT interval '-infinity' > interval 'infinity';
-SELECT interval '-infinity' > interval '-infinity';
-SELECT interval 'infinity' >= interval 'infinity';
-SELECT interval 'infinity' >= interval '-infinity';
-SELECT interval '-infinity' >= interval 'infinity';
-SELECT interval '-infinity' >= interval '-infinity';
-SELECT interval 'infinity' = interval 'infinity';
-SELECT interval 'infinity' = interval '-infinity';
-SELECT interval '-infinity' = interval 'infinity';
-SELECT interval '-infinity' = interval '-infinity';
-SELECT interval 'infinity' <> interval 'infinity';
-SELECT interval 'infinity' <> interval '-infinity';
-SELECT interval '-infinity' <> interval 'infinity';
-SELECT interval '-infinity' <> interval '-infinity';
+DO $$
+DECLARE
+intv interval;
+intv1 interval;
+intvs interval[] := '{+infinity,-infinity}';
+OPERATOR text[] := '{<,<=,=, >,>=,<>}';
+opr text;
+result boolean;
+BEGIN
+FOREACH intv IN ARRAY intvs LOOP
+FOREACH opr IN ARRAY OPERATOR LOOP
+FOREACH intv1 IN ARRAY intvs LOOP
+EXECUTE 'select interval ' || quote_literal(intv) || ' ' 
+  || opr || ' interval ' || quote_literal(intv1) INTO result;
+RAISE NOTICE '%'
+  ,(format('%10s %2s %10s %2s'
+  ,intv
+  ,opr
+  ,intv1
+  ,result));
+END LOOP;
+END LOOP;
+END LOOP;
+END
+$$;
 
 SELECT -interval 'infinity';
 SELECT -interval '-infinity';
@@ -709,58 +710,48 @@ SELECT date_bin('-infinity', timestamp '2001-02-16 20:38:40', timestamp '2001-02
 SELECT date_trunc('hour', interval 'infinity');
 SELECT date_trunc('hour', interval '-infinity');
 
-SELECT date_part('us', interval 'infinity');
-SELECT date_part('us', interval '-infinity');
-SELECT date_part('ms', interval 'infinity');
-SELECT date_part('ms', interval '-infinity');
-SELECT date_part('second', interval 'infinity');
-SELECT date_part('second', interval '-infinity');
-SELECT date_part('minute', interval 'infinity');
-SELECT date_part('minute', interval '-infinity');
-SELECT date_part('hour', interval 'infinity');
-SELECT date_part('hour', interval '-infinity');
-SELECT date_part('day', interval 'infinity');
-SELECT date_part('day', interval '-infinity');
-SELECT date_part('month', interval 'infinity');
-SELECT date_part('month', interval '-infinity');
-SELECT date_part('quarter', interval 'infinity');
-SELECT date_part('quarter', interval '-infinity');
-SELECT date_part('year', interval 'infinity');
-SELECT date_part('year', interval '-infinity');
-SELECT date_part('decade', interval 'infinity');
-SELECT date_part('decade', interval '-infinity');
-SELECT date_part('century', interval 'infinity');
-SELECT date_part('century', interval '-i

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jan 4, 2023 at 10:22 AM Daniel Verite  wrote:
>> A solution would be for psql to use PQsetSingleRowMode() to retrieve
>> results row-by-row, as opposed to using a cursor, and then allocate
>> memory for only FETCH_COUNT rows at a time.

> Is there any reason that someone hasn't, like, already done this?

As you well know, psql's FETCH_COUNT mechanism is far older than
single-row mode.  I don't think anyone's tried to transpose it
onto that.  I agree that it seems like a good idea to try.
There will be more per-row overhead, but the increase in flexibility
is likely to justify that.

regards, tom lane

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Robert Haas

On Wed, Jan 4, 2023 at 10:22 AM Daniel Verite  wrote:
> A solution would be for psql to use PQsetSingleRowMode() to retrieve
> results row-by-row, as opposed to using a cursor, and then allocate
> memory for only FETCH_COUNT rows at a time. Incidentally it solves
> other problems like queries containing multiple statements, that also
> fail to work properly with cursors, or UPDATE/INSERT... RETURNING.. on
> large number of rows that could also benefit from pagination in
> memory.

Is there any reason that someone hasn't, like, already done this?

Because if there isn't, we should really do this. And if there is,
like say that it would hurt performance or something, then we should
come up with a fix for that problem and then do something like this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Daniel Verite

Jakub Wartak wrote:

> It might be a not so well known fact (?) that CTEs are not executed
> with cursor when asked to do so, but instead silently executed with
> potential huge memory allocation going on. Patch is attached. My one
> doubt is that not every statement starting with "WITH" is WITH(..)
> SELECT of course.

Yes, that's why WITH queries are currently filtered out by the
FETCH_COUNT feature.

Case in point:

test=> begin;
BEGIN

test=> create table tbl(i int);
CREATE TABLE

test=> declare psql_cursor cursor for
 with r(i) as (values (1))
 insert into tbl(i) select i from r;
ERROR:  syntax error at or near "insert"
LINE 3: insert into tbl(i) select i from r;

So the fix you're proposing would fail on that kind of queries.

A solution would be for psql to use PQsetSingleRowMode() to retrieve
results row-by-row, as opposed to using a cursor, and then allocate
memory for only FETCH_COUNT rows at a time. Incidentally it solves
other problems like queries containing multiple statements, that also
fail to work properly with cursors, or UPDATE/INSERT... RETURNING.. on
large number of rows that could also benefit from pagination in
memory.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Ankit Kumar Pandey


Attaching test cases for this (+ small change in doc).

Tested this in one of WIP branch where I had modified 
select_active_windows and it failed


as expected.

Please let me know if something can be improved in this.


Regards,
Ankit Kumar Pandey
From 7647759eb92e1a0560bcff73b4169be8694f83d8 Mon Sep 17 00:00:00 2001
From: Ankit Kumar Pandey 
Date: Wed, 4 Jan 2023 19:57:57 +0530
Subject: [PATCH] Add test cases for optimal ordering of window function order
 by clauses

In case of multiple orderby clauses for window function, it is desired
to have minimum sort operations. This is already implemented in code but
corresponding test cases were missing.
This patch adds test cases covering various cases where order by clause
get optimized and some additional comments in function which implements
this feature.
---
 src/backend/optimizer/plan/planner.c |   5 +
 src/test/regress/expected/window.out | 140 +++
 src/test/regress/sql/window.sql  |  51 ++
 3 files changed, 196 insertions(+)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6ba7589f3..e3989a7b44 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5592,6 +5592,11 @@ optimize_window_clauses(PlannerInfo *root, WindowFuncLists *wflists)
  * select_active_windows
  *		Create a list of the "active" window clauses (ie, those referenced
  *		by non-deleted WindowFuncs) in the order they are to be executed.
+ *		Window clauses are ordered by the highest tleSortGroupRef first,
+ *		resulting in the lowest order tleSortGroupRefs being the last WindowAgg to be
+ *		processed. This reduces requirement for sort in lower order WindowAgg
+ *		as it is quite likely that required column is already sorted and if not,
+ *		there is possibility of incremental sort.
  */
 static List *
 select_active_windows(PlannerInfo *root, WindowFuncLists *wflists)
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 5505e9a2da..71d0cff587 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -4733,3 +4733,143 @@ SELECT * FROM pg_temp.f(2);
  {5}
 (5 rows)
 
+-- test optimal ordering for minimal sort operations
+CREATE temp TABLE abcd (a int, b int, c int, d int);
+INSERT INTO abcd
+  SELECT p,q,r,s FROM
+generate_series(1,5) p,
+generate_Series(1,5) q,
+generate_Series(1,5) r,
+generate_Series(1,5) s;
+EXPLAIN (COSTS OFF)
+  SELECT row_number() OVER (ORDER BY b),
+  count(*) OVER (ORDER BY a,b)
+FROM abcd ORDER BY a,b,c;
+   QUERY PLAN   
+
+ Incremental Sort
+   Sort Key: a, b, c
+   Presorted Key: a, b
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: a, b
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: b
+   ->  Seq Scan on abcd
+(10 rows)
+
+EXPLAIN (COSTS OFF)
+  SELECT row_number() OVER (ORDER BY a, b),
+  count(*) OVER (ORDER BY a)
+FROM abcd ORDER BY a,b,c;
+QUERY PLAN
+--
+ Incremental Sort
+   Sort Key: a, b, c
+   Presorted Key: a, b
+   ->  WindowAgg
+ ->  WindowAgg
+   ->  Sort
+ Sort Key: a, b
+ ->  Seq Scan on abcd
+(8 rows)
+
+EXPLAIN (COSTS OFF)
+  SELECT row_number() OVER (ORDER BY a),
+  count(*) OVER (ORDER BY a,b,c)
+FROM abcd ORDER BY a,b;
+ QUERY PLAN 
+
+ WindowAgg
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: a, b, c
+   ->  Seq Scan on abcd
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+  SELECT row_number() OVER (ORDER BY a),
+  count(*) OVER (ORDER BY a,c),
+  count(*) OVER (ORDER BY a,b)
+FROM abcd ORDER BY a,d;
+  QUERY PLAN  
+--
+ Incremental Sort
+   Sort Key: a, d
+   Presorted Key: a
+   ->  WindowAgg
+ ->  WindowAgg
+   ->  Incremental Sort
+ Sort Key: a, c
+ Presorted Key: a
+ ->  WindowAgg
+   ->  Sort
+ Sort Key: a, b
+ ->  Seq Scan on abcd
+(12 rows)
+
+EXPLAIN (COSTS OFF)
+  SELECT row_number() OVER (ORDER BY a),
+  count(*) OVER (ORDER BY a,b),
+  count(*) OVER (ORDER BY a,c)
+FROM abcd ORDER BY a,d;
+  QUERY PLAN  
+--
+ Incremental Sort
+   Sort Key: a, d
+   Presorted Key: a
+   ->  WindowAgg
+ ->  WindowAgg
+   ->  Incremental Sort
+ Sort Key: a, b
+ Presorted Key: a
+ ->  WindowAgg
+

Re: explain analyze rows=%.0f

2023-01-04 Thread Ibrar Ahmed

On Sun, Nov 6, 2022 at 10:12 AM Tom Lane  wrote:

> Robert Haas  writes:
> > On Fri, Jul 22, 2022 at 6:47 AM Amit Kapila 
> wr=
> ote:
> >> I feel the discussion has slightly deviated which makes it unclear
> >> whether this patch is required or not?
>
> > My opinion is that showing some fractional digits at least when
> > loops>1 would be better than what we have now. It might not be the
> > best thing we could do, but it would be better than doing nothing.
>
> Yeah, I think that is a reasonable compromise.
>
>
Thanks, I have modified everything as suggested, except one point


> I took a brief look through the patch, and I have some review
> comments:
>
> * Code like this is pretty awful:
>
> appendStringInfo(es->str,
>  (nloops =3D=3D 1 || !HAS_DECIMAL(rows)) ?
>  " rows=3D%.0f loops=3D%.0f)" : " rows=3D%=
> .2f loops=3D%.0f)",
>  rows, nloops);
>
> Don't use variable format strings.  They're hard to read and they
> probably defeat compile-time checks that the arguments match the
> format string.  You could use a "*" field width instead, ie
>
> appendStringInfo(es->str,
>  " rows=3D%.*f loops=3D%.0f)",
>  (nloops =3D=3D 1 || !HAS_DECIMAL(rows)) ?=
>  2 : 0,
>  rows, nloops);
>
> That'd also allow you to reduce the code churn you've added by
> splitting some appendStringInfo calls.
>
> * I'm fairly concerned about how stable this'll be in the buildfarm,
> in particular I fear HAS_DECIMAL() is not likely to give consistent
> results across platforms.  I gather that an earlier version of the patch
> tried to check whether the fractional part would be zero to two decimal
> places, rather than whether it's exactly zero.  Probably want to put
> back something like that.
>
> * Another thought is that the non-text formats tend to prize output
> consistency over readability, so maybe we should just always use 2
> fractional digits there, rather than trying to minimize visible changes.
>
> In that, we need to adjust a lot of test case outputs.

* We need some doc adjustments, surely, to explain what the heck this
> means.
>



> regards, tom lane
>


-- 
Ibrar Ahmed

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Robert Haas

On Wed, Jan 4, 2023 at 1:53 AM Peter Geoghegan  wrote:
> I think that we should definitely have a comment directly over
> TransactionIdDidAbort(). Though I wouldn't mind reorganizing these
> other comments, or making the comment over TransactionIdDidAbort()
> mostly just point to the other comments.

Yeah, I think it would be good to have a comment there. As Andres
says, it is almost always wrong to use this function, and we should
make that more visible. Possibly we should even rename the function,
like TransactionIdKnownToHaveAborted().

But that having been said, I'm kind of astonished that you didn't know
about this already. The freezing behavior is in general extremely hard
to get right, and I guess I feel if you don't understand how the
underlying functions work, including things like performance
considerations and which functions return fully reliable results, I do
not think you should be committing your own patches in this area.
There is probably a lot of potential benefit in improving the way this
stuff works, but there is also a heck of a lot of danger of creating
subtle data corrupting bugs that could easily take years to find.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add a new pg_walinspect function to extract FPIs from WAL records

2023-01-04 Thread Drouvot, Bertrand


Hi,

On 12/27/22 12:48 PM, Bharath Rupireddy wrote:

Hi,

Here's a patch that implements the idea of extracting full page images
from WAL records [1] [2] with a function in pg_walinspect. This new
function accepts start and end lsn and returns full page image info
such as WAL record lsn, tablespace oid, database oid, relfile number,
block number, fork name and the raw full page (as bytea). I'll
register this in the next commitfest.

Thoughts?



I think it makes sense to somehow align the pg_walinspect functions with the pg_waldump 
"features".
And since [1] added FPI "extraction" then +1 for the proposed patch in this 
thread.


[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d497093cbecccf6df26365e06a5f8f8614b591c8
[2] 
https://postgr.es/m/CAOxo6XKjQb2bMSBRpePf3ZpzfNTwjQUc4Tafh21=jzjx6bx...@mail.gmail.com


I just have a few comments:

+
+/*
+ * Get full page images and their info associated with a given WAL record.
+ */

What about adding a few words about compression? (like "Decompression is applied if 
necessary"?)


+   /* Full page exists, so let's output it. */
+   if (!RestoreBlockImage(record, block_id, page))

"Full page exists, so let's output its info and content." instead?


+ 
+  Gets raw full page images and their information associated with all the
+  valid WAL records between start_lsn and
+  end_lsn. Returns one row per full page image.

Worth to add a few words about decompression too?


I'm also wondering if it would make sense to extend the test coverage of it (and 
pg_waldump) to "validate" that both
extracted images are the same and matches the one modified right after the 
checkpoint.

What do you think? (could be done later in another patch though).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-01-04 Thread Alexander Korotkov

Hi, Pavel!

On Wed, Jan 4, 2023 at 3:43 PM Pavel Borisov  wrote:
> On Wed, 4 Jan 2023 at 12:52, Pavel Borisov  wrote:
> > On Wed, 4 Jan 2023 at 12:41, vignesh C  wrote:
> > >
> > > On Fri, 1 Jul 2022 at 16:49, Alexander Korotkov  
> > > wrote:
> > > >
> > > > Hackers,
> > > >
> > > > When working in the read committed transaction isolation mode
> > > > (default), we have the following sequence of actions when
> > > > tuple_update() or tuple_delete() find concurrently updated tuple.
> > > >
> > > > 1. tuple_update()/tuple_delete() returns TM_Updated
> > > > 2. tuple_lock()
> > > > 3. Re-evaluate plan qual (recheck if we still need to update/delete
> > > > and calculate the new tuple for update)
> > > > 4. tuple_update()/tuple_delete() (this time should be successful,
> > > > since we've previously locked the tuple).
> > > >
> > > > I wonder if we should merge steps 1 and 2. We could save some efforts
> > > > already done during tuple_update()/tuple_delete() for locking the
> > > > tuple. In heap table access method, we've to start tuple_lock() with
> > > > the first tuple in the chain, but tuple_update()/tuple_delete()
> > > > already visited it. For undo-based table access methods,
> > > > tuple_update()/tuple_delete() should start from the last version, why
> > > > don't place the tuple lock immediately once a concurrent update is
> > > > detected. I think this patch should have some performance benefits on
> > > > high concurrency.
> > > >
> > > > Also, the patch simplifies code in nodeModifyTable.c getting rid of
> > > > the nested case. I also get rid of extra
> > > > table_tuple_fetch_row_version() in ExecUpdate. Why re-fetch the old
> > > > tuple, when it should be exactly the same tuple we've just locked.
> > > >
> > > > I'm going to check the performance impact. Thoughts and feedback are 
> > > > welcome.
> > >
> > > The patch does not apply on top of HEAD as in [1], please post a rebased 
> > > patch:
> > > === Applying patches on top of PostgreSQL commit ID
> > > eb5ad4ff05fd382ac98cab60b82f7fd6ce4cfeb8 ===
> > > === applying patch
> > > ./0001-Lock-updated-tuples-in-tuple_update-and-tuple_del-v1.patch
> > > patching file src/backend/executor/nodeModifyTable.c
> > > ...
> > > Hunk #3 FAILED at 1376.
> > > ...
> > > 1 out of 15 hunks FAILED -- saving rejects to file
> > > src/backend/executor/nodeModifyTable.c.rej
> > >
> > > [1] - http://cfbot.cputube.org/patch_41_4099.log
> >
> > The rebased patch is attached. It's just a change in formatting, no
> > changes in code though.
>
> One more update of a patchset to avoid compiler warnings.

Thank you for your help.  I'm going to provide the revised version of
patch with comments and commit message in the next couple of days.

--
Regards,
Alexander Korotkov

Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2023-01-04 Thread Amit Kapila

On Tue, Jan 3, 2023 at 8:50 PM Michail Nikolaev
 wrote:
>
> > Does that by any chance mean you are using a non-community version of
> > Postgres which has some other changes?
>
> It is a managed Postgres service in the general cloud. Usually, such
> providers apply some custom minor patches.
> The only one I know about - about forbidding of canceling queries
> while waiting for synchronous replication acknowledgement.
>

Okay, but it would be better to know what all the other changes they have.

> > It is possible but ideally, in that case, the client should request
> > such a transaction again.
>
> I am not sure I get you here.
>
> I'll try to explain what I mean:
>
> The patch I'm referring to does not allow canceling a query while it
> waiting acknowledge for ACK for COMMIT message in case of synchronous
> replication.
> If synchronous standby is down - query and connection just stuck until
> server restart (or until standby become available to process ACK).
> Tuples changed by such a hanging transaction are not visible by other
> transactions. It is all done to prevent seeing spurious tuples in case
> of network split.
>
> So, it seems like we had such a situation during that story because of
> our synchronous standby downtime (before server restart).
> My thoughts just about the possibility of fact that such transactions
> (waiting for ACK for COMMIT) are handled somehow incorrectly by
> logical replication engine.
>

I understood this point yesterday but we do have handling for such
cases. Say, if the subscriber is down during the time of such
synchronous transactions, after the restart, it will request to
restart the replication from a point which is prior to such
transactions. We ensure this by replication origins. See docs [1] for
more information about the same. Now, it is possible that there is a
bug in that mechanism but it is difficult to find it without some
hints from LOGs or a reproducible test. It is also possible that there
is another area that has a bug in the Postgres code. But, OTOH, we
can't rule out the possibility that it is because of some features
added by managed service unless you can reproduce it on the Postgres
build.

[1] - https://www.postgresql.org/docs/devel/replication-origins.html

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-04 Thread Dilip Kumar

On Wed, Jan 4, 2023 at 6:40 PM Amit Kapila  wrote:
>
> On Wed, Jan 4, 2023 at 4:52 PM Dilip Kumar  wrote:
> >
> > 2.
> > + * Since the database structure (schema of subscription tables, 
> > constraints,
> > + * etc.) of the publisher and subscriber could be different, applying
> > + * transactions in parallel mode on the subscriber side can cause some
> > + * deadlocks that do not occur on the publisher side.
> >
> > I think this paragraph needs to be rephrased a bit.  It is saying that
> > some deadlock can occur on subscribers which did not occur on the
> > publisher.  I think what it should be conveying is that the deadlock
> > can occur due to concurrently applying the conflicting/dependent
> > transactions which are not conflicting/dependent on the publisher due
> > to .  Because if we create the same schema on the
> > publisher it might not have ended up in a deadlock instead it would
> > have been executed in sequence (due to lock waiting). So the main
> > point we are conveying is that the transaction which was independent
> > of each other on the publisher could be dependent on the subscriber
> > and they can end up in deadlock due to parallel apply.
> >
>
> How about changing it to: "We have a risk of deadlock due to
> parallelly applying the transactions that were independent on the
> publisher side but became dependent on the subscriber side due to the
> different database structures (like schema of subscription tables,
> constraints, etc.) on each side.

I think this looks good to me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: fix and document CLUSTER privileges

2023-01-04 Thread Gilles Darold


Le 16/12/2022 à 05:57, Nathan Bossart a écrit :

Here is a new version of the patch.  I've moved the privilege checks to a
new function, and I added a note in the docs about clustering partitioned
tables in a transaction block (it's not allowed).



Getting into review of this patch I wonder why the CLUSTER command do 
not react as VACUUM FULL command when there is insuffisant privileges. 
For example with a partitioned table (ptnowner) and two partitions 
(ptnowner1 and ptnowner2) with the second partition owned by another 
user, let' say usr2. We have the following report when executing vacuum 
as usr2:


testdb=> VACUUM FULL ptnowner;
WARNING:  permission denied to vacuum "ptnowner", skipping it
WARNING:  permission denied to vacuum "ptnowner1", skipping it
VACUUM

Here only ptnowner2 have been vacuumed which is correct and expected.

For the cluster command:

testdb=> CLUSTER;
CLUSTER


I would have expected something like:

testdb=> CLUSTER;
WARNING:  permission denied to cluster "ptnowner1", skipping it
CLUSTER

I mean that the silent behavior is not very helpful.


This is the current behavior of the CLUSTER command and current patch 
adds a sentence about the silent behavior in the documentation. This is 
good but I just want to ask if we could want to fix this behavior too or 
just keep things like that with the lack of noise.



Best regards,

--
Gilles Darold

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-04 Thread Amit Kapila

On Wed, Jan 4, 2023 at 4:52 PM Dilip Kumar  wrote:
>
> 2.
> + * Since the database structure (schema of subscription tables, constraints,
> + * etc.) of the publisher and subscriber could be different, applying
> + * transactions in parallel mode on the subscriber side can cause some
> + * deadlocks that do not occur on the publisher side.
>
> I think this paragraph needs to be rephrased a bit.  It is saying that
> some deadlock can occur on subscribers which did not occur on the
> publisher.  I think what it should be conveying is that the deadlock
> can occur due to concurrently applying the conflicting/dependent
> transactions which are not conflicting/dependent on the publisher due
> to .  Because if we create the same schema on the
> publisher it might not have ended up in a deadlock instead it would
> have been executed in sequence (due to lock waiting). So the main
> point we are conveying is that the transaction which was independent
> of each other on the publisher could be dependent on the subscriber
> and they can end up in deadlock due to parallel apply.
>

How about changing it to: "We have a risk of deadlock due to
parallelly applying the transactions that were independent on the
publisher side but became dependent on the subscriber side due to the
different database structures (like schema of subscription tables,
constraints, etc.) on each side.

-- 
With Regards,
Amit Kapila.

Re: GSOC2023

2023-01-04 Thread Jesper Pedersen


Hi,

On 12/28/22 21:10, diaa wrote:

*Hi Sir.*
I’m Computer engineering student from Egypt Interested in Database Management
Systems.
Can I know details about the list of ideas for 2023 projects or how to prepare
myself to be ready with the required knowledge?



Thanks for your interest in GSoC 2023 !


The GSoC timeline is described here [1], and the PostgreSQL projects - 
if approved - will be listed here [2].



So, check back in February and see which projects has been proposed.


[1] https://developers.google.com/open-source/gsoc/timeline

[2] https://wiki.postgresql.org/wiki/GSoC_2023


Best regards,

 Jesper

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-01-04 Thread Pavel Borisov

On Wed, 4 Jan 2023 at 12:52, Pavel Borisov  wrote:
>
> Hi, Vignesh!
>
> On Wed, 4 Jan 2023 at 12:41, vignesh C  wrote:
> >
> > On Fri, 1 Jul 2022 at 16:49, Alexander Korotkov  
> > wrote:
> > >
> > > Hackers,
> > >
> > > When working in the read committed transaction isolation mode
> > > (default), we have the following sequence of actions when
> > > tuple_update() or tuple_delete() find concurrently updated tuple.
> > >
> > > 1. tuple_update()/tuple_delete() returns TM_Updated
> > > 2. tuple_lock()
> > > 3. Re-evaluate plan qual (recheck if we still need to update/delete
> > > and calculate the new tuple for update)
> > > 4. tuple_update()/tuple_delete() (this time should be successful,
> > > since we've previously locked the tuple).
> > >
> > > I wonder if we should merge steps 1 and 2. We could save some efforts
> > > already done during tuple_update()/tuple_delete() for locking the
> > > tuple. In heap table access method, we've to start tuple_lock() with
> > > the first tuple in the chain, but tuple_update()/tuple_delete()
> > > already visited it. For undo-based table access methods,
> > > tuple_update()/tuple_delete() should start from the last version, why
> > > don't place the tuple lock immediately once a concurrent update is
> > > detected. I think this patch should have some performance benefits on
> > > high concurrency.
> > >
> > > Also, the patch simplifies code in nodeModifyTable.c getting rid of
> > > the nested case. I also get rid of extra
> > > table_tuple_fetch_row_version() in ExecUpdate. Why re-fetch the old
> > > tuple, when it should be exactly the same tuple we've just locked.
> > >
> > > I'm going to check the performance impact. Thoughts and feedback are 
> > > welcome.
> >
> > The patch does not apply on top of HEAD as in [1], please post a rebased 
> > patch:
> > === Applying patches on top of PostgreSQL commit ID
> > eb5ad4ff05fd382ac98cab60b82f7fd6ce4cfeb8 ===
> > === applying patch
> > ./0001-Lock-updated-tuples-in-tuple_update-and-tuple_del-v1.patch
> > patching file src/backend/executor/nodeModifyTable.c
> > ...
> > Hunk #3 FAILED at 1376.
> > ...
> > 1 out of 15 hunks FAILED -- saving rejects to file
> > src/backend/executor/nodeModifyTable.c.rej
> >
> > [1] - http://cfbot.cputube.org/patch_41_4099.log
>
> The rebased patch is attached. It's just a change in formatting, no
> changes in code though.

One more update of a patchset to avoid compiler warnings.

Regards,
Pavel Borisov


v3-0001-Lock-updated-tuples-in-tuple_update-and-tuple_del.patch
Description: Binary data

Re: [PoC] Implementation of distinct in Window Aggregates

2023-01-04 Thread Ankit Kumar Pandey



On 29/12/22 20:58, Ankit Kumar Pandey wrote:


On 24/12/22 18:22, Ankit Pandey wrote:

Hi,

This is a PoC patch which implements distinct operation in window 
aggregates (without order by and for single column aggregation, final 
version may vary wrt these limitations). Purpose of this PoC is to 
get feedback on the approach used and corresponding implementation, 
any nitpicking as deemed reasonable.


Distinct operation is mirrored from implementation in nodeAgg. 
Existing partitioning logic determines if row is in partition and 
when distinct is required, all tuples for the aggregate column are 
stored in tuplesort. When finalize_windowaggregate gets called, 
tuples are sorted and duplicates are removed, followed by calling the 
transition function on each tuple.
When distinct is not required, the above process is skipped and the 
transition function gets called directly and nothing gets inserted 
into tuplesort.
Note: For each partition, in tuplesort_begin and tuplesort_end is 
involved to rinse tuplesort, so at any time, max tuples in tuplesort 
is equal to tuples in a particular partition.


I have verified it for interger and interval column aggregates (to 
rule out obvious issues related to data types).


Sample cases:

create table mytable(id int, name text);
insert into mytable values(1, 'A');
insert into mytable values(1, 'A');
insert into mytable values(5, 'B');
insert into mytable values(3, 'A');
insert into mytable values(1, 'A');

select avg(distinct id) over (partition by name) from mytable;
        avg

2.
2.
2.
2.
5.

select avg(id) over (partition by name) from mytable;
        avg

 1.5000
 1.5000
 1.5000
 1.5000
 5.

select avg(distinct id) over () from mytable;
        avg

 3.
 3.
 3.
 3.
 3.

select avg(distinct id)  from mytable;
        avg

 3.

This is my first-time contribution. Please let me know if anything 
can be

improved as I`m eager to learn.

Regards,
Ankit Kumar Pandey


Hi all,

I know everyone is busy with holidays (well, Happy Holidays!) but I 
will be glad if someone can take a quick look at this PoC and share 
thoughts.


This is my first time contribution so I am pretty sure there will be 
some very obvious feedbacks (which will help me to move forward with 
this change).




Updated patch with latest master. Last patch was an year old.

--
Regards,
Ankit Kumar Pandey
From cf56d545bb837fc8f1a7630ea4417680256eddd4 Mon Sep 17 00:00:00 2001
From: Ankit Kumar Pandey 
Date: Wed, 23 Nov 2022 00:38:01 +0530
Subject: [PATCH] Implement distinct in Window Aggregates.

---
 src/backend/executor/nodeWindowAgg.c | 227 +++
 src/backend/optimizer/util/clauses.c |   2 +
 src/backend/parser/parse_agg.c   |  45 ++
 src/backend/parser/parse_func.c  |  19 +--
 src/include/nodes/execnodes.h|   1 +
 src/include/nodes/primnodes.h|   2 +
 6 files changed, 257 insertions(+), 39 deletions(-)

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index d61d57e9a8..2d685f3be4 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -154,6 +154,14 @@ typedef struct WindowStatePerAggData
 
 	int64		transValueCount;	/* number of currently-aggregated rows */
 
+	/* For DISTINCT in Aggregates */
+	Datum		lastdatum;		/* used for single-column DISTINCT */
+	FmgrInfo	equalfnOne; /* single-column comparisons*/
+
+	Oid			*eq_ops;  /* used for equality check in DISTINCT */
+	Oid			*sort_ops; /* used for sorting distinct columns */
+	bool 		sort_in; /* FLAG set true if data is stored in tuplesort */
+
 	/* Data local to eval_windowaggregates() */
 	bool		restart;		/* need to restart this agg in this cycle? */
 } WindowStatePerAggData;
@@ -163,7 +171,7 @@ static void initialize_windowaggregate(WindowAggState *winstate,
 	   WindowStatePerAgg peraggstate);
 static void advance_windowaggregate(WindowAggState *winstate,
 	WindowStatePerFunc perfuncstate,
-	WindowStatePerAgg peraggstate);
+	WindowStatePerAgg peraggstate, Datum value, bool isNull);
 static bool advance_windowaggregate_base(WindowAggState *winstate,
 		 WindowStatePerFunc perfuncstate,
 		 WindowStatePerAgg peraggstate);
@@ -173,6 +181,9 @@ static void finalize_windowaggregate(WindowAggState *winstate,
 	 Datum *result, bool *isnull);
 
 static void eval_windowaggregates(WindowAggState *winstate);
+static void process_ordered_windowaggregate_single(WindowAggState *winstate, 
+			 WindowStatePerFunc perfuncstate,
+ 			 WindowStatePerAgg peraggstate);
 static void eval_windowfunction(WindowAggState *winstate,
 WindowStatePerFun

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-04 Thread Ankit Kumar Pandey



On 04/01/23 09:32, David Rowley wrote:


It looks like that works by accident. I see no mention of this either
in the comments or in [1].


This kind of troubles me because function name 
/select_active_windows///doesn't tell me if its only job is


to reorder window clauses for optimizing sort. From code, I don't see it 
doing anything else either.




If we don't have one already, then we should likely add a regression
test that ensures that this remains true.  Since it does not seem to
be documented in the code anywhere, it seems like something that could
easily be overlooked if we were to ever refactor that code.

I don't see any tests in windows specific to sorting operation (and in 
what order). I will add those.



Also, one thing, consider the following query:

explain analyze select row_number() over (order by a,b),count(*) over 
(order by a) from abcd order by a,b,c;


In this case, sorting is done on (a,b) followed by incremental sort on c 
at final stage.


If we do just one sort: a,b,c at first stage then there won't be need to 
do another sort (incremental one).



Now, I am not sure if which one would be faster: sorting (a,b,c) vs 
sort(a,b) + incremental sort(c)


because even though datum sort is fast, there can be n number of combos 
where we won't be doing that.


I might be looking at extreme corner cases though but still wanted to share.




--
Regards,
Ankit Kumar Pandey

Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2023-01-04 Thread vignesh C

On Tue, 1 Nov 2022 at 15:54, Etsuro Fujita  wrote:
>
> Hi David,
>
> On Sat, Oct 1, 2022 at 5:54 AM David Zhang  wrote:
> > After rebase the file `postgres_fdw.out` and applied to master branch,
> > make and make check are all ok for postgres_fdw.
>
> Thanks for testing!  Attached is a rebased version of the patch.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
b37a0832396414e8469d4ee4daea33396bde39b0 ===
=== applying patch ./v10-postgres-fdw-Add-support-for-parallel-abort.patch
patching file contrib/postgres_fdw/connection.c
patching file contrib/postgres_fdw/expected/postgres_fdw.out
Hunk #1 succeeded at 11704 (offset 203 lines).
Hunk #2 FAILED at 11576.
1 out of 2 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/expected/postgres_fdw.out.rej
patching file contrib/postgres_fdw/option.c
Hunk #2 succeeded at 272 (offset 17 lines).
patching file contrib/postgres_fdw/sql/postgres_fdw.sql
Hunk #1 succeeded at 3894 (offset 150 lines).
Hunk #2 FAILED at 3788.
1 out of 2 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/sql/postgres_fdw.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3392.log

Regards,
Vignesh

Re: making relfilenodes 56 bits

2023-01-04 Thread vignesh C

On Fri, 21 Oct 2022 at 11:31, Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > any actual checks on the sanity of the output?  I realize that the
> > output's far from static, but still ...
>
> Honestly, checking all the fields is not that exciting, but the
> maximum I can think of that would be portable enough is something like
> the attached.  No arithmetic operators for xid limits things a bit,
> but at least that's something.
>
> Thoughts?

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
33ab0a2a527e3af5beee3a98fc07201e555d6e45 ===
=== applying patch ./controldata-regression-2.patch
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 642 with fuzz 2 (offset 48 lines).
patching file src/test/regress/sql/misc_functions.sql
Hunk #1 FAILED at 223.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/misc_functions.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3711.log

Regards,
Vignesh

Re: Using AF_UNIX sockets always for tests on Windows

2023-01-04 Thread vignesh C

On Fri, 2 Dec 2022 at 18:08, Andrew Dunstan  wrote:
>
>
> On 2022-12-01 Th 21:10, Andres Freund wrote:
> > Hi,
> >
> > On 2022-12-01 20:56:18 -0500, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> On 2022-12-01 20:30:36 -0500, Tom Lane wrote:
>  If we remove that, won't we have a whole lot of code that's not
>  tested at all on any platform, ie all the TCP-socket code?
> >>> There's some coverage via the auth and ssl tests. But I agree it's an
> >>> issue. But to me the fix for that seems to be to add a dedicated test for
> >>> that, rather than relying on windows to test our socket code - that's 
> >>> quite a
> >>> few separate code paths from the tcp support of other platforms.
> >> IMO that's not the best way forward, because you'll always have
> >> nagging questions about whether a single-purpose test covers
> >> everything that needs coverage.
> > Still seems better than not having any coverage in our development
> > environments...
> >
> >
> >> I think the best place to be in would be to be able to run the whole test
> >> suite using either TCP or UNIX sockets, on any platform (with stuff like 
> >> the
> >> SSL test overriding the choice as needed).
> > I agree that that's useful. But it seems somewhat independent from the
> > majority of the proposed changes. To be able to test force-tcp-everywhere we
> > don't need e.g.  code for setting sspi auth in pg_regress etc - it's afaik
> > just needed so there's a secure way of running tests at all on windows.
> >
> > I think 0003 should be "trimmed" to only change the default for
> > $use_unix_sockets on windows and to remove PG_TEST_USE_UNIX_SOCKETS. Whoever
> > wants to, can then add a new environment variable to force tap tests to use
> > tcp.
> >
>
> Not sure if it's useful here, but a few months ago I prepared patches to
> remove the config-auth option of pg_regress, which struck me as more
> than odd, and replace it with a perl module. I didn't get around to
> finishing them, but the patches as of then are attached.
>
> I agree that having some switch that says "run everything with TCP" or
> "run (almost) everything with Unix sockets" would be good.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
bf03cfd162176d543da79f9398131abc251ddbb9 ===
=== applying patch
./0001-Do-config_auth-in-perl-code-for-TAP-tests-and-vcregr.patch
patching file contrib/basebackup_to_shell/t/001_basic.pl
Hunk #1 FAILED at 21.
1 out of 1 hunk FAILED -- saving rejects to file
contrib/basebackup_to_shell/t/001_basic.pl.rej
patching file src/bin/pg_basebackup/t/010_pg_basebackup.pl
Hunk #1 FAILED at 29.
1 out of 1 hunk FAILED -- saving rejects to file
src/bin/pg_basebackup/t/010_pg_basebackup.pl.rej
Hunk #3 FAILED at 461.
1 out of 3 hunks FAILED -- saving rejects to file
src/test/perl/PostgreSQL/Test/Cluster.pm.rej

[1] - http://cfbot.cputube.org/patch_41_4033.log

Regards,
Vignesh

psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Jakub Wartak

Hi -hackers,

I've spent some time fighting against "out of memory" errors coming
out of psql when trying to use the cursor via FETCH_COUNT. It might be
a not so well known fact (?) that CTEs are not executed with cursor
when asked to do so, but instead silently executed with potential huge
memory allocation going on. Patch is attached. My one doubt is that
not every statement starting with "WITH" is WITH(..) SELECT of course.

Demo (one might also get the "out of memory for query result"):

postgres@hive:~$ psql -Ant --variable='FETCH_COUNT=100' -c "WITH data
AS (SELECT  generate_series(1, 2000) as Total) select repeat('a',
100) || data.Total || repeat('b', 800) as total_pat from data;"
Killed
postgres@hive:~$ tail -4 /var/log/postgresql/postgresql-14-main.log
[..]
2023-01-04 12:46:20.193 CET [32936] postgres@postgres LOG:  could not
send data to client: Broken pipe
[..]
2023-01-04 12:46:20.195 CET [32936] postgres@postgres FATAL:
connection to client lost

With the patch:
postgres@hive:~$ /tmp/psql16-with-patch -Ant
--variable='FETCH_COUNT=100' -c "WITH data AS (SELECT
generate_series(1, 2000) as Total) select repeat('a', 100) ||
data.Total || repeat('b', 800) as total_pat from data;" | wc -l
2000
postgres@hive:~$

Regards,
-Jakub Wartak.


0001-psql-allow-CTE-queries-to-be-executed-also-using-cur.patch
Description: Binary data

Re: Understanding, testing and improving our Windows filesystem code

2023-01-04 Thread vignesh C

On Tue, 25 Oct 2022 at 09:42, Thomas Munro  wrote:
>
> I pushed the bug fixes from this series, without their accompanying
> tests.  Here's a rebase of the test suite, with all those tests now
> squashed into the main test patch, and also the
> tell-Windows-to-be-more-like-Unix patch.  Registered in the
> commitfest.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
5212d447fa53518458cbe609092b347803a667c5 ===
=== applying patch
./v3-0002-meson-Add-infrastructure-for-TAP-tests-written-in.patch
patching file meson.build
Hunk #5 FAILED at 3000.
Hunk #6 FAILED at 3035.
2 out of 6 hunks FAILED -- saving rejects to file meson.build.rej

[1] - http://cfbot.cputube.org/patch_41_3951.log

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-04 Thread David Geier


Hi,


CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh


I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

--
David Geier
(ServiceNow)
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else			/* WIN32 */
 
-- 
2.34.1

From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
 instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU

Getting an error if we provide --enable-tap-tests switch on SLES 12

2023-01-04 Thread tushar

Hi,

We found that if we provide *--enable-tap-tests * switch at the time of PG
sources configuration,  getting this below error
"
checking for Perl modules required for TAP tests... Can't locate IPC/Run.pm
in @INC (you may need to install the IPC::Run module) (@INC contains:
/usr/lib/perl5/site_perl/5.18.2/x86_64-linux-thread-multi
/usr/lib/perl5/site_perl/5.18.2
/usr/lib/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.18.2
/usr/lib/perl5/5.18.2/x86_64-linux-thread-multi /usr/lib/perl5/5.18.2
/usr/lib/perl5/site_perl .) at ./config/check_modules.pl line 11.

BEGIN failed--compilation aborted at ./config/check_modules.pl line 11.

configure: error: Additional Perl modules are required to run TAP tests
"

look like this is happening because the Perl-IPC-Run package is not
available on SLES 12 where Perl-IPC-Run3 is available.

Srinu (my teammate) found that  IPC::Run is hard coded in config/
check_modules.pl and if we replace Run to Run3 it works (patch is attached,
created by Srinu)

Do we have any better option to work without this workaround?

regards,


perl.patch
Description: Binary data

Re: meson oddities

2023-01-04 Thread Peter Eisentraut


On 16.11.22 18:07, Andres Freund wrote:

If I just want to install postgres into a prefix without 'postgresql' added in
a bunch of directories, e.g. because I already have pg-$version to be in the
prefix, there's really no good way to do so - you can't even specify
--sysconfdir or such, because we just override that path.


At least for the libraries, the point of the 'postgresql' subdir IMO
is to keep backend-loadable extensions separate from random libraries.
It's not great that we may fail to do that depending on what the
initial part of the library path is.


Agreed, extensions really should never be in a path searched by the dynamic
linker, even if the prefix contains 'postgres'.

To me that's a separate thing from adding postgresql to datadir, sysconfdir,
includedir, docdir... On a green field I'd say the 'extension library'
directory should just always be extensions/ or such.


I think we should get the two build systems to produce the same 
installation layout when given equivalent options.


Unless someone comes up with a proposal to address the above broader 
issues, also taking into account current packaging practices etc., then 
I think we should do a short-term solution to either port the 
subdir-appending to the meson scripts or remove it from the makefiles 
(or maybe a bit of both).

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-04 Thread Dilip Kumar

On Wed, Jan 4, 2023 at 4:25 PM houzj.f...@fujitsu.com
 wrote:
>

> Attach the new patch set.
> Apart from addressing Sawada-San's comments, I also did some other minor
> changes in the patch:

I have done a high-level review of 0001, and later I will do a
detailed review of this while reading through the patch I think some
of the comments need some changes..

1.
+ The deadlock can happen in
+ * the following ways:
+ *

+ * 4) Lock types
+ *
+ * Both the stream lock and the transaction lock mentioned above are
+ * session-level locks because both locks could be acquired outside the
+ * transaction, and the stream lock in the leader needs to persist across
+ * transaction boundaries i.e. until the end of the streaming transaction.

I think the Lock types should not be listed with the number 4).
Because point number 1,2 and 3 are explaining the way how deadlocks
can happen but 4) doesn't fall under that category.

2.
+ * Since the database structure (schema of subscription tables, constraints,
+ * etc.) of the publisher and subscriber could be different, applying
+ * transactions in parallel mode on the subscriber side can cause some
+ * deadlocks that do not occur on the publisher side.

I think this paragraph needs to be rephrased a bit.  It is saying that
some deadlock can occur on subscribers which did not occur on the
publisher.  I think what it should be conveying is that the deadlock
can occur due to concurrently applying the conflicting/dependent
transactions which are not conflicting/dependent on the publisher due
to .  Because if we create the same schema on the
publisher it might not have ended up in a deadlock instead it would
have been executed in sequence (due to lock waiting). So the main
point we are conveying is that the transaction which was independent
of each other on the publisher could be dependent on the subscriber
and they can end up in deadlock due to parallel apply.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-04 Thread Ankit Kumar Pandey




On 04/01/23 06:27, David Rowley wrote:

I think it's better you leave this then. I think if someone comes
along and demonstrates the feature's usefulness and can sell us having
it so we can easily enable it by GUC then maybe that's the time to
consider it. I don't think ticking off a TODO item is reason enough.


Make sense, not going further with this.

--
Regards,
Ankit Kumar Pandey

GSOC2023

2023-01-04 Thread diaa

Hi Sir.I’m Computer engineering student from Egypt Interested in Database Management Systems.Can I know details about the list of ideas for 2023 projects or how to prepare myself to be ready with the required knowledge?Please if you can help me don't ignore my email.Sincerely.Diaa Badr.

grouping pushdown

2023-01-04 Thread Spring Zhong

Hi hackers,

I came across a problem on how to improve the performance of queries with GROUP 
BY clause when the grouping columns have much duplicate data. For example:

create table t1(i1) as select 1 from generate_series(1,1);
create table t2(i2) as select 2 from generate_series(1,1);

select i1,i2 from t1, t2 group by i1,i2;
 i1 | i2
+
  1 |  2

   QUERY PLAN
---
 HashAggregate
   Group Key: t1.i1, t2.i2
   Batches: 1  Memory Usage: 24kB
   ->  Nested Loop
 ->  Seq Scan on t1
 ->  Materialize
   ->  Seq Scan on t2
 Planning Time: 0.067 ms
 Execution Time: 15864.585 ms


The plan is apparently inefficient, since the hash aggregate goes after the 
Cartesian product. We could expect the query's performance get much improved if 
the HashAggregate node can be pushed down to the SCAN node. For example, the 
plan may looks like:

 expected QUERY PLAN

 Group
   Group Key: t1.i1, t2.i2
   ->  Sort
 Sort Key: t1.i1, t2.i2
 ->  Nested Loop
   ->  HashAggregate
 Group Key: t1.i1
 ->  Seq Scan on t1
   ->  HashAggregate
 Group Key: t2.i2
 ->  Seq Scan on t2

Moreover, queries with expressions as GROUP BY columns may also take advantage 
of this feature, e.g.

select i1+i2 from t1, t2 group by i1+i2;
 ?column?
--
3

  expected QUERY PLAN

 Group
   Group Key: ((t1.i1 + t2.i2))
   ->  Sort
 Sort Key: ((t1.i1 + t2.i2))
 ->  Nested Loop
   ->  HashAggregate
 Group Key: t1.i1
 ->  Seq Scan on t1
   ->  HashAggregate
 Group Key: t2.i2
 ->  Seq Scan on t2

Is someone has suggestions on this?

Re: on placeholder entries in view rule action query's range table

2023-01-04 Thread vignesh C

On Fri, 9 Dec 2022 at 12:20, Amit Langote  wrote:
>
> On Fri, Dec 9, 2022 at 3:07 PM Amit Langote  wrote:
> > On Thu, Dec 8, 2022 at 6:12 PM Alvaro Herrera  
> > wrote:
> > > On 2022-Dec-07, Amit Langote wrote:
> > > > However, this
> > > > approach of not storing the placeholder in the stored rule would lead
> > > > to a whole lot of regression test output changes, because the stored
> > > > view queries of many regression tests involving views would now end up
> > > > with only 1 entry in the range table instead of 3, causing ruleutils.c
> > > > to no longer qualify the column names in the deparsed representation
> > > > of those queries appearing in those regression test expected outputs.
> > > >
> > > > To avoid that churn (not sure if really a goal to strive for in this
> > > > case!), I thought it might be better to keep the OLD entry in the
> > > > stored action query while getting rid of the NEW entry.
> > >
> > > If the *only* argument for keeping the RTE for OLD is to avoid
> > > regression test churn, then definitely it is not worth doing and it
> > > should be ripped out.
> > >
> > > > Other than avoiding the regression test output churn, this also makes
> > > > the changes of ApplyRetrieveRule unnecessary.
> > >
> > > But do these changes mean the code is worse afterwards?  Changing stuff,
> > > per se, is not bad.  Also, since you haven't posted the "complete" patch
> > > since Nov 7th, it's not easy to tell what those changes are.
> > >
> > > Maybe you should post both versions of the patch -- one that removes
> > > just NEW, and one that removes both OLD and NEW, so that we can judge.
> >
> > OK, I gave the previous approach another try to see if I can change
> > ApplyRetrieveRule() in a bit more convincing way this time around, now
> > that the RTEPermissionInfo patch is in.
> >
> > I would say I'm more satisfied with how it turned out this time.  Let
> > me know what you think.
> >
> > > > Actually, as I was addressing Alvaro's comments on the now-committed
> > > > patch, I was starting to get concerned about the implications of the
> > > > change in position of the view relation RTE in the query's range table
> > > > if ApplyRetrieveRule() adds one from scratch instead of simply
> > > > recycling the OLD entry from stored rule action query, even though I
> > > > could see that there are no *user-visible* changes, especially after
> > > > decoupling permission checking from the range table.
> > >
> > > Hmm, I think I see the point, though I don't necessarily agree that
> > > there is a problem.
> >
> > Yeah, I'm not worried as much with the new version.  That is helped by
> > the fact that I've made ApplyRetrieveRule() now do basically what
> > UpdateRangeTableOfViewParse() would do with the stored rule query.
> > Also, our making add_rtes_to_flat_rtable() add perminfos in the RTE
> > order helped find the bug with the last version.
> >
> > Attaching both patches.
>
> Looks like I forgot to update some expected output files.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
54afdcd6182af709cb0ab775c11b90decff166eb ===
=== applying patch
./v1-0001-Do-not-add-the-NEW-entry-to-view-rule-action-s-ra.patch
Hunk #1 succeeded at 1908 (offset 1 line).
=== applying patch ./v2-0001-Remove-UpdateRangeTableOfViewParse.patch
patching file contrib/postgres_fdw/expected/postgres_fdw.out
Hunk #1 FAILED at 2606.
Hunk #2 FAILED at 2669.
2 out of 4 hunks FAILED -- saving rejects to file
contrib/postgres_fdw/expected/postgres_fdw.out.rej

[1] - http://cfbot.cputube.org/patch_41_4048.log

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-04 Thread vignesh C

On Tue, 3 Jan 2023 at 14:08, David Geier  wrote:
>
> Hi Lukas,
>
> On 1/2/23 20:50, Lukas Fittl wrote:
> > Thanks for continuing to work on this patch, and my apologies for
> > silence on the patch.
>
> It would be great if you could review it.
> Please also share your thoughts around exposing the used clock source as
> GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
>
> I rebased again on master because of [1]. Patches attached.
>
> >
> > Its been hard to make time, and especially so because I typically
> > develop on an ARM-based macOS system where I can't test this directly
> > - hence my tests with virtualized EC2 instances, where I ran into the
> > timing oddities.
> That's good and bad. Bad to do the development and good to test the
> implementation on more virtualized setups; given that I also encountered
> "interesting" behavior on VMWare (see my previous mails).
> >
> > On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:
> >
> > The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
> > variants
> > return double. This seems error prone. What about renaming the
> > function
> > or also have the function return a double and cast where necessary at
> > the call site?
> >
> >
> > Minor note, but in my understanding using a uint64 (where we can) is
> > faster for any simple arithmetic we do with the values.
>
> That's true. So the argument could be that for seconds and milliseconds
> we want the extra precision while microseconds are precise enough.
> Still, we could also make the seconds and milliseconds conversion code
> integer only and e.g. return two integers with the value before and
> after the comma. FWICS, the functions are nowhere used in performance
> critical code, so it doesn't really make a difference performance-wise.
>
> >
> > +1, and feel free to carry this patch forward - I'll try to make an
> > effort to review my earlier testing issues again, as well as your
> > later improvements to the patch.
> Moved to the current commit fest. Will you become reviewer?
> >
> > Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
> > ANALYZE with timing over in [0], using a sampling-based approach to
> > reduce the timing overhead.
>
> Interesting idea. I'll reply with some thoughts on the corresponding thread.
>
> [1]
> https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh

Re: WIP: Aggregation push-down - take2

2023-01-04 Thread vignesh C

On Thu, 17 Nov 2022 at 16:34, Antonin Houska  wrote:
>
> Tomas Vondra  wrote:
>
> > Hi,
> >
> > I did a quick initial review of the v20 patch series. I plan to do a
> > more thorough review over the next couple days, if time permits. In
> > general I think the patch is in pretty good shape.
>
> Thanks.
>
> > I've added a bunch of comments in a number of places - see the "review
> > comments" parts for each of the original parts. That should make it
> > easier to deal with all the items. I'll go through the main stuff here:
>
> Unless I miss something, all these items are covered in context below, except
> for this one:
>
> > 7) when I change enable_agg_pushdown to true and run regression tests, I
> > get a bunch of failures like
> >
> >ERROR:  WindowFunc found where not expected
> >
> > Seems we don't handle window functions correctly somewhere, or maybe
> > setup_aggregate_pushdown should check/reject hasWindowFuncs too?
>
> We don't need to reject window functions, window functions are processed after
> grouping/aggregation. The problem I noticed in the regression tests was that a
> window function referenced a (non-window) aggregate. We just need to ensure
> that pull_var_clause() recurses into that window function in such cases:
>
> Besides the next version, v21-fixes.patch file is attached. It tries to
> summarize all the changes between v21 and v22. (I wonder if this attachment
> makes the cfbot fail.)
>
>
> diff --git a/src/backend/optimizer/plan/initsplan.c 
> b/src/backend/optimizer/plan/initsplan.c
> index 8e913c92d8..8dc39765f2 100644
> --- a/src/backend/optimizer/plan/initsplan.c
> +++ b/src/backend/optimizer/plan/initsplan.c
> @@ -355,7 +355,8 @@ create_aggregate_grouped_var_infos(PlannerInfo *root)
> Assert(root->grouped_var_list == NIL);
>
> tlist_exprs = pull_var_clause((Node *) root->processed_tlist,
> - 
> PVC_INCLUDE_AGGREGATES);
> + 
> PVC_INCLUDE_AGGREGATES |
> + 
> PVC_RECURSE_WINDOWFUNCS);
>
> /*
>  * Although GroupingFunc is related to root->parse->groupingSets, this
>
>
> > ---
> >  src/backend/optimizer/util/relnode.c | 11 +++
> >  src/include/nodes/pathnodes.h|  3 +++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/src/backend/optimizer/util/relnode.c 
> > b/src/backend/optimizer/util/relnode.c
> > index 94720865f47..d4367ba14a5 100644
> > --- a/src/backend/optimizer/util/relnode.c
> > +++ b/src/backend/optimizer/util/relnode.c
> > @@ -382,6 +382,12 @@ find_base_rel(PlannerInfo *root, int relid)
> >  /*
> >   * build_rel_hash
> >   * Construct the auxiliary hash table for relation specific data.
> > + *
> > + * XXX Why is this renamed, leaving out the "join" part? Are we going to 
> > use
> > + * it for other purposes?
>
> Yes, besides join relation, it's used to find the "grouped relation" by
> Relids. This change tries to follow the suggestion "Maybe an appropriate
> preliminary patch ..." in [1], but I haven't got any feedback whether my
> understanding was correct.
>
> > + * XXX Also, why change the API and not pass PlannerInfo? Seems pretty 
> > usual
> > + * for planner functions.
>
> I think that the reason was that, with the patch applied, PlannerInfo contains
> multiple fields of the RelInfoList type, so build_rel_hash() needs an
> information which one it should process. Passing the exact field is simpler
> than passing PlannerInfo plus some additional information.
>
> >   */
> >  static void
> >  build_rel_hash(RelInfoList *list)
> > @@ -422,6 +428,11 @@ build_rel_hash(RelInfoList *list)
> >  /*
> >   * find_rel_info
> >   * Find a base or join relation entry.
> > + *
> > + * XXX Why change the API and not pass PlannerInfo? Seems pretty usual
> > + * for planner functions.
>
> For the same reason that build_rel_hash() receives the list explicitly, see
> above.
>
> > + * XXX I don't understand why we need both this and find_join_rel.
>
> Perhaps I just wanted to keep the call sites of find_join_rel() untouched. I
> think that
>
> find_join_rel(root, relids);
>
> is a little bit easier to read than
>
> (RelOptInfo *) find_rel_info(root->join_rel_list, relids);
>
> >   */
> >  static void *
> >  find_rel_info(RelInfoList *list, Relids relids)
> > diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
> > index 0ca7d5ab51e..018ce755720 100644
> > --- a/src/include/nodes/pathnodes.h
> > +++ b/src/include/nodes/pathnodes.h
> > @@ -88,6 +88,9 @@ typedef enum UpperRelationKind
> >   * present and valid when rel_hash is not NULL.  Note that we still 
> > maintain
> >   * the list even when using the hash table for lookups; this simplifies 
> > life
> >   * for GEQO.
> > + *
> > + * XXX I wonder why we actually need a separate node, merely wrapping 
> > fields
> > + * that already existed ...

Re: Prefetch the next tuple's memory during seqscans

2023-01-04 Thread vignesh C

On Wed, 23 Nov 2022 at 03:28, David Rowley  wrote:
>
> On Thu, 3 Nov 2022 at 06:25, Andres Freund  wrote:
> > Attached is an experimental patch/hack for that. It ended up being more
> > beneficial to make the access ordering more optimal than prefetching the 
> > tuple
> > contents, but I'm not at all sure that's the be-all-end-all.
>
> Thanks for writing that patch. I've been experimenting with it.
>
> I tried unrolling the loop (patch 0003) as you mentioned in:
>
> + * FIXME: Worth unrolling so that we don't fetch the same cacheline
> + * over and over, due to line items being smaller than a cacheline?
>
> but didn't see any gains from doing that.
>
> I also adjusted your patch a little so that instead of doing:
>
> - OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
> + OffsetNumber *rs_vistuples;
> + OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */
>
> to work around the issue of having to populate rs_vistuples_d in
> reverse, I added a new field called rs_startindex to mark where the
> first element in the rs_vistuples array is. The way you wrote it seems
> to require fewer code changes, but per the FIXME comment you left, I
> get the idea you just did it the way you did to make it work enough
> for testing.
>
> I'm quite keen to move forward in committing the 0001 patch to add the
> pg_prefetch_mem macro. What I'm a little undecided about is what the
> best patch is to commit first to make use of the new macro.
>
> I did some tests on the attached set of patches:
>
> alter system set max_parallel_workers_per_gather = 0;
> select pg_reload_conf();
>
> create table t as select a from generate_series(1,1000)a;
> alter table t set (autovacuum_enabled=false);
>
> $ cat bench.sql
> select * from t where a = 0;
>
> psql -c "select pg_prewarm('t');" postgres
>
> -- Test 1 no frozen tuples in "t"
>
> Master (@9c6ad5eaa):
> $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
> latency average = 383.332 ms
> latency average = 375.747 ms
> latency average = 376.090 ms
>
> Master + 0001 + 0002:
> $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
> latency average = 370.133 ms
> latency average = 370.149 ms
> latency average = 370.157 ms
>
> Master + 0001 + 0005:
> $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
> latency average = 372.662 ms
> latency average = 371.034 ms
> latency average = 372.709 ms
>
> -- Test 2 "select count(*) from t" with all tuples frozen
>
> $ cat bench1.sql
> select count(*) from t;
>
> psql -c "vacuum freeze t;" postgres
> psql -c "select pg_prewarm('t');" postgres
>
> Master (@9c6ad5eaa):
> $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
> latency average = 406.238 ms
> latency average = 407.029 ms
> latency average = 406.962 ms
>
> Master + 0001 + 0005:
> $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
> latency average = 345.470 ms
> latency average = 345.775 ms
> latency average = 345.354 ms
>
> My current thoughts are that it might be best to go with 0005 to start
> with.  I know Melanie is working on making some changes in this area,
> so perhaps it's best to leave 0002 until that work is complete.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
5212d447fa53518458cbe609092b347803a667c5 ===
=== applying patch ./v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch
=== applying patch ./v2-0002-Perform-memory-prefetching-in-heapgetpage.patch
patching file src/backend/access/heap/heapam.c
Hunk #1 FAILED at 451.
1 out of 6 hunks FAILED -- saving rejects to file
src/backend/access/heap/heapam.c.rej

[1] - http://cfbot.cputube.org/patch_41_3978.log

Regards,
Vignesh

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 4:54 PM Andres Freund  wrote:
> There's some changes from TransactionIdDidCommit() to !TransactionIdDidAbort()
> that don't look right to me. If the server crashed while xid X was
> in-progress, TransactionIdDidCommit(X) will return false, but so will
> TransactionIdDidAbort(X). So besides moving when the check happens you also
> changed what's being checked in a more substantial way.

I did point this out on the thread. I made this change with the
intention of making the check more robust. Apparently this was
misguided.

Where is the behavior that you describe documented, if anywhere?

> Also, why did you change when MarkBufferDirty() happens? Previously it
> happened before we modify the page contents, now after. That's probably fine
> (it's the order suggested in transam/README), but seems like a mighty subtle
> thing to change at the same time as something unrelated, particularly without
> even mentioning it?

I changed it because the new order is idiomatic. I didn't think that
this was particularly worth mentioning, or even subtle. The logic from
heap_execute_freeze_tuple() only performs simple in-place
modifications.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Andres Freund

Hi,

On 2023-01-03 19:23:41 +, Peter Geoghegan wrote:
> Delay commit status checks until freezing executes.
> 
> pg_xact lookups are relatively expensive.  Move the xmin/xmax commit
> status checks from the point that freeze plans are prepared to the point
> that they're actually executed.  Otherwise we'll repeat many commit
> status checks whenever multiple successive VACUUM operations scan the
> same pages and decide against freezing each time, which is a waste of
> cycles.
> 
> Oversight in commit 1de58df4, which added page-level freezing.
> 
> Author: Peter Geoghegan 
> Discussion: 
> https://postgr.es/m/cah2-wzkzpe4k6qmfet8h4qyjckc2r7tpvksbva7jc9w7igx...@mail.gmail.com

There's some changes from TransactionIdDidCommit() to !TransactionIdDidAbort()
that don't look right to me. If the server crashed while xid X was
in-progress, TransactionIdDidCommit(X) will return false, but so will
TransactionIdDidAbort(X). So besides moving when the check happens you also
changed what's being checked in a more substantial way.

Also, why did you change when MarkBufferDirty() happens? Previously it
happened before we modify the page contents, now after. That's probably fine
(it's the order suggested in transam/README), but seems like a mighty subtle
thing to change at the same time as something unrelated, particularly without
even mentioning it?

Greetings,

Andres Freund

1 2 >

1 - 100 of 109 matches

Mail list logo