Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule  
wrote in 
> čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada 
> napsal:
> > Is this a bug in plpgsql?
> >
> 
> I think it is by design.  There is not any callback that is called after an
> exception.
> 
> It is true, so some callbacks on statement error and function's error can
> be nice. It can help me to implement profilers, or tracers more simply and
> more robustly.
> 
> But I am not sure about performance impacts. This is on a critical path.

I didn't searched for, but I guess all of the end-side callback of all
begin-end type callbacks are not called on exception. Additional
PG_TRY level wouldn't be acceptable for performance reasons.

What we (pg_hint_plan people) want is any means to know that the
top-level function is exited, to reset function nest level. It would
be simpler than calling end callback at every nest level.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-14 Thread Pavel Stehule
čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada 
napsal:

> Hi,
>
> While investigating the issue reported on pg_hint_plan[1], I realized
> that stmt_end() callback is not called if an error raised during the
> statement execution is caught. I've attached the patch to check when
> stmt_beg() and stmt_end() are called. Here is an example:
>
> postgres(1:3220232)=# create or replace function testfn(a text) returns
> int as
> $$
> declare
>   x int;
> begin
>   select a::int into x;
>   return x;
>   exception when others then return 99;
> end;
> $$
> language plpgsql;
> CREATE FUNCTION
>
> postgres(1:3220232)=# select testfn('1');
> NOTICE:  stmt_beg toplevel_block
> NOTICE:  stmt_beg stmt SQL statement
> NOTICE:  stmt_end stmt SQL statement
> NOTICE:  stmt_beg stmt RETURN
> NOTICE:  stmt_end stmt RETURN
> NOTICE:  stmt_end toplevel_block
>  testfn
> 
>   1
> (1 row)
>
> postgres(1:3220232)=# select testfn('x');
> NOTICE:  stmt_beg toplevel_block
> NOTICE:  stmt_beg stmt SQL statement
> NOTICE:  stmt_beg stmt RETURN
> NOTICE:  stmt_end stmt RETURN
> NOTICE:  stmt_end toplevel_block
>  testfn
> 
>  99
> (1 row)
>
> In exec_stmt_block(), we call exec_stmts() in a PG_TRY() block and
> call stmt_beg() and stmt_end() callbacks for each statement executed
> there. However, if an error is caught during executing a statement, we
> jump to PG_CATCH() block in exec_stmt_block() so we don't call
> stmt_end() callback that is supposed to be called in exec_stmts(). To
> fix it, I think we can call stmt_end() callback in PG_CATCH() block.
>
> pg_hint_plan increments and decrements a count in stmt_beg() and
> stmt_end() callbacks, respectively[2]. It resets the counter when
> raising an ERROR (not caught). But if an ERROR is caught, the counter
> could be left as an invalid value.
>
> Is this a bug in plpgsql?
>

I think it is by design.  There is not any callback that is called after an
exception.

It is true, so some callbacks on statement error and function's error can
be nice. It can help me to implement profilers, or tracers more simply and
more robustly.

But I am not sure about performance impacts. This is on a critical path.

Regards

Pavel





> Regards,
>
> [1] https://github.com/ossc-db/pg_hint_plan/issues/93
> [2]
> https://github.com/ossc-db/pg_hint_plan/blob/master/pg_hint_plan.c#L4870
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>


plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-14 Thread Masahiko Sawada
Hi,

While investigating the issue reported on pg_hint_plan[1], I realized
that stmt_end() callback is not called if an error raised during the
statement execution is caught. I've attached the patch to check when
stmt_beg() and stmt_end() are called. Here is an example:

postgres(1:3220232)=# create or replace function testfn(a text) returns int as
$$
declare
  x int;
begin
  select a::int into x;
  return x;
  exception when others then return 99;
end;
$$
language plpgsql;
CREATE FUNCTION

postgres(1:3220232)=# select testfn('1');
NOTICE:  stmt_beg toplevel_block
NOTICE:  stmt_beg stmt SQL statement
NOTICE:  stmt_end stmt SQL statement
NOTICE:  stmt_beg stmt RETURN
NOTICE:  stmt_end stmt RETURN
NOTICE:  stmt_end toplevel_block
 testfn

  1
(1 row)

postgres(1:3220232)=# select testfn('x');
NOTICE:  stmt_beg toplevel_block
NOTICE:  stmt_beg stmt SQL statement
NOTICE:  stmt_beg stmt RETURN
NOTICE:  stmt_end stmt RETURN
NOTICE:  stmt_end toplevel_block
 testfn

 99
(1 row)

In exec_stmt_block(), we call exec_stmts() in a PG_TRY() block and
call stmt_beg() and stmt_end() callbacks for each statement executed
there. However, if an error is caught during executing a statement, we
jump to PG_CATCH() block in exec_stmt_block() so we don't call
stmt_end() callback that is supposed to be called in exec_stmts(). To
fix it, I think we can call stmt_end() callback in PG_CATCH() block.

pg_hint_plan increments and decrements a count in stmt_beg() and
stmt_end() callbacks, respectively[2]. It resets the counter when
raising an ERROR (not caught). But if an ERROR is caught, the counter
could be left as an invalid value.

Is this a bug in plpgsql?

Regards,

[1] https://github.com/ossc-db/pg_hint_plan/issues/93
[2] https://github.com/ossc-db/pg_hint_plan/blob/master/pg_hint_plan.c#L4870

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


test_plpgsql.patch
Description: Binary data


Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 04:44:34PM -0600, David Christensen wrote:
> I can get one sent in tomorrow.

-XLogRecordHasFPW(XLogReaderState *record)
+XLogRecordHasFPI(XLogReaderState *record)
This still refers to a FPW, so let's leave that out as well as any
renamings of this kind..

+   if (config.save_fpi_path != NULL)
+   {
+   /* Create the dir if it doesn't exist */
+   if (pg_mkdir_p(config.save_fpi_path, pg_dir_create_mode) < 0)
+   {
+   pg_log_error("could not create output directory \"%s\": %m",
+config.save_fpi_path);
+   goto bad_argument;
+   }
+   }
It seems to me that you could allow things to work even if the
directory exists and is empty.  See for example
verify_dir_is_empty_or_create() in pg_basebackup.c.

+my $file_re =
+  
qr/^([0-9A-F]{8})-([0-9A-F]{8})[.][0-9]+[.][0-9]+[.][0-9]+[.][0-9]+(?:_vm|_init|_fsm|_main)?$/;
This is artistic to parse for people not used to regexps (I do, a
little).  Perhaps this could use a small comment with an example of
name or a reference describing this format?

+# Set umask so test directories and files are created with default permissions
+umask(0077);
Incorrect copy-paste coming from elsewhere like the TAP tests of
pg_basebackup with group permissions?  Doesn't 
PostgreSQL::Test::Utils::tempdir give already enough protection in
terms of umask() and permissions?

+   if (config.save_fpi_path != NULL)
+   {
+   /* We fsync our output directory only; since these files are not part
+* of the production database we do not require the performance hit
+* that fsyncing every FPI would entail, so are doing this as a
+* compromise. */
+
+   fsync_fname(config.save_fpi_path, true);
+   }
Why is it necessary to flush anything at all, then?

+my $relation = $node->safe_psql('postgres',
+   q{SELECT format('%s/%s/%s', CASE WHEN reltablespace = 0 THEN
dattablespace ELSE reltablespace END, pg_database.oid,
pg_relation_filenode(pg_class.oid)) FROM pg_class, pg_database WHERE
relname = 'test_table' AND datname = current_database()}
Could you rewrite that to be on multiple lines?

+diag "using walfile: $walfile";
You should avoid the use of "diag", as this would cause extra output
noise with a simple make check.

+$node->safe_psql('postgres', "SELECT 
pg_drop_replication_slot('regress_pg_waldump_slot')")
That's not really necessary, the nodes are wiped out at the end of the
test. 

+$node->safe_psql('postgres', <

signature.asc
Description: PGP signature


Re: Error-safe user functions

2022-12-14 Thread Amul Sul
On Thu, Dec 15, 2022 at 11:16 AM Tom Lane  wrote:
>
> Amul Sul  writes:
> > There are other a bunch of hard errors from get_multirange_io_data(),
> > get_range_io_data() and its subroutine can hit, shouldn't we care
> > about those?
>
> I think those are all "internal" errors, ie not reachable as a
> consequence of bad input data.  Do you see a reason to think
> differently?

Make sense, I was worried about the internal errors as well as an
error that the user can cause while declaring multi-range e.g. shell
type, but realized that case gets checked at creating that multi-range
type.

Regards,
Amul




Re: Use get_call_result_type() more widely

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 11:14:59AM +0530, Bharath Rupireddy wrote:
> AFAICS, most of these functions have no direct source code callers,
> they're user-facing functions and not in a hot code path. I measured
> the test times of these functions and I don't see much difference [1].

Thanks for the summary.  It looks like your tests involve single
runs.  What is the difference in run-time when invoking this
repeatedly with a large generate_series() for example when few or no
tuples are returned?  Do you see a difference in perf profile?  Some
of the functions could have their time mostly eaten while looking at
the syscache on repeated calls, but you could see the actual work this
involves with a dummy function that returns a large number of
attributes on a single record in the worst case possible?

Separating things into two buckets..

> [1]
> pg_old_snapshot_time_mapping() - an extension function with no
> internal source code callers, no test coverage.
> pg_visibility_map_summary() - an extension function with no internal
> source code callers, test coverage exists, test times on HEAD:25 ms
> PATCHED:25 ms
> pg_last_committed_xact() and pg_xact_commit_timestamp_origin() - no
> internal source code callers, test coverage exists, test times on
> HEAD:10 ms PATCHED:10 ms> pg_get_multixact_members() - no internal source 
> code callers, no test coverage.
> pg_control_recovery() and pg_control_init() - test coverage exists,
> test times on HEAD:42 ms PATCHED:44 ms
> pg_identify_object() - no internal source code callers, test coverage
> exists, test times on HEAD:17 ms PATCHED:18 ms
> pg_identify_object_as_address() - no internal source code callers,
> test coverage exists, test times on HEAD:66 ms PATCHED:60 ms
> pg_get_object_address() - no internal source code callers, test
> coverage exists, test times on HEAD:66 ms PATCHED:60 ms
> pg_sequence_parameters() - no internal source code callers, test
> coverage exists, test times on HEAD:96 ms PATCHED:98 ms
> ts_token_type_byid(), ts_token_type_byname(), ts_parse_byid() and
> ts_parse_byname() - internal source code callers exists, test coverage
> exists, test times on HEAD:195 ms, pg_dump 10 wallclock secs
> PATCHED:186 ms, pg_dump 10 wallclock secs
> pg_get_keywords() - no internal source code callers, test coverage
> exists, test times on HEAD:129 ms PATCHED:130 ms
> pg_get_catalog_foreign_keys() - no internal source code callers, test
> coverage exists, test times on HEAD:114 ms PATCHED:111 ms
> tsvector_unnest() - no internal source code callers, test coverage
> exists, test times on HEAD:26 ms PATCHED:26 ms
> ts_setup_firstcall() - test coverage exists, test times on HEAD:195 ms
> PATCHED:186 ms
> pg_partition_tree() - no internal source code callers, test coverage
> exists, test times on HEAD:30 ms PATCHED:32 ms
> pg_timezone_abbrevs() - no internal source code callers, test coverage
> exists, test times on HEAD:40 ms PATCHED:36 ms

These ones don't worry me much, TBH.

> pg_stat_get_wal(), pg_stat_get_archiver() and
> pg_stat_get_replication_slot() - no internal source code callers, test
> coverage exists, test times on HEAD:479 ms PATCHED:483 ms
> pg_prepared_xact() - no internal source code callers, test coverage
> exists, test times on HEAD:50 ms, subscription 108 wallclock secs,
> recovery 111 wallclock secs PATCHED:48 ms, subscription 110 wallclock
> secs, recovery 112 wallclock secs
> show_all_settings(), pg_control_system(), pg_control_checkpoint(),
> test_predtest() - no internal source code callers, test coverage
> exists, test times on HEAD:18 ms PATCHED:18 ms
> pg_walfile_name_offset() - no internal source code callers, no test coverage.
> aclexplode() - internal callers exists information_schema.sql,
> indirect test coverage exists.
> pg_stat_file() - no internal source code callers, test coverage
> exists, test times on HEAD:42 ms PATCHED:46 ms
> pg_get_publication_tables() - internal source code callers exist, test
> coverage exists, test times on HEAD:159 ms, subscription 108 wallclock
> secs PATCHED:167 ms, subscription 110 wallclock secs
> pg_lock_status() - no internal source code callers, test coverage
> exists, test times on HEAD:16 ms PATCHED:22 ms
> pg_stat_get_subscription_stats() - no internal source code callers,
> test coverage exists, test times on HEAD:subscription 108 wallclock
> secs PATCHED:subscription 110 wallclock secs

These ones could be involved in monitoring queries run on a periodic
basis.
--
Michael


signature.asc
Description: PGP signature


Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  
> > wrote in
> > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
> > >  wrote:
> > > subscriber was busy enough that it doesn't need to add an additional
> > > delay before applying a particular transaction(s) but adding a delay
> > > to such a transaction on the publisher will actually make it take much
> > > longer to reflect than expected. We probably need to name this
> >
> > Isn't the name min_apply_delay implying the same behavior? Even though
> > the delay time will be a bit prolonged.
> >
> 
> Sorry, I don't understand what you intend to say in this point. In
> above, I mean that the currently proposed patch won't have such a
> problem but if we apply delay on publisher the problem can happen.

Are you saing about the sender-side delay lets the whole transaction
(if it have not streamed out) stay on the sender side?  If so... yeah,
I agree that it is undesirable.

> > > parameter as min_send_delay if we want to do what you are saying and
> > > then I don't know if it serves the actual need and also it will be
> > > different from what we do in physical standby.
> >
> > In the first place phisical and logical replication works differently
> > and the mechanism to delaying "apply" differs even in the current
> > state in terms of logrep delay choking stream.
> >
> 
> I think the first preference is to make it work in a similar way (as
> much as possible) to how this parameter works in physical standby and
> if that is not at all possible then we may consider other approaches.

I uderstood that. However, still I think choking the stream on the
receiver-side alone is kind of ugly since it is breaking the protocol
assumption, that is, the in-band maintenance packets are processed in
a on-time manner on the peer under normal operation (even though
involving some delays for some natural reasons).  In this regard, I
inclined to be in favor of Kuroda-san'sproposal..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Error-safe user functions

2022-12-14 Thread Tom Lane
Amul Sul  writes:
> There are other a bunch of hard errors from get_multirange_io_data(),
> get_range_io_data() and its subroutine can hit, shouldn't we care
> about those?

I think those are all "internal" errors, ie not reachable as a
consequence of bad input data.  Do you see a reason to think
differently?

regards, tom lane




The drop-index-concurrently-1 isolation test no longer tests what it was meant to

2022-12-14 Thread David Rowley
I'm in the middle of working on making some adjustments to the costs
of Incremental Sorts and I see the patch I wrote changes the plan in
the drop-index-concurrently-1 isolation test.

The particular plan changed currently expects:

---
Sort
  Sort Key: id, data
  ->  Index Scan using test_dc_pkey on test_dc
Filter: ((data)::text = '34'::text)

It seems fairly clear from reading the test spec that this plan is
really meant to be a seq scan plan and the change made in [1] adjusted
that without any regard for that.

That seems to have come around because of how the path generation of
incremental sorts work.  The current incremental sort path generation
will put a Sort path atop of the cheapest input path, even if that
cheapest input path has presorted keys. The test_dc_pkey index
provides presorted input for the required sort order.  Prior to
incremental sort, we did not consider paths which only provided
presorted input to be useful paths, hence we used to get a seq scan
plan.

I propose the attached which gets rid of the not-so-great casting
method that was originally added to this test to try and force the seq
scan.  It seems a little dangerous to put in hacks like that to force
a particular plan when the resulting plan ends up penalized with a
(1.0e10) disable_cost.  The planner is just not going to be stable
when the plan includes such a large penalty. To force the planner,
I've added another test step to do set enable_seqscan to true and
adjusted the permutations to run that just before preparing the seq
scan query.

I also tried to make it more clear that we want to be running the
query twice, once with an index scan and again with a seq scan. I'm
hoping the changes to the prepared query names and the extra comments
will help reduce the chances of this getting broken again in the
future.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/test/isolation/expected/drop-index-concurrently-1.out;h=8e6adb66bb1479b8d7db2fcf5f70b89acd3af577;hp=75dff56bc46d40aa8eb012543044b7c10d516b7e;hb=d2d8a229bc5;hpb=3c8553547b1493c4afdb80393f4a47dbfa019a79
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out 
b/src/test/isolation/expected/drop-index-concurrently-1.out
index 97e1a6e779..1cb2250891 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -1,17 +1,17 @@
 Parsed test spec with 3 sessions
 
-starting permutation: noseq chkiso prepi preps begin explaini explains select2 
drop insert2 end2 selecti selects end
-step noseq: SET enable_seqscan = false;
+starting permutation: chkiso prepi preps begin disableseq explaini enableseq 
explains select2 drop insert2 end2 selecti selects end
 step chkiso: SELECT (setting in ('read committed','read uncommitted')) AS 
is_read_committed FROM pg_settings WHERE name = 'default_transaction_isolation';
 is_read_committed
 -
 t
 (1 row)
 
-step prepi: PREPARE getrow_idx AS SELECT * FROM test_dc WHERE data=34 ORDER BY 
id,data;
-step preps: PREPARE getrow_seq AS SELECT * FROM test_dc WHERE 
data::text=34::text ORDER BY id,data;
+step prepi: PREPARE getrow_idxscan AS SELECT * FROM test_dc WHERE data = 34 
ORDER BY id,data;
+step preps: PREPARE getrow_seqscan AS SELECT * FROM test_dc WHERE data = 34 
ORDER BY id,data;
 step begin: BEGIN;
-step explaini: EXPLAIN (COSTS OFF) EXECUTE getrow_idx;
+step disableseq: SET enable_seqscan = false;
+step explaini: EXPLAIN (COSTS OFF) EXECUTE getrow_idxscan;
 QUERY PLAN
 --
 Sort  
@@ -20,16 +20,17 @@ Sort
 Index Cond: (data = 34)   
 (4 rows)
 
-step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seq;
-QUERY PLAN
---
-Sort  
-  Sort Key: id, data  
-  ->  Index Scan using test_dc_pkey on test_dc
-Filter: ((data)::text = '34'::text)   
+step enableseq: SET enable_seqscan = true;
+step explains: EXPLAIN (COSTS OFF) EXECUTE getrow_seqscan;
+QUERY PLAN 
+---
+Sort   
+  Sort Key: id 
+  ->  Seq Scan on test_dc  
+Filter: (data = 34)
 (4 rows)
 
-step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
+step select2: SELECT * FROM test_dc WHERE data = 34 ORDER BY id,data;
 id|data
 --+
 34|  34
@@ -38,14 +39,14 @@ id|data
 step drop: DROP INDEX CONCURRENTLY test_dc_data; 
 step insert2: INSERT INTO test_dc(data) SELECT * FROM generate_series(1, 100);
 step end2: COMMIT;
-step selecti: EXECUTE getrow_idx;
+step selecti: EXECUTE getrow_idxscan;
  id|data
 ---+
  34|  34
 134|  34
 (2 rows)
 
-step selects: EXECUTE getrow_seq;
+step selects: EXECUTE 

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-14 Thread Masahiko Sawada
On Wed, Dec 14, 2022 at 3:48 PM Amit Kapila  wrote:
>
> On Wed, Dec 14, 2022 at 9:50 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada 
> >  wrote:
> > >
> > > Here are comments on v59 0001, 0002 patches:
> >
> > Thanks for the comments!
> >
> > > +void
> > > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > > +while (1)
> > > +{
> > > +SpinLockAcquire(>mutex);
> > > +
> > > +/*
> > > + * Don't try to increment the count if the parallel
> > > apply worker is
> > > + * taking the stream lock. Otherwise, there would be
> > > a race condition
> > > + * that the parallel apply worker checks there is no
> > > pending streaming
> > > + * block and before it actually starts waiting on a
> > > lock, the leader
> > > + * sends another streaming block and take the stream
> > > lock again. In
> > > + * this case, the parallel apply worker will start
> > > waiting for the next
> > > + * streaming block whereas there is actually a
> > > pending streaming block
> > > + * available.
> > > + */
> > > +if (!wshared->pa_wait_for_stream)
> > > +{
> > > +wshared->pending_stream_count++;
> > > +SpinLockRelease(>mutex);
> > > +break;
> > > +}
> > > +
> > > +SpinLockRelease(>mutex);
> > > +}
> > > +}
> > >
> > > I think we should add an assertion to check if we don't hold the stream 
> > > lock.
> > >
> > > I think that waiting for pa_wait_for_stream to be false in a busy loop is 
> > > not a
> > > good idea. It's not interruptible and there is not guarantee that we can 
> > > break
> > > from this loop in a short time. For instance, if PA executes
> > > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > > pa_increment_stream_block(), LA has to wait for PA to acquire and release 
> > > the
> > > stream lock in a busy loop. It should not be long in normal cases but the
> > > duration LA needs to wait for PA depends on PA, which could be long. Also
> > > what if PA raises an error in
> > > pa_lock_stream() due to some reasons? I think LA won't be able to detect 
> > > the
> > > failure.
> > >
> > > I think we should at least make it interruptible and maybe need to add 
> > > some
> > > sleep. Or perhaps we can use the condition variable for this case.
> >
>
> Or we can leave this while (true) logic altogether for the first
> version and have a comment to explain this race. Anyway, after
> restarting, it will probably be solved. We can always change this part
> of the code later if this really turns out to be problematic.
>

+1. Thank you Hou-san for adding this comment in the latest version (v61) patch!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Error-safe user functions

2022-12-14 Thread Amul Sul
On Thu, Dec 15, 2022 at 9:03 AM Tom Lane  wrote:
>
> Here are some proposed patches for converting range_in and multirange_in.
>
> 0001 tackles the straightforward part, which is trapping syntax errors
> and called-input-function errors.  The only thing that I think might
> be controversial here is that I chose to change the signatures of
> the exposed functions range_serialize and make_range rather than
> inventing xxx_safe variants.  I think this is all right, because
> AFAIK the only likely reason for extensions to call either of those
> is that custom types' canonical functions would need to call
> range_serialize --- and those will need to be touched anyway,
> see 0002.
>
> What 0001 does not cover is trapping errors occurring in range
> canonicalize functions.  I'd first thought maybe doing that wasn't
> worth the trouble, but it's not really very hard to fix the built-in
> canonicalize functions, as shown in 0002.  Probably extensions would
> not find it much harder, and in any case they're not really required
> to make their errors soft.
>
> Any objections?
>

There are other a bunch of hard errors from get_multirange_io_data(),
get_range_io_data() and its subroutine can hit, shouldn't we care
about those?


Regards,
Amul




Re: [PATCH] random_normal function

2022-12-14 Thread Michael Paquier
On Tue, Dec 13, 2022 at 03:51:11PM -0800, Paul Ramsey wrote:
> Clearing up one CI failure.

+-- normal values converge on stddev == 2.0
+SELECT round(stddev(random_normal(2, 2)))
+  FROM generate_series(1, 1);

I am not sure that it is a good idea to make a test based on a random
behavior that should tend to a normalized value.  This is costly in
cycles, requiring a lot of work just for generate_series().  You could
do the same kind of thing as random() a few lines above?

+SELECT bool_and(random_string(16) != random_string(16)) AS same
+  FROM generate_series(1,8);
That should be fine in terms of impossible chances :)

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("return size must be non-negative")))
This could have a test, same for 0.

+#ifndef M_PI
+#define M_PI 3.14159265358979323846
+#endif
Postgres' float.h includes one version of that.
--
Michael


signature.asc
Description: PGP signature


Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Amit Kapila
On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  
> wrote in
> > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  
> > > wrote in
> > > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
> > > >  wrote:
> > > > > One idea to avoid that is to send the min_apply_delay subscriber 
> > > > > option to publisher
> > > > > and compare them, but it may be not sufficient. Because XXX_timout 
> > > > > GUC parameters
> > > > > could be modified later.
> > > > >
> > > >
> > > > How about restarting the apply worker if min_apply_delay changes? Will
> > > > that be sufficient?
> > >
> > > Mmm. If publisher knows that value, isn't it albe to delay *sending*
> > > data in the first place? This will resolve many known issues including
> > > walsender's un-terminatability, possible buffer-full and status packet
> > > exchanging.
> > >
> >
> > Yeah, but won't it change the meaning of this parameter? Say the
>
> Internally changes, but does not change on its face.  The difference is
> only in where the choking point exists. If ".._apply_delay" should
> work literally, we should go the way Kuroda-san proposed. Namely,
> "apply worker has received the data, but will deilay applying it".  If
> we technically name it correctly for the current behavior, it would be
> "min_receive_delay" or "min_choking_interval".
>
> > subscriber was busy enough that it doesn't need to add an additional
> > delay before applying a particular transaction(s) but adding a delay
> > to such a transaction on the publisher will actually make it take much
> > longer to reflect than expected. We probably need to name this
>
> Isn't the name min_apply_delay implying the same behavior? Even though
> the delay time will be a bit prolonged.
>

Sorry, I don't understand what you intend to say in this point. In
above, I mean that the currently proposed patch won't have such a
problem but if we apply delay on publisher the problem can happen.

> > parameter as min_send_delay if we want to do what you are saying and
> > then I don't know if it serves the actual need and also it will be
> > different from what we do in physical standby.
>
> In the first place phisical and logical replication works differently
> and the mechanism to delaying "apply" differs even in the current
> state in terms of logrep delay choking stream.
>

I think the first preference is to make it work in a similar way (as
much as possible) to how this parameter works in physical standby and
if that is not at all possible then we may consider other approaches.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  
> > wrote in
> > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
> > >  wrote:
> > > > One idea to avoid that is to send the min_apply_delay subscriber option 
> > > > to publisher
> > > > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > > > parameters
> > > > could be modified later.
> > > >
> > >
> > > How about restarting the apply worker if min_apply_delay changes? Will
> > > that be sufficient?
> >
> > Mmm. If publisher knows that value, isn't it albe to delay *sending*
> > data in the first place? This will resolve many known issues including
> > walsender's un-terminatability, possible buffer-full and status packet
> > exchanging.
> >
> 
> Yeah, but won't it change the meaning of this parameter? Say the

Internally changes, but does not change on its face.  The difference is
only in where the choking point exists. If ".._apply_delay" should
work literally, we should go the way Kuroda-san proposed. Namely,
"apply worker has received the data, but will deilay applying it".  If
we technically name it correctly for the current behavior, it would be
"min_receive_delay" or "min_choking_interval".

> subscriber was busy enough that it doesn't need to add an additional
> delay before applying a particular transaction(s) but adding a delay
> to such a transaction on the publisher will actually make it take much
> longer to reflect than expected. We probably need to name this

Isn't the name min_apply_delay implying the same behavior? Even though
the delay time will be a bit prolonged.

> parameter as min_send_delay if we want to do what you are saying and
> then I don't know if it serves the actual need and also it will be
> different from what we do in physical standby.

In the first place phisical and logical replication works differently
and the mechanism to delaying "apply" differs even in the current
state in terms of logrep delay choking stream.

I guess they cannot be different in terms of normal operation. But I'm
not sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:23:12 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi
>  wrote:
> > Allowing walsender to finish ignoring replication status
> > wouldn't be great.
> >
> 
> Yes, that would be ideal. But do you know why that is a must?

I believe a graceful shutdown (fast and smart) of a replication set is expected 
to be in sync.  Of course we can change the policy to allow walsnder to stop 
before confirming all WAL have been applied. However walsender doesn't have an 
idea of  wheter the peer is intentionally delaying or not.

> >  One idea is to let logical workers send delaying
> > status.
> >
> 
> How can that help?

If logical worker notifies "I'm intentionally pausing replication for
now, so if you wan to shutting down, plese go ahead ignoring me",
publisher can legally run a (kind of) dirty shut down.

# It looks a bit too much, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Amit Kapila
On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 14 Dec 2022 10:46:17 +, "Hayato Kuroda (Fujitsu)" 
>  wrote in
> > I have implemented and tested that workers wake up per 
> > wal_receiver_timeout/2
> > and send keepalive. Basically it works well, but I found two problems.
> > Do you have any good suggestions about them?
> >
> > 1)
> >
> > With this PoC at present, workers calculate sending intervals based on its
> > wal_receiver_timeout, and it is suppressed when the parameter is set to 
> > zero.
> >
> > This means that there is a possibility that walsender is timeout when 
> > wal_sender_timeout
> > in publisher and wal_receiver_timeout in subscriber is different.
> > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,
>
> It seems to me wal_receiver_status_interval is better for this use.
> It's enough for us to docuemnt that "wal_r_s_interval should be
> shorter than wal_sener_timeout/2 especially when logical replication
> connection is using min_apply_delay. Otherwise you will suffer
> repeated termination of walsender".
>

This sounds reasonable to me.

> > and min_apply_delay is 10min. The worker on subscriber will wake up per 
> > 2.5min and
> > send keepalives, but walsender exits before the message arrives to 
> > publisher.
> >
> > One idea to avoid that is to send the min_apply_delay subscriber option to 
> > publisher
> > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > parameters
> > could be modified later.
>
> # Anyway, I don't think such asymmetric setup is preferable.
>
> > 2)
> >
> > The issue reported by Vignesh-san[1] has still remained. I have already 
> > analyzed that [2],
> > the root cause is that flushed WAL is not updated and sent to the 
> > publisher. Even
> > if workers send keepalive messages to pub during the delay, the flushed 
> > position
> > cannot be modified.
>
> I didn't look closer but the cause I guess is walsender doesn't die
> until all WAL has been sent, while logical delay chokes replication
> stream.
>

Right, I also think so.

> Allowing walsender to finish ignoring replication status
> wouldn't be great.
>

Yes, that would be ideal. But do you know why that is a must?

>  One idea is to let logical workers send delaying
> status.
>

How can that help?

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Amit Kapila
On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  
> wrote in
> > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > > One idea to avoid that is to send the min_apply_delay subscriber option 
> > > to publisher
> > > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > > parameters
> > > could be modified later.
> > >
> >
> > How about restarting the apply worker if min_apply_delay changes? Will
> > that be sufficient?
>
> Mmm. If publisher knows that value, isn't it albe to delay *sending*
> data in the first place? This will resolve many known issues including
> walsender's un-terminatability, possible buffer-full and status packet
> exchanging.
>

Yeah, but won't it change the meaning of this parameter? Say the
subscriber was busy enough that it doesn't need to add an additional
delay before applying a particular transaction(s) but adding a delay
to such a transaction on the publisher will actually make it take much
longer to reflect than expected. We probably need to name this
parameter as min_send_delay if we want to do what you are saying and
then I don't know if it serves the actual need and also it will be
different from what we do in physical standby.

-- 
With Regards,
Amit Kapila.




Re: Error-safe user functions

2022-12-14 Thread Tom Lane
Here are some proposed patches for converting range_in and multirange_in.

0001 tackles the straightforward part, which is trapping syntax errors
and called-input-function errors.  The only thing that I think might
be controversial here is that I chose to change the signatures of
the exposed functions range_serialize and make_range rather than
inventing xxx_safe variants.  I think this is all right, because
AFAIK the only likely reason for extensions to call either of those
is that custom types' canonical functions would need to call
range_serialize --- and those will need to be touched anyway,
see 0002.

What 0001 does not cover is trapping errors occurring in range
canonicalize functions.  I'd first thought maybe doing that wasn't
worth the trouble, but it's not really very hard to fix the built-in
canonicalize functions, as shown in 0002.  Probably extensions would
not find it much harder, and in any case they're not really required
to make their errors soft.

Any objections?

regards, tom lane

diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index 307d087c97..ed26cfba2d 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -120,6 +120,7 @@ multirange_in(PG_FUNCTION_ARGS)
 	char	   *input_str = PG_GETARG_CSTRING(0);
 	Oid			mltrngtypoid = PG_GETARG_OID(1);
 	Oid			typmod = PG_GETARG_INT32(2);
+	Node	   *escontext = fcinfo->context;
 	TypeCacheEntry *rangetyp;
 	int32		ranges_seen = 0;
 	int32		range_count = 0;
@@ -133,6 +134,7 @@ multirange_in(PG_FUNCTION_ARGS)
 	const char *range_str_begin = NULL;
 	int32		range_str_len;
 	char	   *range_str;
+	Datum		range_datum;
 
 	cache = get_multirange_io_data(fcinfo, mltrngtypoid, IOFunc_input);
 	rangetyp = cache->typcache->rngtype;
@@ -144,7 +146,7 @@ multirange_in(PG_FUNCTION_ARGS)
 	if (*ptr == '{')
 		ptr++;
 	else
-		ereport(ERROR,
+		ereturn(escontext, (Datum) 0,
 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  errmsg("malformed multirange literal: \"%s\"",
 		input_str),
@@ -157,7 +159,7 @@ multirange_in(PG_FUNCTION_ARGS)
 		char		ch = *ptr;
 
 		if (ch == '\0')
-			ereport(ERROR,
+			ereturn(escontext, (Datum) 0,
 	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 	 errmsg("malformed multirange literal: \"%s\"",
 			input_str),
@@ -186,7 +188,7 @@ multirange_in(PG_FUNCTION_ARGS)
 	parse_state = MULTIRANGE_AFTER_RANGE;
 }
 else
-	ereport(ERROR,
+	ereturn(escontext, (Datum) 0,
 			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 			 errmsg("malformed multirange literal: \"%s\"",
 	input_str),
@@ -204,10 +206,14 @@ multirange_in(PG_FUNCTION_ARGS)
 			repalloc(ranges, range_capacity * sizeof(RangeType *));
 	}
 	ranges_seen++;
-	range = DatumGetRangeTypeP(InputFunctionCall(>typioproc,
- range_str,
- cache->typioparam,
- typmod));
+	if (!InputFunctionCallSafe(>typioproc,
+			   range_str,
+			   cache->typioparam,
+			   typmod,
+			   escontext,
+			   _datum))
+		PG_RETURN_NULL();
+	range = DatumGetRangeTypeP(range_datum);
 	if (!RangeIsEmpty(range))
 		ranges[range_count++] = range;
 	parse_state = MULTIRANGE_AFTER_RANGE;
@@ -256,7 +262,7 @@ multirange_in(PG_FUNCTION_ARGS)
 else if (ch == '}')
 	parse_state = MULTIRANGE_FINISHED;
 else
-	ereport(ERROR,
+	ereturn(escontext, (Datum) 0,
 			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 			 errmsg("malformed multirange literal: \"%s\"",
 	input_str),
@@ -280,7 +286,7 @@ multirange_in(PG_FUNCTION_ARGS)
 		ptr++;
 
 	if (*ptr != '\0')
-		ereport(ERROR,
+		ereturn(escontext, (Datum) 0,
 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  errmsg("malformed multirange literal: \"%s\"",
 		input_str),
@@ -807,7 +813,7 @@ multirange_get_union_range(TypeCacheEntry *rangetyp,
 	multirange_get_bounds(rangetyp, mr, 0, , );
 	multirange_get_bounds(rangetyp, mr, mr->rangeCount - 1, , );
 
-	return make_range(rangetyp, , , false);
+	return make_range(rangetyp, , , false, NULL);
 }
 
 
@@ -2696,7 +2702,8 @@ range_merge_from_multirange(PG_FUNCTION_ARGS)
 		multirange_get_bounds(typcache->rngtype, mr, mr->rangeCount - 1,
 			  , );
 
-		result = make_range(typcache->rngtype, , , false);
+		result = make_range(typcache->rngtype, , ,
+			false, NULL);
 	}
 
 	PG_RETURN_RANGE_P(result);
diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 919c8889d4..8fd6250e4a 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -221,7 +221,8 @@ multirangesel(PG_FUNCTION_ARGS)
 			upper.val = ((Const *) other)->constvalue;
 			upper.infinite = false;
 			upper.lower = false;
-			constrange = range_serialize(typcache->rngtype, , , false);
+			constrange = 

check_strxfrm_bug()

2022-12-14 Thread Thomas Munro
Hi

While studying Jeff's new crop of collation patches I noticed in
passing that check_strxfrm_bug() must surely by now be unnecessary.
The buffer overrun bugs were fixed a decade ago, and the relevant
systems are way out of support.  If you're worried that the bugs might
come back, then the test is insufficient: modern versions of both OSes
have strxfrm_l(), which we aren't checking.  In any case, we also
completely disable this stuff because of bugs and quality problems in
every other known implementation, via TRUST_STRXFRM (or rather the
lack of it).  So I think it's time to remove that function; please see
attached.

Just by the way, if you like slow motion domino runs, check this out:

* Original pgsql-bugs investigation into strxfrm() inconsistencies
  
https://www.postgresql.org/message-id/flat/111d0e27-a8f3-4a84-a4e0-b0fb70386...@s24.com

* I happened to test that on bleeding-edge FreeBSD 11 (wasn't released
yet), because at that time FreeBSD was in the process of adopting
illumos's new collation code, and reported teething problems:
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208266

* FreeBSD, DragonFly and illumos's trees were then partially fixed by
the authors, but our strcolltest.c still showed some remaining
problems in some locales (and it still does on my local FreeBSD
battlestation):
  
https://github.com/freebsd/freebsd-src/commit/c48dc2a193b9befceda8dfc6f894d73251cc00a4
  https://www.illumos.org/rb/r/402/

* The authors traced the remaining problem to flaws in the Unicode
project's CLDR/POSIX data, and the report was accepted:
  https://www.illumos.org/issues/7962
  https://unicode-org.atlassian.net/browse/CLDR-10394

Eventually that'll be fixed, and (I guess) trigger at least a CLDR
minor version bump affecting all downstream consumers (ICU, ...).
Then... maybe... at least FreeBSD will finally pass that test.  I do
wonder whether other consumer libraries are also confused by that
problem source data, and if not, why not; are glibc's problems related
or just random code or data quality problems in different areas?  (I
also don't know why a problem in that data should affect strxfrm() and
strcoll() differently, but I don't plan to find out owing to an acute
shortage of round tuits).

But in the meantime, I still can't recommend turning on TRUST_STRXFRM
on any OS that I know of!  The strcolltest.c program certainly still
finds fault with glibc 2.36 despite the last update on that redhat
bugzilla ticket that suggested that the big resync back in 2.28 was
going to fix it.

To be fair, macOS does actually pass that test for all locales, but
the strxfrm() result is too narrow to be useful, according to comments
in our tree.  I would guess that a couple of other OSes with the old
Berkeley locale code are similar.
From 8a70596a7de680e02fb2a6db59e622a653f097c2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 15 Dec 2022 13:38:57 +1300
Subject: [PATCH] Remove obsolete defense against strxfrm() bugs.

Old versions of Solaris and illumos had buffer overrun bugs in their
strxfrm() implementations.  The bugs were fixed a decade ago and the
relevant releases are long out of vendor support.  It's time to remove
the defense added by commit be8b06c3.

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 37989ba567..bf6992d9fe 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -137,8 +137,6 @@ main(int argc, char *argv[])
 	 */
 	unsetenv("LC_ALL");
 
-	check_strxfrm_bug();
-
 	/*
 	 * Catch standard options before doing much else, in particular before we
 	 * insist on not being root.
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..aa60008ad7 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1134,64 +1134,6 @@ IsoLocaleName(const char *winlocname)
 #endif			/* WIN32 && LC_MESSAGES */
 
 
-/*
- * Detect aging strxfrm() implementations that, in a subset of locales, write
- * past the specified buffer length.  Affected users must update OS packages
- * before using PostgreSQL 9.5 or later.
- *
- * Assume that the bug can come and go from one postmaster startup to another
- * due to physical replication among diverse machines.  Assume that the bug's
- * presence will not change during the life of a particular postmaster.  Given
- * those assumptions, call this no less than once per postmaster startup per
- * LC_COLLATE setting used.  No known-affected system offers strxfrm_l(), so
- * there is no need to consider pg_collation locales.
- */
-void
-check_strxfrm_bug(void)
-{
-	char		buf[32];
-	const int	canary = 0x7F;
-	bool		ok = true;
-
-	/*
-	 * Given a two-byte ASCII string and length limit 7, 8 or 9, Solaris 10
-	 * 05/08 returns 18 and modifies 10 bytes.  It respects limits above or
-	 * below that range.
-	 *
-	 * The bug is present in Solaris 8 as well; it is absent in Solaris 10
-	 * 01/13 and Solaris 11.2.  Affected locales include is_IS.ISO8859-1,
-	 * 

Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  wrote 
in 
> On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
>  wrote:
> > One idea to avoid that is to send the min_apply_delay subscriber option to 
> > publisher
> > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > parameters
> > could be modified later.
> >
> 
> How about restarting the apply worker if min_apply_delay changes? Will
> that be sufficient?

Mmm. If publisher knows that value, isn't it albe to delay *sending*
data in the first place? This will resolve many known issues including
walsender's un-terminatability, possible buffer-full and status packet
exchanging.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 10:46:17 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> I have implemented and tested that workers wake up per wal_receiver_timeout/2
> and send keepalive. Basically it works well, but I found two problems.
> Do you have any good suggestions about them?
> 
> 1)
> 
> With this PoC at present, workers calculate sending intervals based on its
> wal_receiver_timeout, and it is suppressed when the parameter is set to zero.
> 
> This means that there is a possibility that walsender is timeout when 
> wal_sender_timeout
> in publisher and wal_receiver_timeout in subscriber is different.
> Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,

It seems to me wal_receiver_status_interval is better for this use.
It's enough for us to docuemnt that "wal_r_s_interval should be
shorter than wal_sener_timeout/2 especially when logical replication
connection is using min_apply_delay. Otherwise you will suffer
repeated termination of walsender".

> and min_apply_delay is 10min. The worker on subscriber will wake up per 
> 2.5min and
> send keepalives, but walsender exits before the message arrives to publisher.
> 
> One idea to avoid that is to send the min_apply_delay subscriber option to 
> publisher
> and compare them, but it may be not sufficient. Because XXX_timout GUC 
> parameters
> could be modified later.

# Anyway, I don't think such asymmetric setup is preferable.

> 2)
> 
> The issue reported by Vignesh-san[1] has still remained. I have already 
> analyzed that [2],
> the root cause is that flushed WAL is not updated and sent to the publisher. 
> Even
> if workers send keepalive messages to pub during the delay, the flushed 
> position
> cannot be modified.

I didn't look closer but the cause I guess is walsender doesn't die
until all WAL has been sent, while logical delay chokes replication
stream. Allowing walsender to finish ignoring replication status
wouldn't be great.  One idea is to let logical workers send delaying
status.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_upgrade: Make testing different transfer modes easier

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 10:40:45 +0100, Daniel Gustafsson  wrote 
in 
> > On 14 Dec 2022, at 08:04, Peter Eisentraut 
> >  wrote:
> > 
> > On 07.12.22 17:33, Peter Eisentraut wrote:
> >> I think if we want to make this configurable on the fly, and environment 
> >> variable would be much easier, like
> >> my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> > 
> > Here is an updated patch set that incorporates this idea.

We have already PG_TEST_EXTRA. Shouldn't we use it here as well?

> I would prefer a small note about it in src/bin/pg_upgrade/TESTING to document
> it outside of the code, but otherwise LGTM.
> 
> + $mode,
>   '--check'
>   ],
> 
> ...
> 
> - '-p', $oldnode->port, '-P', $newnode->port
> + '-p', $oldnode->port, '-P', $newnode->port,
> + $mode,
>   ],
> 
> Minor nitpick, but while in there should we take the opportunity to add a
> trailing comma on the other two array declarations which now ends with 
> --check?
> It's good Perl practice and will make the code consistent.

Otherwise look god to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Nathan Bossart
On Thu, Dec 15, 2022 at 09:12:26AM +0900, Michael Paquier wrote:
> On Wed, Dec 14, 2022 at 03:29:39PM -0800, Nathan Bossart wrote:
>> On Wed, Dec 14, 2022 at 11:05:13AM -0800, Jeff Davis wrote:
>>> On Wed, 2022-12-14 at 10:16 -0800, Nathan Bossart wrote:
 Okay.  Should all the privileges governed by MAINTAIN apply to a
 relation's
 TOAST table as well?
>>> 
>>> Yes, I agree.
>> 
>> This might be tricky, because AFAICT you have to scan pg_class to find a
>> TOAST table's main relation.
> 
> Ugh, yeah.  Are we talking about a case where we know the toast
> information but need to look back at some information of its parent to
> do a decision?  I don't recall a case where we do that.  CLUSTER,
> REINDEX and VACUUM lock first the parent when working on it, and no
> AEL is taken on the parent if doing directly a VACUUM or a REINDEX on
> the toast table, so that could lead to deadlock scenarios.  Shouldn't
> MAINTAIN be sent down to the toast table as well if that's not done
> this way?

Another option I'm looking at is skipping the privilege checks when VACUUM
recurses to a TOAST table.  This won't allow you to VACUUM the TOAST table
directly, but it would at least address the originally-reported issue [0].

Since you can't ANALYZE, REFRESH, or LOCK TOAST tables, this isn't a
problem for those commands.  CLUSTER and REINDEX seem to process relations'
TOAST tables without extra privilege checks already.  So with the attached
patch applied, you wouldn't be able to VACUUM, CLUSTER, and REINDEX TOAST
tableѕ directly (unless you were given MAINTAIN or pg_maintain), but you
could indirectly process them by specifying the main relation.

I don't know if this is good enough.  It seems like ideally you should be
able to VACUUM a TOAST table directly if you have MAINTAIN on its main
relation.

[0] https://postgr.es/m/b572d238-0de2-9cad-5f34-4741dc627834%40postgrespro.ru

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..8f80e5d8c0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -91,7 +91,8 @@ static void vac_truncate_clog(TransactionId frozenXID,
 			  MultiXactId minMulti,
 			  TransactionId lastSaneFrozenXid,
 			  MultiXactId lastSaneMinMulti);
-static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
+static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
+	   bool skip_privs);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -470,7 +471,7 @@ vacuum(List *relations, VacuumParams *params,
 
 			if (params->options & VACOPT_VACUUM)
 			{
-if (!vacuum_rel(vrel->oid, vrel->relation, params))
+if (!vacuum_rel(vrel->oid, vrel->relation, params, false))
 	continue;
 			}
 
@@ -1806,7 +1807,7 @@ vac_truncate_clog(TransactionId frozenXID,
  *		At entry and exit, we are not inside a transaction.
  */
 static bool
-vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
+vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
 {
 	LOCKMODE	lmode;
 	Relation	rel;
@@ -1893,7 +1894,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * happen across multiple transactions where privileges could have changed
 	 * in-between.  Make sure to only generate logs for VACUUM in this case.
 	 */
-	if (!vacuum_is_permitted_for_relation(RelationGetRelid(rel),
+	if (!skip_privs &&
+		!vacuum_is_permitted_for_relation(RelationGetRelid(rel),
 		  rel->rd_rel,
 		  params->options & VACOPT_VACUUM))
 	{
@@ -2067,7 +2069,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * totally unimportant for toast relations.
 	 */
 	if (toast_relid != InvalidOid)
-		vacuum_rel(toast_relid, NULL, params);
+		vacuum_rel(toast_relid, NULL, params, true);
 
 	/*
 	 * Now release the session-level lock on the main table.
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 34c26a804a..5335e45434 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -2901,6 +2901,7 @@ CREATE INDEX ON maintain_test (a);
 GRANT MAINTAIN ON maintain_test TO regress_maintain;
 CREATE MATERIALIZED VIEW refresh_test AS SELECT 1;
 GRANT MAINTAIN ON refresh_test TO regress_maintain;
+GRANT MAINTAIN ON pg_type TO regress_maintain;
 CREATE SCHEMA reindex_test;
 -- negative tests; should fail
 SET ROLE regress_no_maintain;
@@ -2910,6 +2911,8 @@ ANALYZE maintain_test;
 WARNING:  permission denied to analyze "maintain_test", skipping it
 VACUUM (ANALYZE) maintain_test;
 WARNING:  permission denied to vacuum "maintain_test", skipping it
+VACUUM pg_type;
+WARNING:  permission denied to vacuum "pg_type", skipping it
 CLUSTER maintain_test USING maintain_test_a_idx;
 ERROR: 

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Jeff Davis
On Wed, 2022-12-14 at 16:11 -0600, Justin Pryzby wrote:
> Yeah, but:
> 
> regression=> insert into p1 values (1);
> ERROR:  permission denied for table p1
> regression=> select * from p1;
> ERROR:  permission denied for table p1

Right, that's what I had in mind: a user is only granted operations on
the partitioned table, not the partitions.

It happens that an INSERT or SELECT on the partitioned table flows
through to the partitions, whereas the VACUUM ends up skipping them, so
I guess the analogy could be interpreted either way. Hmmm...

Thinking about it another way: logical partitioning is about making the
table logically one table, but physically many tables. That would imply
that the privileges should apply per-partition. But then that doesn't
make a lot of sense, because what maintenance can you do on the
partitioned table (which itself has no data)?

There's definitely a problem with this patch and partitioning, because
REINDEX affects the partitions, CLUSTER is a no-op, and VACUUM/ANALYZE
skip them.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 03:29:39PM -0800, Nathan Bossart wrote:
> On Wed, Dec 14, 2022 at 11:05:13AM -0800, Jeff Davis wrote:
>> On Wed, 2022-12-14 at 10:16 -0800, Nathan Bossart wrote:
>>> Okay.  Should all the privileges governed by MAINTAIN apply to a
>>> relation's
>>> TOAST table as well?
>> 
>> Yes, I agree.
> 
> This might be tricky, because AFAICT you have to scan pg_class to find a
> TOAST table's main relation.

Ugh, yeah.  Are we talking about a case where we know the toast
information but need to look back at some information of its parent to
do a decision?  I don't recall a case where we do that.  CLUSTER,
REINDEX and VACUUM lock first the parent when working on it, and no
AEL is taken on the parent if doing directly a VACUUM or a REINDEX on
the toast table, so that could lead to deadlock scenarios.  Shouldn't
MAINTAIN be sent down to the toast table as well if that's not done
this way?

FWIW, I have briefly poked at that here:
https://www.postgresql.org/message-id/yzi+anennpbas...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: Raising the SCRAM iteration count

2022-12-14 Thread Jonathan S. Katz

On 12/14/22 6:52 PM, Michael Paquier wrote:

On Wed, Dec 14, 2022 at 01:59:04PM -0500, Jonathan S. Katz wrote:

HA-256 that we will just need to pick up?



The attached v2 has the GUC rename and a change to GUC_REPORT such that the
frontend can use the real value rather than the default.  I kept it for super
users so far, do you think it should be a user setting being somewhat sensitive?


No, because a user can set the number of iterations today if they build
their own SCRAM secret. I think it's OK if they change it in a session.

If a superuser wants to enforce a minimum iteration count, they can write a
password_check_hook. (Or we could add another GUC to enforce that).


Hm?  check_password_hook does not allow one to recompile the password
given by the user, except if I am missing your point?
My point is you can write a hook to reject the password if the iteration 
count is "too low". Not to re-hash the password.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Raising the SCRAM iteration count

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 01:59:04PM -0500, Jonathan S. Katz wrote:
> On 12/14/22 6:25 AM, Daniel Gustafsson wrote:
>> I was thinking about it but opted for the simpler approach of a GUC name with
>> the algorithm baked into it: scram_sha256_iterations.  It doesn't seem all 
>> that
>> likely that we'll have more than two versions of SCRAM (sha256/sha512) so
>> the additional complexity doesn't seem worth it.
> 
> I would not rule this out. There is a RFC draft for SCRAM-SHA3-512[1].
> 
> I do have mixed feelings on the 'x1=y1,x2=y2' style GUC, but we do have
> machinery to handle it and it gives a bit more flexibility over how many
> SCRAM hash methods get added. I'd like to hear more feedback.

Technically, I would put the logic to parse the GUC to scram-common.c
and let libpq and the backend use it.  Saying that, we are just
talking about what looks like one new hashing method, so a separate
GUC is fine by me.

> (I don't know if there will be a world if we ever let users BYO-hash, but
> that case may force separate GUCs anyway).
> 
> [1] https://datatracker.ietf.org/doc/draft-melnikov-scram-sha3-512/

Still, the odds is that we are going to see one update to
SCRAM-SHA-256 that we will just need to pick up?

>> The attached v2 has the GUC rename and a change to GUC_REPORT such that the
>> frontend can use the real value rather than the default.  I kept it for super
>> users so far, do you think it should be a user setting being somewhat 
>> sensitive?
> 
> No, because a user can set the number of iterations today if they build
> their own SCRAM secret. I think it's OK if they change it in a session.
> 
> If a superuser wants to enforce a minimum iteration count, they can write a
> password_check_hook. (Or we could add another GUC to enforce that).

Hm?  check_password_hook does not allow one to recompile the password
given by the user, except if I am missing your point?

> -pg_fe_scram_build_secret(const char *password, const char **errstr)
> +pg_fe_scram_build_secret(const char *password, int iterations, const char
> **errstr)
> 
> I have mild worry about changing this function definition for downstream
> usage, esp. for drivers. Perhaps it's not that big of a deal, and perhaps
> this will end up being needed for the work we've discussed around
> "\password" but I do want to note that this could be a breaking change.

FWIW, an extension would be required to enforce the type of hash
used, which is an extra parameter on top of the iteration number when
building the SCRAM verifier.

> + else if (strcmp(name, "scram_sha256_iterations") == 0)
> + {
> + conn->scram_iterations = atoi(value);
> + }
> 
> Maybe out of scope for this patch based on what else is in the patch, but I
> was wondering why we don't use a "strncmp" here?

What would that change?  This needs an equal match.

conn->in_hot_standby = PG_BOOL_UNKNOWN;
+   conn->scram_iterations = SCRAM_DEFAULT_ITERATIONS;

s/SCRAM_DEFAULT_ITERATIONS/SCRAM_SHA_256_DEFAULT_ITERATIONS/ and
s/scram_iterations/scram_sha_256_interations/ perhaps?  It does not
look like we'd have the same default across the various SHA variations
if we stick with the RFC definitions..

+#ifndef FRONTEND
+/*
+ * Number of iterations when generating new secrets.
+ */
+extern PGDLLIMPORT int scram_sha256_iterations;
+#endif

It looks like libpq/scram.h, which is backend-only, would be a better
location.

@@ -692,7 +697,7 @@ mock_scram_secret(const char *username, int *iterations, 
char **salt,
encoded_salt[encoded_len] = '\0';
 
*salt = encoded_salt;
-   *iterations = SCRAM_DEFAULT_ITERATIONS;
+   *iterations = scram_sha256_iterations;

This looks incorrect to me?  The mock authentication is here to
produce a realistic verifier, still it will fail.  It seems to me that
we'd better stick to the default in all the cases.

(FWIW, extending \password with custom options would have the
advantage to allow older server versions to use a custom iteration
number.  Perhaps that's not worth bothering about, just saying as a
separate thing to consider.)
--
Michael


signature.asc
Description: PGP signature


Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 11:05:13AM -0800, Jeff Davis wrote:
> On Wed, 2022-12-14 at 10:16 -0800, Nathan Bossart wrote:
>> Okay.  Should all the privileges governed by MAINTAIN apply to a
>> relation's
>> TOAST table as well?
> 
> Yes, I agree.

This might be tricky, because AFAICT you have to scan pg_class to find a
TOAST table's main relation.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Error-safe user functions

2022-12-14 Thread Tom Lane
I wrote:
> Amul Sul  writes:
>> Attaching a complete set of the patches changing function till this
>> except bpcharin, byteain jsonpath_in that Andrew is planning to look
>> in. I have skipped reg* functions.

> I'll take a look at these shortly, unless Andrew is already on it.

I've gone through these now and revised/pushed most of them.

I do not think that we need to touch any unimplemented I/O functions.
They can just as well be unimplemented for soft-error cases too;
I can't see any use-case where it could be useful to have them not
complain.  So I discarded

v1-0004-Change-brin_bloom_summary_in-to-allow-non-throw-e.patch
v1-0005-Change-brin_minmax_multi_summary_in-to-allow-non-.patch
v1-0009-Change-gtsvectorin-to-allow-non-throw-error-repor.patch
v1-0018-Change-pg_mcv_list_in-to-allow-non-throw-error-re.patch
v1-0019-Change-pg_ndistinct_in-to-allow-non-throw-error-r.patch

As for the rest, some were closer to being committable than others.
You need to be more careful about handling error cases in subroutines:
you can't just ereturn from a subroutine and figure you're done,
because the caller will keep plugging along if you don't do something
to teach it not to.  What that would often lead to is the caller
finding what it thinks is a new error condition, and overwriting the
original message with something that's much less on-point.  This is
comparable to cascading errors from a compiler: anybody who's dealt
with those knows that errors after the first one are often just noise.
So we have to be careful to quit after we log the first error.

Also, I ended up ripping out the changes in line_construct, because
as soon as I tried to test them I tripped over the fact that lseg_sl
was still throwing hard errors, before we ever get to line_construct.
Perhaps it is worth fixing all that but I have to deem it very low
priority, because the two-input-points formulation isn't the mainstream
code path.  (I kind of wonder too if there isn't a better, more
numerically robust conversion method ...)  In any case I'm pretty sure
those changes in float.h would have drawn Andres' ire.  We don't want
to be adding arguments to float_overflow_error/float_underflow_error;
if that were acceptable they'd not have looked like that to begin with.

Anyway, thanks for the work!  That moved us a good ways.

I think I'm going to go fix bpcharin and varcharin, because those
are the last of the SQL-spec-defined types.

regards, tom lane




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 02:02:58PM -0500, Tom Lane wrote:
> Maybe we could have workers that are exiting for that reason set a
> flag saying "please restart me without delay"?

That helps a bit, but there are still delays when starting workers for new
subscriptions.  I think we'd need to create a new array in shared memory
for subscription OIDs that need their workers started immediately.

I'm not totally sure this is worth the effort.  These delays surface in the
tests because the workers are started so frequently.  In normal operation,
this is probably unusual, so the launcher would typically start new workers
immediately.  But if you and/or others feel this is worthwhile, I don't
mind working on the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.

2022-12-14 Thread Thomas Munro
On Thu, Nov 24, 2022 at 12:08 AM David Rowley  wrote:
> On Wed, 23 Nov 2022 at 23:13, Alex Fan  wrote:
> > I am new to the postgres community and apologise for resending this as the 
> > previous one didn't include patch properly and didn't cc reviewers (maybe 
> > the reason it has been buried in mailing list for months)
>
> Welcome to the community!

+1

I don't know enough about LLVM or RISCV to have any strong opinions
here, but I have a couple of questions...  It looks like we have two
different things in this patch:

1.  Optionally use JITLink instead of RuntimeDyld for relocation.
>From what I can tell from some quick googling, that is necessary for
RISCV because they haven't got around to doing this yet:

https://reviews.llvm.org/D127842

Independently of that, it seems that
https://llvm.org/docs/JITLink.html is the future and RuntimeDyld will
eventually be obsolete, so one question I have is: why should we do
this only for riscv?

You mentioned that this change might be necessary to support COFF and
thus Windows.  I'm not a Windows user and I think it would be beyond
my pain threshold to try to get this working there by using CI alone,
but I'm just curious... wouldn't
https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldCOFF.cpp
work for that already?  (I haven't heard about anyone successfully
using PostgreSQL/LLVM on Windows; it would certainly be cool to hear
some more about what would be needed for that.)

2.  Manually adjust the CPU features and ABI/subtarget.

+#if defined(__riscv)
+/* getHostCPUName returns "generic-rv[32|64]", which lacks all features */
+Features.AddFeature("m", true);
+Features.AddFeature("a", true);
+Features.AddFeature("c", true);
+# if defined(__riscv_float_abi_single)
+Features.AddFeature("f", true);
+# endif
+# if defined(__riscv_float_abi_double)
+Features.AddFeature("d", true);
+# endif
+#endif

I'm trying to understand this, and the ABI name selection logic.
Maybe there are two categories of features here?

The ABI bits, "f" and "d" are not just "which instructions can I
used", but they also affect the ABI (I guess something like: where
floats go in the calling convention), and they have to match the ABI
of the main executable to allow linking to succeed, right?  Probably a
stupid question: wouldn't the subtarget/ABI be the same as the one
that the LLVM library itself was compiled for (which must also match
the postgres executable), and doesn't it know that somewhere?  I guess
I'm confused about why we don't need to deal with this kind of manual
subtarget selection on any other architecture: for PPC it
automatically knows whether to be big endian/little endian, 32 or 64
bit, etc.

Then for "m", "a", "c", I guess these are code generation options -- I
think "c" is compressed instructions for example?  Can we get a
comment to say what they are?  Why do you think that all RISCV chips
have these features?  Perhaps these are features that are part of some
kind of server chip profile (ie features not present in a tiny
microcontroller chip found in a toaster, but expected in any system
that would actually run PostgreSQL) -- in which case can we get a
reference to explain that?

I remembered the specific reason why we have that
LLVMGethostCPUFeatures() call: it's because the list of default
features that would apply otherwise based on CPU "name" alone turned
out to assume that all x86 chips had AVX, but some low end parts
don't, so we have to check for AVX etc presence that way.  But your
patch seems to imply that LLVM is not able to get features reliably
for RISCV -- why not, immaturity or technical reason why it can't?

+assert(ES && "ES must not be null");

We use our own Assert() macro (capital A).




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-14 Thread David Christensen
Hi Bharath,

I can get one sent in tomorrow.

Thanks,

David




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Isaac Morland
On Wed, 14 Dec 2022 at 15:57, Jeff Davis  wrote:

> On Wed, 2022-12-14 at 15:32 -0500, Isaac Morland wrote:
>
> > Is there a firm decision on the issue of changing the cluster index
> > of a table? Re-clustering a table on the same index is clearly
> > something that should be granted by MAINTAIN as I imagine it, but
> > changing the cluster index, strictly speaking, changes the schema and
> > could be considered outside of the scope of what should be allowed.
> > On the other hand, I can see simplicity in having CLUSTER check the
> > same permissions whether or not the cluster index is being updated.
>
> In both the case of CLUSTER and REFRESH, I don't have a strong enough
> opinion to break them out into separate privileges. There's some
> argument that can be made; but at the same time it's hard for me to
> imagine someone really making use of the privileges separately.
>

Thanks, that makes a lot of sense. I wanted to make sure the question was
considered. I'm very pleased this is happening and appreciate all the work
you're doing. I have a few places where I want to be able to grant MAINTAIN
so I'll be using this as soon as it's available on our production database.


Re: Error-safe user functions

2022-12-14 Thread Tom Lane
Andrew Dunstan  writes:
> Thanks, I have been looking at jsonpath, but I'm not quite sure how to
> get the escontext argument to the yyerror calls in jsonath_scan.l. Maybe
> I need to specify a lex-param setting?

You want a parse-param option in jsonpath_gram.y, I think; adding that
will persuade Bison to change the signatures of relevant functions.
Compare the mods I made in contrib/cube in ccff2d20e.

regards, tom lane




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Justin Pryzby
On Thu, Dec 15, 2022 at 01:02:39AM +0300, Pavel Luzanov wrote:
> On 14.12.2022 22:46, Jeff Davis wrote:
> > The behavior is that MAINTAIN
> > privileges on the partitioned table does not imply MAINTAIN privileges
> > on the partitions. I believe that's fine and it's consistent with other
> > privileges on partitioned tables, such as SELECT and INSERT.
> 
> Sorry, I may have missed something, but here's what I see:
> 
> postgres@postgres(16.0)=# create table p (id int) partition by list (id);
> postgres@postgres(16.0)=# create table p1 partition of p for values in (1);
> postgres@postgres(16.0)=# create table p2 partition of p for values in (2);
> 
> postgres@postgres(16.0)=# grant select, insert, maintain on p to alice ;
> 
> postgres@postgres(16.0)=# \c - alice
> You are now connected to database "postgres" as user "alice".
> 
> alice@postgres(16.0)=> insert into p values (1);
> INSERT 0 1
> alice@postgres(16.0)=> select * from p;
>  id
> 
>   1
> (1 row)
> 
> alice@postgres(16.0)=> vacuum p;
> WARNING:  permission denied to vacuum "p1", skipping it
> WARNING:  permission denied to vacuum "p2", skipping it
> VACUUM

Yeah, but:

regression=> insert into p1 values (1);
ERROR:  permission denied for table p1
regression=> select * from p1;
ERROR:  permission denied for table p1




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Pavel Luzanov

On 14.12.2022 22:46, Jeff Davis wrote:

The behavior is that MAINTAIN
privileges on the partitioned table does not imply MAINTAIN privileges
on the partitions. I believe that's fine and it's consistent with other
privileges on partitioned tables, such as SELECT and INSERT.


Sorry, I may have missed something, but here's what I see:

postgres@postgres(16.0)=# create table p (id int) partition by list (id);
postgres@postgres(16.0)=# create table p1 partition of p for values in (1);
postgres@postgres(16.0)=# create table p2 partition of p for values in (2);

postgres@postgres(16.0)=# grant select, insert, maintain on p to alice ;

postgres@postgres(16.0)=# \c - alice
You are now connected to database "postgres" as user "alice".

alice@postgres(16.0)=> insert into p values (1);
INSERT 0 1
alice@postgres(16.0)=> select * from p;
 id

  1
(1 row)

alice@postgres(16.0)=> vacuum p;
WARNING:  permission denied to vacuum "p1", skipping it
WARNING:  permission denied to vacuum "p2", skipping it
VACUUM

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





Re: Inconsistency in reporting checkpointer stats

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 04:54:53PM +0530, Bharath Rupireddy wrote:
> Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
> buffer writes in SlruInternalWritePage(). However, does it need to be
> done immediately there? The stats will not be visible to the users
> until the next pgstat_report_checkpointer(). Incrementing
> buf_written_checkpoints in BufferSync() makes sense as the
> pgstat_report_checkpointer() gets called in there via
> CheckpointWriteDelay() and it becomes visible to the user immediately.
> Have you checked if pgstat_report_checkpointer() gets called while the
> checkpoint calls SlruInternalWritePage()? If not, then you can just
> assign ckpt_bufs_written to buf_written_checkpoints in
> LogCheckpointEnd() like its other friends
> checkpoint_write_time and checkpoint_sync_time.

/* If part of a checkpoint, count this as a buffer written. */
if (fdata)
CheckpointStats.ckpt_bufs_written++;
+   PendingCheckpointerStats.buf_written_checkpoints++;
Also, the proposed patch would touch PendingCheckpointerStats even
when there is no fdata, aka outside the context of a checkpoint or
shutdown sequence..
--
Michael


signature.asc
Description: PGP signature


Re: Error-safe user functions

2022-12-14 Thread Andrew Dunstan


On 2022-12-14 We 11:00, Tom Lane wrote:
> Amul Sul  writes:
>> Attaching a complete set of the patches changing function till this
>> except bpcharin, byteain jsonpath_in that Andrew is planning to look
>> in. I have skipped reg* functions.
> I'll take a look at these shortly, unless Andrew is already on it.
>
>   


Thanks, I have been looking at jsonpath, but I'm not quite sure how to
get the escontext argument to the yyerror calls in jsonath_scan.l. Maybe
I need to specify a lex-param setting?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Temporary tables versus wraparound... again

2022-12-14 Thread Greg Stark
> You do have to lock a table in order to update its pg_class row,
> though, whether the table is temporary or not. Otherwise, another
> session could drop it while you're doing something with it, after
> which bad things would happen.

I was responding to this from Andres:

> Is that actually true? Don't we skip some locking operations for temporary
> tables, which then also means catalog modifications cannot safely be done in
> other sessions?

I don't actually see this in the code but in any case we're not doing
any catalog modifications here. We're just inspecting values of
relfrozenxid and relminmxid in the struct returned from
SearchSysCache. Which I think is no different for temp tables than any
other table.




Re: Temporary tables versus wraparound... again

2022-12-14 Thread Robert Haas
On Wed, Dec 14, 2022 at 1:18 PM Greg Stark  wrote:
> So I don't see any evidence we skip any locking on pg_class when doing
> updates on rows for temporary tables.

I don't know what this means. You don't have to lock pg_class to
update rows in any table, whether temporary or otherwise.

You do have to lock a table in order to update its pg_class row,
though, whether the table is temporary or not. Otherwise, another
session could drop it while you're doing something with it, after
which bad things would happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Jeff Davis
On Wed, 2022-12-14 at 15:32 -0500, Isaac Morland wrote:

> Is there a firm decision on the issue of changing the cluster index
> of a table? Re-clustering a table on the same index is clearly
> something that should be granted by MAINTAIN as I imagine it, but
> changing the cluster index, strictly speaking, changes the schema and
> could be considered outside of the scope of what should be allowed.
> On the other hand, I can see simplicity in having CLUSTER check the
> same permissions whether or not the cluster index is being updated.

In both the case of CLUSTER and REFRESH, I don't have a strong enough
opinion to break them out into separate privileges. There's some
argument that can be made; but at the same time it's hard for me to
imagine someone really making use of the privileges separately.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Isaac Morland
On Wed, 14 Dec 2022 at 14:47, Jeff Davis  wrote:

Furthermore, MAINTAIN privileges on the partitioned table do not grant
> the ability to create new partitions. There's a comment in tablecmds.c
> alluding to a possible "UNDER" privilege:
>
>   /*
>* We should have an UNDER permission flag for this, but for now,
>* demand that creator of a child table own the parent.
>*/
>
> Perhaps there's something we want to do there, but it's a different use
> case than the MAINTAIN privilege, so I don't see a reason it should be
> grouped. Also, there's a bit of weirdness to think about in cases where
> another user creates (and owns) a partition of your table (currently
> this is only possible if the other user is a superuser).
>

I strongly agree. MAINTAIN is for actions that leave the schema the same.
Conceptually, running MAINTAIN shouldn't affect the result of pg_dump. That
may not be strictly true, but adding a table is definitely not something
that MAINTAIN should allow.

Is there a firm decision on the issue of changing the cluster index of a
table? Re-clustering a table on the same index is clearly something that
should be granted by MAINTAIN as I imagine it, but changing the cluster
index, strictly speaking, changes the schema and could be considered
outside of the scope of what should be allowed. On the other hand, I can
see simplicity in having CLUSTER check the same permissions whether or not
the cluster index is being updated.


Re: Refactor SCRAM code to dynamically handle hash type and key length

2022-12-14 Thread Michael Paquier
On Wed, Dec 14, 2022 at 02:39:43PM +0100, Peter Eisentraut wrote:
> On 14.12.22 03:38, Michael Paquier wrote:
>> While investigating on what it would take to extend SCRAM to use new
>> hash methods (say like the RFC draft for SCRAM-SHA-512), I have been
>> quickly reminded of the limitations created by SCRAM_KEY_LEN, which is
>> the key length that we use in the HMAC and hash computations when
>> creating a SCRAM verifier or when doing a SASL exchange.
> 
> then the obvious fix there is to change the definition of SCRAM_KEY_LEN to
> PG_SHA512_DIGEST_LENGTH, which would be a much smaller and simpler change.
> We don't have to support arbitrary key sizes, so a fixed-size array seems
> appropriate.

Yeah, I was considering doing that as well for the static arrays, with
something like a Max() to combine but perhaps that's not necessary for
the digest lengths anyway.  Perhaps I just over-engineered the
approach.

However, that's only half of the picture.  The key length and the hash
type (or just the hash type to know what's the digest/key length to
use but that's more invasive) still need to be sent across the
internal routines of SCRAM and attached to the state data of the
frontend and the backend or we won't be able to do the hash and HMAC
computations dependent on that.
--
Michael


signature.asc
Description: PGP signature


Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Jeff Davis
On Wed, 2022-12-14 at 12:07 +0300, Pavel Luzanov wrote:
> After a fresh install, including the patch for \dpS [1],
> I found that granting MAINTAIN privilege does not allow the TOAST
> table 
> to be vacuumed.

I wanted to also mention partitioning. The behavior is that MAINTAIN
privileges on the partitioned table does not imply MAINTAIN privileges
on the partitions. I believe that's fine and it's consistent with other
privileges on partitioned tables, such as SELECT and INSERT. In the
case of an admin maintaining users' tables, they'd be a member of
pg_maintain anyway.

Furthermore, MAINTAIN privileges on the partitioned table do not grant
the ability to create new partitions. There's a comment in tablecmds.c
alluding to a possible "UNDER" privilege:

  /*  
   * We should have an UNDER permission flag for this, but for now,   
   * demand that creator of a child table own the parent. 
   */

Perhaps there's something we want to do there, but it's a different use
case than the MAINTAIN privilege, so I don't see a reason it should be
grouped. Also, there's a bit of weirdness to think about in cases where
another user creates (and owns) a partition of your table (currently
this is only possible if the other user is a superuser).

I am not suggesting a change here, just posting in case someone has a
different opinion.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Jeff Davis
On Wed, 2022-12-14 at 10:16 -0800, Nathan Bossart wrote:
> Okay.  Should all the privileges governed by MAINTAIN apply to a
> relation's
> TOAST table as well?

Yes, I agree.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Tom Lane
Nathan Bossart  writes:
> On Wed, Dec 14, 2022 at 01:23:18PM -0500, Tom Lane wrote:
>> Oh.  What in the world is the rationale for that?

> My assumption is that this is meant to avoid starting workers as fast as
> possible if they repeatedly crash.

I can see the point of rate-limiting if the workers are failing to connect
or crashing while trying to process data.  But it's not very sane to
apply the same policy to an intentional worker exit-for-reconfiguration.

Maybe we could have workers that are exiting for that reason set a
flag saying "please restart me without delay"?

A *real* fix would be to not exit at all, at least for reconfigurations
that don't change the connection parameters, but instead cope with
recomputing whatever needs recomputed in the workers' state.  I can
believe that that'd be a lot of work though.

regards, tom lane




Re: Raising the SCRAM iteration count

2022-12-14 Thread Jonathan S. Katz

On 12/14/22 6:25 AM, Daniel Gustafsson wrote:

On 14 Dec 2022, at 02:00, Michael Paquier  wrote:

On Tue, Dec 13, 2022 at 12:17:58PM +0100, Daniel Gustafsson wrote:

It does raise an interesting point though, if we in the future add suppprt for
SCRAM-SHA-512 (which seems reasonable to do) it's not good enough to have a
single GUC for SCRAM iterations; we'd need to be able to set the iteration
count per algorithm. I'll account for that when updating the patch downthread.


So, you mean that the GUC should be named like password_iterations,
taking a grammar with a list like 'scram-sha-256=4096,algo2=5000'?


I was thinking about it but opted for the simpler approach of a GUC name with
the algorithm baked into it: scram_sha256_iterations.  It doesn't seem all that
likely that we'll have more than two versions of SCRAM (sha256/sha512) so
the additional complexity doesn't seem worth it.


I would not rule this out. There is a RFC draft for SCRAM-SHA3-512[1].

I do have mixed feelings on the 'x1=y1,x2=y2' style GUC, but we do have 
machinery to handle it and it gives a bit more flexibility over how many 
SCRAM hash methods get added. I'd like to hear more feedback.


(I don't know if there will be a world if we ever let users BYO-hash, 
but that case may force separate GUCs anyway).


[1] https://datatracker.ietf.org/doc/draft-melnikov-scram-sha3-512/


The attached v2 has the GUC rename and a change to GUC_REPORT such that the
frontend can use the real value rather than the default.  I kept it for super
users so far, do you think it should be a user setting being somewhat sensitive?


No, because a user can set the number of iterations today if they build 
their own SCRAM secret. I think it's OK if they change it in a session.


If a superuser wants to enforce a minimum iteration count, they can 
write a password_check_hook. (Or we could add another GUC to enforce that).



The default in this version is rolled back to 4096 as there was pushback
against raising it, and the lower limit is one in order to potentially assist
situations like the one Andres mentioned where md5 is used.


Reviewing patch as is.

Suggestion on text:

==snip==
The number of computational iterations to perform when generating
a SCRAM-SHA-256 secret. The default is 4096. A
higher number of iterations provides additional protection against
brute-force attacks on stored passwords, but makes authentication
slower. Changing the value has no effect on previously created
SCRAM-SHA-256 secrets as the iteration count at the time of creation
is fixed. A password must be re-hashed to use an updated iteration
value.
==snip==

 /*
- * Default number of iterations when generating secret.  Should be at least
- * 4096 per RFC 7677.
+ * Default number of iterations when generating secret.
  */

I don't think we should remove the RFC 7677 reference entirely. Perhaps:

/*
 * Default number of iterations when generating secret. RFC 7677
 * recommend 4096 for SCRAM-SHA-256, which we set as the default,
 * but we allow users to select their own values.
 */


-pg_fe_scram_build_secret(const char *password, const char **errstr)
+pg_fe_scram_build_secret(const char *password, int iterations, const 
char **errstr)


I have mild worry about changing this function definition for downstream 
usage, esp. for drivers. Perhaps it's not that big of a deal, and 
perhaps this will end up being needed for the work we've discussed 
around "\password" but I do want to note that this could be a breaking 
change.



+   else if (strcmp(name, "scram_sha256_iterations") == 0)
+   {
+   conn->scram_iterations = atoi(value);
+   }

Maybe out of scope for this patch based on what else is in the patch, 
but I was wondering why we don't use a "strncmp" here?


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: collect_corrupt_items_vacuum.patch

2022-12-14 Thread Nikita Malakhov
Hi hackers!

Just to bump this thread, because the problem seems to be still actual:

Please correct me if I am wrong. I've checked another discussion related to
pg_visibility [1].
According to discussion: if using latest completed xid is not right for
checking visibility, than
it should be the least running transaction xid? So it must be another
function to be used for
these calculations, not the GetOldestNonRemovableTransactionId that uses
the ComputeXidHorizons.

[1]
https://www.postgresql.org/message-id/flat/c0610352-8433-ab4b-986d-0e803c628efe%40postgrespro.ru

On Wed, Oct 12, 2022 at 8:15 AM Michael Paquier  wrote:
>
>> On Wed, Jul 27, 2022 at 09:47:19PM -0400, Robert Haas wrote:
>> > On Wed, Jul 27, 2022 at 5:56 PM Tom Lane  wrote:
>> > > Maybe we need a different function for pg_visibility to call?
>> > > If we want ComputeXidHorizons to serve both these purposes, then it
>> > > has to always deliver exactly the right answer, which seems like
>> > > a definition that will be hard and expensive to achieve.
>> >
>> > Yeah, I was thinking along similar lines.
>> >
>> > I'm also kind of wondering why these calculations use
>> > latestCompletedXid. Is that something we do solely to reduce locking?
>> > The XIDs of running transactions matter, and their snapshots matter,
>> > and the XIDs that could start running in the future matter, but I
>> > don't know why it matters what the latest completed XID is.
>>
>> Daniel, it seems to me that this thread is waiting for some input from
>> you, based on the remarks of Tom and Robert.  Are you planning to do
>> so?  This is marked as a bug fix, so I have moved this item to the
>> next CF for now.
>> --
>> Michael
>>
>
-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 01:23:18PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> I'm reasonably certain the launcher is already signaled like you describe.
>> It'll just wait to start new workers if it's been less than
>> wal_retrieve_retry_interval milliseconds since the last time it started
>> workers.
> 
> Oh.  What in the world is the rationale for that?

My assumption is that this is meant to avoid starting workers as fast as
possible if they repeatedly crash.  I didn't see much discussion in the
original logical replication thread [0], but I do see follow-up discussion
about creating a separate GUC for this [1] [2].

[0] https://postgr.es/m/b8132323-b577-428c-b2aa-bf41a66b18e7%402ndquadrant.com
[1] 
https://postgr.es/m/CAD21AoAjTTGm%2BOx70b2OGWvb77vPcRdYeRv3gkAWx76nXDo%2BEA%40mail.gmail.com
[2] 
https://postgr.es/m/CAD21AoDCnyRJDUY%3DESVVe68AukvOP2dFomTeBFpAd1TiFbjsGg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Tom Lane
Nathan Bossart  writes:
> I'm reasonably certain the launcher is already signaled like you describe.
> It'll just wait to start new workers if it's been less than
> wal_retrieve_retry_interval milliseconds since the last time it started
> workers.

Oh.  What in the world is the rationale for that?

regards, tom lane




Re: Temporary tables versus wraparound... again

2022-12-14 Thread Greg Stark
On Sat, 5 Nov 2022 at 15:34, Tom Lane  wrote:
>
> Greg Stark  writes:
> > Simple Rebase
>
> I took a little bit of a look through these.
>
> * I find 0001 a bit scary, specifically that it's decided it's
> okay to apply extract_autovac_opts, pgstat_fetch_stat_tabentry_ext,
> and especially relation_needs_vacanalyze to another session's
> temp table.  How safe is that really?

So I don't see any evidence we skip any locking on pg_class when doing
updates on rows for temporary tables. It's a bit hard to tell because
there are several ways of checking if a table is temporary. Most
places just check relpersistence but there is also an accessing macro
RelationIsPermanent() as well as a relcache field rd_islocaltemp which
could be used. I'm still looking But so far nearly all the checks I've
found just throw errors for temporary tables and none relate to any
operations on pg_class entries.

In any case we're already using the pg_class struct to look at
relpersistence itself So... the danger to check for is something
we would already be at risk of. Namely that the pg_class row is
updated without any locking and then vacuumed away while we hold this
struct pointer and we're looking at fields that have since been
overwritten with other data from an unrelated row. But that would
require all kinds of rules to be broken and would be posing a risk for
anyone just running select * from pg_class. So I find it hard to
believe we would be doing this.

extract_autovac_opts looks at a variable sized field so concurrent
updates would be an issue, but obviously there are only mvcc updates
to this field so I don't see how it could be a problem.

pgstat_fetch_stat_tabentry I don't even see what the possible risks
would be. The words persistence and temporary don't appear in pgstat.c
(except "temporary statistics" in some log messages).

And then there's relation_needs_vacanalyze() and it looks at
relfrozenxid and relminmxid (and relname in some debug messages).
Those fields could get updated by a concurrent vacuum or -- after this
patch -- a truncate in an inplace_update. That seems to be the only
real risk here.

But this is not related to temporary tables at all. Any pg_class entry
can get in_place_update'd by plain old vacuum to update the
relfrozenxid and relminmxid. The in_place_update would take an
exclusive lock on the buffer but I think that doesn't actually protect
us since autovacuum would only have a pin? Or does the SysCache
protect us by copying out the whole row while it's locked? This is
worth answering but it's not an issue related to this patch or
temporary tables. Is autovacuum looking at relfrozenxid and relminmxid
in a way that's safely protected against a concurrent manual vacuum
issuing an in_place_update?

-- 
greg




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 07:05:34PM +0100, Alvaro Herrera wrote:
> On 2022-Dec-14, Nathan Bossart wrote:
>> On Wed, Dec 14, 2022 at 12:07:13PM +0300, Pavel Luzanov wrote:
>> > I found that granting MAINTAIN privilege does not allow the TOAST table to
>> > be vacuumed.
>> 
>> Hm.  My first thought is that this is the appropriate behavior.  WDYT?
> 
> It seems wrong to me.  If you can vacuum a table, surely you can also
> vacuum its toast table.  If you can vacuum all normal tables, you should
> be able to vacuum all toast tables too.

Okay.  Should all the privileges governed by MAINTAIN apply to a relation's
TOAST table as well?  I would think so, but I don't want to assume too much
before writing the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Alvaro Herrera
On 2022-Dec-14, Nathan Bossart wrote:

> On Wed, Dec 14, 2022 at 12:07:13PM +0300, Pavel Luzanov wrote:
> > I found that granting MAINTAIN privilege does not allow the TOAST table to
> > be vacuumed.
> 
> Hm.  My first thought is that this is the appropriate behavior.  WDYT?

It seems wrong to me.  If you can vacuum a table, surely you can also
vacuum its toast table.  If you can vacuum all normal tables, you should
be able to vacuum all toast tables too.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: Minimal logical decoding on standbys

2022-12-14 Thread Robert Haas
On Wed, Dec 14, 2022 at 12:35 PM Andres Freund  wrote:
> >  typedef struct xl_heap_prune
> >
> > I think this is unsafe on alignment-picky machines. I think it will
> > cause the offset numbers to be aligned at an odd address.
> > heap_xlog_prune() doesn't copy the data into aligned memory, so I
> > think this will result in a misaligned pointer being passed down to
> > heap_page_prune_execute.
>
> I think the offset numbers are stored separately from the record, even
> though it doesn't quite look like that in the above due to the way the
> 'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
> block reference 0, the added boolean shouldn't make a difference
> alignment wise?
>
> Or am I misunderstanding your point?

Oh, you're right. So this is another case similar to
xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
data like this:

redirected = (OffsetNumber *)
XLogRecGetBlockData(record, 0, );
end = (OffsetNumber *) ((char *) redirected + datalen);
nowdead = redirected + (nredirected * 2);
nowunused = nowdead + ndead;
nunused = (end - nowunused);
heap_page_prune_execute(buffer,

redirected, nredirected,
nowdead, ndead,

nowunused, nunused);

This is only safe if the return value of XLogRecGetBlockData is
guaranteed to be properly aligned, and I think that there is no such
guarantee in general. I think that it happens to come out properly
aligned because both the main body of the record (xl_heap_prune) and
the length of a block header (XLogRecordBlockHeader) happen to be
sufficiently aligned. But we just recently had discussion about trying
to make WAL records smaller by various means, and some of those
proposals involved changing the length of XLogRecordBlockHeader. And
the patch proposed here increases SizeOfHeapPrune by 1. So I think
with the patch, the offset number array would become misaligned.

No?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 12:42:32PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> My first thought is that the latter two uses should be moved to a new
>> parameter, and the apply launcher should store the last start time for each
>> apply worker like the apply workers do for the table-sync workers.  In any
>> case, it probably makes sense to lower this parameter's value for testing
>> so that tests that restart these workers frequently aren't waiting for so
>> long.
> 
>> I can put a patch together if this seems like a reasonable direction to go.
> 
> No, I'm still of the opinion that waiting for the launcher to timeout
> before doing something is fundamentally wrong design.  We should signal
> it when we want it to do something.  That's not different from what
> you're fixing about the workers; why don't you see that it's appropriate
> for the launcher too?

I'm reasonably certain the launcher is already signaled like you describe.
It'll just wait to start new workers if it's been less than
wal_retrieve_retry_interval milliseconds since the last time it started
workers.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Tom Lane
Nathan Bossart  writes:
> My first thought is that the latter two uses should be moved to a new
> parameter, and the apply launcher should store the last start time for each
> apply worker like the apply workers do for the table-sync workers.  In any
> case, it probably makes sense to lower this parameter's value for testing
> so that tests that restart these workers frequently aren't waiting for so
> long.

> I can put a patch together if this seems like a reasonable direction to go.

No, I'm still of the opinion that waiting for the launcher to timeout
before doing something is fundamentally wrong design.  We should signal
it when we want it to do something.  That's not different from what
you're fixing about the workers; why don't you see that it's appropriate
for the launcher too?

regards, tom lane




Re: Common function for percent placeholder replacement

2022-12-14 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Dec 14, 2022 at 08:31:02AM +0100, Peter Eisentraut wrote:
>> +return replace_percent_placeholders(base_command, "df", (const char 
>> *[]){target_detail, filename});

> This is a "compound literal", which I gather is required by C99.

> But I don't think that's currently being exercised, so I wonder if it's
> going to break some BF members.

It's pretty illegible, whatever it is.  Could we maybe expend a
few more keystrokes in favor of readability?

regards, tom lane




Re: Minimal logical decoding on standbys

2022-12-14 Thread Andres Freund
Hi,

On 2022-12-14 10:55:31 -0500, Robert Haas wrote:
> I read through 0001 again and I noticed this:
> 
>  typedef struct xl_heap_prune
>  {
>  TransactionId snapshotConflictHorizon;
>  uint16  nredirected;
>  uint16  ndead;
> +boolonCatalogAccessibleInLogicalDecoding;
>  /* OFFSET NUMBERS are in the block reference 0 */
>  } xl_heap_prune;
> 
> I think this is unsafe on alignment-picky machines. I think it will
> cause the offset numbers to be aligned at an odd address.
> heap_xlog_prune() doesn't copy the data into aligned memory, so I
> think this will result in a misaligned pointer being passed down to
> heap_page_prune_execute.

I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?

Greetings,

Andres Freund




Re: fix and document CLUSTER privileges

2022-12-14 Thread Nathan Bossart
On Thu, Dec 08, 2022 at 04:08:40PM -0500, Robert Haas wrote:
> On Thu, Dec 8, 2022 at 1:13 PM Nathan Bossart  
> wrote:
>> Currently, CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX (minus REINDEX
>> SCHEMA|DATABASE|SYSTEM) require ownership of the relation or superuser.  In
>> fact, all three use the same RangeVarCallbackOwnsTable() callback function.
>> My current thinking is that this is good enough.  I don't sense any strong
>> demand for allowing database owners to run these commands on all non-shared
>> relations, and there's ongoing work to break out the privileges to GRANT
>> and predefined roles.
> 
> +1.
> 
> I don't see why being the database owner should give you the right to
> run a random subset of commands on any table in the database. Tables
> have their own system for access privileges; we should use that, or
> extend it as required.

Here is a rebased version of the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 145101e6a5..d6b2651657 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -67,7 +67,8 @@ CLUSTER [VERBOSE]
   
 
   
-   CLUSTER without any parameter reclusters all the
+   CLUSTER without a
+   table_name reclusters all the
previously-clustered tables in the current database that the calling user
owns or has the MAINTAIN privilege for, or all such tables
if called by a superuser or a role with privileges of the
@@ -134,6 +135,16 @@ CLUSTER [VERBOSE]
  
   Notes
 
+   
+To cluster a table, one must have the MAINTAIN privilege
+on the table or be the table's owner, a superuser, or a role with
+privileges of the
+pg_maintain
+role.  Database-wide clusters and clusters on partitioned tables will
+silently skip over any tables that the calling user does not have
+permission to cluster.
+   
+

 In cases where you are accessing single rows randomly
 within a table, the actual order of the data in the
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8966b75bd1..8140a90699 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1697,9 +1697,7 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 
 		/* Silently skip partitions which the user has no access to. */
 		if (!object_ownercheck(RelationRelationId, relid, GetUserId()) &&
-			pg_class_aclcheck(relid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK &&
-			(!object_ownercheck(DatabaseRelationId, MyDatabaseId, GetUserId()) ||
-			 IsSharedRelation(relid)))
+			pg_class_aclcheck(relid, GetUserId(), ACL_MAINTAIN) != ACLCHECK_OK)
 			continue;
 
 		/* Use a permanent memory context for the result list */


Re: Amcheck verification of GiST and GIN

2022-12-14 Thread Robert Haas
On Wed, Dec 14, 2022 at 7:19 AM Jose Arthur Benetasso Villanova
 wrote:
> I'm a bit lost here. I tried your patch again and indeed the
> heapallindexed inside gin_check_parent_keys_consistency has a TODO
> comment, but it's unclear to me if you are going to implement it or if the
> patch "needs review". Right now it's "Waiting on Author".

FWIW, I don't think there's a hard requirement that every index AM
needs to support the same set of amcheck options. Where it makes sense
and can be done in a reasonably straightforward manner, we should. But
sometimes that may not be the case, and that seems fine, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Rework confusing permissions for LOCK TABLE

2022-12-14 Thread Nathan Bossart
On Tue, Dec 13, 2022 at 06:59:48PM -0800, Jeff Davis wrote:
> I can't think of any reason for this behavior, and I didn't find an
> obvious answer in the last commits to touch that (2ad36c4e44,
> fa2642438f).

I can't think of a reason, either.

> Patch attached to simplify it. It uses the philosophy that, if you have
> permissions to lock at a given mode, you should be able to lock at
> strictly less-conflicting modes as well.

+1.  Your patch looks reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Nathan Bossart
On Wed, Dec 14, 2022 at 12:07:13PM +0300, Pavel Luzanov wrote:
> I found that granting MAINTAIN privilege does not allow the TOAST table to
> be vacuumed.

Hm.  My first thought is that this is the appropriate behavior.  WDYT?

> So, the patch for \dpS works as expected and can be committed.

Thanks for reviewing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-14 Thread Nathan Bossart
On Tue, Dec 13, 2022 at 04:41:05PM -0800, Nathan Bossart wrote:
> On Tue, Dec 13, 2022 at 07:20:14PM -0500, Tom Lane wrote:
>> I certainly don't think that "wake the apply launcher every 1ms"
>> is a sane configuration.  Unless I'm missing something basic about
>> its responsibilities, it should seldom need to wake at all in
>> normal operation.
> 
> This parameter appears to control how often the apply launcher starts new
> workers.  If it starts new workers in a loop iteration, it updates its
> last_start_time variable, and it won't start any more workers until another
> wal_retrieve_retry_interval has elapsed.  If no new workers need to be
> started, it only wakes up every 3 minutes.

Looking closer, I see that wal_retrieve_retry_interval is used for three
purposes.  It's main purpose seems to be preventing busy-waiting in
WaitForWALToBecomeAvailable(), as that's what's documented.  But it's also
used for logical replication.  The apply launcher uses it as I've describe
above, and the apply workers use it when launching sync workers.  Unlike
the apply launcher, the apply workers store the last start time for each
table's sync worker and use that to determine whether to start a new one.

My first thought is that the latter two uses should be moved to a new
parameter, and the apply launcher should store the last start time for each
apply worker like the apply workers do for the table-sync workers.  In any
case, it probably makes sense to lower this parameter's value for testing
so that tests that restart these workers frequently aren't waiting for so
long.

I can put a patch together if this seems like a reasonable direction to go.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Common function for percent placeholder replacement

2022-12-14 Thread Justin Pryzby
On Wed, Dec 14, 2022 at 08:31:02AM +0100, Peter Eisentraut wrote:
> + return replace_percent_placeholders(base_command, "df", (const char 
> *[]){target_detail, filename});

This is a "compound literal", which I gather is required by C99.

But I don't think that's currently being exercised, so I wonder if it's
going to break some BF members.

-- 
Justin




Re: Common function for percent placeholder replacement

2022-12-14 Thread Robert Haas
On Wed, Dec 14, 2022 at 2:31 AM Peter Eisentraut
 wrote:
> There are a number of places where a shell command is constructed with
> percent-placeholders (like %x).  First, it's obviously cumbersome to
> have to open-code this several times.  Second, each of those pieces of
> code silently encodes some edge case behavior, such as what to do with
> unrecognized placeholders.  (I remember when I last did one of these, I
> stared very hard at the existing code instances to figure out what they
> would do.)  We now also have a newer instance in basebackup_to_shell.c
> that has different behavior in such cases.  (Maybe it's better, but it
> would be good to be explicit and consistent about this.)

Well, OK, I'll tentatively cast a vote in favor of adopting
basebackup_to_shell's approach elsewhere. Or to put that in plain
English: I think that if the input appears to be malformed, it's
better to throw an error than to guess what the user meant. In the
case of basebackup_to_shell there are potentially security
ramifications to the setting involved so it seemed like a bad idea to
take a laissez faire approach. But also, just in general, if somebody
supplies an ssl_passphrase_command or archive_command with %, I don't really see why we should treat that differently
than trying to start the server with work_mem=banana.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Error-safe user functions

2022-12-14 Thread Tom Lane
Amul Sul  writes:
> Attaching a complete set of the patches changing function till this
> except bpcharin, byteain jsonpath_in that Andrew is planning to look
> in. I have skipped reg* functions.

I'll take a look at these shortly, unless Andrew is already on it.

regards, tom lane




Re: Minimal logical decoding on standbys

2022-12-14 Thread Robert Haas
 On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand
 wrote:
> Please find attached v31 with the changes mentioned above (except that I put 
> your wording into the commit message instead of a README: I think it helps to 
> make
> clear what the "design" for the patch series is).

Thanks, I think that's good clarification.

I read through 0001 again and I noticed this:

 typedef struct xl_heap_prune
 {
 TransactionId snapshotConflictHorizon;
 uint16  nredirected;
 uint16  ndead;
+boolonCatalogAccessibleInLogicalDecoding;
 /* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I wonder what the best fix is here. We could (1) have
heap_page_prune_execute copy the data into a newly-palloc'd chunk,
which seems kind of sad on performance grounds, or we could (2) just
make the field here two bytes, or add a second byte as padding, but
that bloats the WAL slightly, or we could (3) try to steal a bit from
ndirected or ndead, if we think that we don't need all the bits. It
seems like the maximum block size is 32kB right now, which means
MaxOffsetNumber can't, I think, be more than 16kB. So maybe we could
think of replacing nredirected and ndead with uint32 flags and then
have accessor macros.

But it looks like we also have a bunch of similar issues elsewhere.
gistxlogDelete looks like it has the same problem. gistxlogPageReuse
is OK because there's no data afterwards. xl_hash_vacuum_one_page
looks like it has the same problem. So does xl_heap_prune.
xl_heap_freeze_page also has the issue: heap_xlog_freeze_page does
memcpy the plans, but not the offsets, and even for the plans, I think
for correctness we would need to treat the "plans" pointer as a void *
or char * because the pointer might be unaligned and the compiler, not
knowing that, could do bad things.

xl_btree_reuse_page is OK because no data follows the main record.
xl_btree_delete appears to have this problem if you just look at the
comments, because it says that offset numbers follow, and thus are
probably presumed aligned. However, they don't really follow, because
commit d2e5e20e57111cca9e14f6e5a99a186d4c66a5b7 moved the data from
the main data to the registered buffer data. However, AIUI, you're not
really supposed to assume that the registered buffer data is aligned.
I think this just happens to work because the length of the main
record is a multiple of the relevant small integer, and the size of
the block header is 4, so the buffer data ends up being accidentally
aligned. That might be worth fixing somehow independently of this
issue.

spgxlogVacuumRedirect is OK because the offsets array is part of the
struct, using FLEXIBLE_ARRAY_MEMBER, which will cause the offsets
field to be aligned properly. It means inserting a padding byte, but
it's not broken. If we don't mind adding padding bytes in some of the
other cases, we could potentially make use of this technique
elsewhere, I think.

Other comments:

+if RelationIsAccessibleInLogicalDecoding(rel)
+xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;

This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.

+xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

+if (heapRelation->rd_options)
+isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;

Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,

Re: static assert cleanup

2022-12-14 Thread Peter Eisentraut

On 11.12.22 23:18, Peter Smith wrote:

+StaticAssertDecl(SysCacheSize == (int) lengthof(cacheinfo),
+ "SysCacheSize does not match syscache.c's array");
+
  static CatCache *SysCache[SysCacheSize];

In almost every example I found of StaticAssertXXX, the lengthof(arr)
part came first in the condition. Since you are modifying this anyway,
should this one also be reversed for consistency?


Makes sense.  I have pushed this separately.




Re: Refactor SCRAM code to dynamically handle hash type and key length

2022-12-14 Thread Peter Eisentraut

On 14.12.22 03:38, Michael Paquier wrote:

This patch passes check-world and the CI is green.  I have tested as
well the patch with SCRAM verifiers coming from a server initially on
HEAD, so it looks pretty solid seen from here, being careful of memory
leaks in the frontend, mainly.


The changes from local arrays to dynamic allocation appear to introduce 
significant complexity.  I would reconsider that.  If we consider your 
reasoning


> While investigating on what it would take to extend SCRAM to use new
> hash methods (say like the RFC draft for SCRAM-SHA-512), I have been
> quickly reminded of the limitations created by SCRAM_KEY_LEN, which is
> the key length that we use in the HMAC and hash computations when
> creating a SCRAM verifier or when doing a SASL exchange.

then the obvious fix there is to change the definition of SCRAM_KEY_LEN 
to PG_SHA512_DIGEST_LENGTH, which would be a much smaller and simpler 
change.  We don't have to support arbitrary key sizes, so a fixed-size 
array seems appropriate.






Re: Allow batched insert during cross-partition updates

2022-12-14 Thread Amit Langote
On Wed, Dec 14, 2022 at 6:45 PM Etsuro Fujita  wrote:
> On Thu, Dec 8, 2022 at 8:01 PM Etsuro Fujita  wrote:
> > On Thu, Dec 8, 2022 at 5:00 PM Amit Langote  wrote:
> > > Updated patch attached.
> >
> > I will review the patch a bit more, but I think
> > it would be committable.
>
> One thing I noticed is this bit:
>
>  -- Clean up
> -DROP TABLE batch_table, batch_cp_upd_test, batch_table_p0,
> batch_table_p1 CASCADE;
> +DROP TABLE batch_table, batch_table_p0, batch_table_p1,
> batch_cp_upd_test, cmdlog CASCADE;
>
> This would be nitpicking, but this as-proposed will not remove remote
> tables created for foreign-table partitions of the partitioned table
> ‘batch_cp_upd_test’.  So I modified this a bit further to remove them
> as well.  Also, I split this into two, for readability.  Another thing
> is a typo in a test-case comment: s/a single INSERTs/a single INSERT/.
> I fixed it as well.  Other than that, the patch looks good to me.
> Attached is an updated patch.  If there are no objections, I will
> commit the patch.

Thanks for the changes.  LGTM.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2022-12-14 Thread Drouvot, Bertrand

Hi,

On 12/13/22 5:37 PM, Drouvot, Bertrand wrote:

Hi,

On 12/13/22 2:50 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand



It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,


Agree and I don't think that there is other option than adding some payload in 
some WAL records (at the very beginning the proposal was to periodically log a 
new record
that announces the current catalog xmin horizon).


and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem. 


+1

If no objections from your side, I'll submit a patch proposal by tomorrow, 
which:

- get rid of IndexIsAccessibleInLogicalDecoding
- let RelationIsAccessibleInLogicalDecoding deals with the index case
- takes care of the padding where the new bool is added
- convert this new bool to a flag for the xl_heap_visible case (adding a new 
bit to the already existing flag)
- Add my proposed wording above to the commit message
- Add your proposed wording above in a README


Please find attached v31 with the changes mentioned above (except that I put 
your wording into the commit message instead of a README: I think it helps to 
make
clear what the "design" for the patch series is).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom df311456fa6f48ab8ed4f7554ebbd82ad28e6aa9 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS 
Date: Wed, 14 Dec 2022 12:24:22 +
Subject: [PATCH v31 6/6] Fixing Walsender corner case with logical decoding on
 standby.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

Introducing a new condition variable to fix this corner case.
---
 src/backend/access/transam/xlogrecovery.c | 28 
 src/backend/replication/walsender.c   | 31 +--
 src/backend/utils/activity/wait_event.c   |  3 +++
 src/include/access/xlogrecovery.h |  3 +++
 src/include/replication/walsender.h   |  1 +
 src/include/utils/wait_event.h|  1 +
 6 files changed, 59 insertions(+), 8 deletions(-)
  41.2% src/backend/access/transam/
  48.5% src/backend/replication/
   3.6% src/backend/utils/activity/
   3.4% src/include/access/

diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index d5a81f9d83..ac8b169ab5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
RecoveryPauseState recoveryPauseState;
ConditionVariable recoveryNotPausedCV;
 
+   /* Replay state (see getReplayedCV() for more explanation) */
+   ConditionVariable replayedCV;
+
slock_t info_lck;   /* locks shared variables shown 
above */
 } XLogRecoveryCtlData;
 
@@ -467,6 +470,7 @@ XLogRecoveryShmemInit(void)
SpinLockInit(>info_lck);
InitSharedLatch(>recoveryWakeupLatch);
ConditionVariableInit(>recoveryNotPausedCV);
+   ConditionVariableInit(>replayedCV);
 }
 
 /*
@@ -1916,6 +1920,11 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord 
*record, TimeLineID *repl
XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
SpinLockRelease(>info_lck);
 
+   /*
+* wake up walsender(s) used by logical decoding on standby.
+*/
+   ConditionVariableBroadcast(>replayedCV);
+
/*
 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
 * receiver so that it notices the updated lastReplayedEndRecPtr and 
sends
@@ -4916,3 +4925,22 @@ assign_recovery_target_xid(const char *newval, void 
*extra)
else
recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Return the ConditionVariable indicating that a replay has been done.
+ *
+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process already 
replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
+ */

Re: Amcheck verification of GiST and GIN

2022-12-14 Thread Jose Arthur Benetasso Villanova



On Sun, 27 Nov 2022, Andrey Borodin wrote:


On Sun, Nov 27, 2022 at 1:29 PM Andrey Borodin  wrote:



I was wrong. GIN check does similar gin_refind_parent() to lock pages
in bottom-up manner and truly verify downlink-child_page invariant.


Does this mean that we need the adjustment in docs?


Here's v17. The only difference is that I added progress reporting to
GiST verification.
I still did not implement heapallindexed for GIN. Existence of pending
lists makes this just too difficult for a weekend coding project :(

Thank you!

Best regards, Andrey Borodin.



I'm a bit lost here. I tried your patch again and indeed the 
heapallindexed inside gin_check_parent_keys_consistency has a TODO 
comment, but it's unclear to me if you are going to implement it or if the 
patch "needs review". Right now it's "Waiting on Author".


--
Jose Arthur Benetasso Villanova





Re: Force streaming every change in logical decoding

2022-12-14 Thread Amit Kapila
On Wed, Dec 14, 2022 at 2:15 PM shiy.f...@fujitsu.com
 wrote:
>
> Please see the attached patch. I also fix Peter's comments[1]. The GUC name 
> and
> design are still under discussion, so I didn't modify them.
>

Let me summarize the discussion on name and design till now. As per my
understanding, we have three requirements: (a) In publisher, stream
each change of transaction instead of waiting till
logical_decoding_work_mem or commit; this will help us to reduce the
test timings of current and future tests for replication of
in-progress transactions; (b) In publisher, serialize each change
instead of waiting till logical_decoding_work_mem; this can help
reduce the test time of tests related to serialization of changes in
logical decoding; (c) In subscriber, during parallel apply for
in-progress transactions (a new feature being discussed at [1]) allow
the system to switch to serialize mode (no more space in shared memory
queue between leader and parallel apply worker either due to a
parallel worker being busy or waiting on some lock) while sending
changes.

Having a GUC that controls these actions/features will allow us to
write tests with shorter duration and better predictability as
otherwise, they require a lot of changes. Apart from tests, these also
help to easily debug the required code. So they fit the Developer
Options category of GUC [2].

We have discussed three different ways to provide GUC for these
features. (1) Have separate GUCs like force_server_stream_mode,
force_server_serialize_mode, force_client_serialize_mode (we can use
different names for these) for each of these; (2) Have two sets of
GUCs for server and client. We can have logical_decoding_mode with
values as 'stream' and 'serialize' for the server and then
logical_apply_serialize = true/false for the client. (3) Have one GUC
like logical_replication_mode with values as 'server_stream',
'server_serialize', 'client_serialize'.

The names used here are tentative mainly to explain each of the
options, we can use different names once we decide among the above.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/flat/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com
[2] - https://www.postgresql.org/docs/devel/runtime-config-developer.html

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-12-14 Thread Bharath Rupireddy
On Thu, Nov 17, 2022 at 10:02 PM David Christensen
 wrote:
>
> On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy
>  wrote:
> > 1.
> > -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state))
> > +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state))
> > These changes are not related to this feature, hence renaming those
> > variables/function names must be dealt with separately. If required,
> > proposing another patch can be submitted to change filter_by_fpw to
> > filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI().
>
> Not required; can revert the changes unrelated to this specific patch.
> (I'd written the original ones for it, so didn't really think anything
> of it... :-))
>
> > 2.
> > +/* We fsync our output directory only; since these files are not 
> > part
> > + * of the production database we do not require the performance hit
> > + * that fsyncing every FPI would entail, so are doing this as a
> > + * compromise. */
> > The commenting style doesn't match the standard that we follow
> > elsewhere in postgres, please refer to other multi-line comments.
>
> Will fix.
>
> > 3.
> > +fsync_fname(config.save_fpi_path, true);
> > +}
> > It looks like fsync_fname()/fsync() in general isn't recursive, in the
> > sense that it doesn't fsync the files under the directory, but the
> > directory only. So, the idea of directory fsync doesn't seem worth it.
> > We either 1) get rid of fsync entirely or 2) fsync all the files after
> > they are created and the directory at the end or 3) do option (2) with
> > --no-sync option similar to its friends. Since option (2) is a no go,
> > we can either choose option (1) or option (2). My vote at this point
> > is for option (1).
>
> Agree to remove.
>
> > 4.
> > +($walfile_name, $blocksize) = split '\|' =>
> > $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()),
> > current_setting('block_size')");
> > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> > I think there's something wrong with this, no? pg_switch_wal() can, at
> > times, return end+1 of the prior segment (see below snippet) and I'm
> > not sure if such a case can happen here.
> >
> >  * The return value is either the end+1 address of the switch record,
> >  * or the end+1 address of the prior segment if we did not need to
> >  * write a switch record because we are already at segment start.
> >  */
> > XLogRecPtr
> > RequestXLogSwitch(bool mark_unimportant)
>
> I think this approach is pretty common to get the walfile name, no?
> While there might be an edge case here, since the rest of the test is
> a controlled environment I'm inclined to just not worry about it; this
> would require the changes prior to this to exactly fill a WAL segment
> which strikes me as extremely unlikely to the point of impossible in
> this specific scenario.
>
> > 5.
> > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> > +ok(-f $walfile, "Got a WAL file");
> > Is this checking if the WAL file is present or not in PGDATA/pg_wal?
> > If yes, I think this isn't required as pg_switch_wal() ensures that
> > the WAL is written and flushed to disk.
>
> You are correct, probably another artifact of the earlier version.
> That said, not sure I see the harm in keeping it as a sanity-check.
>
> > 6.
> > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> > Isn't "pgdata" hardcoded here? I think you might need to do the following:
> > $node->data_dir . '/pg_wal/' . $walfile_name;;
>
> Can fix.
>
> > 7.
> > +# save filename for later verification
> > +$files{$file}++;
> >
> > +# validate that we ended up with some FPIs saved
> > +ok(keys %files > 0, 'verify we processed some files');
> > Why do we need to store filenames in an array when we later just check
> > the size of the array? Can't we use a boolean (file_found) or an int
> > variable (file_count) to verify that we found the file.
>
> Another artifact; we were comparing the files output between two
> separate lists of arbitrary numbers of pages being written out and
> verifying the raw/fixup versions had the same lists.
>
> > 8.
> > +$node->safe_psql('postgres', < > +SELECT 'init' FROM
> > pg_create_physical_replication_slot('regress_pg_waldump_slot', true,
> > false);
> > +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> > +CHECKPOINT; -- required to force FPI for next writes
> > +UPDATE test_table SET a = a + 1;
> > +EOF
> > The EOF with append_conf() is being used in 4 files and elsewhere in
> > the TAP test files (more than 100?) qq[] or quotes is being used. I
> > have no strong opinion here, I'll leave it to the other reviewers or
> > committer.
>
> I'm inclined to leave it just for (personal) readability, but can
> change if there's a strong consensus against.
>
> > > > 11.
> > > > +# verify filename formats matches w/--save-fpi
> > > > +for my $fullpath (glob "$tmp_folder/raw/*")
> > > > Do we 

Re: Raising the SCRAM iteration count

2022-12-14 Thread Daniel Gustafsson
> On 14 Dec 2022, at 02:00, Michael Paquier  wrote:
> 
> On Tue, Dec 13, 2022 at 12:17:58PM +0100, Daniel Gustafsson wrote:
>> It does raise an interesting point though, if we in the future add suppprt 
>> for
>> SCRAM-SHA-512 (which seems reasonable to do) it's not good enough to have a
>> single GUC for SCRAM iterations; we'd need to be able to set the iteration
>> count per algorithm. I'll account for that when updating the patch 
>> downthread.
> 
> So, you mean that the GUC should be named like password_iterations,
> taking a grammar with a list like 'scram-sha-256=4096,algo2=5000'?

I was thinking about it but opted for the simpler approach of a GUC name with
the algorithm baked into it: scram_sha256_iterations.  It doesn't seem all that
likely that we'll have more than two versions of SCRAM (sha256/sha512) so
the additional complexity doesn't seem worth it.

The attached v2 has the GUC rename and a change to GUC_REPORT such that the
frontend can use the real value rather than the default.  I kept it for super
users so far, do you think it should be a user setting being somewhat 
sensitive? 

The default in this version is rolled back to 4096 as there was pushback
against raising it, and the lower limit is one in order to potentially assist
situations like the one Andres mentioned where md5 is used.

--
Daniel Gustafsson   https://vmware.com/



v2-0001-Make-SCRAM-iteration-count-configurable.patch
Description: Binary data


Re: Inconsistency in reporting checkpointer stats

2022-12-14 Thread Bharath Rupireddy
On Wed, Dec 14, 2022 at 1:02 PM Nitin Jadhav
 wrote:
>
> Hi,
>
> While working on checkpoint related stuff, I have encountered that
> there is some inconsistency while reporting checkpointer stats. When a
> checkpoint gets completed, a checkpoint complete message gets logged.
> This message has a lot of information including the buffers written
> (CheckpointStats.ckpt_bufs_written). This variable gets incremented in
> 2 cases. First is in BufferSync() and the second is in
> SlruInternalWritePage(). On the other hand the checkpointer stats
> exposed using pg_stat_bgwriter contains a lot of information including
> buffers written (PendingCheckpointerStats.buf_written_checkpoints).
> This variable gets incremented in only one place and that is in
> BufferSync(). So there is inconsistent behaviour between these two
> data. Please refer to the sample output below.
>
> In order to fix this, the
> PendingCheckpointerStats.buf_written_checkpoints should be incremented
> in SlruInternalWritePage() similar to
> CheckpointStats.ckpt_bufs_written. I have attached the patch for the
> same. Please share your thoughts.

Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
buffer writes in SlruInternalWritePage(). However, does it need to be
done immediately there? The stats will not be visible to the users
until the next pgstat_report_checkpointer(). Incrementing
buf_written_checkpoints in BufferSync() makes sense as the
pgstat_report_checkpointer() gets called in there via
CheckpointWriteDelay() and it becomes visible to the user immediately.
Have you checked if pgstat_report_checkpointer() gets called while the
checkpoint calls SlruInternalWritePage()? If not, then you can just
assign ckpt_bufs_written to buf_written_checkpoints in
LogCheckpointEnd() like its other friends
checkpoint_write_time and checkpoint_sync_time.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Amit Kapila
On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Horiguchi-san, Amit,
>
> > > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila
> >  wrote in
> > > Yeah, I think ideally it will timeout but if we have a solution like
> > > during delay, we keep sending ping messages time-to-time, it should
> > > work fine. However, that needs to be verified. Do you see any reasons
> > > why that won't work?
>
> I have implemented and tested that workers wake up per wal_receiver_timeout/2
> and send keepalive. Basically it works well, but I found two problems.
> Do you have any good suggestions about them?
>
> 1)
>
> With this PoC at present, workers calculate sending intervals based on its
> wal_receiver_timeout, and it is suppressed when the parameter is set to zero.
>
> This means that there is a possibility that walsender is timeout when 
> wal_sender_timeout
> in publisher and wal_receiver_timeout in subscriber is different.
> Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,
> and min_apply_delay is 10min. The worker on subscriber will wake up per 
> 2.5min and
> send keepalives, but walsender exits before the message arrives to publisher.
>
> One idea to avoid that is to send the min_apply_delay subscriber option to 
> publisher
> and compare them, but it may be not sufficient. Because XXX_timout GUC 
> parameters
> could be modified later.
>

How about restarting the apply worker if min_apply_delay changes? Will
that be sufficient?

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Amit Kapila
On Fri, Dec 9, 2022 at 10:49 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Hi Vignesh,
>
> > In the case of physical replication by setting
> > recovery_min_apply_delay, I noticed that both primary and standby
> > nodes were getting stopped successfully immediately after the stop
> > server command. In case of logical replication, stop server fails:
> > pg_ctl -D publisher -l publisher.log stop -c
> > waiting for server to shut
> > down...
> > failed
> > pg_ctl: server does not shut down
> >
> > In case of logical replication, the server does not get stopped
> > because the walsender process is not able to exit:
> > ps ux | grep walsender
> > vignesh  1950789 75.3  0.0 8695216 22284 ?   Rs   11:51   1:08
> > postgres: walsender vignesh [local] START_REPLICATION
>
> Thanks for reporting the issue. I analyzed about it.
>
>
> This issue has occurred because the apply worker cannot reply during the 
> delay.
> I think we may have to modify the mechanism that delays applying transactions.
>
> When walsender processes are requested to shut down, it can shut down only 
> after
> that all the sent WALs are replicated on the subscriber. This check is done in
> WalSndDone(), and the replicated position will be updated when processes 
> handle
> the reply messages from a subscriber, in ProcessStandbyReplyMessage().
>
> In the case of physical replication, the walreciever can receive WALs and 
> reply
> even if the application is delayed. It means that the replicated position will
> be transported to the publisher side immediately. So the walsender can exit.
>

I think it is not only the replicated positions but it also checks if
there is any pending send in WalSndDone(). Why is it a must to send
all pending WAL and confirm that it is flushed on standby before the
shutdown for physical standby? Is it because otherwise, we may lose
the required WAL? I am asking because it is better to see if those
conditions apply to logical replication as well.

-- 
With Regards,
Amit Kapila.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san, Amit,

> > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila
>  wrote in
> > Yeah, I think ideally it will timeout but if we have a solution like
> > during delay, we keep sending ping messages time-to-time, it should
> > work fine. However, that needs to be verified. Do you see any reasons
> > why that won't work?

I have implemented and tested that workers wake up per wal_receiver_timeout/2
and send keepalive. Basically it works well, but I found two problems.
Do you have any good suggestions about them?

1)

With this PoC at present, workers calculate sending intervals based on its
wal_receiver_timeout, and it is suppressed when the parameter is set to zero.

This means that there is a possibility that walsender is timeout when 
wal_sender_timeout
in publisher and wal_receiver_timeout in subscriber is different.
Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,
and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min 
and
send keepalives, but walsender exits before the message arrives to publisher.

One idea to avoid that is to send the min_apply_delay subscriber option to 
publisher
and compare them, but it may be not sufficient. Because XXX_timout GUC 
parameters
could be modified later.

2)

The issue reported by Vignesh-san[1] has still remained. I have already 
analyzed that [2],
the root cause is that flushed WAL is not updated and sent to the publisher. 
Even
if workers send keepalive messages to pub during the delay, the flushed position
cannot be modified.

[1]: 
https://www.postgresql.org/message-id/CALDaNm1vT8qNBqHivtAgYur-5-YkwF026VHtw9srd4fsdeaufA%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/TYAPR01MB5866F6BE7399E6343A96E016F51C9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: slab allocator performance issues

2022-12-14 Thread John Naylor
On Tue, Dec 13, 2022 at 7:50 AM David Rowley  wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor 
wrote:

> > While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;

Simply running over a longer period of time like this makes the SlabFree
difference much closer to your results, so it doesn't seem out of line
anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current
slab, with a 5% reduction in total time:

500k ints:

v13-0001-0005
average of 30: 217ms

  47.61%  postgres  postgres [.] rt_set
  20.99%  postgres  postgres [.] SlabAlloc
  10.00%  postgres  postgres [.] rt_node_insert_inner.isra.0
   6.87%  postgres  [unknown][k] 0xbce011b7
   3.53%  postgres  postgres [.] MemoryContextAlloc
   2.82%  postgres  postgres [.] SlabFree

+slab v4
average of 30: 206ms

  51.13%  postgres  postgres [.] rt_set
  14.08%  postgres  postgres [.] SlabAlloc
  11.41%  postgres  postgres [.] rt_node_insert_inner.isra.0
   7.44%  postgres  [unknown][k] 0xbce011b7
   3.89%  postgres  postgres [.] MemoryContextAlloc
   3.39%  postgres  postgres [.] SlabFree

It doesn't look mysterious anymore, but I went ahead and took some more
perf measurements, including for cache misses. My naive impression is that
we're spending a bit more time waiting for data, but having to do less work
with it once we get it, which is consistent with your earlier comments:

perf stat -p $pid sleep 2
v13:
  2,001.55 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   311,690  page-faults:u#  155.724 K/sec

 3,128,740,701  cycles:u #1.563 GHz

 4,739,333,861  instructions:u   #1.51  insn
per cycle
   820,014,588  branches:u   #  409.690 M/sec

 7,385,923  branch-misses:u  #0.90% of all
branches

+slab v4:
  2,001.09 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   326,017  page-faults:u#  162.920 K/sec

 3,016,668,818  cycles:u #1.508 GHz

 4,324,863,908  instructions:u   #1.43  insn
per cycle
   761,839,927  branches:u   #  380.712 M/sec

 7,718,366  branch-misses:u  #1.01% of all
branches


perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13:  LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Allow batched insert during cross-partition updates

2022-12-14 Thread Etsuro Fujita
On Thu, Dec 8, 2022 at 8:01 PM Etsuro Fujita  wrote:
> On Thu, Dec 8, 2022 at 5:00 PM Amit Langote  wrote:
> > Updated patch attached.
>
> I will review the patch a bit more, but I think
> it would be committable.

One thing I noticed is this bit:

 -- Clean up
-DROP TABLE batch_table, batch_cp_upd_test, batch_table_p0,
batch_table_p1 CASCADE;
+DROP TABLE batch_table, batch_table_p0, batch_table_p1,
batch_cp_upd_test, cmdlog CASCADE;

This would be nitpicking, but this as-proposed will not remove remote
tables created for foreign-table partitions of the partitioned table
‘batch_cp_upd_test’.  So I modified this a bit further to remove them
as well.  Also, I split this into two, for readability.  Another thing
is a typo in a test-case comment: s/a single INSERTs/a single INSERT/.
I fixed it as well.  Other than that, the patch looks good to me.
Attached is an updated patch.  If there are no objections, I will
commit the patch.

Best regards,
Etsuro Fujita


v11-0001-Allow-batching-of-inserts-during-cross-partition-efujita.patch
Description: Binary data


Re: pg_upgrade: Make testing different transfer modes easier

2022-12-14 Thread Daniel Gustafsson
> On 14 Dec 2022, at 08:04, Peter Eisentraut 
>  wrote:
> 
> On 07.12.22 17:33, Peter Eisentraut wrote:
>> I think if we want to make this configurable on the fly, and environment 
>> variable would be much easier, like
>> my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> 
> Here is an updated patch set that incorporates this idea.

I would prefer a small note about it in src/bin/pg_upgrade/TESTING to document
it outside of the code, but otherwise LGTM.

+   $mode,
'--check'
],

...

-   '-p', $oldnode->port, '-P', $newnode->port
+   '-p', $oldnode->port, '-P', $newnode->port,
+   $mode,
],

Minor nitpick, but while in there should we take the opportunity to add a
trailing comma on the other two array declarations which now ends with --check?
It's good Perl practice and will make the code consistent.

--
Daniel Gustafsson   https://vmware.com/





Allow tailoring of ICU locales with custom rules

2022-12-14 Thread Peter Eisentraut
This patch exposes the ICU facility to add custom collation rules to a 
standard collation.  This would allow users to customize any ICU 
collation to whatever they want.  A very simple example from the 
documentation/tests:


CREATE COLLATION en_custom
(provider = icu, locale = 'en', rules = ' < g');

This places "g" after "a" before "b".  Details about the syntax can be 
found at 
.


The code is pretty straightforward.  It mainly just records these rules 
in the catalog and feeds them to ICU when creating the collator object.From b0d42407a60e116d3ccb0ed04505aa362f8a6a1d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 14 Dec 2022 10:15:03 +0100
Subject: [PATCH] Allow tailoring of ICU locales with custom rules

This exposes the ICU facility to add custom collation rules to a
standard collation.
---
 doc/src/sgml/catalogs.sgml| 18 +++
 doc/src/sgml/ref/create_collation.sgml| 22 +
 doc/src/sgml/ref/create_database.sgml | 12 +
 src/backend/catalog/pg_collation.c|  5 ++
 src/backend/commands/collationcmds.c  | 21 ++--
 src/backend/commands/dbcommands.c | 49 +--
 src/backend/utils/adt/pg_locale.c | 41 +++-
 src/backend/utils/init/postinit.c | 11 -
 src/include/catalog/pg_collation.h|  2 +
 src/include/catalog/pg_database.h |  3 ++
 src/include/utils/pg_locale.h |  1 +
 .../regress/expected/collate.icu.utf8.out | 30 
 src/test/regress/sql/collate.icu.utf8.sql | 13 +
 13 files changed, 219 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9316b811ac..afa9f28ef9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2428,6 +2428,15 @@ pg_collation 
Columns
   
  
 
+ 
+  
+   collicurules text
+  
+  
+   ICU collation rules for this collation object
+  
+ 
+
  
   
collversion text
@@ -3106,6 +3115,15 @@ pg_database 
Columns
   
  
 
+ 
+  
+   daticurules text
+  
+  
+   ICU collation rules for this database
+  
+ 
+
  
   
datcollversion text
diff --git a/doc/src/sgml/ref/create_collation.sgml 
b/doc/src/sgml/ref/create_collation.sgml
index 58f5f0cd63..2c7266107e 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -27,6 +27,7 @@
 [ LC_CTYPE = lc_ctype, ]
 [ PROVIDER = provider, ]
 [ DETERMINISTIC = boolean, ]
+[ RULES = rules, ]
 [ VERSION = version ]
 )
 CREATE COLLATION [ IF NOT EXISTS ] name FROM 
existing_collation
@@ -149,6 +150,19 @@ Parameters
  
 
 
+
+ rules
+
+ 
+  
+   Specifies additional collation rules to customize the behavior of the
+   collation.  This is supported for ICU only.  See https://unicode-org.github.io/icu/userguide/collation/customization/"/>
+   for details on the syntax.
+  
+ 
+
+
 
  version
 
@@ -228,6 +242,14 @@ Examples
 
   
 
+  
+   To create a collation using the ICU provider, based on the English ICU
+   locale, with custom rules:
+
+
+
+  
+
   
To create a collation from an existing collation:
 
diff --git a/doc/src/sgml/ref/create_database.sgml 
b/doc/src/sgml/ref/create_database.sgml
index ea38c64731..aa6f121a81 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -192,6 +192,18 @@ Parameters
   
  
 
+ 
+  icu_rules
+  
+   
+Specifies additional collation rules to customize the behavior of the
+collation.  This is supported for ICU only.  See https://unicode-org.github.io/icu/userguide/collation/customization/"/>
+for details on the syntax.
+   
+  
+ 
+
  
   locale_provider
 
diff --git a/src/backend/catalog/pg_collation.c 
b/src/backend/catalog/pg_collation.c
index aa8fbe1a98..7ed9de3891 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -50,6 +50,7 @@ CollationCreate(const char *collname, Oid collnamespace,
int32 collencoding,
const char *collcollate, const char *collctype,
const char *colliculocale,
+   const char *collicurules,
const char *collversion,
bool if_not_exists,
bool quiet)
@@ -194,6 +195,10 @@ CollationCreate(const char *collname, Oid collnamespace,
values[Anum_pg_collation_colliculocale - 1] = 
CStringGetTextDatum(colliculocale);
else
nulls[Anum_pg_collation_colliculocale - 1] = true;
+   if (collicurules)
+   

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-14 Thread Masahiko Sawada
On Wed, Dec 14, 2022 at 1:20 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada 
>  wrote:
> >
> > On Sun, Dec 11, 2022 at 8:45 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, December 9, 2022 3:14 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Dec 8, 2022 at 12:37 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > >
> > > > Review comments
> > >
> > > Thanks for the comments!
> > >
> > > > ==
> > > > 1. Currently, we don't release the stream lock in LA (leade apply
> > > > worker) for "rollback to savepoint" and the reason is mentioned in
> > > > comments of
> > > > apply_handle_stream_abort() in the patch. But, today, while testing,
> > > > I found that can lead to deadlock which otherwise, won't happen on
> > > > the publisher. The key point is rollback to savepoint releases the
> > > > locks acquired by the particular subtransaction, so parallel apply
> > > > worker should also do the same. Consider the following example where
> > > > the transaction in session-1 is being performed by the parallel
> > > > apply worker and the transaction in session-2 is being performed by the
> > leader apply worker. I have simulated it by using GUC force_stream_mode.
> > > > Publisher
> > > > ==
> > > > Session-1
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# savepoint s1;
> > > > SAVEPOINT
> > > > postgres=*# truncate t1;
> > > > TRUNCATE TABLE
> > > >
> > > > Session-2
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# insert into t1 values(4);
> > > >
> > > > Session-1
> > > > postgres=*# rollback to savepoint s1; ROLLBACK
> > > >
> > > > Session-2
> > > > Commit;
> > > >
> > > > With or without commit of Session-2, this scenario will lead to
> > > > deadlock on the subscriber because PA (parallel apply worker) is
> > > > waiting for LA to send the next command, and LA is blocked by
> > > > Exclusive of PA. There is no deadlock on the publisher because
> > > > rollback to savepoint will release the lock acquired by truncate.
> > > >
> > > > To solve this, How about if we do three things before sending abort
> > > > of sub-transaction (a) unlock the stream lock, (b) increment
> > > > pending_stream_count,
> > > > (c) take the stream lock again?
> > > >
> > > > Now, if the PA is not already waiting on the stop, it will not wait
> > > > at stream_stop but will wait after applying abort of sub-transaction
> > > > and if it is already waiting at stream_stop, the wait will be
> > > > released. If this works then probably we should try to do (b) before 
> > > > (a) to
> > match the steps with stream_start.
> > >
> > > The solution works for me, I have changed the code as suggested.
> > >
> > >
> > > > 2. There seems to be another general problem in the way the patch
> > > > waits for stream_stop in PA (parallel apply worker). Currently, PA
> > > > checks, if there are no more pending streams then it tries to wait
> > > > for the next stream by waiting on a stream lock. However, it is
> > > > possible after PA checks there is no pending stream and before it
> > > > actually starts waiting on a lock, the LA sends another stream for
> > > > which even stream_stop is sent, in this case, PA will start waiting
> > > > for the next stream whereas there is actually a pending stream
> > > > available. In this case, it won't lead to any problem apart from
> > > > delay in applying the changes in such cases but for the case mentioned 
> > > > in
> > the previous point (Pont 1), it can lead to deadlock even after we 
> > implement the
> > solution proposed to solve it.
> > >
> > > Thanks for reporting, I have introduced another flag in shared memory
> > > and use it to prevent the leader from incrementing the
> > > pending_stream_count if the parallel apply worker is trying to lock the 
> > > stream
> > lock.
> > >
> > >
> > > > 3. The other point to consider is that for
> > > > stream_commit/prepare/abort, in LA, we release the stream lock after
> > > > sending the message whereas for stream_start we release it before
> > > > sending the message. I think for the earlier cases
> > > > (stream_commit/prepare/abort), the patch has done like this because
> > > > pa_send_data() may need to require the lock again when it times out
> > > > and start serializing, so there will be no sense in first releasing
> > > > it, then re-acquiring it, and then again releasing it. Can't we also
> > > > release the lock for stream_start after
> > > > pa_send_data() only if it is not switched to serialize mode?
> > >
> > > Changed.
> > >
> > > Attach the new version patch set which addressed above comments.
> >
> > Here are comments on v59 0001, 0002 patches:
>
> Thanks for the comments!
>
> > +void
> > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > +while (1)
> > +{
> > +SpinLockAcquire(>mutex);
> > +
> > +/*
> > + * Don't try to increment the count if the 

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-14 Thread Pavel Luzanov

After a fresh install, including the patch for \dpS [1],
I found that granting MAINTAIN privilege does not allow the TOAST table 
to be vacuumed.


postgres@postgres(16.0)=# GRANT MAINTAIN ON pg_type TO alice;
GRANT
postgres@postgres(16.0)=# \c - alice
You are now connected to database "postgres" as user "alice".
alice@postgres(16.0)=> \dpS pg_type
    Access privileges
   Schema   |  Name   | Type  | Access privileges  | Column 
privileges | Policies

+-+---++---+--
 pg_catalog | pg_type | table | 
postgres=arwdDxtm/postgres+|   |

    | |   | =r/postgres +|   |
    | |   | alice=m/postgres |   |
(1 row)

So, the patch for \dpS works as expected and can be committed.

alice@postgres(16.0)=> VACUUM pg_type;
WARNING:  permission denied to vacuum "pg_toast_1247", skipping it
VACUUM

[1] 
https://www.postgresql.org/message-id/20221206193606.GB3078082%40nathanxps13


--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





RE: Force streaming every change in logical decoding

2022-12-14 Thread shiy.f...@fujitsu.com
On Sat, Dec 10, 2022 2:03 PM Dilip Kumar  wrote:
> 
> On Tue, Dec 6, 2022 at 11:53 AM shiy.f...@fujitsu.com
>  wrote:
> >
> > Hi hackers,
> >
> > In logical decoding, when logical_decoding_work_mem is exceeded, the
> changes are
> > sent to output plugin in streaming mode. But there is a restriction that the
> > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC
> to
> > allow sending every change to output plugin without waiting until
> > logical_decoding_work_mem is exceeded.
> >
> > This helps to test streaming mode. For example, to test "Avoid streaming the
> > transaction which are skipped" [1], it needs many
> XLOG_XACT_INVALIDATIONS
> > messages. With the new option, it can be tested with fewer changes and in
> less
> > time. Also, this new option helps to test more scenarios for "Perform
> streaming
> > logical transactions by background workers" [2].
> 
> Some comments on the patch
> 

Thanks for your comments.

> 1. Can you add one test case using this option
> 

I added a simple test to confirm the new option works.

> 2. +  xreflabel="force_stream_mode">
> +  force_stream_mode
> (boolean)
> +  
> +   force_stream_mode configuration
> parameter
> +  
> +  
> 
> This GUC name "force_stream_mode" somehow appears like we are forcing
> this streaming mode irrespective of whether the
> subscriber has requested for this mode or not.  But actually it is not
> that, it is just streaming each change if
> it is enabled.  So we might need to think on the name (at least we
> should avoid using *mode* in the name IMHO).
> 

I think a similar GUC is force_parallel_mode, and if the query is parallel
unsafe or max_worker_processes is exceeded, force_parallel_mode will not work.
This is similar to what we do in this patch. So, maybe it's ok to use "mode". I
didn't change it in the new version of patch. What do you think?

> 3.
> -while (rb->size >= logical_decoding_work_mem * 1024L)
> +while ((!force_stream && rb->size >= logical_decoding_work_mem *
> 1024L) ||
> +   (force_stream && rb->size > 0))
>  {
> 
> It seems like if force_stream is on then indirectly it is enabling
> force serialization as well.  Because once we enter into the loop
> based on "force_stream" then it will either stream or serialize but I
> guess we do not want to force serialize based on this parameter.
> 

Agreed, I refactored the code and modified this point.

Please see the attached patch. I also fix Peter's comments[1]. The GUC name and
design are still under discussion, so I didn't modify them.

By the way, I noticed that the comment for ReorderBufferCheckMemoryLimit() on
HEAD missed something. I fix it in this patch, too.

[1] 
https://www.postgresql.org/message-id/CAHut%2BPtOjZ_e-KLf26i1XLH2ffPEZGOmGSKy0wDjwyB_uvzxBQ%40mail.gmail.com

Regards,
Shi yu


v2-0001-Allow-streaming-every-change-without-waiting-till.patch
Description:  v2-0001-Allow-streaming-every-change-without-waiting-till.patch


Re: generic plans and "initial" pruning

2022-12-14 Thread Amit Langote
On Tue, Dec 13, 2022 at 2:24 AM Alvaro Herrera  wrote:
> On 2022-Dec-12, Amit Langote wrote:
> > I started feeling like putting all the new logic being added
> > by this patch into plancache.c at the heart of GetCachedPlan() and
> > tweaking its API in kind of unintuitive ways may not have been such a
> > good idea to begin with.  So I started thinking again about your
> > GetRunnablePlan() wrapper idea and thought maybe we could do something
> > with it.  Let's say we name it GetCachedPlanLockPartitions() and put
> > the logic that does initial pruning with the new
> > ExecutorDoInitialPruning() in it, instead of in the normal
> > GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
> > call GetCachedPlanLockPartitions() with either the List ** parameter
> > as now or some container struct if that seems better.  Whether
> > GetCachedPlanLockPartitions() needs to do anything other than return
> > the CachedPlan returned by GetCachedPlan() can be decided by the
> > latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
> > done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
> > any of the PlannedStmts it sees, locking only the
> > PlannedStmt.minLockRelids set (which is all relations where no pruning
> > is needed!), leaving the partition locking to
> > GetCachedPlanLockPartitions().
>
> Hmm.  This doesn't sound totally unreasonable, except to the point David
> was making that perhaps we may want this container struct to accomodate
> other things in the future than just the partition pruning results, so I
> think its name (and that of the function that produces it) ought to be a
> little more generic than that.
>
> (I think this also answers your question on whether a List ** is better
> than a container struct.)

OK, so here's a WIP attempt at that.

I have moved the original functionality of GetCachedPlan() to
GetCachedPlanInternal(), turning the former into a sort of controller
as described shortly.  The latter's CheckCachedPlan() part now only
locks the "minimal" set of, non-prunable, relations, making a note of
whether the plan contains any prunable subnodes and thus prunable
relations whose locking is deferred to the caller, GetCachedPlan().
GetCachedPlan(), as a sort of controller as mentioned before, does the
pruning if needed on the minimally valid plan returned by
GetCachedPlanInternal(), locks the partitions that survive, and redoes
the whole thing if the locking of partitions invalidates the plan.

The pruning results are returned through the new output parameter of
GetCachedPlan() of type CachedPlanExtra.  I named it so after much
consideration, because all the new logic that produces stuff to put
into it is a part of the plancache module and has to do with
manipulating a CachedPlan.  (I had considered CachedPlanExecInfo to
indicate that it contains information that is to be forwarded to the
executor, though that just didn't seem to fit in plancache.h.)

I have broken out a few things into a preparatory patch 0001.  Mainly,
it invents PlannedStmt.minLockRelids to replace the
AcquireExecutorLocks()'s current loop over the range table to figure
out the relations to lock.  I also threw in a couple of pruning
related non-functional changes in there to make it easier to read the
0002, which is the main patch.



--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v29-0001-Preparatory-refactoring-before-reworking-CachedP.patch
Description: Binary data


v29-0002-In-GetCachedPlan-only-lock-unpruned-partitions.patch
Description: Binary data