date:20220316

Re: Out-of-tree certificate interferes ssltest

2022-03-16 Thread Michael Paquier

On Wed, Mar 16, 2022 at 11:45:39AM +0100, Daniel Gustafsson wrote:
> On 16 Mar 2022, at 08:36, Kyotaro Horiguchi  wrote:
>> The attached fixes that and make-world successfully finished even if I
>> have a cert file in my home direcotory.
> 
> Seems correct to me, thanks!

The ultimate test I can think about to stress the robustness of this
test suite is to generate various certs and keys using "make
sslfiles", save them into a ~/.postgresql/ (postgresql.crt,
postgresql.key, root.crl and root.crt), and then run the tests to see
how much junk data the SSL scripts would feed on.  With this method, I
have caught a total of 71 failures, much more than reported upthread.

We should really put more attention to set invalid default values for
sslcert, sslkey, sslcrl, sslcrldir and sslrootcert, rather than
hardcoding a couple of them in only a few places, opening ourselves to
the same problem, again, each time a new test is added.  The best way
I can think about here is to use a string that includes all the
default SSL settings, appending that at the beginning of each
$common_connstr.  This takes care of most the failures, except two
cases related to expected failures for sslcrldir:
- directory CRL belonging to a different CA
- does not connect with client-side CRL directory

In both cases, enforcing sslcrl to a value of "invalid" interferes
with the failure scenario we expect from sslcrldir.  It is possible to
bypass that with something like the attached, but that's a kind of
ugly hack.  Another alternative would be to drop those two tests, and
I am not sure how much we care about these two negative scenarios.

Thoughts?
--
Michael
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 5c5b16fbe7..55f2a52dee 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -138,8 +138,17 @@ note "running client tests";
 
 switch_server_cert($node, 'server-cn-only');
 
+# Set of default settings for SSL parameters in connection string.  This
+# makes the tests protected against any defaults the environment may have
+# in ~/.postgresql/.  no_crl is a flavor that does not have a default
+# value set for the connection parameter sslcrl.
+my $default_ssl_connstr = "sslkey=invalid sslcert=invalid sslrootcert=invalid sslcrl=invalid sslcrldir=invalid";
+my $default_ssl_connstr_no_crl = "sslkey=invalid sslcert=invalid sslrootcert=invalid sslcrldir=invalid";
+
 $common_connstr =
-  "user=ssltestuser dbname=trustdb sslcert=invalid hostaddr=$SERVERHOSTADDR host=common-name.pg-ssltest.test";
+  "$default_ssl_connstr user=ssltestuser dbname=trustdb hostaddr=$SERVERHOSTADDR host=common-name.pg-ssltest.test";
+my $common_connstr_no_crl =
+  "$default_ssl_connstr_no_crl user=ssltestuser dbname=trustdb hostaddr=$SERVERHOSTADDR host=common-name.pg-ssltest.test";
 
 # The server should not accept non-SSL connections.
 $node->connect_fails(
@@ -150,14 +159,14 @@ $node->connect_fails(
 # Try without a root cert. In sslmode=require, this should work. In verify-ca
 # or verify-full mode it should fail.
 $node->connect_ok(
-	"$common_connstr sslrootcert=invalid sslmode=require",
+	"$common_connstr sslmode=require",
 	"connect without server root cert sslmode=require");
 $node->connect_fails(
-	"$common_connstr sslrootcert=invalid sslmode=verify-ca",
+	"$common_connstr sslmode=verify-ca",
 	"connect without server root cert sslmode=verify-ca",
 	expected_stderr => qr/root certificate file "invalid" does not exist/);
 $node->connect_fails(
-	"$common_connstr sslrootcert=invalid sslmode=verify-full",
+	"$common_connstr sslmode=verify-full",
 	"connect without server root cert sslmode=verify-full",
 	expected_stderr => qr/root certificate file "invalid" does not exist/);
 
@@ -217,8 +226,10 @@ $node->connect_fails(
 	expected_stderr => qr/SSL error: certificate verify failed/);
 
 # The same for CRL directory
+# The default connection string should not include a default value for
+# sslcrl, or this interferes with the result of this test.
 $node->connect_fails(
-	"$common_connstr sslrootcert=ssl/root+server_ca.crt sslmode=verify-ca sslcrldir=ssl/client-crldir",
+	"$common_connstr_no_crl sslrootcert=ssl/root+server_ca.crt sslmode=verify-ca sslcrldir=ssl/client-crldir",
 	"directory CRL belonging to a different CA",
 	expected_stderr => qr/SSL error: certificate verify failed/);
 
@@ -235,7 +246,7 @@ $node->connect_ok(
 # Check that connecting with verify-full fails, when the hostname doesn't
 # match the hostname in the server's certificate.
 $common_connstr =
-  "user=ssltestuser dbname=trustdb sslcert=invalid sslrootcert=ssl/root+server_ca.crt hostaddr=$SERVERHOSTADDR";
+  "$default_ssl_connstr user=ssltestuser dbname=trustdb sslrootcert=ssl/root+server_ca.crt hostaddr=$SERVERHOSTADDR";
 
 $node->connect_ok("$common_connstr sslmode=require host=wronghost.test",
 	"mismatch between host name and server certificate sslmode=require");
@@ -253,7 +264,7 @@ $node->connect_fails(

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 20:49:12 +0530, Ashutosh Sharma  
wrote in 
> I can see that the pg_get_wal_records_info function shows the details
> of the WAL record whose existence is beyond the user specified
> stop/end lsn pointer. See below:
> 
> ashu@postgres=# select * from pg_get_wal_records_info('0/0128',
> '0/0129');
> -[ RECORD 1 
> ]+--
> start_lsn| 0/128
> end_lsn  | 0/19F
> prev_lsn | 0/0
...
> record_length| 114
...
> In this case, the end lsn pointer specified by the user is
> '0/0129'. There is only one WAL record which starts before this
> specified end lsn pointer whose start pointer is at 0128, but that
> WAL record ends at 0/19F which is way beyond the specified end
> lsn. So, how come we are able to display the complete WAL record info?
> AFAIU, end lsn is the lsn pointer where you need to stop reading the
> WAL data. If that is true, then there exists no valid WAL record
> between the start and end lsn in this particular case.

You're right considering how pg_waldump behaves. pg_waldump works
almost the way as you described above.  The record above actually ends
at 199 and pg_waldump shows that record by specifying -s 0/128
-e 0/19a, but not for -e 0/199.

# I personally think the current behavior is fine, though..


It still suggests unspecifiable end-LSN..

> select * from pg_get_wal_records_info('4/4B28EB68', '4/4C60');
> ERROR:  cannot accept future end LSN
> DETAIL:  Last known WAL LSN on the database system is 4/4C60.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Window Function "Run Conditions"

2022-03-16 Thread Corey Huinker

On Tue, Mar 15, 2022 at 5:24 PM Greg Stark  wrote:

> This looks like an awesome addition.
>
> I have one technical questions...
>
> Is it possible to actually transform the row_number case into a LIMIT
> clause or make the planner support for this case equivalent to it (in
> which case we can replace the LIMIT clause planning to transform into
> a window function)?
>
> The reason I ask is because the Limit plan node is actually quite a
> bit more optimized than the general window function plan node. It
> calculates cost estimates based on the limit and can support Top-N
> sort nodes.
>
> But the bigger question is whether this patch is ready for a committer
> to look at? Were you able to resolve Andy Fan's bug report? Did you
> resolve the two questions in the original email?
>

+1 to all this

It seems like this effort would aid in implementing what some other
databases implement via the QUALIFY clause, which is to window functions
what HAVING is to aggregate functions.
example:
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-16 Thread Thomas Munro

On Thu, Mar 17, 2022 at 3:53 PM Michael Paquier  wrote:
>
> On Wed, Mar 16, 2022 at 05:15:58PM +0900, Kyotaro Horiguchi wrote:
> > I'm not sure that the "of the symbolic link in pg_tblspc/" is
> > needed. And allow_in_place_tablespaces alone doesn't create in-place
> > tablespace. So this might need rethink at least for the second point.
>
> Surely this can be improved.  I was not satisfied with this paragraph
> after re-reading it this morning, so I have just removed it, rewording
> slightly the part for in-place tablespaces that is still necessary.

+   
+A relative path to the data directory is returned for tablespaces
+created when  is
+enabled.
+   
+   

I think what Horiguchi-san was pointing out above is that you need to
enable the GUC *and* say LOCATION '', which the new paragraph doesn't
capture.  What do you think about this:

A path relative to the data directory is returned for in-place
tablespaces (see ).

Re: Skipping logical replication transactions on subscriber side

2022-03-16 Thread Amit Kapila

On Thu, Mar 17, 2022 at 8:13 AM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Mar 16, 2022 4:23 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated version patch.
> >
>
> Thanks for updating the patch. Here are some comments for the v15 patch.
>
> 1. src/backend/replication/logical/worker.c
>
> + * to skip applying the changes when starting to apply changes.  The 
> subskiplsn is
> + * cleared after successfully skipping the transaction or applying non-empty
> + * transaction. The latter prevents the mistakenly specified subskiplsn from
>
> Should "applying non-empty transaction" be modified to "finishing a
> transaction"? To be consistent with the description in the
> alter_subscription.sgml.
>

The current wording in the patch seems okay to me as it is good to
emphasize on non-empty transactions.

> 2. src/test/subscription/t/029_on_error.pl
>
> +# Test of logical replication subscription self-disabling feature.
>
> Should we add something about "skip logical replication transactions" in this
> comment?
>

How about: "Tests for disable_on_error and SKIP transaction features."?

I am making some other minor edits in the patch and will take care of
whatever we decide for these comments.

-- 
With Regards,
Amit Kapila.

Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

2022-03-16 Thread Andy Fan

Hi:

I just tested more cases for the estimation issue for this feature, and we
can
find **we get a more accurate/stable estimation than before**. Here is the
test
cases and result (by comparing the master version and patched version).

create table ec_t110 as select i::int as a from generate_series(1, 110) i;
create table ec_t200 as select i::int as a from generate_series(1, 200) i;
create table ec_t500 as select i::int as a from generate_series(1, 500) i;
create table ec_t800 as select i::int as a from generate_series(1, 800) i;
create table ec_t1000 as select i::int as a from generate_series(1, 1000) i;

analyze;

-- 2 table joins.
explain analyze select * from ec_t1000, ec_t110 where ec_t1000.a =
ec_t110.a and ec_t1000.a > 100; -- (0.9)
explain analyze select * from ec_t1000, ec_t110 where ec_t1000.a =
ec_t110.a and ec_t110.a > 100; --  (0.1)

-- 3 table joins.
explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a
= ec_t110.a and ec_t110.a = ec_t200.a and ec_t1000.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a
= ec_t110.a and ec_t110.a = ec_t200.a and ec_t110.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a
= ec_t110.a and ec_t110.a = ec_t200.a and ec_t200.a > 100;

-- 4 table joins.
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where
ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a
and ec_t1000.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where
ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a
and ec_t110.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where
ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a
and ec_t200.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where
ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a
and ec_t500.a > 100;

-- 5 table joins.
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800
where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a =
ec_t200.a and ec_t500.a = ec_t800.a and ec_t1000.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800
where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a =
ec_t200.a and ec_t500.a = ec_t800.a and ec_t110.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800
where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a =
ec_t200.a and ec_t500.a = ec_t800.a and ec_t200.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800
where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a =
ec_t200.a and ec_t500.a = ec_t800.a and ec_t500.a > 100;
explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800
where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a =
ec_t200.a and ec_t500.a = ec_t800.a and ec_t800.a > 100;

| Query Id | Real rows | Est. Rows at master | Est. rows with patched |
table # |
|--+---+-++-|
|1 |10 |  99 | 10 |
  2 |
|2 |10 |  10 | 10 |
  2 |
|3 |10 |  20 | 11 |
  3 |
|4 |10 |   2 | 11 |
  3 |
|5 |10 |  11 | 11 |
  3 |
|6 |10 |  10 |  9 |
  4 |
|7 |10 |   1 |  9 |
  4 |
|8 |10 |   6 |  9 |
  4 |
|9 |10 |   9 |  9 |
  4 |
|   10 |10 |   8 |  8 |
  5 |
|   11 |10 |   1 |  8 |
  5 |
|   12 |10 |   5 |  8 |
  5 |
|   13 |10 |   7 |  8 |
  5 |
|   14 |10 |   8 |  8 |
  5 |


In the past, we can just use the qual user provided to do estimation. As for
now, since we introduce the CorrectiveQuals design, we still keep just only
1
qual counted, but we can choose the best one in CorrectiveQuals no matter
which
one is provided by the user. we gain a better and stable estimation because
of this.

I'm happy about the overall design but not pretty confident about the
method to
"choose the best one to keep". So I did some test case as many as I can to
find
something is wrong, so far so good.

I'm also happy with how to keep only one qual in CorrectiveQuals (not
choose the
best one). Assume we just have 1 EC filter in this query for simplicity. At
the
beginning, all the baserel have been

Re: BufferAlloc: don't take two simultaneous locks

2022-03-16 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov  
wrote in 
> В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
> > wrote in 
> > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > returns the stashed)" case the stashed element is returned to its
> > original partition.  But it is not what I mentioned.
> > 
> > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > from old partition) case.  I suspect that ththat the frequent freelist
> > starvation comes from the latter case.
> 
> Doubtfully. Due to probabilty theory, single partition doubdfully
> will be too overflowed. Therefore, freelist.

Yeah.  I think so generally.

> But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> 32 freelist partition - certainly some freelist partition will certainly
> have 0 entry even if all entries are in freelists. 

Anyway, it's an extreme condition and the starvation happens only at a
neglegible ratio.

> > RETURNED: 2
> > ALLOCED: 0
> > BORROWED: 435
> > REUSED: 495444
> > ASSIGNED: 495467 (-23)
> > 
> > Now "BORROWED" happens 0.8% of REUSED
> 
> 0.08% actually :)

Mmm.  Doesn't matter:p

> > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > ...
> > > > > Strange thing: both master and patched version has higher
> > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > than in first october version [1]. But lower tps at higher
> > > > > connections number (>= 191 clients).
> > > > > I'll try to bisect on master this unfortunate change.
...
> I've checked. Looks like something had changed on the server, since
> old master commit behaves now same to new one (and differently to
> how it behaved in October).
> I remember maintainance downtime of the server in november/december.
> Probably, kernel were upgraded or some system settings were changed.

One thing I have a little concern is that numbers shows 1-2% of
degradation steadily for connection numbers < 17.

I think there are two possible cause of the degradation.

1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
  This might cause degradation for memory-contended use.

2. nallocs operation might cause degradation on non-shared dynahasyes?
  I believe doesn't but I'm not sure.

  On a simple benchmarking with pgbench on a laptop, dynahash
  allocation (including shared and non-shared) happend about at 50
  times per second with 10 processes and 200 with 100 processes.

> > I don't think nalloced needs to be the same width to long.  For the
> > platforms with 32-bit long, anyway the possible degradation if any by
> > 64-bit atomic there doesn't matter.  So don't we always define the
> > atomic as 64bit and use the pg_atomic_* functions directly?
> 
> Some 32bit platforms has no native 64bit atomics. Then they are
> emulated with locks.
> 
> Well, and for 32bit platform long is just enough. Why spend other
> 4 bytes per each dynahash?

I don't think additional bytes doesn't matter, but emulated atomic
operations can matter. However I'm not sure which platform uses that
fallback implementations.  (x86 seems to have __sync_fetch_and_add()
since P4).

My opinion in the previous mail is that if that level of degradation
caued by emulated atomic operations matters, we shouldn't use atomic
there at all since atomic operations on the modern platforms are not
also free.

In relation to 2 above, if we observe that the degradation disappears
by (tentatively) use non-atomic operations for nalloced, we should go
back to the previous per-freelist nalloced.

I don't have access to such a musculous machines, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-16 Thread Michael Paquier

On Wed, Mar 16, 2022 at 05:15:58PM +0900, Kyotaro Horiguchi wrote:
> I'm not sure that the "of the symbolic link in pg_tblspc/" is
> needed. And allow_in_place_tablespaces alone doesn't create in-place
> tablespace. So this might need rethink at least for the second point.

Surely this can be improved.  I was not satisfied with this paragraph
after re-reading it this morning, so I have just removed it, rewording
slightly the part for in-place tablespaces that is still necessary.

> Agreed. And v2 looks cleaner.
> 
> The test detects the lack of the feature.
> It successfully builds and runs on Rocky8 and Windows11.

Thanks for the review.  After a second look, it seemed fine so I have
applied it.  (I'll try to jump on the tablespace patch for recovery
soon-ish-ly if nobody beats me to it.)
--
Michael


signature.asc
Description: PGP signature

RE: Skipping logical replication transactions on subscriber side

2022-03-16 Thread shiy.f...@fujitsu.com

On Wed, Mar 16, 2022 4:23 PM Masahiko Sawada  wrote:
> 
> I've attached an updated version patch.
> 

Thanks for updating the patch. Here are some comments for the v15 patch.

1. src/backend/replication/logical/worker.c

+ * to skip applying the changes when starting to apply changes.  The 
subskiplsn is
+ * cleared after successfully skipping the transaction or applying non-empty
+ * transaction. The latter prevents the mistakenly specified subskiplsn from

Should "applying non-empty transaction" be modified to "finishing a
transaction"? To be consistent with the description in the
alter_subscription.sgml.

2. src/test/subscription/t/029_on_error.pl

+# Test of logical replication subscription self-disabling feature.

Should we add something about "skip logical replication transactions" in this
comment?

Regards,
Shi yu

Re: support for MERGE

2022-03-16 Thread Japin Li



On Thu, 17 Mar 2022 at 03:18, Alvaro Herrera  wrote:
> v16.
> On 2022-Mar-14, Japin Li wrote:
>
>> +   ar_delete_trig_tcs = mtstate->mt_transition_capture;
>> +   if (mtstate->operation == CMD_UPDATE && 
>> mtstate->mt_transition_capture &&
>> +   mtstate->mt_transition_capture->tcs_update_old_table)
>> +   {
>> +   ExecARUpdateTriggers(estate, resultRelInfo, tupleid, 
>> oldtuple,
>> +NULL, NULL, 
>> mtstate->mt_transition_capture);
>> +
>> +   /*
>> +* We've already captured the NEW TABLE row, so make sure 
>> any AR
>> +* DELETE trigger fired below doesn't capture it again.
>> +*/
>> +   ar_delete_trig_tcs = NULL;
>> +   }
>>
>> Maybe we can use ar_delete_trig_tcs in if condition and
>> ExecARUpdateTriggers() to make the code shorter.
>
> ... Well, the code inside the block is about UPDATE triggers, so it's a
> nasty to use the variable whose name says it's for DELETE triggers.
> That variable really is just a clever flag that indicates whether or not
> to call the DELETE part.  It's a bit of strange coding, but ...
>
>> +MERGE is a multiple-table, multiple-action command: it specifies a target
>> +table and a source relation, and can contain multiple WHEN MATCHED and
>> +WHEN NOT MATCHED clauses, each of which specifies one UPDATE, INSERT,
>> +UPDATE, or DO NOTHING actions.  The target table is modified by MERGE,
>> +and the source relation supplies additional data for the actions
>>
>> I think the "source relation" is inaccurate.  How about using "data
>> source" like document?
>
> I debated this with myself when I started using the term "source
> relation" here.  The result of a query is also a relation, so this term
> is not incorrect; in fact, the glossary entry for "relation" explains
> this:
> https://www.postgresql.org/docs/14/glossary.html#GLOSSARY-RELATION
>
> It's true that it's not the typical use of the term in our codebase.
>
> Overall, I'm -0 for changing this.  (If we decide to change it, there
> are other places that would need to change as well.)

Thanks for the detailed explanation.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Support logical replication of DDLs

2022-03-16 Thread Japin Li



Hi, Zhang Li

On Thu, 17 Mar 2022 at 05:17, Zheng Li  wrote:
> Hi,
>
>>If you don't mind, would you like to share the POC or the branch for this 
>>work?
>
> The POC patch is attached. It currently supports the following 
> functionalities:
> 1. Configure either database level or table level DDL replication via
> the CREATE PUBLICATION command.
>
> 2.Supports replication of DDL of the following types when database
> level DDL replication is turned on. Other less common DDL types could
> be added later.
> T_CreateSchemaStmt
> T_CreateStmt
> T_CreateForeignTableStmt
> T_AlterDomainStmt
> T_DefineStmt
> T_CompositeTypeStmt
> T_CreateEnumStmt
> T_CreateRangeStmt
> T_AlterEnumStmt
> T_ViewStmt
> T_CreateFunctionStmt
> T_AlterFunctionStmt
> T_CreateTrigStmt
> T_CreateDomainStmt
> T_CreateCastStmt
> T_CreateOpClassStmt
> T_CreateOpFamilyStmt
> T_AlterOpFamilyStmt
> T_AlterOperatorStmt
> T_AlterTypeStmt
> T_GrantStmt
> T_AlterCollationStmt
> T_AlterTableStmt
> T_IndexStmt
>
> 3.Supports replication of DDLs of the following types if only table
> level DDL replication is turned on.
> T_AlterTableStmt
> T_IndexStmt
>
> 4.Supports seamless DML replication of new tables created via DDL
> replication without having to manually running “ALTER SUBSCRIPTION ...
> REFRESH PUBLICATION" on the subscriber.
>

Great!  I think we can split the patch into several pieces (at least
implementation and test cases) for easier review.


+ * Simiarl to the generic logical messages, These DDL messages can be either
+ * transactional or non-transactional. Note by default DDLs in PostgreSQL are
+ * transactional.

There is a typo, s/Simiarl/Similar/.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-16 Thread Jacob Champion

On Wed, 2022-03-16 at 15:56 +0900, Kyotaro Horiguchi wrote:
> At Tue, 15 Mar 2022 21:41:49 +, Jacob Champion  
> wrote in 
> > Hmm, the sslinfo tests are failing? I wouldn't have expected that based
> > on the patch changes. Just to confirm -- they pass for you without the
> > patch?
> 
> Mmm I'm not sure how come I didn't noticed that, master also fails
> for me fo the same reason.  In the past that may fail when valid
> clinent-certs exists in the users ~/.postgresql but I believe that has
> been fixed.

Good to know; I was worried that I'd messed up something well outside
the code I'd touched.

> > > > > Mmm. after the end of the loop, rc is non-zero only when the loop has
> > > > > exited by the break and otherwise rc is zero. Why isn't it equivalent
> > > > > to setting check_cn to false at the break?
> > > > 
> > > > check_cn can be false if rc is zero, too; it means that we found a SAN
> > > > of the correct type but it didn't match.
> 
> I'm not discussing on the meaning. Purely on the logical equivalancy
> of the two ways to decide whether to visit CN.
> 
> Concretely, the equivalancy between this: [snip]

Thank you for the explanation -- the misunderstanding was all on my
end. I thought you were asking me to move the check_cn assignment
instead of copying it to the end. I agree that your suggestion is much
clearer, and I'll make that change tomorrow.

Thanks!
--Jacob

Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-16 Thread Nathan Bossart

On Wed, Mar 16, 2022 at 09:47:49PM +, Imseih (AWS), Sami wrote:
> Spoke to Nathan offline and fixed some more comments/nitpicks in the patch.

I don't have any substantial comments for v9, so I think this can be marked
as ready-for-committer.  However, we probably should first see whether
Sawada-san has any comments on the revised approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread Stephen Frost

Greetings,

* samay sharma (smilingsa...@gmail.com) wrote:
> On Wed, Mar 16, 2022 at 8:02 AM Stephen Frost  wrote:
> > How about- if we just added OAUTH support directly into libpq and the
> > backend, would that work with Azure's OIDC provider?  If not, why not?
> 
> Overall, Azure AD implements OIDC, so this could be doable. However, we'll
> have to explore further to see if the current implementation would satisfy
> all the requirements.

Great, would be useful to know what's discovered here and if OIDC
doesn't do what you're looking for, how you're intending to work on
improving it to make it so that it does, in an open and transparent
manner through the RFC process.

> > If it does, then what's the justification for trying to allow custom
> > backend-only authentication methods?
> 
> The only goal of this patch wasn't to enable support for Azure AD. That's
> just one client. Users might have a need to add or change auth methods in
> the future and providing that extensibility so we don't need to have core
> changes for each one of them would be useful. I know there isn't alignment
> on this yet, but if we'd like to move certain auth methods out of core into
> extensions, then this might provide a good framework for that.

Hand-waving at this and saying that other people want it is not
justification for it- who else, exactly, has come along and asked for
it?  I also disagree that we actually want to move authentication
methods out of core- that was an argument brought up in this thread as
justification for why these hooks should exist but I don't see anyone
other than the folks that want the hooks actually saying that's
something they want.  There are some folks who agree that we should drop
support for certain authentication methods but that's not at all the
same thing (and no, moving them to contrib isn't somehow deprecating or
removing them from what we have to support or making it so that we don't
have to deal with them).

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Wed, 2022-03-16 at 11:02 -0400, Stephen Frost wrote:
> > That we're having to extend this quite a bit to work for the proposed
> > OAUTH patches and that it still doesn't do anything for the client
> > side
> > (note that the OAUTHBEARER patches are still patching libpq to add
> > support directly and surely will still be even with this latest patch
> > set ...) makes me seriously doubt that this is the right direction to
> > be
> > going in.
> 
> I don't follow this concern. I assume you're referring to the last two
> bullet points, which I repeat below:
> 
> * Add support for custom auth options to configure provider's
> behavior: 
> 
> Even without OAUTH I think this would have been requested.
> Configuration of some kind is going to be necessary. Without custom
> options, I guess the provider would need to define its own config file?
> Not the end of the world, but not ideal.

Yes, configuration of some kind will be needed and getting that
configuration correct is going to be essential to being able to have a
secure authentication system.

> > How about- if we just added OAUTH support directly into libpq and the
> > backend, would that work with Azure's OIDC provider?  If not, why
> > not?
> > If it does, then what's the justification for trying to allow custom
> > backend-only authentication methods?
> 
> I understand the point, but it also has a flip side: if custom auth
> works, why do we need to block waiting for a complete core OAUTH
> feature?

First, let's be clear- we aren't actually talking about custom or
pluggable authentication here, at least when it comes to PG as a
project.  For it to actually be pluggable, it needs to be supported on
both the client side and the server side, not just the server side.

That this keeps getting swept under the carpet makes me feel like this
isn't actually an argument about the best way to move the PG project
forward but rather has another aim.  I don't think we should even
consider accepting a patch to make this something that works only on the
server side as, in such a case, we wouldn't even be able to have an
extension of our own that leverages it or tests it without bespoke code
for that purpose.  I certainly don't think we should add code to libpq
for some auth method that isn't even available in a default install of
PG, as would happen if we accepted both this patch and the patch to add
OAUTH support to libpq.  What exactly is the thinking on how this would
move forward?  We'd commit this custom support that requires an external
extension to actually work and then add hacked-in code to libpq in order
for folks to be able to use OAUTH?  What happens when, and not if,
something changes in that extension that requires a change on the client
side..?  Are we going to be dealing with CVEs for the libpq side of
this?

> Authentication seems like a good candidate for allowing custom methods.

It's not and this is clear from looking at history.  We have seen this
time and time again- PAM, SASL, RADIUS could be included, EAP, etc.
You'll note that we already support some of these, yet you'd like to now
add some set of hooks in addition to these pluggable options that
already exist because, presumably, they don't actually solve what you're
looking for.  That's actually rather typical when it comes to
authentication methods- everyone has this idea that it can be pluggable
or a few hooks to allow a custom method will work for the next great
thing, but that's not what actually happens.

> New ones are always coming along, being used in new ways, updating to
> newer crypto, or falling out of favor. We've accumulated quite a few.

I agree that we need to get rid of some of the ones that we've got, but
even so, at least the ones we have are maintained and updated to the
extent possible for us to fix issues with them.  The idea that
externally maintained custom auth methods would be taken care of in such
a way is quite far fetched in my view.

I also disagree that they're coming along so quickly and so fast that
it's actually difficult for us to include or support, we've covered
quite a few, after all, as you seem to agree with.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Support logical replication of DDLs

2022-03-16 Thread Alvaro Herrera

Hello

I think this is a pretty interesting and useful feature.

Did you see some old code I wrote towards this goal?
https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
The intention was that DDL would produce some JSON blob that accurately
describes the DDL that was run; the caller can acquire that and use it
to produce working DDL that doesn't depend on runtime conditions.  There
was lots of discussion on doing things this way.  It was ultimately
abandoned, but I think it's valuable.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Andrew Dunstan



On 3/16/22 16:53, Tom Lane wrote:

>> Personally I don't have problem with the use of SETTING. I think the
>> meaning is pretty plain in context and unlikely to produce any confusion.
> I'm just unhappy about the disconnect with the documentation.  I wonder
> if we could get away with s/configuration parameter/setting/g in the docs.
>
>   


I don't think we need to fix that here though. If you can live with
SETTING for now I will undertake to fix the docs post-feature-freeze if
necessary.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Tom Lane

Andrew Dunstan  writes:
> On 3/16/22 14:47, Tom Lane wrote:
>> I'm also fairly allergic to the way that this patch has decided to assign
>> multi-word names to privilege types (ie SET VALUE, ALTER SYSTEM).  There
>> is no existing precedent for that, and I think it's going to break
>> client-side code that we don't need to break.  It's not coincidental that
>> this forces weird changes in rules about whitespace in the has_privilege
>> functions, for example; and if you think that isn't going to cause
>> problems I think you are wrong.  Perhaps we could just use "SET" and
>> "ALTER", or "SET" and "SYSTEM"?

> That's going to look weird, ISTM. This is less clear about what it's
> granting.
>  GRANT ALTER ON SOMETHING shared_buffers TO myuser;

True.  I think "GRANT SET" is clear enough, and it fits with the custom of
using the name of the SQL statement that the privilege allows you to
invoke.  (I gather from Mark's comments that Bison gave him problems with
that, but maybe that can be dealt with.)  But I concede that "ALTER" by
itself is pretty vague.

> If you don't like that maybe ALTER_SYSTEM and SET_VALUE would work,
> although mostly we have avoided things like that.
> How about MODIFY instead of SET VALUE and CONFIGURE instead of ALTER SYSTEM?

I thought about ALTER_SYSTEM too.  It's not great but maybe the best we
can do.  Not sure that CONFIGURE is better.

> Personally I don't have problem with the use of SETTING. I think the
> meaning is pretty plain in context and unlikely to produce any confusion.

I'm just unhappy about the disconnect with the documentation.  I wonder
if we could get away with s/configuration parameter/setting/g in the docs.

regards, tom lane

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread Jeff Davis

On Tue, 2022-03-15 at 12:27 -0700, samay sharma wrote:
> This patch-set adds the following:
> 
> * Allow multiple custom auth providers to be registered (Addressing
> feedback from Aleksander and Andrew)
> * Modify the test extension to use SCRAM to exchange secrets (Based
> on Andres's suggestion)
> * Add support for custom auth options to configure provider's
> behavior (by exposing a new hook) (Required by OAUTHBEARER)
> * Allow custom auth methods to use usermaps. (Required by
> OAUTHBEARER)

Review:

* I retract my previous comment that "...it seems to only be useful
with plaintext password authentication over an SSL connection"[0].
  - it can be used with SCRAM, as you've shown, which could be useful
for storing the credentials elsewhere
  - it can be used with one-time auth tokens, or short-lived auth
tokens, obtained from some other service
  - it can be used with SASL to negotiate with special clients that
support some other auth protocol (this is not shown, but having custom
auth would make it interesting to experiment in this area)

* There's a typo in the top comment for the test module, where you say
that it lives in "contrib", but it actually lives in
"src/test/modules".

* Don't specify ".so" in shared_preload_libraries (in the test module).

* Needs docs.

* I'm wondering whether provider initialization/lookup might have a
performance impact. Does it make sense to initialize the
CustomAuthProvider once when parsing the hba.conf, and then cache it in
the HbaLine or somewhere?

* When specifying the provider in pg_hba.conf, I first made a mistake
and specified it as "test_auth_provider" (which is the module name)
rather than "test" (which is the provider name). Maybe the message
could be slightly reworded in this case, like "no authentication
provider registered with name %s"? As is, the emphasis seems to be on
failing to load the library, which confused me at first.

* Should be a check like "if
(!process_shared_preload_libraries_in_progress) ereport(...)" in
RegisterAuthProvider.

* Could use some fuzz testing on the hba line parsing. For instance, I
was able to specify "provider" twice. And if I specify "allow" first,
then "provider" second, it fails (this is probably fine to require
provider to be first, but needs to be documented/commented somewhere).

* If you are intending for your custom provider to be open source, it
would be helpful to show the code (list, github link, whatever), even
if rough. That would ensure that it's really solving at least one real
use case and we aren't forgetting something.

* In general I like this patch. Trying to catch up on the rest of the
discussion.

Regards,
Jeff Davis

[0] 
https://www.postgresql.org/message-id/bfc55e8045453659df26cd60035bfbb4b9530052.ca...@j-davis.com

Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-16 Thread Jimmy Yih

Zhihong Yu  wrote:
> Hi,
> For RangeVarCallbackForDropRelation():
>
> +   LockRelationOid(indexRelationOid, heap_lockmode);
>
> Since the above is called for both if and else blocks, it can be lifted 
> outside.

Thanks for the feedback.  Attached new v3 patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Description:  v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread samay sharma

Hi,

On Wed, Mar 16, 2022 at 8:02 AM Stephen Frost  wrote:

> Greetings,
>
> * samay sharma (smilingsa...@gmail.com) wrote:
> > On Fri, Mar 4, 2022 at 11:15 AM Jacob Champion 
> wrote:
> > > On Thu, 2022-03-03 at 11:12 +0100, Peter Eisentraut wrote:
> > > > At the moment, it is not possible to judge whether the hook interface
> > > > you have chosen is appropriate.
> > > >
> > > > I suggest you actually implement the Azure provider, then make the
> hook
> > > > interface, and then show us both and we can see what to do with it.
> > >
> > > To add a data point here, I've rebased my OAUTHBEARER experiment [1] on
> > > top of this patchset. (That should work with Azure's OIDC provider, and
> > > if it doesn't, I'd like to know why.)
> >
> > Firstly, thanks for doing this. It helps to have another data point and
> the
> > feedback you provided is very valuable. I've looked to address it with
> the
> > patchset attached to this email.
> >
> > This patch-set adds the following:
> >
> > * Allow multiple custom auth providers to be registered (Addressing
> > feedback from Aleksander and Andrew)
> > * Modify the test extension to use SCRAM to exchange secrets (Based on
> > Andres's suggestion)
> > * Add support for custom auth options to configure provider's behavior
> (by
> > exposing a new hook) (Required by OAUTHBEARER)
> > * Allow custom auth methods to use usermaps. (Required by OAUTHBEARER)
>
> That we're having to extend this quite a bit to work for the proposed
> OAUTH patches and that it still doesn't do anything for the client side
> (note that the OAUTHBEARER patches are still patching libpq to add
> support directly and surely will still be even with this latest patch
> set ...) makes me seriously doubt that this is the right direction to be
> going in.
>

I think I should have clarified this in my previous email. There is nothing
specific to OAUTHBEARER in these changes. Having auth options and using
usermaps is something which any auth method could require. I had not added
them yet, but I'm pretty sure this would have been requested.


>
> > > After the port, here are the changes I still needed to carry in the
> > > backend to get the tests passing:
> > >
> > > - I needed to add custom HBA options to configure the provider.
> >
> > Could you try to rebase your patch to use the options hook and let me
> know
> > if it satisfies your requirements?
> >
> > Please let me know if there's any other feedback.
>
> How about- if we just added OAUTH support directly into libpq and the
> backend, would that work with Azure's OIDC provider?  If not, why not?
>

Overall, Azure AD implements OIDC, so this could be doable. However, we'll
have to explore further to see if the current implementation would satisfy
all the requirements.


> If it does, then what's the justification for trying to allow custom
> backend-only authentication methods?
>

The only goal of this patch wasn't to enable support for Azure AD. That's
just one client. Users might have a need to add or change auth methods in
the future and providing that extensibility so we don't need to have core
changes for each one of them would be useful. I know there isn't alignment
on this yet, but if we'd like to move certain auth methods out of core into
extensions, then this might provide a good framework for that.

Regards,
Samay


>
> Thanks,
>
> Stephen
>

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Andrew Dunstan



On 3/16/22 14:47, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Generally I think this is now in fairly good shape, I've played with it
>> and it seems to do what I expect in every case, and the things I found
>> surprising are gone.
> Stepping back a bit ... do we really want to institutionalize the
> term "setting" for GUC variables?  I realize that the view pg_settings
> exists, but the documentation generally prefers the term "configuration
> parameters".  Where config.sgml uses "setting" as a noun, it's usually
> talking about a specific concrete value for a parameter, and you can
> argue that the view's name comports with that meaning.  But you can't
> GRANT a parameter's current value.
>
> I don't have a better name to offer offhand --- I surely am not proposing
> that we change the syntax to be "GRANT ... ON CONFIGURATION PARAMETER x",
> because that's way too wordy.  But now is the time to bikeshed if we're
> gonna bikeshed, or else admit that we're not interested in precise
> vocabulary.
>
> I'm also fairly allergic to the way that this patch has decided to assign
> multi-word names to privilege types (ie SET VALUE, ALTER SYSTEM).  There
> is no existing precedent for that, and I think it's going to break
> client-side code that we don't need to break.  It's not coincidental that
> this forces weird changes in rules about whitespace in the has_privilege
> functions, for example; and if you think that isn't going to cause
> problems I think you are wrong.  Perhaps we could just use "SET" and
> "ALTER", or "SET" and "SYSTEM"?


That's going to look weird, ISTM. This is less clear about what it's
granting.

 GRANT ALTER ON SOMETHING shared_buffers TO myuser;

If you don't like that maybe ALTER_SYSTEM and SET_VALUE would work,
although mostly we have avoided things like that.

How about MODIFY instead of SET VALUE and CONFIGURE instead of ALTER SYSTEM?

Personally I don't have problem with the use of SETTING. I think the
meaning is pretty plain in context and unlikely to produce any confusion.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Tom Lane

Joshua Brindle  writes:
> On Wed, Mar 16, 2022 at 3:06 PM Tom Lane  wrote:
>> It's going to be hard to do anything useful in a hook that (a) does
>> not know which GUC is being assigned to and (b) cannot do catalog
>> accesses for fear that we're not inside a transaction.  (b), in
>> particular, seems like a rather thorough API break; up to now
>> ObjectPostAlter hooks could assume that catalog accesses are OK.

> Can you elaborate on this point?

Hmm, I glossed over too many details there perhaps.  I was thinking
about the restrictions on GUC check_hooks, which can be run outside
a transaction altogether.  But that's not quite relevant here.
ExecSetVariableStmt can assume it's inside a transaction, but what
it *can't* assume is that we've set a transaction snapshot as yet
(cf. PlannedStmtRequiresSnapshot).  If we call an ObjectPostAlter hook
there, and it does a catalog access, that's going to break things for
modifications of GUCs that are supposed to be modifiable without
freezing the transaction snapshot.  So the net result is the same:
catalog access not okay, at least not in general.

Between that and the fact that an OID-based API is largely useless here,
I don't think it's sane to try to use the existing ObjectPostAlter API.
Maybe there is a case for inventing a separate hook API that could pass
the GUC name as a string, instead.  I remain of the opinion that this
patch should not concern itself with that, though.

regards, tom lane

Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-16 Thread Zhihong Yu

On Wed, Mar 16, 2022 at 11:20 AM Jimmy Yih  wrote:

> Tom Lane  wrote:
> > 1. RangeVarCallbackForDropRelation can get called more than once
> > during a lookup (in case of concurrent rename and suchlike).
> > If so it needs to be prepared to drop the lock(s) it got last time.
> > You have not implemented any such logic.  This doesn't seem hard
> > to fix, just store the OID list into struct DropRelationCallbackState.
>
> Fixed in attached patch. We added the OID list to the
> DropRelationCallbackState as you suggested.
>
> > 2. I'm not sure you have fully thought through the interaction
> > with the subsequent "is_partition" stanza.   If the target is
> > an intermediate partitioned index, don't we need to acquire lock
> > on its parent index before starting to lock children?  (It may
> > be that that stanza is already in the wrong place relative to
> > the table-locking stanza it currently follows, but not sure.)
>
> It's not required to acquire lock on a possible parent index because
> of the restriction where we can only run DROP INDEX on the top-most
> partitioned index.
>
> > 3. Calling find_all_inheritors with lockmode NoLock, and then
> > locking those relation OIDs later, is both unsafe and code-wasteful.
> > Just tell find_all_inheritors to get the locks you want.
>
> Fixed in attached patch.
>
> > 4. This code will invoke find_all_inheritors on InvalidOid if
> > IndexGetRelation fails.  It needs to be within the if-test
> > just above.
>
> Fixed in attached patch.
>
> > 5. Reading classform again at this point in the function is
> > not merely inefficient, but outright wrong, because we
> > already released the syscache entry.  Use the local variable.
>
> Fixed in attached patch. Added another local variable
> is_partitioned_index to store the classform value. The main reason we
> need the classform is because the existing relkind and
> expected_relkind local variables would only show RELKIND_INDEX whereas
> we needed exactly RELKIND_PARTITIONED_INDEX.
>
> > 6. You've not paid enough attention to updating existing comments,
> > particularly the header comment for RangeVarCallbackForDropRelation.
>
> Fixed in attached patch. Updated the header comment and aggregated our
> introduced comment to another relative comment slightly above the
> proposed locking section.
>
> > Actually though, maybe you *don't* want to do this in
> > RangeVarCallbackForDropRelation.  Because of point 2, it might be
> > better to run find_all_inheritors after we've successfully
> > identified and locked the direct target relation, ie do it back
> > in RemoveRelations.  I've not thought hard about that, but it's
> > attractive if only because it'd mean you don't have to fix point 1.
>
> We think that RangeVarCallbackForDropRelation is probably the most
> correct function to place the fix. It would look a bit out-of-place
> being in RemoveRelations seeing how there's already relative DROP
> INDEX code in RangeVarCallbackForDropRelation. With point 2 explained
> and point 1 addressed, the fix seems to look okay in
> RangeVarCallbackForDropRelation.
>
> Thanks for the feedback.  Attached new patch with feedback addressed.
>
> --
> Jimmy Yih (VMware)
> Gaurab Dey (VMware)


Hi,
For RangeVarCallbackForDropRelation():

+   LockRelationOid(indexRelationOid, heap_lockmode);

Since the above is called for both if and else blocks, it can be lifted
outside.

Cheers

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread Jeff Davis

On Wed, 2022-03-16 at 11:02 -0400, Stephen Frost wrote:
> That we're having to extend this quite a bit to work for the proposed
> OAUTH patches and that it still doesn't do anything for the client
> side
> (note that the OAUTHBEARER patches are still patching libpq to add
> support directly and surely will still be even with this latest patch
> set ...) makes me seriously doubt that this is the right direction to
> be
> going in.

I don't follow this concern. I assume you're referring to the last two
bullet points, which I repeat below:

* Add support for custom auth options to configure provider's
behavior: 

Even without OAUTH I think this would have been requested.
Configuration of some kind is going to be necessary. Without custom
options, I guess the provider would need to define its own config file?
Not the end of the world, but not ideal.

* Allow custom auth methods to use usermaps:

This is really just appending the "custom" method to a list of other
methods that are  allowed to specify a usermap in pg_hba.conf. Again, I
don't see anything specific to OAUTH, and this would likely have been
requested regardless.

> How about- if we just added OAUTH support directly into libpq and the
> backend, would that work with Azure's OIDC provider?  If not, why
> not?
> If it does, then what's the justification for trying to allow custom
> backend-only authentication methods?

I understand the point, but it also has a flip side: if custom auth
works, why do we need to block waiting for a complete core OAUTH
feature?

Authentication seems like a good candidate for allowing custom methods.
New ones are always coming along, being used in new ways, updating to
newer crypto, or falling out of favor. We've accumulated quite a few.

Regards,
Jeff Davis

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Joshua Brindle

On Wed, Mar 16, 2022 at 3:06 PM Tom Lane  wrote:
>
> Mark Dilger  writes:
> > On Mar 16, 2022, at 11:47 AM, Tom Lane  wrote:
> >> ... I therefore judge the
> >> hook calls added to ExecSetVariableStmt and AlterSystemSetConfigFile
> >> to be 100% useless, in fact probably counterproductive because they
> >> introduce a boatload of worries about whether the right things happen
> >> if the hook errors out or does something guc.c isn't expecting.
>
> > I think Joshua was planning to use these hooks for security purposes.  The 
> > hooks are supposed to check whether the Oid is valid, and if not, still be 
> > able to make choices based on the other information.  Joshua, any comment 
> > on this?
>
> It's going to be hard to do anything useful in a hook that (a) does
> not know which GUC is being assigned to and (b) cannot do catalog
> accesses for fear that we're not inside a transaction.  (b), in
> particular, seems like a rather thorough API break; up to now
> ObjectPostAlter hooks could assume that catalog accesses are OK.
>

Can you elaborate on this point? This is perhaps an area where I don't
know the rules of the road, to test this hook I modified the set_user
extension, which normally uses the parse tree to figure out if someone
is trying to SET log_statement or ALTER SYSTEM log_statement and
replaced it with:

case OAT_POST_ALTER:
if (classId == SettingAclRelationId)
{
ObjectAddress   object;
object.classId = SettingAclRelationId;
object.objectId = objectId;
object.objectSubId = 0;
if (strcmp(getObjectIdentity(),
"log_statement") == 0)
{
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 errmsg("Setting %s blocked",
getObjectIdentity(;
}
  }

Is that inherently unsafe?

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Tom Lane

Mark Dilger  writes:
> On Mar 16, 2022, at 11:47 AM, Tom Lane  wrote:
>> ... I therefore judge the
>> hook calls added to ExecSetVariableStmt and AlterSystemSetConfigFile
>> to be 100% useless, in fact probably counterproductive because they
>> introduce a boatload of worries about whether the right things happen
>> if the hook errors out or does something guc.c isn't expecting.

> I think Joshua was planning to use these hooks for security purposes.  The 
> hooks are supposed to check whether the Oid is valid, and if not, still be 
> able to make choices based on the other information.  Joshua, any comment on 
> this?

It's going to be hard to do anything useful in a hook that (a) does
not know which GUC is being assigned to and (b) cannot do catalog
accesses for fear that we're not inside a transaction.  (b), in
particular, seems like a rather thorough API break; up to now
ObjectPostAlter hooks could assume that catalog accesses are OK.

regards, tom lane

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Joshua Brindle



> I suggest that what might be saner is to consider that the "objects"
> that the hook calls are concerned with are the pg_setting_acl entries,
> not the underlying GUCs, and thus that the hooks need be invoked only
> when creating, destroying or altering those entries.  If we do have
> a need for a hook editorializing on GUC value settings, that would
> have to be an independent API --- but I have heard no calls for
> the ability to have such a hook, and I don't think that this patch
> is the place to introduce one.

I requested it here:
https://www.postgresql.org/message-id/CAGB%2BVh5pVFAqw8YzeXy4xxmEt_4Hq_8pEUHdCQvv3mCjvC-S-w%40mail.gmail.com

with compromises here:
https://www.postgresql.org/message-id/CAGB%2BVh6wLJ3FKsno62fi54pfg0FDrZRWcpuuCJBkHcCj-G1ndw%40mail.gmail.com
https://www.postgresql.org/message-id/0A3D3CBA-6548-4C9E-9F46-59D5C51A1F31%40enterprisedb.com
https://www.postgresql.org/message-id/CAGB%2BVh65R5vKC4rEt7r2_pK3kMZd-VY0n99RJwcP8Bic7xvOxQ%40mail.gmail.com

and the new version here:
https://www.postgresql.org/message-id/C8DF19D8-C15D-4C2D-91CA-391390F1E421%40enterprisedb.com

which I wrote. a toy module to test and was satisfied with it, despite
the limitations.

If we are adding a DAC Grant for a type of object it seems untenable
that it would not come with analogous MAC capable hooks.

Thank you.

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Mark Dilger

> On Mar 16, 2022, at 11:47 AM, Tom Lane  wrote:
> 
> Stepping back a bit ... do we really want to institutionalize the
> term "setting" for GUC variables?  I realize that the view pg_settings
> exists, but the documentation generally prefers the term "configuration
> parameters".  Where config.sgml uses "setting" as a noun, it's usually
> talking about a specific concrete value for a parameter, and you can
> argue that the view's name comports with that meaning.  But you can't
> GRANT a parameter's current value.
> 
> I don't have a better name to offer offhand --- I surely am not proposing
> that we change the syntax to be "GRANT ... ON CONFIGURATION PARAMETER x",
> because that's way too wordy.  But now is the time to bikeshed if we're
> gonna bikeshed, or else admit that we're not interested in precise
> vocabulary.

Informally, we often use "GUC" on this list, but that isn't used formally, 
leaving "configuration parameter" and "setting" as the two obvious choices.  I 
preferred "configuration parameter" originally and was argued out of it.  My 
take on "setting" was also that it more naturally refers to the choice of 
setting, not the thing being set, such that "work_mem = 8192" means the 
configuration parameter "work_mem" has the setting "8192".  I'm happy to change 
the patch along these lines, but I vaguely recall it being Robert who liked the 
"setting" language.  Robert?  (Sorry if I misremember that...)

> I'm also fairly allergic to the way that this patch has decided to assign
> multi-word names to privilege types (ie SET VALUE, ALTER SYSTEM).  There
> is no existing precedent for that, and I think it's going to break
> client-side code that we don't need to break.  It's not coincidental that
> this forces weird changes in rules about whitespace in the has_privilege
> functions, for example; and if you think that isn't going to cause
> problems I think you are wrong.  Perhaps we could just use "SET" and
> "ALTER", or "SET" and "SYSTEM"?

I chose those particular multi-word names to avoid needing to promote any 
keywords.  That's been a while ago, and I don't recall exactly where the 
shift-reduce or reduce-reduce errors were coming from.  I'll play with it to 
see what I can find.

> I also agree with the upthread criticism that this is abusing the
> ObjectPostAlterHook API unreasonably much.  In particular, it's
> trying to use pg_setting_acl OIDs as identities for GUCs, which
> they are not, first because they're unstable and second because
> most GUCs won't even have an entry there.  I therefore judge the
> hook calls added to ExecSetVariableStmt and AlterSystemSetConfigFile
> to be 100% useless, in fact probably counterproductive because they
> introduce a boatload of worries about whether the right things happen
> if the hook errors out or does something guc.c isn't expecting.

I think Joshua was planning to use these hooks for security purposes.  The 
hooks are supposed to check whether the Oid is valid, and if not, still be able 
to make choices based on the other information.  Joshua, any comment on this?

> I suggest that what might be saner is to consider that the "objects"
> that the hook calls are concerned with are the pg_setting_acl entries,
> not the underlying GUCs, and thus that the hooks need be invoked only
> when creating, destroying or altering those entries.

That's certainly a thing we could do, but I got the impression that Joshua 
wanted to hook into SET, RESET, and ALTER SYSTEM SET, not merely into GRANT and 
REVOKE.

>  If we do have
> a need for a hook editorializing on GUC value settings, that would
> have to be an independent API --- but I have heard no calls for
> the ability to have such a hook, and I don't think that this patch
> is the place to introduce one.

Well, the call for it was in this thread, but I'm ok with yanking out that part 
of the patch if it seems half-baked.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: JSON path decimal literal syntax

2022-03-16 Thread Peter Eisentraut


On 06.03.22 02:43, Nikita Glukhov wrote:

Obviously, there are compatibility issues with expressions like
'1.type()', which will start to require parentheses around numbers,
but they seem to be useful only for our regression tests.

The corresponding changes in jsonpath_out() related to parentheses
are missing in the v2 patch:

=# select '(1).a'::jsonpath;
  jsonpath
--
  1."a"
(1 row)

=# select '(1).a'::jsonpath::text::jsonpath;
ERROR:  syntax error, unexpected STRING_P, expecting $end at or near """ of 
jsonpath input


I have added in v3 enclosing of numbers in parentheses if they have
successive path items. (Changed results of several test cases, one test
case added.)


Thank you for these insights.  I have integrated this into my patch and 
updated the commit message to point out the change.From cbe9c601ed5f0db894df74377bbc5aa624626c4e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 16 Mar 2022 18:24:00 +0100
Subject: [PATCH v4] Make JSON path numeric literals more correct

Per ECMAScript standard (ECMA-262, referenced by SQL standard), the
syntax forms

.1
1.

should be allowed for decimal numeric literals, but the existing
implementation rejected them.

Also, by the same standard, reject trailing junk after numeric
literals.

Note that the ECMAScript standard for numeric literals is in respects
like these slightly different from the JSON standard, which might be
the original cause for this discrepancy.

A change is that this kind of syntax is now rejected:

1.type()

This needs to be written as

(1).type()

This is correct; normal JavaScript also does not accept this syntax.

We also need to fix up the jsonpath output function for this case.  We
put parentheses around numeric items if they are followed by another
path item.

Discussion: 
https://www.postgresql.org/message-id/flat/50a828cc-0a00-7791-7883-2ed06dfb2...@enterprisedb.com
---
 src/backend/utils/adt/jsonpath.c   |   4 +
 src/backend/utils/adt/jsonpath_scan.l  |  24 ++-
 src/test/regress/expected/jsonpath.out | 238 -
 src/test/regress/sql/jsonpath.sql  |   8 +
 4 files changed, 176 insertions(+), 98 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath.c b/src/backend/utils/adt/jsonpath.c
index 9be4e305ff..91af030095 100644
--- a/src/backend/utils/adt/jsonpath.c
+++ b/src/backend/utils/adt/jsonpath.c
@@ -500,9 +500,13 @@ printJsonPathItem(StringInfo buf, JsonPathItem *v, bool 
inKey,
escape_json(buf, jspGetString(v, NULL));
break;
case jpiNumeric:
+   if (jspHasNext(v))
+   appendStringInfoChar(buf, '(');
appendStringInfoString(buf,
   
DatumGetCString(DirectFunctionCall1(numeric_out,

   
NumericGetDatum(jspGetNumeric(v);
+   if (jspHasNext(v))
+   appendStringInfoChar(buf, ')');
break;
case jpiBool:
if (jspGetBool(v))
diff --git a/src/backend/utils/adt/jsonpath_scan.l 
b/src/backend/utils/adt/jsonpath_scan.l
index 827a9e44cb..1f08e7c51f 100644
--- a/src/backend/utils/adt/jsonpath_scan.l
+++ b/src/backend/utils/adt/jsonpath_scan.l
@@ -82,11 +82,13 @@ other   
[^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f]
 
 digit  [0-9]
 integer(0|[1-9]{digit}*)
-decimal{integer}\.{digit}+
-decimalfail{integer}\.
+decimal({integer}\.{digit}*|\.{digit}+)
 real   ({integer}|{decimal})[Ee][-+]?{digit}+
-realfail1  ({integer}|{decimal})[Ee]
-realfail2  ({integer}|{decimal})[Ee][-+]
+realfail   ({integer}|{decimal})[Ee][-+]
+
+integer_junk   {integer}{other}
+decimal_junk   {decimal}{other}
+real_junk  {real}{other}
 
 hex_dig[0-9A-Fa-f]
 unicode\\u({hex_dig}{4}|\{{hex_dig}{1,6}\})
@@ -242,16 +244,10 @@ hex_fail  \\x{hex_dig}{0,1}
return 
INT_P;
}
 
-{decimalfail}  {
-   /* 
throw back the ., and treat as integer */
-   
yyless(yyleng - 1);
-   
addstring(true, yytext, yyleng);
-   
addchar(false, '\0');
-   
yylval->str = scanstring;
-

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Tom Lane

Andrew Dunstan  writes:
> Generally I think this is now in fairly good shape, I've played with it
> and it seems to do what I expect in every case, and the things I found
> surprising are gone.

Stepping back a bit ... do we really want to institutionalize the
term "setting" for GUC variables?  I realize that the view pg_settings
exists, but the documentation generally prefers the term "configuration
parameters".  Where config.sgml uses "setting" as a noun, it's usually
talking about a specific concrete value for a parameter, and you can
argue that the view's name comports with that meaning.  But you can't
GRANT a parameter's current value.

I don't have a better name to offer offhand --- I surely am not proposing
that we change the syntax to be "GRANT ... ON CONFIGURATION PARAMETER x",
because that's way too wordy.  But now is the time to bikeshed if we're
gonna bikeshed, or else admit that we're not interested in precise
vocabulary.

I'm also fairly allergic to the way that this patch has decided to assign
multi-word names to privilege types (ie SET VALUE, ALTER SYSTEM).  There
is no existing precedent for that, and I think it's going to break
client-side code that we don't need to break.  It's not coincidental that
this forces weird changes in rules about whitespace in the has_privilege
functions, for example; and if you think that isn't going to cause
problems I think you are wrong.  Perhaps we could just use "SET" and
"ALTER", or "SET" and "SYSTEM"?

I also agree with the upthread criticism that this is abusing the
ObjectPostAlterHook API unreasonably much.  In particular, it's
trying to use pg_setting_acl OIDs as identities for GUCs, which
they are not, first because they're unstable and second because
most GUCs won't even have an entry there.  I therefore judge the
hook calls added to ExecSetVariableStmt and AlterSystemSetConfigFile
to be 100% useless, in fact probably counterproductive because they
introduce a boatload of worries about whether the right things happen
if the hook errors out or does something guc.c isn't expecting.

I suggest that what might be saner is to consider that the "objects"
that the hook calls are concerned with are the pg_setting_acl entries,
not the underlying GUCs, and thus that the hooks need be invoked only
when creating, destroying or altering those entries.  If we do have
a need for a hook editorializing on GUC value settings, that would
have to be an independent API --- but I have heard no calls for
the ability to have such a hook, and I don't think that this patch
is the place to introduce one.

regards, tom lane

Re: Optimize external TOAST storage

2022-03-16 Thread Nathan Bossart

On Wed, Mar 16, 2022 at 11:36:56AM -0700, Nathan Bossart wrote:
> Thinking further, is simply reducing the number of TOAST chunks the right
> thing to look at?  If I want to add a TOAST attribute that requires 100,000
> chunks, and you told me that I could save 10% in the read path for an extra
> 250 chunks of disk space, I would probably choose read performance.  If I
> wanted to add 100,000 attributes that were each 3 chunks, and you told me
> that I could save 10% in the read path for an extra 75,000 chunks of disk
> space, I might choose the extra disk space.  These are admittedly extreme
> (and maybe even impossible) examples, but my point is that the amount of
> disk space you are willing to give up may be related to the size of the
> attribute.  And maybe one way to extract additional read performance with
> this optimization is to use a variable threshold so that we are more likely
> to use it for large attributes.

I might be overthinking this.  Maybe it is enough to skip compressing the
attribute whenever compression saves no more than some percentage of the
uncompressed attribute size.  A conservative default setting might be
something like 5% or 10%.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-16 Thread Nathan Bossart

On Wed, Mar 16, 2022 at 03:02:41PM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 14, 2022 at 10:34 PM Nathan Bossart
>  wrote:
>> I'm -1 on splitting these new statistics to separate LOGs.  In addition to
>> making it more difficult to discover statistics for a given checkpoint, I
>> think it actually adds even more bloat to the server log.  If we are
>> concerned about the amount of information in these LOGs, perhaps we should
>> adjust the format to make it more human-readable.
> 
> Below are the ways that I can think of. Thoughts?

I think I prefer the first option.  If these tasks don't do anything, we
leave the stats out of the message.  If they do some work, we add them.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Optimize external TOAST storage

2022-03-16 Thread Nathan Bossart

On Wed, Mar 16, 2022 at 10:08:45AM -0400, Robert Haas wrote:
> I would like to take a slightly contrary position. I think that a
> system here that involves multiple knobs is going to be too
> complicated to be of any real-world use, because almost nobody will
> understand it or tune it properly for their installation. And who is
> to say that a single setting would even be appropriate for a whole
> installation, as opposed to something that needs to be tuned for each
> individual table? A single tunable might be OK, but what I would
> really like here is a heuristic that, while perhaps not optimal in
> every environment, is good enough that a very high percentage of users
> will never need to worry about it. In a case like this, a small gain
> that happens automatically feels better than a large gain that
> requires manual tweaking in every install.

Agreed.  If there is a tunable, it needs to be as simple as possible.  And
ideally most users would never need to use it because the default would be
"good enough."

> Now that doesn't answer the question of what that heuristic should be.
> I initially thought that if compression didn't end up reducing the
> number of TOAST chunks then there was no point, but that's not true,
> because having the last chunk be smaller than the rest can allow us to
> fit more than 4 into a page rather than exactly 4. However, this
> effect isn't likely to be important if most chunks are full-size
> chunks. If we insert 100 full-size chunks and 1 partial chunk at the
> end, we can't save much even if that chunk ends up being 1 byte
> instead of a full-size chunk. 25 TOAST table pages out of 26 are going
> to be full of full-size chunks either way, so we can't save more than
> ~4% and we might easily save nothing at all. As you say, the potential
> savings are larger as the values get smaller, because a higher
> proportion of the TOAST table pages are not quite full. So I think the
> only cases where it's even interesting to think about suppressing this
> optimization are those where the value is quite small -- and maybe the
> thing to do is just pick a conservatively large threshold and call it
> good.
> 
> For example, say that instead of applying this technique when there
> are at least 2 TOAST chunks, we apply it when there are at least 500
> (i.e. normally 1MB). It's hard for me to imagine that we can ever
> lose, and in fact I think we could come down quite a bit from there
> and still end up winning pretty much all the time. Setting the
> threshold to, say, 50 or 100 or 150 TOAST chunks instead of 2 may
> leave some money on the table, but if it means that we don't have
> meaningful downsides and we don't have tunables for users to fiddle
> with, it might be worth it.

As long as we can demonstrate some decent improvements, I think using a
conservative threshold is a good idea.  I do wonder whether thiѕ could
weaken the read performance gains quite a bit, though.

Thinking further, is simply reducing the number of TOAST chunks the right
thing to look at?  If I want to add a TOAST attribute that requires 100,000
chunks, and you told me that I could save 10% in the read path for an extra
250 chunks of disk space, I would probably choose read performance.  If I
wanted to add 100,000 attributes that were each 3 chunks, and you told me
that I could save 10% in the read path for an extra 75,000 chunks of disk
space, I might choose the extra disk space.  These are admittedly extreme
(and maybe even impossible) examples, but my point is that the amount of
disk space you are willing to give up may be related to the size of the
attribute.  And maybe one way to extract additional read performance with
this optimization is to use a variable threshold so that we are more likely
to use it for large attributes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-16 Thread Jimmy Yih

Tom Lane  wrote:
> 1. RangeVarCallbackForDropRelation can get called more than once
> during a lookup (in case of concurrent rename and suchlike).
> If so it needs to be prepared to drop the lock(s) it got last time.
> You have not implemented any such logic.  This doesn't seem hard
> to fix, just store the OID list into struct DropRelationCallbackState.

Fixed in attached patch. We added the OID list to the
DropRelationCallbackState as you suggested.

> 2. I'm not sure you have fully thought through the interaction
> with the subsequent "is_partition" stanza.   If the target is
> an intermediate partitioned index, don't we need to acquire lock
> on its parent index before starting to lock children?  (It may
> be that that stanza is already in the wrong place relative to
> the table-locking stanza it currently follows, but not sure.)

It's not required to acquire lock on a possible parent index because
of the restriction where we can only run DROP INDEX on the top-most
partitioned index.

> 3. Calling find_all_inheritors with lockmode NoLock, and then
> locking those relation OIDs later, is both unsafe and code-wasteful.
> Just tell find_all_inheritors to get the locks you want.

Fixed in attached patch.

> 4. This code will invoke find_all_inheritors on InvalidOid if
> IndexGetRelation fails.  It needs to be within the if-test
> just above.

Fixed in attached patch.

> 5. Reading classform again at this point in the function is
> not merely inefficient, but outright wrong, because we
> already released the syscache entry.  Use the local variable.

Fixed in attached patch. Added another local variable
is_partitioned_index to store the classform value. The main reason we
need the classform is because the existing relkind and
expected_relkind local variables would only show RELKIND_INDEX whereas
we needed exactly RELKIND_PARTITIONED_INDEX.

> 6. You've not paid enough attention to updating existing comments,
> particularly the header comment for RangeVarCallbackForDropRelation.

Fixed in attached patch. Updated the header comment and aggregated our
introduced comment to another relative comment slightly above the
proposed locking section.

> Actually though, maybe you *don't* want to do this in
> RangeVarCallbackForDropRelation.  Because of point 2, it might be
> better to run find_all_inheritors after we've successfully
> identified and locked the direct target relation, ie do it back
> in RemoveRelations.  I've not thought hard about that, but it's
> attractive if only because it'd mean you don't have to fix point 1.

We think that RangeVarCallbackForDropRelation is probably the most
correct function to place the fix. It would look a bit out-of-place
being in RemoveRelations seeing how there's already relative DROP
INDEX code in RangeVarCallbackForDropRelation. With point 2 explained
and point 1 addressed, the fix seems to look okay in
RangeVarCallbackForDropRelation.

Thanks for the feedback.  Attached new patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch
Description:  v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patch

Re: USE_BARRIER_SMGRRELEASE on Linux?

2022-03-16 Thread Nathan Bossart

On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote:
> Pushed and back-patched (it's slightly different before 12).  Thanks!

Thank you!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Assert in pageinspect with NULL pages

2022-03-16 Thread Alexander Lakhin


Hello Michael,
16.03.2022 10:39, Michael Paquier wrote:

On Mon, Feb 21, 2022 at 10:00:00AM +0300, Alexander Lakhin wrote:

Could you please confirm before committing the patchset that it fixes
the bug #16527 [1]? Or maybe I could check it?
(Original patch proposed by Daria doesn't cover that case, but if the
patch going to be improved, probably it's worth fixing that bug too.)

[1]
https://www.postgresql.org/message-id/flat/16527-ef7606186f0610a1%40postgresql.org

FWIW, I think that this one has been fixed by 076f4d9, where we make
sure that the page is correctly aligned.  Are you still seeing this
problem?

Yes, I've reproduced that issue on master (46d9bfb0). 
pageinspect-zeros-v2.patch doesn't help too.


Best regards.
Alexander

Re: Column Filtering in Logical Replication

2022-03-16 Thread Tomas Vondra




On 3/15/22 09:30, Tomas Vondra wrote:
> 
> 
> On 3/15/22 05:43, Amit Kapila wrote:
>> On Mon, Mar 14, 2022 at 4:42 PM houzj.f...@fujitsu.com
>>  wrote:
>>>
>>> On Monday, March 14, 2022 5:08 AM Tomas Vondra 
>>>  wrote:

 On 3/12/22 05:30, Amit Kapila wrote:
>> ...
>
> Okay, please find attached. I have done basic testing of this, if we
> agree with this approach then this will require some more testing.
>

 Thanks, the proposed changes seem like a clear improvement, so I've
 added them, with some minor tweaks (mostly to comments).
>>>
>>> Hi,
>>>
>>> Thanks for updating the patches !
>>> And sorry for the row filter bug caused by my mistake.
>>>
>>> I looked at the two fixup patches. I am thinking would it be better if we
>>> add one testcase for these two bugs? Maybe like the attachment.
>>>
>>
>> Your tests look good to me. We might want to add some comments for
>> each test but I guess that can be done before committing. Tomas, it
>> seems you are planning to push these bug fixes, do let me know if you
>> want me to take care of these while you focus on the main patch? I
>> think the first patch needs to be backpatched till 13 and the second
>> one is for just HEAD.
>>
> 
> Yeah, I plan to push the fixes later today. I'll polish them a bit
> first, and merge the tests (shared by Hou zj) into the patches etc.
> 

I've pushed (and backpatched to 13+) the fix for the publish_as_relid
issue, including the test. I tweaked the test a bit, to check both
orderings of the publication list.

While doing that, I discovered yet ANOTHER bug in the publish_as_relid
loop, affecting 12+13. There was a break once all actions were
replicated, but skipping additional publications ignores the fact that
the publications may replicate a different (higher-up) ancestor.

I removed the break, if anyone thinks this optimization is worth it we
could still do that once we replicate the top-most ancestor.


I'll push the second fix soon.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Ashutosh Sharma

I can see that the pg_get_wal_records_info function shows the details
of the WAL record whose existence is beyond the user specified
stop/end lsn pointer. See below:

ashu@postgres=# select * from pg_get_wal_records_info('0/0128',
'0/0129');
-[ RECORD 1 
]+--
start_lsn| 0/128
end_lsn  | 0/19F
prev_lsn | 0/0
xid  | 0
resource_manager | XLOG
record_length| 114
fpi_length   | 0
description  | CHECKPOINT_SHUTDOWN redo 0/128; tli 1; prev tli
1; fpw true; xid 0:3; oid 1; multi 1; offset 0; oldest xid 3 in DB
1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0;
oldest running xid 0; shutdown
block_ref|
data_length  | 88
data |
\x280101000100010003001027010003000100010001007255a5c43162ff7f

In this case, the end lsn pointer specified by the user is
'0/0129'. There is only one WAL record which starts before this
specified end lsn pointer whose start pointer is at 0128, but that
WAL record ends at 0/19F which is way beyond the specified end
lsn. So, how come we are able to display the complete WAL record info?
AFAIU, end lsn is the lsn pointer where you need to stop reading the
WAL data. If that is true, then there exists no valid WAL record
between the start and end lsn in this particular case.

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 16, 2022 at 7:56 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> > On Tue, Mar 15, 2022 at 7:21 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Mon, Mar 14, 2022 at 8:25 PM Stephen Frost  wrote:
> > > >
> > > > > As this patch is currently written, pg_monitor has access these
> > > > > functions, though I don't think that's the right privilege level at
> > > > > least for pg_get_raw_wal_record().
> > > >
> > > > Yeah, I agree that pg_monitor isn't the right thing for such a function
> > > > to be checking.
> > >
> > > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> > > >
> > > > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that
> > > > function should require pg_read_server_files? Or at least
> > > > pg_read_all_data?
> > >
> > > The v9 patch set posted at [1] grants execution on
> > > pg_get_raw_wal_record() to the pg_monitor role.
> > >
> > > pg_read_all_data may not be the right choice, but pg_read_server_files
> > > is as these functions do read the WAL files on the server. If okay,
> > > I'm happy to grant execution on pg_get_raw_wal_record() to the
> > > pg_read_server_files role.
> > >
> > > Thoughts?
> > >
> > > [1] 
> > > https://www.postgresql.org/message-id/CALj2ACVRH-z8mZLyFkpLvY4qRhxQCqU_BLkFTtwt%2BTPZNhfEVg%40mail.gmail.com
> >
> > Attaching v10 patch set which allows pg_get_raw_wal_record to be
> > executed by either superuser or users with pg_read_server_files role,
> > no other change from v9 patch set.
>
> In a quick look, that seems reasonable to me.  If folks want to give out
> access to this function individually they're also able to do so, which
> is good.  Doesn't seem worthwhile to introduce a new predefined role for
> this one function.
>
> Thanks,
>
> Stephen

Re: refactoring basebackup.c (zstd negative compression)

2022-03-16 Thread Justin Pryzby

Should zstd's negative compression levels be supported here ?

Here's a POC patch which is enough to play with it.

$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd |wc -c
12305659
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:1 |wc -c
13827521
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:0 |wc -c
12304018
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:-1 |wc -c
16443893
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:-2 |wc -c
17349563
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:-4 |wc -c
19452631
$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=zstd:-7 |wc -c
21871505

Also, with a partial regression DB, this crashes when writing to stdout.

$ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --compress=lz4 |wc -c
pg_basebackup: bbstreamer_lz4.c:172: bbstreamer_lz4_compressor_content: 
Assertion `mystreamer->base.bbs_buffer.maxlen >= out_bound' failed.
24117248

#4  0xe8b4 in bbstreamer_lz4_compressor_content 
(streamer=0x555a5260, member=0x7fffc760, 
data=0x73068010 "{ \"PostgreSQL-Backup-Manifest-Version\": 
1,\n\"Files\": [\n{ \"Path\": \"backup_label\", \"Size\": 227, 
\"Last-Modified\": \"2022-03-16 02:29:11 GMT\", \"Checksum-Algorithm\": 
\"CRC32C\", \"Checksum\": \"46f69d99\" },\n{ \"Pa"..., len=401072, 
context=BBSTREAMER_MEMBER_CONTENTS) at bbstreamer_lz4.c:172
mystreamer = 0x555a5260
next_in = 0x73068010 "{ \"PostgreSQL-Backup-Manifest-Version\": 
1,\n\"Files\": [\n{ \"Path\": \"backup_label\", \"Size\": 227, 
\"Last-Modified\": \"2022-03-16 02:29:11 GMT\", \"Checksum-Algorithm\": 
\"CRC32C\", \"Checksum\": \"46f69d99\" },\n{ \"Pa"...
...

(gdb) p mystreamer->base.bbs_buffer.maxlen
$1 = 524288
(gdb) p (int) LZ4F_compressBound(len, >prefs)
$4 = 524300

This is with: liblz4-1:amd64 1.9.2-2ubuntu0.20.04.1
>From 9a330a3a1801352cef3b5912e31aba61760dac32 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 10 Mar 2022 20:16:19 -0600
Subject: [PATCH] pg_basebackup: support Zstd negative compression levels

"higher than maximum" is bogus

TODO: each compression methods should enforce its own levels
---
 src/backend/replication/basebackup_zstd.c |  8 ++--
 src/backend/replication/repl_scanner.l|  4 +++-
 src/bin/pg_basebackup/bbstreamer_zstd.c   |  7 ++-
 src/bin/pg_basebackup/pg_basebackup.c | 23 ---
 4 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/basebackup_zstd.c b/src/backend/replication/basebackup_zstd.c
index c0e2be6e27b..4464fcb30e1 100644
--- a/src/backend/replication/basebackup_zstd.c
+++ b/src/backend/replication/basebackup_zstd.c
@@ -71,7 +71,7 @@ bbsink_zstd_new(bbsink *next, int compresslevel)
 
 	Assert(next != NULL);
 
-	if (compresslevel < 0 || compresslevel > 22)
+	if (compresslevel < -7 || compresslevel > 22)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("zstd compression level %d is out of range",
@@ -96,13 +96,17 @@ bbsink_zstd_begin_backup(bbsink *sink)
 {
 	bbsink_zstd *mysink = (bbsink_zstd *) sink;
 	size_t		output_buffer_bound;
+	size_t		ret;
 
 	mysink->cctx = ZSTD_createCCtx();
 	if (!mysink->cctx)
 		elog(ERROR, "could not create zstd compression context");
 
-	ZSTD_CCtx_setParameter(mysink->cctx, ZSTD_c_compressionLevel,
+	ret = ZSTD_CCtx_setParameter(mysink->cctx, ZSTD_c_compressionLevel,
 		   mysink->compresslevel);
+	if (ZSTD_isError(ret))
+		elog(ERROR, "could not create zstd compression context: %s",
+ZSTD_getErrorName(ret));
 
 	/*
 	 * We need our own buffer, because we're going to pass different data to
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 4b64c0d768b..05c4ef463a1 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -86,6 +86,7 @@ xdinside		[^"]+
 
 digit			[0-9]
 hexdigit		[0-9A-Fa-f]
+sign			("-"|"+")
 
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
@@ -127,9 +128,10 @@ NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT{ return K_WAIT; }
 
+
 {space}+		{ /* do nothing */ }
 
-{digit}+		{
+{sign}?{digit}+		{
 	yylval.uintval = strtoul(yytext, NULL, 10);
 	return UCONST;
 }
diff --git a/src/bin/pg_basebackup/bbstreamer_zstd.c b/src/bin/pg_basebackup/bbstreamer_zstd.c
index e86749a8fb7..337e789b6a1 100644
--- a/src/bin/pg_basebackup/bbstreamer_zstd.c
+++ b/src/bin/pg_basebackup/bbstreamer_zstd.c
@@ -67,6 +67,7 @@ bbstreamer_zstd_compressor_new(bbstreamer *next, int compresslevel)
 {
 #ifdef

Re: Proposal: Support custom authentication methods using hooks

2022-03-16 Thread Stephen Frost

Greetings,

* samay sharma (smilingsa...@gmail.com) wrote:
> On Fri, Mar 4, 2022 at 11:15 AM Jacob Champion  wrote:
> > On Thu, 2022-03-03 at 11:12 +0100, Peter Eisentraut wrote:
> > > At the moment, it is not possible to judge whether the hook interface
> > > you have chosen is appropriate.
> > >
> > > I suggest you actually implement the Azure provider, then make the hook
> > > interface, and then show us both and we can see what to do with it.
> >
> > To add a data point here, I've rebased my OAUTHBEARER experiment [1] on
> > top of this patchset. (That should work with Azure's OIDC provider, and
> > if it doesn't, I'd like to know why.)
> 
> Firstly, thanks for doing this. It helps to have another data point and the
> feedback you provided is very valuable. I've looked to address it with the
> patchset attached to this email.
> 
> This patch-set adds the following:
> 
> * Allow multiple custom auth providers to be registered (Addressing
> feedback from Aleksander and Andrew)
> * Modify the test extension to use SCRAM to exchange secrets (Based on
> Andres's suggestion)
> * Add support for custom auth options to configure provider's behavior (by
> exposing a new hook) (Required by OAUTHBEARER)
> * Allow custom auth methods to use usermaps. (Required by OAUTHBEARER)

That we're having to extend this quite a bit to work for the proposed
OAUTH patches and that it still doesn't do anything for the client side
(note that the OAUTHBEARER patches are still patching libpq to add
support directly and surely will still be even with this latest patch
set ...) makes me seriously doubt that this is the right direction to be
going in.

> > After the port, here are the changes I still needed to carry in the
> > backend to get the tests passing:
> >
> > - I needed to add custom HBA options to configure the provider.
> 
> Could you try to rebase your patch to use the options hook and let me know
> if it satisfies your requirements?
> 
> Please let me know if there's any other feedback.

How about- if we just added OAUTH support directly into libpq and the
backend, would that work with Azure's OIDC provider?  If not, why not?
If it does, then what's the justification for trying to allow custom
backend-only authentication methods?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Issue with pg_stat_subscription_stats

2022-03-16 Thread Masahiko Sawada

On Wed, Mar 16, 2022 at 8:51 PM Amit Kapila  wrote:
>
> On Tue, Mar 15, 2022 at 10:09 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Mar 15, 2022 at 3:34 AM Melanie Plageman
> >  wrote:
> > >
> > > On Mon, Mar 14, 2022 at 4:02 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Mar 14, 2022 at 2:05 AM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > On Sat, Mar 12, 2022 at 3:15 PM Andres Freund  
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On 2022-03-12 08:28:35 +0530, Amit Kapila wrote:
> > > > > > > On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > So, I noticed that pg_stat_reset_subscription_stats() wasn't 
> > > > > > > > working
> > > > > > > > properly, and, upon further investigation, I'm not sure the view
> > > > > > > > pg_stat_subscription_stats is being properly populated.
> > > > > > > >
> > > > > > >
> > > > > > > I have tried the below scenario based on this:
> > > > > > > Step:1 Create some data that generates conflicts and lead to apply
> > > > > > > failures and then check in the view:
> > > > > >
> > > > > > I think the problem is present when there was *no* conflict
> > > > > > previously. Because nothing populates the stats entry without an 
> > > > > > error, the
> > > > > > reset doesn't have anything to set the stats_reset field in, which 
> > > > > > then means
> > > > > > that the stats_reset field is NULL even though stats have been 
> > > > > > reset.
> > > > >
> > > > > Yes, this is what I meant. stats_reset is not initialized and without
> > > > > any conflict happening to populate the stats, after resetting the 
> > > > > stats,
> > > > > the field still does not get populated. I think this is a bit
> > > > > unexpected.
> > > > >
> > > > > psql (15devel)
> > > > > Type "help" for help.
> > > > >
> > > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > > >  subid | subname | apply_error_count | sync_error_count | stats_reset
> > > > > ---+-+---+--+-
> > > > >  16398 | mysub   | 0 |0 |
> > > > > (1 row)
> > > > >
> > > > > mplageman=# select pg_stat_reset_subscription_stats(16398);
> > > > >  pg_stat_reset_subscription_stats
> > > > > --
> > > > >
> > > > > (1 row)
> > > > >
> > > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > > >  subid | subname | apply_error_count | sync_error_count | stats_reset
> > > > > ---+-+---+--+-
> > > > >  16398 | mysub   | 0 |0 |
> > > > > (1 row)
> > > > >
> > > >
> > > > Looking at other statistics such as replication slots, shared stats,
> > > > and SLRU stats, it makes sense that resetting it populates the stats.
> > > > So we need to fix this issue.
> > > >
> > > > However, I think the proposed fix has two problems; it can create an
> > > > entry for non-existing subscriptions if the user directly calls
> > > > function pg_stat_get_subscription_stats(), and stats_reset value is
> > > > not updated in the stats file as it is not done by the stats
> > > > collector.
> > >
> > > You are right. My initial patch was incorrect.
> > >
> > > Thinking about it more, the initial behavior is technically the same for
> > > pg_stat_database. It is just that I didn't notice because you end up
> > > creating stats for pg_stat_database so quickly that you usually never
> > > see them before.
> > >
> > > In pg_stat_get_db_stat_reset_time():
> > >
> > > if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
> > > result = 0;
> > > else
> > > result = dbentry->stat_reset_timestamp;
> > >
> > > if (result == 0)
> > > PG_RETURN_NULL();
> > > else
> > > PG_RETURN_TIMESTAMPTZ(result);
> > >
> > > and in pgstat_recv_resetcounter():
> > >
> > > dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
> > >
> > > if (!dbentry)
> > > return;
> > >
> > > Thinking about it now, though, maybe an alternative solution would be to
> > > have all columns or all columns except the subid/subname or dbname/dboid
> > > be NULL until the statistics have been created, at which point the
> > > reset_timestamp is populated with the current timestamp.
> >
> > It's true that stats_reset is NULL if the statistics of database are
> > not created yet.
> >
>
> So, if the behavior is the same as pg_stat_database, do we really want
> to change anything in this regard?

Both pg_stat_database and pg_stat_subscription_stats work similarly in
principle but they work differently in practice since there are more
chances to create the database stats entry such as connections,
disconnections, and autovacuum than the subscription stats entry. I
think that the issue reported by Melanie is valid and perhaps most
users would expect the same behavior as other statistics.

Regards,

-- 
Masahiko Sawada
EDB:

Re: Unhyphenation of crash-recovery

2022-03-16 Thread Robert Haas

On Tue, Mar 15, 2022 at 9:39 PM Andres Freund  wrote:
> On March 15, 2022 6:25:09 PM PDT, Kyotaro Horiguchi  
> wrote:
> >Hello, this is a derived topic from [1], summarized as $SUBJECT.
> >
> >This just removes useless hyphens from the words
> >"(crash|emergency)-recovery". We don't have such wordings for "archive
> >recovery" This patch fixes non-user-facing texts as well as
> >user-facing ones.
>
> I don't see the point of this kind of change.

It seems like better grammar to me, although whether it is worth the
effort to go fix everything of this kind is certainly debatable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Stephen Frost

Greetings,

* Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> On Tue, Mar 15, 2022 at 7:21 AM Bharath Rupireddy
>  wrote:
> >
> > On Mon, Mar 14, 2022 at 8:25 PM Stephen Frost  wrote:
> > >
> > > > As this patch is currently written, pg_monitor has access these
> > > > functions, though I don't think that's the right privilege level at
> > > > least for pg_get_raw_wal_record().
> > >
> > > Yeah, I agree that pg_monitor isn't the right thing for such a function
> > > to be checking.
> >
> > On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> > >
> > > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that
> > > function should require pg_read_server_files? Or at least
> > > pg_read_all_data?
> >
> > The v9 patch set posted at [1] grants execution on
> > pg_get_raw_wal_record() to the pg_monitor role.
> >
> > pg_read_all_data may not be the right choice, but pg_read_server_files
> > is as these functions do read the WAL files on the server. If okay,
> > I'm happy to grant execution on pg_get_raw_wal_record() to the
> > pg_read_server_files role.
> >
> > Thoughts?
> >
> > [1] 
> > https://www.postgresql.org/message-id/CALj2ACVRH-z8mZLyFkpLvY4qRhxQCqU_BLkFTtwt%2BTPZNhfEVg%40mail.gmail.com
> 
> Attaching v10 patch set which allows pg_get_raw_wal_record to be
> executed by either superuser or users with pg_read_server_files role,
> no other change from v9 patch set.

In a quick look, that seems reasonable to me.  If folks want to give out
access to this function individually they're also able to do so, which
is good.  Doesn't seem worthwhile to introduce a new predefined role for
this one function.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: ICU for global collation

2022-03-16 Thread Peter Eisentraut


On 15.03.22 18:28, Robert Haas wrote:

On Tue, Mar 15, 2022 at 12:58 PM Peter Eisentraut
 wrote:

On 14.03.22 19:57, Robert Haas wrote:

1. What will happen if I set the ICU collation to something that
doesn't match the libc collation? How bad are the consequences?


These are unrelated, so there are no consequences.


Can you please elaborate on this?


The code that is aware of ICU generally works like this:

if (locale_provider == ICU)
  result = call ICU code
else
  result = call libc code
return result

However, there is code out there, both within PostgreSQL itself and in 
extensions, that does not do that yet.  Ideally, we would eventually 
change all that over, but it's not happening now.  So we ought to 
preserve the ability to set the libc to keep that legacy code working 
for now.


This legacy code by definition doesn't know about ICU, so it doesn't 
care whether the ICU setting "matches" the libc setting or anything like 
that.  It will just do its thing depending on its own setting.


The only consequence of settings that don't match is that the different 
pieces of code behave semantically inconsistently (e.g., some routine 
thinks the data is Greek and other code thinks the data is French).  But 
that's up to the user to set correctly.  And the actual scenarios where 
you can actually do anything semantically relevant this way are very 
limited.


A second point is that the LC_CTYPE setting tells other parts of libc 
what the current encoding is.  This affects gettext for example.  So you 
need to set this to something sensible even if you don't use libc locale 
routines otherwise.



2. If I want to avoid a mismatch between the two, then I will need a
way to figure out which libc collation corresponds to a given ICU
collation. How do I do that?


You can specify the same name for both.


Hmm. If every name were valid in both systems, I don't think you'd be
proposing two fields.


Earlier versions of this patch and predecessor patches indeed had common 
fields.  But in fact the two systems accept different values if you want 
to delve into the advanced features.  But for basic usage something like 
"en_US" will work for both.

Re: Corruption during WAL replay

2022-03-16 Thread Robert Haas

On Wed, Mar 16, 2022 at 1:14 AM Kyotaro Horiguchi
 wrote:
> storage.c:
> +* Make sure that a concurrent checkpoint can't complete while 
> truncation
> +* is in progress.
> +*
> +* The truncation operation might drop buffers that the checkpoint
> +* otherwise would have flushed. If it does, then it's essential that
> +* the files actually get truncated on disk before the checkpoint 
> record
> +* is written. Otherwise, if reply begins from that checkpoint, the
> +* to-be-truncated buffers might still exist on disk but have older
> +* contents than expected, which can cause replay to fail. It's OK for
> +* the buffers to not exist on disk at all, but not for them to have 
> the
> +* wrong contents.
>
> FWIW, this seems like slightly confusing between buffer and its
> content.  I can read it correctly so I don't mind if it is natural
> enough.

Hmm. I think the last two instances of "buffers" in this comment
should actually say "blocks".

> I'll try that, if you are already working on it, please inform me. (It
> may more than likely be too late..)

If you want to take a crack at that, I'd be delighted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Optimize external TOAST storage

2022-03-16 Thread Robert Haas

On Tue, Mar 15, 2022 at 5:48 PM Nathan Bossart  wrote:
> I apologize for thinking out loud a bit here, but I hope this gives you
> some insight into my perspective on this.  In general, I am skeptical that
> we can choose one threshold that will work for all PostgreSQL installations
> in the known universe.

I would like to take a slightly contrary position. I think that a
system here that involves multiple knobs is going to be too
complicated to be of any real-world use, because almost nobody will
understand it or tune it properly for their installation. And who is
to say that a single setting would even be appropriate for a whole
installation, as opposed to something that needs to be tuned for each
individual table? A single tunable might be OK, but what I would
really like here is a heuristic that, while perhaps not optimal in
every environment, is good enough that a very high percentage of users
will never need to worry about it. In a case like this, a small gain
that happens automatically feels better than a large gain that
requires manual tweaking in every install.

Now that doesn't answer the question of what that heuristic should be.
I initially thought that if compression didn't end up reducing the
number of TOAST chunks then there was no point, but that's not true,
because having the last chunk be smaller than the rest can allow us to
fit more than 4 into a page rather than exactly 4. However, this
effect isn't likely to be important if most chunks are full-size
chunks. If we insert 100 full-size chunks and 1 partial chunk at the
end, we can't save much even if that chunk ends up being 1 byte
instead of a full-size chunk. 25 TOAST table pages out of 26 are going
to be full of full-size chunks either way, so we can't save more than
~4% and we might easily save nothing at all. As you say, the potential
savings are larger as the values get smaller, because a higher
proportion of the TOAST table pages are not quite full. So I think the
only cases where it's even interesting to think about suppressing this
optimization are those where the value is quite small -- and maybe the
thing to do is just pick a conservatively large threshold and call it
good.

For example, say that instead of applying this technique when there
are at least 2 TOAST chunks, we apply it when there are at least 500
(i.e. normally 1MB). It's hard for me to imagine that we can ever
lose, and in fact I think we could come down quite a bit from there
and still end up winning pretty much all the time. Setting the
threshold to, say, 50 or 100 or 150 TOAST chunks instead of 2 may
leave some money on the table, but if it means that we don't have
meaningful downsides and we don't have tunables for users to fiddle
with, it might be worth it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Logical replication timeout problem

2022-03-16 Thread Masahiko Sawada

On Wed, Mar 16, 2022 at 11:57 AM wangw.f...@fujitsu.com
 wrote:
>
> On Wed, Mar 9, 2022 at 2:45 PM Masahiko Sawada  wrote:
> >
> Thanks for your comments.
>
> > On Wed, Mar 9, 2022 at 10:26 AM I wrote:
> > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada 
> > wrote:
> > > > I've looked at the patch and have a question:
> > > Thanks for your review and comments.
> > >
> > > > +void
> > > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> > > > +static int skipped_changes_count = 0;
> > > > +
> > > > +/*
> > > > + * skipped_changes_count is reset when processing changes that 
> > > > do
> > not
> > > > + * need to be skipped.
> > > > + */
> > > > +if (!skipped)
> > > > +{
> > > > +skipped_changes_count = 0;
> > > > +return;
> > > > +}
> > > > +
> > > > +/*
> > > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> > > > changes, try to send a
> > > > + * keepalive message.
> > > > + */
> > > > +#define SKIPPED_CHANGES_THRESHOLD 1
> > > > +
> > > > +if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> > > > +{
> > > > +/* Try to send a keepalive message. */
> > > > +OutputPluginUpdateProgress(ctx, true);
> > > > +
> > > > +/* After trying to send a keepalive message, reset the 
> > > > flag. */
> > > > +skipped_changes_count = 0;
> > > > +}
> > > > +}
> > > >
> > > > Since we send a keepalive after continuously skipping 1 changes, the
> > > > originally reported issue can still occur if skipping 1 changes 
> > > > took more
> > than
> > > > the timeout and the walsender didn't send any change while that, is that
> > right?
> > > Yes, theoretically so.
> > > But after testing, I think this value should be conservative enough not to
> > reproduce
> > > this bug.
> >
> > But it really depends on the workload, the server condition, and the
> > timeout value, right? The logical decoding might involve disk I/O much
> > to spill/load intermediate data and the system might be under the
> > high-load condition. Why don't we check both the count and the time?
> > That is, I think we can send a keep-alive either if we skipped 1
> > changes or if we didn't sent anything for wal_sender_timeout / 2.
> Yes, you are right.
> Do you mean that when skipping every change, check if it has been more than
> (wal_sender_timeout / 2) without sending anything?
> IIUC, I tried to send keep-alive messages based on time before[1], but after
> testing, I found that it will brings slight overhead. So I am not sure, in a
> function(pgoutput_change) that is invoked frequently, should this kind of
> overhead be introduced?
>
> > Also, the patch changes the current behavior of wal senders; with the
> > patch, we send keep-alive messages even when wal_sender_timeout = 0.
> > But I'm not sure it's a good idea. The subscriber's
> > wal_receiver_timeout might be lower than wal_sender_timeout. Instead,
> > I think it's better to periodically check replies and send a reply to
> > the keep-alive message sent from the subscriber if necessary, for
> > example, every 1 skipped changes.
> Sorry, I could not follow what you said. I am not sure, do you mean the
> following?
> 1. When we didn't sent anything for (wal_sender_timeout / 2) or we skipped
> 1 changes continuously, we will invoke the function WalSndKeepalive in the
> function WalSndUpdateProgress, and send a keepalive message to the subscriber
> with requesting an immediate reply.
> 2. If after sending a keepalive message, and then 1 changes are skipped
> continuously again. In this case, we need to handle the reply from the
> subscriber-side when processing the 1th change. The handling approach is 
> to
> reply to the confirmation message from the subscriber.

After more thought, can we check only wal_sender_timeout without
skip-count? That is, in WalSndUpdateProgress(), if we have received
any reply from the subscriber in last (wal_sender_timeout / 2), we
don't need to do anything in terms of keep-alive. If not, we do
ProcessRepliesIfAny() (and probably WalSndCheckTimeOut()?) then
WalSndKeepalivesIfNecessary(). That way, we can send keep-alive
messages every (wal_sender_timeout / 2). And since we don't call them
for every change, we would not need to worry about the overhead much.
Actually, WalSndWriteData() does similar things; even in the case
where we don't skip consecutive changes (i.e., sending consecutive
changes to the subscriber), we do ProcessRepliesIfAny() at least every
(wal_sender_timeout / 2). I think this would work in most common cases
where the user sets both wal_sender_timeout and wal_receiver_timeout
to the same value.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Tab completion for SET TimeZone

2022-03-16 Thread Tom Lane

=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> I just noticed I left out the = in the match check, here's an updated
> patch that fixes that.

Hmm  is that actually going to be useful in that form?
Most time zone names contain slashes and will therefore require
single-quoting.  I think you might need pushups comparable to
COMPLETE_WITH_ENUM_VALUE.

Also, personally, I'd rather not smash the names to lower case.
I think that's a significant decrement of readability.

regards, tom lane

Re: Tab completion for SET TimeZone

2022-03-16 Thread Dagfinn Ilmari Mannsåker

Dagfinn Ilmari Mannsåker  writes:

> Hi hackers,
>
> I noticed there was no tab completion for time zones in psql, so here's
> a patch that implements that.

I just noticed I left out the = in the match check, here's an updated
patch that fixes that.

- ilmari

>From bcfa40ff3a6702e1bd7112eeaecfde87efaa43c2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 16 Mar 2022 12:52:21 +
Subject: [PATCH v2] =?UTF-8?q?Add=20tab=20completion=20for=20SET=20TimeZon?=
 =?UTF-8?q?e=20TO=20=E2=80=A6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Using verbatim and lower-case for maximum convenience.
---
 src/bin/psql/tab-complete.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 17172827a9..d68d001085 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1105,6 +1105,13 @@ static const SchemaQuery Query_for_trigger_of_table = {
 "   FROM pg_catalog.pg_cursors "\
 "  WHERE name LIKE '%s'"
 
+#define Query_for_list_of_timezones \
+" SELECT name FROM ("\
+"   SELECT pg_catalog.lower(name) AS name "\
+" FROM pg_catalog.pg_timezone_names() "\
+"  ) ss "\
+"  WHERE name LIKE '%s'"
+
 /*
  * These object types were introduced later than our support cutoff of
  * server version 9.2.  We use the VersionedQuery infrastructure so that
@@ -4171,6 +4178,8 @@ psql_completion(const char *text, int start, int end)
 	 " AND nspname NOT LIKE E'pg_temp%%'",
 	 "DEFAULT");
 		}
+		else if (TailMatches("TimeZone", "TO|="))
+			COMPLETE_WITH_QUERY_VERBATIM_PLUS(Query_for_list_of_timezones, "DEFAULT");
 		else
 		{
 			/* generic, type based, GUC support */
-- 
2.30.2

Tab completion for SET TimeZone

2022-03-16 Thread Dagfinn Ilmari Mannsåker

Hi hackers,

I noticed there was no tab completion for time zones in psql, so here's
a patch that implements that.  I chose lower-casing the names since they
are case insensitive, and verbatim since ones without slashes can be
entered without quotes, and (at least my version of) readline completes
them correctly if the entered value starts with a single quote.

- ilmari

>From 96ce5843641d990b278c0038f5b97941df6e73a9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 16 Mar 2022 12:52:21 +
Subject: [PATCH] =?UTF-8?q?Add=20tab=20completion=20for=20SET=20TimeZone?=
 =?UTF-8?q?=20TO=20=E2=80=A6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Using verbatim and lower-case for maximum convenience.
---
 src/bin/psql/tab-complete.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 17172827a9..58e62a5e6e 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1105,6 +1105,13 @@ static const SchemaQuery Query_for_trigger_of_table = {
 "   FROM pg_catalog.pg_cursors "\
 "  WHERE name LIKE '%s'"
 
+#define Query_for_list_of_timezones \
+" SELECT name FROM ("\
+"   SELECT pg_catalog.lower(name) AS name "\
+" FROM pg_catalog.pg_timezone_names() "\
+"  ) ss "\
+"  WHERE name LIKE '%s'"
+
 /*
  * These object types were introduced later than our support cutoff of
  * server version 9.2.  We use the VersionedQuery infrastructure so that
@@ -4171,6 +4178,8 @@ psql_completion(const char *text, int start, int end)
 	 " AND nspname NOT LIKE E'pg_temp%%'",
 	 "DEFAULT");
 		}
+		else if (TailMatches("TimeZone", "TO|"))
+			COMPLETE_WITH_QUERY_VERBATIM_PLUS(Query_for_list_of_timezones, "DEFAULT");
 		else
 		{
 			/* generic, type based, GUC support */
-- 
2.30.2

Re: ExecTypeSetColNames is fundamentally broken

2022-03-16 Thread Aleksander Alekseev

Hi hackers,

> Anyone else have thoughts?

I came across this thread while looking for the patches that need review.

My understanding of the code is limited, but I can say that I don't
see anything particularly wrong with it. I can also confirm that it
fixes the problem reported by the user while passing the rest of the
tests.

I understand the concern expressed by Robert in respect of backward
compatibility. From the user's perspective, personally I would prefer
a fixed bug over backward compatibility. Especially if we consider the
fact that the current behaviour of cases like `select row_to_json(i)
from int8_tbl i(x,y)` is not necessarily the correct one, at least it
doesn't look right to me.

So unless anyone has strong objections against the proposed fix or can
propose a better patch, I would suggest merging it.

-- 
Best regards,
Aleksander Alekseev

Declare PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY for aarch64

2022-03-16 Thread Yura Sokolov

Good day, hackers.

Architecture Reference Manual for ARMv8 B2.2.1 [1] states:

  For explicit memory effects generated from an Exception level the
  following rules apply:
  - A read that is generated by a load instruction that loads a single
  general-purpose register and is aligned to the size of the read in the
  instruction is single-copy atomic.
  - A write that is generated by a store instruction that stores a single
  general-purpose register and is aligned to the size of the write in the
  instruction is single-copy atomic.

So I believe it is safe to define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
for aarch64

[1] https://documentation-service.arm.com/static/61fbe8f4fa8173727a1b734e
https://developer.arm.com/documentation/ddi0487/latest

---

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From b61b065acba570f1f935ff6ca17dc687c45db2f2 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 16 Mar 2022 15:12:07 +0300
Subject: [PATCH v0] Declare aarch64 has single copy atomicity for 8 byte
 values.

Architecture Reference Manual for ARMv8 B2.2.1 [1]

For explicit memory effects generated from an Exception level the
following rules apply:
- A read that is generated by a load instruction that loads a single
general-purpose register and is aligned to the size of the read in the
instruction is single-copy atomic.
- A write that is generated by a store instruction that stores a single
general-purpose register and is aligned to the size of the write in the
instruction is single-copy atomic.

[1] https://documentation-service.arm.com/static/61fbe8f4fa8173727a1b734e
https://developer.arm.com/documentation/ddi0487/latest
---
 src/include/port/atomics/arch-arm.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/include/port/atomics/arch-arm.h b/src/include/port/atomics/arch-arm.h
index 2083e3230db..9fe8f1b95f7 100644
--- a/src/include/port/atomics/arch-arm.h
+++ b/src/include/port/atomics/arch-arm.h
@@ -23,4 +23,10 @@
  */
 #if !defined(__aarch64__) && !defined(__aarch64)
 #define PG_DISABLE_64_BIT_ATOMICS
+#else
+/*
+ * Architecture Reference Manual for ARMv8 states aligned read/write to/from
+ * general purpose register is atomic.
+ */
+#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
 #endif  /* __aarch64__ || __aarch64 */
-- 
2.35.1

Re: Issue with pg_stat_subscription_stats

2022-03-16 Thread Amit Kapila

On Tue, Mar 15, 2022 at 10:09 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 15, 2022 at 3:34 AM Melanie Plageman
>  wrote:
> >
> > On Mon, Mar 14, 2022 at 4:02 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Mar 14, 2022 at 2:05 AM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Sat, Mar 12, 2022 at 3:15 PM Andres Freund  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On 2022-03-12 08:28:35 +0530, Amit Kapila wrote:
> > > > > > On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman
> > > > > >  wrote:
> > > > > > >
> > > > > > > So, I noticed that pg_stat_reset_subscription_stats() wasn't 
> > > > > > > working
> > > > > > > properly, and, upon further investigation, I'm not sure the view
> > > > > > > pg_stat_subscription_stats is being properly populated.
> > > > > > >
> > > > > >
> > > > > > I have tried the below scenario based on this:
> > > > > > Step:1 Create some data that generates conflicts and lead to apply
> > > > > > failures and then check in the view:
> > > > >
> > > > > I think the problem is present when there was *no* conflict
> > > > > previously. Because nothing populates the stats entry without an 
> > > > > error, the
> > > > > reset doesn't have anything to set the stats_reset field in, which 
> > > > > then means
> > > > > that the stats_reset field is NULL even though stats have been reset.
> > > >
> > > > Yes, this is what I meant. stats_reset is not initialized and without
> > > > any conflict happening to populate the stats, after resetting the stats,
> > > > the field still does not get populated. I think this is a bit
> > > > unexpected.
> > > >
> > > > psql (15devel)
> > > > Type "help" for help.
> > > >
> > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > >  subid | subname | apply_error_count | sync_error_count | stats_reset
> > > > ---+-+---+--+-
> > > >  16398 | mysub   | 0 |0 |
> > > > (1 row)
> > > >
> > > > mplageman=# select pg_stat_reset_subscription_stats(16398);
> > > >  pg_stat_reset_subscription_stats
> > > > --
> > > >
> > > > (1 row)
> > > >
> > > > mplageman=# select * from pg_stat_subscription_stats ;
> > > >  subid | subname | apply_error_count | sync_error_count | stats_reset
> > > > ---+-+---+--+-
> > > >  16398 | mysub   | 0 |0 |
> > > > (1 row)
> > > >
> > >
> > > Looking at other statistics such as replication slots, shared stats,
> > > and SLRU stats, it makes sense that resetting it populates the stats.
> > > So we need to fix this issue.
> > >
> > > However, I think the proposed fix has two problems; it can create an
> > > entry for non-existing subscriptions if the user directly calls
> > > function pg_stat_get_subscription_stats(), and stats_reset value is
> > > not updated in the stats file as it is not done by the stats
> > > collector.
> >
> > You are right. My initial patch was incorrect.
> >
> > Thinking about it more, the initial behavior is technically the same for
> > pg_stat_database. It is just that I didn't notice because you end up
> > creating stats for pg_stat_database so quickly that you usually never
> > see them before.
> >
> > In pg_stat_get_db_stat_reset_time():
> >
> > if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
> > result = 0;
> > else
> > result = dbentry->stat_reset_timestamp;
> >
> > if (result == 0)
> > PG_RETURN_NULL();
> > else
> > PG_RETURN_TIMESTAMPTZ(result);
> >
> > and in pgstat_recv_resetcounter():
> >
> > dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
> >
> > if (!dbentry)
> > return;
> >
> > Thinking about it now, though, maybe an alternative solution would be to
> > have all columns or all columns except the subid/subname or dbname/dboid
> > be NULL until the statistics have been created, at which point the
> > reset_timestamp is populated with the current timestamp.
>
> It's true that stats_reset is NULL if the statistics of database are
> not created yet.
>

So, if the behavior is the same as pg_stat_database, do we really want
to change anything in this regard?

-- 
With Regards,
Amit Kapila.

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-03-16 Thread Andrew Dunstan



On 3/15/22 16:59, Mark Dilger wrote:
>> On Mar 6, 2022, at 3:27 PM, Tom Lane  wrote:
>>
>> Mark Dilger  writes:
>>> The existing patch allows grants on unknown gucs, because it can't know 
>>> what guc an upgrade script will introduce, and the grant statement may need 
>>> to execute before the guc exists.
>> Yeah, that's the problematic case.  It might mostly work to assume that
>> an unknown GUC has an empty default ACL.  This could fail to retain the
>> default PUBLIC SET permission if it later turns out the GUC is USERSET
> On further reflection, I concluded this isn't needed.  No current extension, 
> whether in-core or third party, expects to be able to create a new GUC and 
> then grant or revoke permissions on it.  They can already specify the guc 
> context (PGC_USERS, etc).  Introducing a feature that depends on the dubious 
> assumption that unrecognized GUCs will turn out to be USERSET doesn't seem 
> warranted.

Agreed.


>
> The patch attributes all grants of setting privileges to the bootstrap 
> superuser.  Only superusers can grant or revoke privileges on settings, and 
> all settings are implicitly owned by the bootstrap superuser because there is 
> no explicit owner associated with settings.  Consequently, 
> select_best_grantor(some_superuser, ..., BOOTSTRAP_SUPERUSERID, ...) always 
> chooses the bootstrap superuser.  I don't see a problem with this, but 
> wouldn't mind a second opinion.  Some people might find it surprising when 
> viewing the pg_setting_acl.setacl field.


I think it's OK as long as we document it. An alternative might be to
invent a pseudo-superuser called, say, 'postgres_system', but that seems
like overkill to solve what is in effect a cosmetic problem.

Generally I think this is now in fairly good shape, I've played with it
and it seems to do what I expect in every case, and the things I found
surprising are gone.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

unnecessary (same) restart_lsn handing in LogicalIncreaseRestartDecodingForSlot

2022-03-16 Thread Ashutosh Bapat

Hi All,
At the beginning of LogicalIncreaseRestartDecodingForSlot(), we have codeine
```
1623 /* don't overwrite if have a newer restart lsn */
1624 if (restart_lsn <= slot->data.restart_lsn)
1625 {
1626 }
1627
1628 /*
1629  * We might have already flushed far enough to directly
accept this lsn,
1630  * in this case there is no need to check for existing candidate LSNs
1631  */
1632 else if (current_lsn <= slot->data.confirmed_flush)
1633 {
1634 slot->candidate_restart_valid = current_lsn;
1635 slot->candidate_restart_lsn = restart_lsn;
1636
1637 /* our candidate can directly be used */
1638 updated_lsn = true;
1639 }
```
This code avoids directly writing slot's persistent data multiple
times if the restart_lsn does not change between successive running
transactions WAL records and the confirmed flush LSN is higher than
all of those.

But if the downstream is fast enough to send at least one newer
confirmed flush between every two running transactions WAL records, we
end up overwriting slot's restart LSN even if it does not change
because of following code
```
1641 /*
1642  * Only increase if the previous values have been applied, otherwise we
1643  * might never end up updating if the receiver acks too
slowly. A missed
1644  * value here will just cause some extra effort after reconnecting.
1645  */
1646 if (slot->candidate_restart_valid == InvalidXLogRecPtr)
1647 {
1648 slot->candidate_restart_valid = current_lsn;
1649 slot->candidate_restart_lsn = restart_lsn;
1650 SpinLockRelease(>mutex);
1651
1652 elog(LOG, "got new restart lsn %X/%X at %X/%X",
1653  LSN_FORMAT_ARGS(restart_lsn),
1654  LSN_FORMAT_ARGS(current_lsn));
1655 }
```

Let's say RLSN is the restart LSN computed when processing successive
running transaction WAL records at RT_LSN1, RT_LSN2, RT_LSN3 
Let's say downstream sends confirmed flush  LSNs CF_LSN1, CF_LSN2,
CF_LSN3, ... such that RT_LSNx <= CF_LSNx <= RT_LSN(x+1). CF_LSNx
arrives between processing records at RT_LSNx and RT_LSN(x+1). So the
candidate_restart_lsn is always InvalidXlogRecPtr and we install the
same RLSN as candidate_restart_lsn again and again with a different
candidate_restart_valid.

Every installed candidate is updated in the slot by
LogicalConfirmReceivedLocation() when the next confirmed flush
arrives. Such an update also causes a disk write which looks
unnecessary.

I think the function should ignore a restart_lsn older than
data.restart_lsn right away at the beginning of this function.

-- 
Best Wishes,
Ashutosh Bapat

Re: Column Filtering in Logical Replication

2022-03-16 Thread Amit Kapila

On Tue, Mar 15, 2022 at 7:38 AM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Mar 14, 2022 5:08 AM Tomas Vondra  
> wrote:
>
> 3. src/backend/commands/publicationcmds.c
> +/*
> + * Check if all columns referenced in the column list are part of the
> + * REPLICA IDENTITY index or not.
> + *
> + * Returns true if any invalid column is found.
> + */
>
> The comment for pub_collist_contains_invalid_column() seems wrong.  Should it 
> be
> "Check if all REPLICA IDENTITY columns are covered by the column list or not"?
>

On similar lines, I think errdetail for below messages need to be changed.
ereport(ERROR,
+ (errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+ errmsg("cannot update table \"%s\"",
+ RelationGetRelationName(rel)),
+ errdetail("Column list used by the publication does not cover the
replica identity.")));
  else if (cmd == CMD_DELETE && !pubdesc.rf_valid_for_delete)
  ereport(ERROR,
  (errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
  errmsg("cannot delete from table \"%s\"",
  RelationGetRelationName(rel)),
  errdetail("Column used in the publication WHERE expression is not
part of the replica identity.")));
+ else if (cmd == CMD_DELETE && !pubdesc.cols_valid_for_delete)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+ errmsg("cannot delete from table \"%s\"",
+ RelationGetRelationName(rel)),
+ errdetail("Column list used by the publication does not cover the
replica identity.")));

Some assorted comments:

1. As mentioned previously as well[1], the change in ATExecDropColumn
is not required. Similarly, the change you seem to agree upon in
logicalrep_write_update[2] doesn't seem to be present.

2. I think the dependency handling in publication_set_table_columns()
has problems. While removing existing dependencies, it uses
PublicationRelationId as classId whereas while adding new dependencies
it uses PublicationRelRelationId as classId. This will create problems
while removing columns from table. For example,
postgres=# create table t1(c1 int, c2 int, c3 int);
CREATE TABLE
postgres=# create publication pub1 for table t1(c1, c2);
CREATE PUBLICATION
postgres=# select * from pg_depend where classid = 6106 or refclassid
= 6106 or classid = 6104;
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
-+---+--++--+-+-
6106 | 16409 |0 |   1259 |16405 |   1 | a
6106 | 16409 |0 |   1259 |16405 |   2 | a
6106 | 16409 |0 |   6104 |16408 |   0 | a
6106 | 16409 |0 |   1259 |16405 |   0 | a
(4 rows)

Till here everything is fine.

postgres=# Alter publication pub1 alter table t1 set columns(c2);
ALTER PUBLICATION
postgres=# select * from pg_depend where classid = 6106 or refclassid
= 6106 or classid = 6104;
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
-+---+--++--+-+-
6106 | 16409 |0 |   1259 |16405 |   1 | a
6106 | 16409 |0 |   1259 |16405 |   2 | a
6106 | 16409 |0 |   6104 |16408 |   0 | a
6106 | 16409 |0 |   1259 |16405 |   0 | a
6106 | 16409 |0 |   1259 |16405 |   2 | a
(5 rows)

Now without removing dependencies for columns 1 and 2, it added a new
dependency for column 2.

3.
@@ -930,8 +1054,24 @@ copy_table(Relation rel)
...
+ for (int i = 0; i < lrel.natts; i++)
+ {
+ if (i > 0)
+ appendStringInfoString(, ", ");
+
+ appendStringInfoString(, quote_identifier(lrel.attnames[i]));
+ }
...
...
for (int i = 0; i < lrel.natts; i++)
{
appendStringInfoString(, quote_identifier(lrel.attnames[i]));
if (i < lrel.natts - 1)
appendStringInfoString(, ", ");
}

In the same function, we use two different styles to achieve the same
thing. I think it is better to use the same style (probably existing)
at both places for the sake of consistency.

4.
+  
+   The ALTER TABLE ... SET COLUMNS variant allows changing
+   the set of columns that are included in the publication.  If a column list
+   is specified, it must include the replica identity columns.
+  

I think the second part holds true only for update/delete publications.

5.
+ * XXX Should this detect duplicate columns?
+ */
+static void
+publication_translate_columns(Relation targetrel, List *columns,
+   int *natts, AttrNumber **attrs)
{
...
+ if (bms_is_member(attnum, set))
+ ereport(ERROR,
+ errcode(ERRCODE_DUPLICATE_OBJECT),
+ errmsg("duplicate column \"%s\" in publication column list",
+colname));
...
}

It seems we already detect duplicate columns in this function. So XXX
part of the comment doesn't seem to be required.

6.
+ * XXX The name is a bit misleading, because we don't really transform
+ * anything here - we merely check the column list is compatible with the
+ * definition of the publication (with

Re: BufferAlloc: don't take two simultaneous locks

2022-03-16 Thread Yura Sokolov

В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
> wrote in 
> > В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > > Hmm. v8 returns stashed element with original patition index when the
> > > element is *not* reused.  But what I saw in the previous test runs is
> > > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > > behaving the same way (or somehow even worse) with the previous
> > > version.
> > 
> > v8 doesn't differ in REMOVE case neither from master nor from
> > previous version. It differs in RETURNED case only.
> > Or I didn't understand what you mean :(
> 
> In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> the freelist_idx of the new key.  v8 uses that of the old key (at the
> time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> returns the stashed)" case the stashed element is returned to its
> original partition.  But it is not what I mentioned.
> 
> On the other hand, once the stahsed element is reused by HASH_ENTER,
> it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> from old partition) case.  I suspect that ththat the frequent freelist
> starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists. 

> > > get_hash_entry continuously suffer lack of freelist
> > > entry. (FWIW, attached are the test-output diff for both master and
> > > patched)
> > > 
> > > master finally allocated 31 fresh elements for a 100s run.
> > > 
> > > > ALLOCED: 31;; freshly allocated
> > > 
> > > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > > allocated (ah, this version changes that..)
> > > Finally v8 results in:
> > > 
> > > > RETURNED: 50806;; returned stashed elements
> > > > BORROWED: 33620;; borrowed from another freelist
> > > > REUSED: 1812664;; stashed
> > > > ASSIGNED: 1762377  ;; reused
> > > >(ALLOCED: 0);; freshly allocated
> 
> (I misunderstand that v8 modified get_hash_entry's preference between
> allocation and borrowing.)
> 
> I re-ran the same check for v7 and it showed different result.
> 
> RETURNED: 1
> ALLOCED: 15
> BORROWED: 0
> REUSED: 505435
> ASSIGNED: 505462 (-27)  ## the counters are not locked.
> 
> > Is there any measurable performance hit cause of borrowing?
> > Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> > shared buffers that is extremely small. (Or it was 128MB?)
> 
> It is intentional set small to get extremely frequent buffer
> replacements.  The point here was the patch actually can induce
> frequent freelist starvation.  And as you do, I also doubt the
> significance of the performance hit by that.  Just I was not usre.
>
> I re-ran the same for v8 and got a result largely different from the
> previous trial on the same v8.
> 
> RETURNED: 2
> ALLOCED: 0
> BORROWED: 435
> REUSED: 495444
> ASSIGNED: 495467 (-23)
> 
> Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

> 
> > Well, I think some spare entries could reduce borrowing if there is
> > a need. I'll test on 128MB with spare entries. If there is profit,
> > I'll return some, but will keep SharedBufHash fixed.
> 
> I don't doubt the benefit of this patch.  And now convinced by myself
> that the downside is negligible than the benefit.
> 
> > Master branch does less freelist manipulations since it  tries to
> > insert first and if there is collision it doesn't delete victim
> > buffer.
> > 
> > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > ...
> > > > Strange thing: both master and patched version has higher
> > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > than in first october version [1]. But lower tps at higher
> > > > connections number (>= 191 clients).
> > > > I'll try to bisect on master this unfortunate change.
> > > 
> > > The reversing of the preference order between freshly-allocation and
> > > borrow-from-another-freelist might affect.
> > 
> > `master` changed its behaviour as well.
> > It is not problem of the patch at all.
> 
> Agreed.  So I think we should go on this direction.

I've checked. Looks like something had changed on the server, since
old master commit behaves now same to new one (and differently to
how it behaved in October).
I remember maintainance downtime of the server in november/december.
Probably, kernel were upgraded or some system settings were changed.

> There are some last comments on v8.
> 
> + 
> HASH_FIXED_SIZE);
> 
> Ah, now I understand that this prevented allocation of new elements.
> I think this good to do for SharedBufHash.
> 
> 
> 
> +

Re: Out-of-tree certificate interferes ssltest

2022-03-16 Thread Daniel Gustafsson

> On 16 Mar 2022, at 08:36, Kyotaro Horiguchi  wrote:

> I think we don't want this behavior.

Agreed.

> The attached fixes that and make-world successfully finished even if I
> have a cert file in my home direcotory.

Seems correct to me, thanks!

--
Daniel Gustafsson   https://vmware.com/

Re: pgcrypto: Remove internal padding implementation

2022-03-16 Thread Peter Eisentraut




Interesting.  I have added test cases about this.  Could you describe
how you arrived at the second test case?


Sure -- that second ciphertext is the result of running

 SELECT encrypt_iv('abcdefghijklmnopqrstuvwxyz', '0123456', 'abcd', 'aes');

and then incrementing the last byte of the first block (i.e. the 16th
byte) to corrupt the padding in the last block.


I have added this information to the test file.


Is there any reasonable meaning of the previous behaviors?


I definitely don't think the previous behavior was reasonable. It's
possible that someone built a strange system on top of that
unreasonable behavior, but I hope not.


Does bad padding even give correct answers on decryption?


No -- or rather, I'm not really sure how to define "correct answer" for
garbage input. I especially don't like that two different ciphertexts
can silently decrypt to the same plaintext.


What does encryption without padding even do on incorrect input sizes?


Looks like it's being silently zero-padded by the previous code, which
doesn't seem very helpful to me. The documentation says "data must be
multiple of cipher block size" for the pad:none case, though I suppose
it doesn't say what happens if you ignore the "must".


Right, the previous behaviors were clearly faulty.  I have updated the 
commit message to call out the behavior change more clearly.


This patch is now complete from my perspective.From 633d4aa0f762bf976e3ddc077cb9c34b58eeb4d5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 16 Mar 2022 10:58:17 +0100
Subject: [PATCH v4] pgcrypto: Remove internal padding implementation

Use the padding provided by OpenSSL instead of doing it ourselves.
The internal implementation was once applicable to the non-OpenSSL
code paths, but those have since been removed.  The padding algorithm
is still the same.

The OpenSSL padding implementation is stricter than the previous
internal one: Bad padding during decryption is now an error, and
encryption without padding now requires the input size to be a
multiple of the block size, otherwise it is also an error.
Previously, these cases silently proceeded, in spite of the
documentation saying otherwise.

Add some test cases about this, too.  (The test cases are in
rijndael.sql, but they apply to all encryption algorithms.)

Discussion: 
https://www.postgresql.org/message-id/flat/ba94c26b-0c58-c97e-7a44-f44e08b4cca2%40enterprisedb.com
---
 contrib/pgcrypto/expected/rijndael.out |  14 +++
 contrib/pgcrypto/openssl.c |  22 +++--
 contrib/pgcrypto/pgp-cfb.c |   4 +-
 contrib/pgcrypto/px.c  | 120 +
 contrib/pgcrypto/px.h  |  14 +--
 contrib/pgcrypto/sql/rijndael.sql  |  12 +++
 6 files changed, 52 insertions(+), 134 deletions(-)

diff --git a/contrib/pgcrypto/expected/rijndael.out 
b/contrib/pgcrypto/expected/rijndael.out
index ad69cdba49..015ba4430d 100644
--- a/contrib/pgcrypto/expected/rijndael.out
+++ b/contrib/pgcrypto/expected/rijndael.out
@@ -39,6 +39,12 @@ SELECT encrypt(
  \x8ea2b7ca516745bfeafc49904b496089
 (1 row)
 
+-- without padding, input not multiple of block size
+SELECT encrypt(
+'\x00112233445566778899aabbccddeeff00',
+'\x000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f',
+'aes-cbc/pad:none');
+ERROR:  encrypt error: Encryption failed
 -- key padding
 SELECT encrypt(
 '\x0011223344',
@@ -95,6 +101,14 @@ select encode(decrypt(encrypt('foo', '0123456', 'aes'), 
'0123456', 'aes'), 'esca
  foo
 (1 row)
 
+-- data not multiple of block size
+select encode(decrypt(encrypt('foo', '0123456', 'aes') || '\x00'::bytea, 
'0123456', 'aes'), 'escape');
+ERROR:  decrypt error: Decryption failed
+-- bad padding
+-- (The input value is the result of encrypt_iv('abcdefghijklmnopqrstuvwxyz', 
'0123456', 'abcd', 'aes')
+-- with the 16th byte changed (s/db/eb/) to corrupt the padding of the last 
block.)
+select 
encode(decrypt_iv('\xa21a9c15231465964e3396d32095e67eb52bab05f556a581621dee1b85385789',
 '0123456', 'abcd', 'aes'), 'escape');
+ERROR:  decrypt_iv error: Decryption failed
 -- iv
 select encrypt_iv('foo', '0123456', 'abcd', 'aes');
  encrypt_iv 
diff --git a/contrib/pgcrypto/openssl.c b/contrib/pgcrypto/openssl.c
index e236b0d79c..68fd61b716 100644
--- a/contrib/pgcrypto/openssl.c
+++ b/contrib/pgcrypto/openssl.c
@@ -369,17 +369,17 @@ gen_ossl_free(PX_Cipher *c)
 }
 
 static int
-gen_ossl_decrypt(PX_Cipher *c, const uint8 *data, unsigned dlen,
-uint8 *res)
+gen_ossl_decrypt(PX_Cipher *c, int padding, const uint8 *data, unsigned dlen,
+uint8 *res, unsigned *rlen)
 {
OSSLCipher *od = c->ptr;
-   int outlen;
+   int outlen, outlen2;
 
if (!od->init)
{
if (!EVP_DecryptInit_ex(od->evp_ctx, od->evp_ciph, NULL, NULL, 
NULL))
return PXE_CIPHER_INIT;
-

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-16 Thread Bharath Rupireddy

On Mon, Mar 14, 2022 at 10:34 PM Nathan Bossart
 wrote:
>
> > Attaching v3 patch which emits logs only when necessary and doesn't
> > clutter the existing "checkpoint/restartpoint completed" message, see
> > some sample logs at [1]. Please review it further.
>
> I'm okay with not adding these statistics when the number of files removed
> and sync'd is zero.

Thanks.

> IMO we shouldn't include the size of the files since it will prevent us
> from removing lstat() calls.  As noted upthread, I suspect the time spent
> in these tasks is more closely related to the number of files anyway.

Yeah. I will remove them.

> I'm -1 on splitting these new statistics to separate LOGs.  In addition to
> making it more difficult to discover statistics for a given checkpoint, I
> think it actually adds even more bloat to the server log.  If we are
> concerned about the amount of information in these LOGs, perhaps we should
> adjust the format to make it more human-readable.

Below are the ways that I can think of. Thoughts?

1)
if (restartpoint)
{
   if (snap_files_rmvd > 0 && map_files_rmvd > 0)
 existing "restartpoint complete" message + "snapshot files
removed cnt, time, cutoff lsn" + "mapping files removed cnt, time,
cutoff lsn"
   else if (snap_files_rmvd > 0)
 existing "restartpoint complete" message + "snapshot files
removed cnt, time, cutoff lsn"
   else if (map_files_rmvd > 0)
 existing "restartpoint complete" message + "mapping files removed
cnt, time, cutoff lsn"
   else
 existing "restartpoint complete" message
}
else
{
   same as above but with "checkpoint complete" instead of "restart complete"
}

2) Use StringInfoData to prepare the message dynamically but this will
create  message translation problems.

3) Emit the  "snapshot files removed cnt, time, cutoff lsn" and
"mapping files removed cnt, time, cutoff lsn" at LOG level if
(log_checkpoints) in CheckPointSnapBuild and
CheckPointLogicalRewriteHeap respectively.

4) Have separate log messages in LogCheckpointEnd:
if (snap_files_rmvd > 0)
   "snapshot files removed cnt, time, cutoff lsn"
if (map_files_rmvd > 0)
   "mapping files removed cnt, time, cutoff lsn"

if (restartpoint)
  existing "restartpoint complete" message
else
  existing "checkpoint complete" message

Regards,
Bharath Rupireddy.

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-16 Thread Bharath Rupireddy

On Wed, Mar 16, 2022 at 1:47 AM Nathan Bossart  wrote:
>
> On Tue, Mar 15, 2022 at 11:04:26AM +0900, Michael Paquier wrote:
> > On Mon, Mar 14, 2022 at 03:54:19PM +0530, Bharath Rupireddy wrote:
> >> At times, the snapshot or mapping files can be large in number and one
> >> some platforms it takes time for checkpoint to process all of them.
> >> Having the stats about them in server logs can help us better analyze
> >> why checkpoint took a long time and provide a better RCA.
> >
> > Do you have any numbers to share regarding that?  Seeing information
> > about 1k WAL segments being recycled and/or removed by a checkpoint
> > where the operation takes dozens of seconds to complete because we can
> > talk about hundred of gigs worth of files moved around.  If we are
> > talking about 100~200 files up to 10~20kB each for snapshot and
> > mapping files, the information has less value, worth only a portion of
> > one WAL segment.
>
> I don't have specific numbers to share, but as noted elsewhere [0], I
> routinely see lengthy checkpoints that spend a lot of time in these cleanup
> tasks.
>
> [0] https://postgr.es/m/18ED8B1F-7F5B-4ABF-848D-45916C938BC7%40amazon.com

I too don't have numbers in hand to share. But I have come across
cases where checkpoints spend a good amount of time cleaning
up/processing snapshots or mapping files.

Regards,
Bharath Rupireddy.

Re: [PATCH] pgbench: add multiconnect option

2022-03-16 Thread Fabien COELHO




Hello Greg,


It looks like David sent a patch and Fabien sent a followup patch. But
there hasn't been a whole lot of discussion or further patches.

It sounds like there are some basic questions about what the right
interface should be. Are there specific questions that would be
helpful for moving forward?


Review the designs and patches and tell us what you think?

Personnaly, I think that allowing multiple connections is a good thing, 
especially if the code impact is reduced, which is the case with the 
version I sent.


Then for me the next step would be to have a reconnection on errors so as 
to implement a client-side failover policy that could help testing a 
server-failover performance impact. I have done that internally but it 
requires that "Pgbench Serialization and deadlock errors" to land, as it 
would just be another error that can be handled.


--
Fabien.

Re: Assert in pageinspect with NULL pages

2022-03-16 Thread Michael Paquier

On Wed, Feb 23, 2022 at 12:09:02PM +0500, Daria Lepikhova wrote:
> And one more addition. In the previous version of the patch, I forgot to add
> tests for the gist index, but the described problem is also relevant for it.

So, I have looked at this second part of the thread, and concluded
that we should not fail for empty pages.  First, we fetch pages from
the buffer pool in normal mode, where empty pages are valid.  There is
also a second point in favor of doing so: code paths dedicated to hash
indexes already do that, marking such pages as simply "unused".  The
proposal from Julien upthread sounds cleaner to me though in the long
run, as NULL gives the user the possibility to do a full-table scan
with simple clauses to filter out anything found as NULL.

Painting more PageIsNew() across the place requires a bit more work
than a simple ereport(ERROR) in get_page_from_raw(), of course, but
the gain is the portability of the functions.

(One can have a lot of fun playing with random inputs and breaking
most code paths, but that's not new.)
--
Michael
diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index bf12901ac3..87e75f663c 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -58,6 +58,9 @@ brin_page_type(PG_FUNCTION_ARGS)
 
 	page = get_page_from_raw(raw_page);
 
+	if (PageIsNew(page))
+		PG_RETURN_NULL();
+
 	switch (BrinPageType(page))
 	{
 		case BRIN_PAGETYPE_META:
@@ -86,6 +89,9 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
 {
 	Page		page = get_page_from_raw(raw_page);
 
+	if (PageIsNew(page))
+		return page;
+
 	/* verify the special space says this page is what we want */
 	if (BrinPageType(page) != type)
 		ereport(ERROR,
@@ -138,6 +144,13 @@ brin_page_items(PG_FUNCTION_ARGS)
 	/* minimally verify the page we got */
 	page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
 
+	if (PageIsNew(page))
+	{
+		brin_free_desc(bdesc);
+		index_close(indexRel, AccessShareLock);
+		PG_RETURN_NULL();
+	}
+
 	/*
 	 * Initialize output functions for all indexed datatypes; simplifies
 	 * calling them later.
@@ -313,6 +326,9 @@ brin_metapage_info(PG_FUNCTION_ARGS)
 
 	page = verify_brin_page(raw_page, BRIN_PAGETYPE_META, "metapage");
 
+	if (PageIsNew(page))
+		PG_RETURN_NULL();
+
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
@@ -364,6 +380,12 @@ brin_revmap_data(PG_FUNCTION_ARGS)
 		/* minimally verify the page we got */
 		page = verify_brin_page(raw_page, BRIN_PAGETYPE_REVMAP, "revmap");
 
+		if (PageIsNew(page))
+		{
+			MemoryContextSwitchTo(mctx);
+			PG_RETURN_NULL();
+		}
+
 		state = palloc(sizeof(*state));
 		state->tids = ((RevmapContents *) PageGetContents(page))->rm_tids;
 		state->idx = 0;
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index d9628dd664..dde640fd33 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -611,6 +611,12 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs->page = get_page_from_raw(raw_page);
 
+		if (PageIsNew(uargs->page))
+		{
+			MemoryContextSwitchTo(mctx);
+			PG_RETURN_NULL();
+		}
+
 		uargs->offset = FirstOffsetNumber;
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
diff --git a/contrib/pageinspect/expected/brin.out b/contrib/pageinspect/expected/brin.out
index 10cd36c177..15786a0c0d 100644
--- a/contrib/pageinspect/expected/brin.out
+++ b/contrib/pageinspect/expected/brin.out
@@ -52,4 +52,29 @@ SELECT * FROM brin_page_items(get_raw_page('test1_a_idx', 2), 'test1_a_idx')
 CREATE INDEX test1_a_btree ON test1 (a);
 SELECT brin_page_items(get_raw_page('test1_a_btree', 0), 'test1_a_btree');
 ERROR:  "test1_a_btree" is not a BRIN index
+-- Test for empty pages with raw input.
+SHOW block_size \gset
+SELECT brin_page_type(decode(repeat('00', :block_size), 'hex'));
+ brin_page_type 
+
+ 
+(1 row)
+
+SELECT brin_page_items(decode(repeat('00', :block_size), 'hex'), 'test1_a_idx');
+ brin_page_items 
+-
+(0 rows)
+
+SELECT brin_metapage_info(decode(repeat('00', :block_size), 'hex'));
+ brin_metapage_info 
+
+ 
+(1 row)
+
+SELECT brin_revmap_data(decode(repeat('00', :block_size), 'hex'));
+ brin_revmap_data 
+--
+ 
+(1 row)
+
 DROP TABLE test1;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 80b3dfe861..8815b56b15 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -85,4 +85,10 @@ ERROR:  "test1_a_hash" is not a btree index
 SELECT bt_page_items('aaa'::bytea);
 ERROR:  invalid page size
 \set VERBOSITY default
+-- Test for empty pages with raw input.
+SHOW block_size \gset
+SELECT bt_page_items(decode(repeat('00', :block_size), 'hex'));
+-[ RECORD 1 ]-+-
+bt_page_items | 
+
 DROP TABLE test1;
diff

Re: Possible corruption by CreateRestartPoint at promotion

2022-03-16 Thread Kyotaro Horiguchi

Just for the record.

An instance of the corruption showed up in this mailing list [1].

[1] 
https://www.postgresql.org/message-id/flat/9EB4CF63-1107-470E-B5A4-061FB9EF8CC8%40outlook.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread Kyotaro Horiguchi

(My previous mail hass crossed with this one)

At Wed, 16 Mar 2022 08:21:46 +, hao harry  wrote in 
> Found this issue is duplicated to [1], after applied that patch, I cannot 
> reproduce it anymore.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com

Glad to hear that.  Thanks for checking it!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 07:16:16 +, hao harry  wrote in 
> Hi, pgsql-hackers,
> 
> I think I found a case that database is not recoverable, would you please 
> give a look?
> 
> Here is how it happens:
> 
> - setup primary/standby
> - do a lots INSERT at primary
> - create a checkpoint at primary
> - wait until standby start doing restart point, it take about 3mins syncing 
> buffers to complete
> - before the restart point update ControlFile, promote the standby, that 
> changed ControlFile
>   ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving 
> the ControlFile
>   ->checkPoint pointing to a removed file

Yeah, it seems like exactly the same issue pointed in [1].  A fix is
proposed in [1].  Maybe I can remove "possible" from the mail subject:p

[1] 
https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com
[2] 
https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota@gmail.com

> - before the promoted standby request the post-recovery checkpoint (fast 
> promoted), 
>   one backend crashed, it will kill other server process, so the 
> post-recovery checkpoint skipped
> - the database restart startup process, which report: "could not locate a 
> valid checkpoint record"
> 
> I attached a test to reproduce it, it does not fail every time, it fails 
> every 10 times to me.
> To increase the chance CreateRestartPoint skip update ControlFile and to 
> simulate a crash,
> the patch 0001 is needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Skipping logical replication transactions on subscriber side

2022-03-16 Thread Masahiko Sawada

On Wed, Mar 16, 2022 at 1:20 PM Amit Kapila  wrote:
>
> On Wed, Mar 16, 2022 at 7:58 AM Amit Kapila  wrote:
> >
> > On Wed, Mar 16, 2022 at 6:03 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > >
> > > > 6.
> > > > @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
> > > >   TupleTableSlot *remoteslot;
> > > >   MemoryContext oldctx;
> > > >
> > > > - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> > > > + if (is_skipping_changes() ||
> > > >
> > > > Is there a reason to keep the skip_changes check here and in other DML
> > > > operations instead of at one central place in apply_dispatch?
> > >
> > > Since we already have the check of applying the change on the spot at
> > > the beginning of the handlers I feel it's better to add
> > > is_skipping_changes() to the check than add a new if statement to
> > > apply_dispatch, but do you prefer to check it in one central place in
> > > apply_dispatch?
> > >
> >
> > I think either way is fine. I just wanted to know the reason, your
> > current change looks okay to me.
> >
> > Some questions/comments
> > ==
> >
>
> Some cosmetic suggestions:
> ==
> 1.
> +# Create subscriptions. Both subscription sets disable_on_error to on
> +# so that they get disabled when a conflict occurs.
> +$node_subscriber->safe_psql(
> + 'postgres',
> + qq[
> +CREATE SUBSCRIPTION $subname CONNECTION '$publisher_connstr'
> PUBLICATION tap_pub WITH (streaming = on, two_phase = on,
> disable_on_error = on);
> +]);
>
> I don't understand what you mean by 'Both subscription ...' in the
> above comments.

Fixed.

>
> 2.
> + # Check the log indicating that successfully skipped the transaction,
>
> How about slightly rephrasing this to: "Check the log to ensure that
> the transaction is skipped"?

Fixed.

I've attached an updated version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v15-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transa.patch
Description: Binary data

Re: Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread hao harry

Found this issue is duplicated to [1], after applied that patch, I cannot 
reproduce it anymore.

[1] 
https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com

2022年3月16日 下午3:16，hao harry 
mailto:harry-...@outlook.com>> 写道：

Hi, pgsql-hackers,

I think I found a case that database is not recoverable, would you please give 
a look?

Here is how it happens:

- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing 
buffers to complete
- before the restart point update ControlFile, promote the standby, that 
changed ControlFile
 ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the 
ControlFile
 ->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast 
promoted),
 one backend crashed, it will kill other server process, so the post-recovery 
checkpoint skipped
- the database restart startup process, which report: "could not locate a valid 
checkpoint record"

I attached a test to reproduce it, it does not fail every time, it fails every 
10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to 
simulate a crash,
the patch 0001 is needed.

Best Regard.

Harry Hao

<0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch>

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-16 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 15:42:52 +0900, Michael Paquier  wrote 
in 
> On Wed, Mar 16, 2022 at 10:34:15AM +0900, Kyotaro Horiguchi wrote:
> > +1. Desn't the doc need to mention that?
> 
> Yes, I agree that it makes sense to add a note, even if
> allow_in_place_tablespaces is a developer option.  I have added the
> following paragraph in the docs:
> +A full path of the symbolic link in pg_tblspc/
> +is returned. A relative path to the data directory is returned
> +for tablespaces created with
> + enabled.

I'm not sure that the "of the symbolic link in pg_tblspc/" is
needed. And allow_in_place_tablespaces alone doesn't create in-place
tablesapce. So this might need rethink at least for the second point.

> Another thing that was annoying in the first version of the patch is
> the useless call to lstat() on Windows, not needed because it is
> possible to rely just on pgwin32_is_junction() to check if readlink()
> should be called or not.

Agreed. And v2 looks cleaner.

The test detects the lack of the feature.
It successfully builds and runs on Rocky8 and Windows11.

> This leads me to the revised version attached.  What do you think?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Out-of-tree certificate interferes ssltest

2022-03-16 Thread Michael Paquier

On Wed, Mar 16, 2022 at 04:36:58PM +0900, Kyotaro Horiguchi wrote:
> ok 6 - ssl_client_cert_present() for connection with cert
> connection error: 'psql: error: connection to server at "127.0.0.1", port 
> 61688 failed: could not read certificate file 
> "/home/horiguti/.postgresql/postgresql.crt": no start line'
> while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt 
> sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser 
> host=localhost -f - -v ON_ERR
> 
> I think we don't want this behavior.
> 
> The attached fixes that and make-world successfully finished even if I
> have a cert file in my home direcotory.

That's the same issue as the one fixed in dd87799, using the same
method.  I'll double-check on top of looking at what you are
suggesting here.
--
Michael


signature.asc
Description: PGP signature

Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-16 Thread Bharath Rupireddy

On Tue, Mar 15, 2022 at 7:21 AM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 14, 2022 at 8:25 PM Stephen Frost  wrote:
> >
> > > As this patch is currently written, pg_monitor has access these
> > > functions, though I don't think that's the right privilege level at
> > > least for pg_get_raw_wal_record().
> >
> > Yeah, I agree that pg_monitor isn't the right thing for such a function
> > to be checking.
>
> On Thu, Mar 10, 2022 at 1:52 PM Jeff Davis  wrote:
> >
> > * pg_get_raw_wal_record() seems too powerful for pg_monitor. Maybe that
> > function should require pg_read_server_files? Or at least
> > pg_read_all_data?
>
> The v9 patch set posted at [1] grants execution on
> pg_get_raw_wal_record() to the pg_monitor role.
>
> pg_read_all_data may not be the right choice, but pg_read_server_files
> is as these functions do read the WAL files on the server. If okay,
> I'm happy to grant execution on pg_get_raw_wal_record() to the
> pg_read_server_files role.
>
> Thoughts?
>
> [1] 
> https://www.postgresql.org/message-id/CALj2ACVRH-z8mZLyFkpLvY4qRhxQCqU_BLkFTtwt%2BTPZNhfEVg%40mail.gmail.com

Attaching v10 patch set which allows pg_get_raw_wal_record to be
executed by either superuser or users with pg_read_server_files role,
no other change from v9 patch set.

Please review it further.

Regards,
Bharath Rupireddy.


v10-0001-pg_walinspect.patch
Description: Binary data


v10-0001-pg_walinspect-tests.patch
Description: Binary data


v10-0001-pg_walinspect-docs.patch
Description: Binary data

Re: Assert in pageinspect with NULL pages

2022-03-16 Thread Michael Paquier

On Mon, Feb 21, 2022 at 10:00:00AM +0300, Alexander Lakhin wrote:
> Could you please confirm before committing the patchset that it fixes
> the bug #16527 [1]? Or maybe I could check it?
> (Original patch proposed by Daria doesn't cover that case, but if the
> patch going to be improved, probably it's worth fixing that bug too.)
> 
> [1]
> https://www.postgresql.org/message-id/flat/16527-ef7606186f0610a1%40postgresql.org

FWIW, I think that this one has been fixed by 076f4d9, where we make
sure that the page is correctly aligned.  Are you still seeing this
problem?
--
Michael


signature.asc
Description: PGP signature

Out-of-tree certificate interferes ssltest

2022-03-16 Thread Kyotaro Horiguchi

Hello.

003_sslinfo.pl fails for me.

ok 6 - ssl_client_cert_present() for connection with cert
connection error: 'psql: error: connection to server at "127.0.0.1", port 61688 
failed: could not read certificate file 
"/home/horiguti/.postgresql/postgresql.crt": no start line'
while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt sslmode=require 
dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser host=localhost -f - -v ON_ERR

I think we don't want this behavior.

The attached fixes that and make-world successfully finished even if I
have a cert file in my home direcotory.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 308b55f06907ccaf4ac5669daacf04fea8a18fe1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 16 Mar 2022 16:20:46 +0900
Subject: [PATCH v1] Prevent out-of-tree certificates from interfering ssl
 tests

003_sslinfo.pl fails when there is a certificate file in ~/.postgresql
directory.  Prevent that failure by explicitly setting sslcert option
in connection string.
---
 src/test/ssl/t/003_sslinfo.pl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/ssl/t/003_sslinfo.pl b/src/test/ssl/t/003_sslinfo.pl
index 95742081f3..81da94f18d 100644
--- a/src/test/ssl/t/003_sslinfo.pl
+++ b/src/test/ssl/t/003_sslinfo.pl
@@ -93,7 +93,7 @@ $result = $node->safe_psql("certdb", "SELECT ssl_client_cert_present();",
 is($result, 't', "ssl_client_cert_present() for connection with cert");
 
 $result = $node->safe_psql("trustdb", "SELECT ssl_client_cert_present();",
-  connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " .
+  connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " .
   "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost");
 is($result, 'f', "ssl_client_cert_present() for connection without cert");
 
@@ -108,7 +108,7 @@ $result = $node->psql("certdb", "SELECT ssl_client_dn_field('invalid');",
 is($result, '3', "ssl_client_dn_field() for an invalid field");
 
 $result = $node->safe_psql("trustdb", "SELECT ssl_client_dn_field('commonName');",
-  connstr => "sslrootcert=ssl/root+server_ca.crt sslmode=require " .
+  connstr => "sslcert=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require " .
   "dbname=trustdb hostaddr=$SERVERHOSTADDR user=ssltestuser host=localhost");
 is($result, '', "ssl_client_dn_field() for connection without cert");
 
-- 
2.27.0

Re: Tab completion for ALTER MATERIALIZED VIEW ... SET ACCESS METHOD

2022-03-16 Thread Michael Paquier

Hi Nagata-san,

On Wed, Mar 16, 2022 at 01:33:37PM +0900, Yugo NAGATA wrote:
> SET ACCESS METHOD is supported in ALTER TABLE since the commit
> b0483263dd. Since that time,  this also has be allowed SET ACCESS
> METHOD in ALTER MATERIALIZED VIEW. Although it is not documented,
> this works.

Yes, that's an oversight.  I see no reason to not authorize that, and
the rewrite path in tablecmds.c is the same as for plain tables.

> I cannot found any reasons to prohibit SET ACCESS METHOD in ALTER
> MATERIALIZED VIEW, so I think it is better to support this in psql
> tab-completion and be documented.

I think that we should have some regression tests about those command
flavors.  How about adding a couple of queries to create_am.sql for
SET ACCESS METHOD and to tablespace.sql for SET TABLESPACE?
--
Michael


signature.asc
Description: PGP signature

Standby got invalid primary checkpoint after crashed right after promoted.

2022-03-16 Thread hao harry

Hi, pgsql-hackers,

I think I found a case that database is not recoverable, would you please give 
a look?

Here is how it happens:

- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing 
buffers to complete
- before the restart point update ControlFile, promote the standby, that 
changed ControlFile
  ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving 
the ControlFile
  ->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast 
promoted), 
  one backend crashed, it will kill other server process, so the post-recovery 
checkpoint skipped
- the database restart startup process, which report: "could not locate a valid 
checkpoint record"

I attached a test to reproduce it, it does not fail every time, it fails every 
10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to 
simulate a crash,
the patch 0001 is needed.

Best Regard.

Harry Hao



0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch
Description:  0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch

# Copyright (c) 2021-2022, PostgreSQL Global Development Group

# This test reproduces a crash after promotion caused error,
# log says: "could not locate a valid checkpoint record".
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;

# Initialize a primary
my $alpha = PostgreSQL::Test::Cluster->new('alpha');
$alpha->init(allows_streaming => 1);
$alpha->start;

# To simulate a backend crash
my $regress_shlib = $ENV{REGRESS_SHLIB};
$alpha->safe_psql('postgres', new('bravo');
$bravo->init_from_backup($alpha, 'bkp', has_streaming => 1);
$bravo->append_conf('postgresql.conf', safe_psql('postgres', 'create table test1 (a int)');
$alpha->safe_psql('postgres',
	'insert into test1 select generate_series(1, 100)');

# Take a checkpoint
$alpha->safe_psql('postgres', 'checkpoint');

my $in  = '';
my $out = '';
my $timer = IPC::Run::timeout(180);
my $h = $bravo->background_psql('postgres', \$in, \$out, $timer,
on_error_stop => 0);
$in .= q{
checkpoint;
};

my $in2  = '';
my $out2 = '';
my $h2 = $bravo->background_psql('postgres', \$in2, \$out2, $timer,
 on_error_stop => 0);
$in2 .= q {
select pg_sleep(0.03);
select pg_promote();
};

# Force a restartpoint, patched to sleep a while before checking
# ControlFile->state.
$h->pump_nb;

# Promote the standby, to set ControlFile->state to DB_IN_PRODUCTION
$h2->pump_nb;

# Simulate a crash, to skip the post-recovery checkpoint.
$bravo->psql('postgres', 'select pg_sleep(0.001)');
$bravo->psql('postgres', 'select chaos_sigsegv()');


# Check the log to see if database recovered from last crash.
my $logfile = slurp_file($bravo->logfile());
ok( $logfile !~ 'invalid primary checkpoint', 
'should recover from last checkpoint');

done_testing();

Re: Skipping logical replication transactions on subscriber side

2022-03-16 Thread Masahiko Sawada

On Wed, Mar 16, 2022 at 11:28 AM Amit Kapila  wrote:
>
> On Wed, Mar 16, 2022 at 6:03 AM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
> > >
> > > On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > 6.
> > > @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
> > >   TupleTableSlot *remoteslot;
> > >   MemoryContext oldctx;
> > >
> > > - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> > > + if (is_skipping_changes() ||
> > >
> > > Is there a reason to keep the skip_changes check here and in other DML
> > > operations instead of at one central place in apply_dispatch?
> >
> > Since we already have the check of applying the change on the spot at
> > the beginning of the handlers I feel it's better to add
> > is_skipping_changes() to the check than add a new if statement to
> > apply_dispatch, but do you prefer to check it in one central place in
> > apply_dispatch?
> >
>
> I think either way is fine. I just wanted to know the reason, your
> current change looks okay to me.
>
> Some questions/comments
> ==
> 1. IIRC, earlier, we thought of allowing to use of this option (SKIP)
> only for superusers (as this can lead to inconsistent data if not used
> carefully) but I don't see that check in the latest patch. What is the
> reason for the same?

I thought the non-superuser subscription owner can resolve the
conflict by manuall manipulating the relations, which is the same
result of skipping all data modification changes by ALTER SUBSCRIPTION
SKIP feature. But after more thought, it would not be exactly the same
since the skipped transaction might include changes to the relation
that the owner doesn't have permission on it.

>
> 2.
> + /*
> + * Update the subskiplsn of the tuple to InvalidXLogRecPtr.
>
> I think we can change the above part of the comment to "Clear subskiplsn."
>

Fixed.

> 3.
> + * Since we already have
>
> Isn't it better to say here: Since we have already ...?

Fixed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-16 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 15:56:02 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Mmm I'm not sure how come I didn't noticed that, master also fails
> for me fo the same reason.  In the past that may fail when valid
> clinent-certs exists in the users ~/.postgresql but I believe that has
> been fixed.

And finally my fear found to be true..  The test doesn't fail after
removing files under ~/.postgresql, which should not happen.

I'll fix it apart from this.
Sorry for the confusion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: logical replication empty transactions

2022-03-16 Thread Ajin Cherian

On Mon, Mar 7, 2022 at 11:44 PM Ajin Cherian  wrote:
>
> Fixed.
>
> regards,
> Ajin Cherian
> Fujitsu Australia

Rebased the patch and fixed some whitespace errors.
regards,
Ajin Cherian
Fujitsu Australia


v25-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


v25-0002-Skip-empty-streamed-transactions-for-logical-rep.patch
Description: Binary data

RE: Skipping logical replication transactions on subscriber side

2022-03-16 Thread osumi.takami...@fujitsu.com

On Wednesday, March 16, 2022 3:37 PM I wrote:
> On Wednesday, March 16, 2022 11:33 AM Amit Kapila
>  wrote:
> > On Tue, Mar 15, 2022 at 7:30 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Tuesday, March 15, 2022 3:13 PM Masahiko Sawada
> >  wrote:
> > > > I've attached an updated version patch.
> > >
> > > A couple of minor comments on v14.
> > >
> > > (1) apply_handle_commit_internal
> > >
> > >
> > > +   if (is_skipping_changes())
> > > +   {
> > > +   stop_skipping_changes();
> > > +
> > > +   /*
> > > +* Start a new transaction to clear the subskipxid,
> > > + if not
> > started
> > > +* yet. The transaction is committed below.
> > > +*/
> > > +   if (!IsTransactionState())
> > > +   StartTransactionCommand();
> > > +   }
> > > +
> > >
> > > I suppose we can move this condition check and
> > > stop_skipping_changes() call to the inside of the block we enter
> > > when IsTransactionState() returns
> > true.
> > >
> > > As the comment of apply_handle_commit_internal() mentions, it's the
> > > helper function for apply_handle_commit() and
> > > apply_handle_stream_commit().
> > >
> > > Then, I couldn't think that both callers don't open a transaction
> > > before the call of apply_handle_commit_internal().
> > > For applying spooled messages, we call begin_replication_step as well.
> > >
> > > I can miss something, but timing when we receive COMMIT message
> > > without opening a transaction, would be the case of empty
> > > transactions where the subscription (and its subscription worker) is not
> interested.
> > >
> >
> > I think when we skip non-streamed transactions we don't start a transaction.
> > So, if we do what you are suggesting, we will miss to clear the
> > skip_lsn after skipping the transaction.
> OK, this is what I missed.
> 
> On the other hand, what I was worried about is that empty transaction can 
> start
> skipping changes, if the subskiplsn is equal to the finish LSN for the empty
> transaction. The reason is we call maybe_start_skipping_changes even for
> empty ones and set skip_xact_finish_lsn by the finish LSN in that case.
> 
> I checked I could make this happen with debugger and some logs for LSN.
> What I did is just having two pairs of pub/sub and conduct a change for one of
> them, after I set a breakpoint in the logicalrep_write_begin on the walsender
> that will issue an empty transaction.
> Then, I check the finish LSN of it and
> conduct an alter subscription skip lsn command with this LSN value.
> As a result, empty transaction calls stop_skipping_changes in the
> apply_handle_commit_internal and then enter the block for IsTransactionState
> == true, which would not happen before applying the patch.
> 
> Also, this behavior looks contradicted with some comments in worker.c "The
> subskiplsn is cleared after successfully skipping the transaction or applying
> non-empty transaction." so, I was just confused and wrote the above comment.
Sorry, my understanding was not correct.

Even when we clear the subskiplsn by empty transaction,
we can say that it applies to the success of skipping the transaction.
Then this behavior and allowing empty transaction to match the indicated
LSN by alter subscription is fine.

I'm sorry for making noises.


Best Regards,
Takamichi Osumi

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-03-16 Thread Kyotaro Horiguchi

At Tue, 15 Mar 2022 21:41:49 +, Jacob Champion  wrote 
in 
> On Mon, 2022-03-14 at 15:30 +0900, Kyotaro Horiguchi wrote:
> > t/003_sslinfo.pl ... 1/? # Tests were run but no plan was declared and 
> > done_testing() was not seen.
> > # Looks like your test exited with 29 just after 6.
> > t/003_sslinfo.pl ... Dubious, test returned 29 (wstat 7424, 0x1d00)
> > All 6 subtests passed 
> > ...
> > Result: FAIL
> > 
> > The script complains like this:
> > 
> > ok 6 - ssl_client_cert_present() for connection with cert
> > connection error: 'psql: error: connection to server at "127.0.0.1", port 
> > 62656 failed: SSL error: tlsv1 alert unknown ca'
> > while running 'psql -XAtq -d sslrootcert=ssl/root+server_ca.crt 
> > sslmode=require dbname=trustdb hostaddr=127.0.0.1 user=ssltestuser 
> > host=localhost -f - -v ON_ERROR_STOP=1' at 
> > /home/horiguti/work/worktrees/ipsan/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
> >  line 1873.
> > 
> > So, psql looks like disliking the ca certificate.  I also will dig
> > into that.
> 
> Hmm, the sslinfo tests are failing? I wouldn't have expected that based
> on the patch changes. Just to confirm -- they pass for you without the
> patch?

Mmm I'm not sure how come I didn't noticed that, master also fails
for me fo the same reason.  In the past that may fail when valid
clinent-certs exists in the users ~/.postgresql but I believe that has
been fixed.


> > > > Mmm. after the end of the loop, rc is non-zero only when the loop has
> > > > exited by the break and otherwise rc is zero. Why isn't it equivalent
> > > > to setting check_cn to false at the break?
> > > 
> > > check_cn can be false if rc is zero, too; it means that we found a SAN
> > > of the correct type but it didn't match.

I'm not discussing on the meaning. Purely on the logical equivalancy
of the two ways to decide whether to visit CN.

Concretely, the equivalancy between this:

=
 check_cn = true;
 rc = 0;
 for (i < san_len)
 {
   if (type matches)  check_cn = false;
   if (some error happens)  rc = nonzero;
 
   if (rc != 0)
 break;
 }
!if ((rc == 0) && check_cn) {}
=

and this.

=
 check_cn = true;
 rc = 0;
 for (i < san_len)
 {
   if (type matches)  check_cn = false;
   if (some error happens)  rc = nonzero;

   if (rc != 0)
   {
!check_cn = false;
 break;
   }
 }
!if (check_cn) {}
=

The two are equivalant to me. And if it is, the latter form smees
simpler and clearner to me.

> > > check_cn can be false if rc is zero, too; it means that we found a SAN
> > > of the correct type but it didn't match.
> 
> > Don't we count unmatched certs as "existed"?  In that case I think we
> > don't go to CN.
> 
> Unmatched names, you mean? I'm not sure I understand.

Sorry, I was confused here. Please ignore that. I shoudl have said
something like the following instead.

> check_cn can be false if rc is zero, too; it means that we found a SAN
> of the correct type but it didn't match.

Yes, in that case we don't visit CN because check_cn is false even if
we don't exit by (rc != 0) and that behavior is not changed by my
proposal.

> > > > I supposed that we only
> > > > be deviant in the case "IP address connhost and no SANs of any type
> > > > exists". What do you think about it?
> > > 
> > > We fall back in the case of "IP address connhost and dNSName SANs
> > > exist", which is prohibited by that part of RFC 6125. I don't think we
> > > deviate in the case you described; can you explain further?
> > 
> > In that case, i.e., connhost is IP address and no SANs exist at all,
> > we go to CN.  On the other hand in RFC6125:
> > 
> > rfc6125> In some cases, the URI is specified as an IP address rather
> > rfc6125> than a hostname. In this case, the iPAddress subjectAltName
> > rfc6125> must be present in the certificate and must exactly match the
> > rfc6125> IP in the URI.
> > 
> > It (seems to me) denies that behavior.  Regardless of the existence of
> > other types of SANs, iPAddress is required if connname is an IP
> > address.  (That is, it doesn't seem to me that there's any context
> > like "if any SANs exists", but I'm not so sure I read it perfectly.)
> 
> I see what you mean now. Yes, we deviate there as well (and have done
> so for a while now). I think breaking compatibility there would
> probably not go over well.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_tablespace_location() failure with allow_in_place_tablespaces

2022-03-16 Thread Michael Paquier

On Wed, Mar 16, 2022 at 10:34:15AM +0900, Kyotaro Horiguchi wrote:
> +1. Desn't the doc need to mention that?

Yes, I agree that it makes sense to add a note, even if
allow_in_place_tablespaces is a developer option.  I have added the
following paragraph in the docs:
+A full path of the symbolic link in pg_tblspc/
+is returned. A relative path to the data directory is returned
+for tablespaces created with
+ enabled.

Another thing that was annoying in the first version of the patch is
the useless call to lstat() on Windows, not needed because it is
possible to rely just on pgwin32_is_junction() to check if readlink()
should be called or not.

This leads me to the revised version attached.  What do you think?
--
Michael
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 4568749d23..89690be2ed 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -282,6 +283,9 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 	char		sourcepath[MAXPGPATH];
 	char		targetpath[MAXPGPATH];
 	int			rllen;
+#ifndef WIN32
+	struct stat st;
+#endif
 
 	/*
 	 * It's useful to apply this function to pg_class.reltablespace, wherein
@@ -306,6 +310,31 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 	 */
 	snprintf(sourcepath, sizeof(sourcepath), "pg_tblspc/%u", tablespaceOid);
 
+	/*
+	 * Before reading the link, check if the source path is a link or a
+	 * junction point.  Note that a directory is possible for a tablespace
+	 * created with allow_in_place_tablespaces enabled.  If a directory is
+	 * found, a relative path to the data directory is returned.
+	 */
+#ifdef WIN32
+	if (!pgwin32_is_junction(sourcepath))
+		PG_RETURN_TEXT_P(cstring_to_text(sourcepath));
+#else
+	if (lstat(sourcepath, ) < 0)
+	{
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not stat file \"%s\": %m",
+		sourcepath)));
+	}
+
+	if (!S_ISLNK(st.st_mode))
+		PG_RETURN_TEXT_P(cstring_to_text(sourcepath));
+#endif
+
+	/*
+	 * In presence of a link or a junction point, return the path pointing to.
+	 */
 	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
 	if (rllen < 0)
 		ereport(ERROR,
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 2dfbcfdebe..473fe8c28e 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -24,6 +24,15 @@ SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';
 DROP TABLESPACE regress_tblspacewith;
 -- create a tablespace we can use
 CREATE TABLESPACE regress_tblspace LOCATION '';
+-- This returns a relative path as of an effect of allow_in_place_tablespaces,
+-- masking the tablespace OID used in the path name.
+SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
+  FROM pg_tablespace  WHERE spcname = 'regress_tblspace';
+ regexp_replace 
+
+ pg_tblspc/NNN
+(1 row)
+
 -- try setting and resetting some properties for the new tablespace
 ALTER TABLESPACE regress_tblspace SET (random_page_cost = 1.0, seq_page_cost = 1.1);
 ALTER TABLESPACE regress_tblspace SET (some_nonexistent_parameter = true);  -- fail
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 896f05cea3..0949a28488 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -22,6 +22,10 @@ DROP TABLESPACE regress_tblspacewith;
 
 -- create a tablespace we can use
 CREATE TABLESPACE regress_tblspace LOCATION '';
+-- This returns a relative path as of an effect of allow_in_place_tablespaces,
+-- masking the tablespace OID used in the path name.
+SELECT regexp_replace(pg_tablespace_location(oid), '(pg_tblspc)/(\d+)', '\1/NNN')
+  FROM pg_tablespace  WHERE spcname = 'regress_tblspace';
 
 -- try setting and resetting some properties for the new tablespace
 ALTER TABLESPACE regress_tblspace SET (random_page_cost = 1.0, seq_page_cost = 1.1);
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8a802fb225..df54c5815d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23924,7 +23924,14 @@ SELECT currval(pg_get_serial_sequence('sometable', 'id'));


 Returns the file system path that this tablespace is located in.
-   
+   
+   
+A full path of the symbolic link in pg_tblspc/
+is returned. A relative path to the data directory is returned
+for tablespaces created with
+ enabled.
+   
+   
   
 
   


signature.asc
Description: PGP signature

RE: Skipping logical replication transactions on subscriber side

2022-03-16 Thread osumi.takami...@fujitsu.com

On Wednesday, March 16, 2022 11:33 AM Amit Kapila  
wrote:
> On Tue, Mar 15, 2022 at 7:30 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Tuesday, March 15, 2022 3:13 PM Masahiko Sawada
>  wrote:
> > > I've attached an updated version patch.
> >
> > A couple of minor comments on v14.
> >
> > (1) apply_handle_commit_internal
> >
> >
> > +   if (is_skipping_changes())
> > +   {
> > +   stop_skipping_changes();
> > +
> > +   /*
> > +* Start a new transaction to clear the subskipxid, if not
> started
> > +* yet. The transaction is committed below.
> > +*/
> > +   if (!IsTransactionState())
> > +   StartTransactionCommand();
> > +   }
> > +
> >
> > I suppose we can move this condition check and stop_skipping_changes()
> > call to the inside of the block we enter when IsTransactionState() returns
> true.
> >
> > As the comment of apply_handle_commit_internal() mentions, it's the
> > helper function for apply_handle_commit() and
> > apply_handle_stream_commit().
> >
> > Then, I couldn't think that both callers don't open a transaction
> > before the call of apply_handle_commit_internal().
> > For applying spooled messages, we call begin_replication_step as well.
> >
> > I can miss something, but timing when we receive COMMIT message
> > without opening a transaction, would be the case of empty transactions
> > where the subscription (and its subscription worker) is not interested.
> >
> 
> I think when we skip non-streamed transactions we don't start a transaction.
> So, if we do what you are suggesting, we will miss to clear the skip_lsn after
> skipping the transaction.
OK, this is what I missed.

On the other hand, what I was worried about is that
empty transaction can start skipping changes,
if the subskiplsn is equal to the finish LSN for
the empty transaction. The reason is we call
maybe_start_skipping_changes even for empty ones
and set skip_xact_finish_lsn by the finish LSN in that case.

I checked I could make this happen with debugger and some logs for LSN.
What I did is just having two pairs of pub/sub
and conduct a change for one of them,
after I set a breakpoint in the logicalrep_write_begin
on the walsender that will issue an empty transaction.
Then, I check the finish LSN of it and
conduct an alter subscription skip lsn command with this LSN value.
As a result, empty transaction calls stop_skipping_changes
in the apply_handle_commit_internal and then
enter the block for IsTransactionState == true,
which would not happen before applying the patch.

Also, this behavior looks contradicted with some comments in worker.c
"The subskiplsn is cleared after successfully skipping the transaction
or applying non-empty transaction." so, I was just confused and
wrote the above comment.

I think this would not happen in practice, then
it might be OK without a special measure for this,
but I wasn't sure.

Best Regards,
Takamichi Osumi

Re: Skipping logical replication transactions on subscriber side

2022-03-16 Thread Masahiko Sawada

On Tue, Mar 15, 2022 at 7:18 PM Amit Kapila  wrote:
>
> On Tue, Mar 15, 2022 at 11:43 AM Masahiko Sawada  
> wrote:
> >
> > I've attached an updated version patch.
> >
>
> Review:
> ===

Thank you for the comments.

> 1.
> +++ b/doc/src/sgml/logical-replication.sgml
> @@ -366,15 +366,19 @@ CONTEXT:  processing remote data for replication
> origin "pg_16395" during "INSER
> transaction, the subscription needs to be disabled temporarily by
> ALTER SUBSCRIPTION ... DISABLE first or
> alternatively, the
> subscription can be used with the
> disable_on_error option.
> -   Then, the transaction can be skipped by calling the
> +   Then, the transaction can be skipped by using
> +   ALTER SUBSCRITPION ... SKIP with the finish LSN
> +   (i.e., LSN 0/14C0378). After that the replication
> +   can be resumed by ALTER SUBSCRIPTION ... ENABLE.
> +   Alternatively, the transaction can also be skipped by calling the
>
> Do we really need to disable the subscription for the skip feature? I
> think that is required for origin_advance. Also, probably, we can say
> Finish LSN could be Prepare LSN, Commit LSN, etc.

Not necessary to disable the subscription for skip feature. Fixed.

>
> 2.
> + /*
> + * Quick return if it's not requested to skip this transaction. This
> + * function is called every start of applying changes and we assume that
> + * skipping the transaction is not used in many cases.
> + */
> + if (likely(XLogRecPtrIsInvalid(MySubscription->skiplsn) ||
>
> The second part of this comment (especially ".. every start of
> applying changes ..") sounds slightly odd to me. How about changing it
> to: "This function is called for every remote transaction and we
> assume that skipping the transaction is not used in many cases."
>

Fixed.

> 3.
> +
> + ereport(LOG,
> + errmsg("start skipping logical replication transaction which
> finished at %X/%X",
> ...
> + ereport(LOG,
> + (errmsg("done skipping logical replication transaction which
> finished at %X/%X",
>
> No need of 'which' in above LOG messages. I think the message will be
> clear without the use of which in above message.

Removed.

>
> 4.
> + ereport(LOG,
> + (errmsg("done skipping logical replication transaction which
> finished at %X/%X",
> + LSN_FORMAT_ARGS(skip_xact_finish_lsn;
> +
> + /* Stop skipping changes */
> + skip_xact_finish_lsn = InvalidXLogRecPtr;
>
> Let's reverse the order of these statements to make them consistent
> with the corresponding maybe_start_* function.

But we cannot simply rever the order since skip_xact_finish_lsn is
used in the log message. Do we want to use a variable for it?

>
> 5.
> +
> + if (myskiplsn != finish_lsn)
> + ereport(WARNING,
> + errmsg("skip-LSN of logical replication subscription \"%s\"
> cleared", MySubscription->name),
>
> Shouldn't this be a LOG instead of a WARNING as this will be displayed
> only in server logs and by background apply worker?

WARNINGs are used also by other auxiliary processes such as archiver,
autovacuum workers, and launcher. So I think we can use it here.

>
> 6.
> @@ -1583,7 +1649,8 @@ apply_handle_insert(StringInfo s)
>   TupleTableSlot *remoteslot;
>   MemoryContext oldctx;
>
> - if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
> + if (is_skipping_changes() ||
>
> Is there a reason to keep the skip_changes check here and in other DML
> operations instead of at one central place in apply_dispatch?

I'd leave it as is as I mentioned in another email. But I've added
some comments as you suggested.

>
> 7.
> + /*
> + * Start a new transaction to clear the subskipxid, if not started
> + * yet. The transaction is committed below.
> + */
> + if (!IsTransactionState())
>
> I think the second part of the comment: "The transaction is committed
> below." is not required.

Removed.

>
> 8.
> + XLogRecPtr subskiplsn; /* All changes which finished at this LSN are
> + * skipped */
> +
>  #ifdef CATALOG_VARLEN /* variable-length fields start here */
>   /* Connection string to the publisher */
>   text subconninfo BKI_FORCE_NOT_NULL;
> @@ -109,6 +112,8 @@ typedef struct Subscription
>   bool disableonerr; /* Indicates if the subscription should be
>   * automatically disabled if a worker error
>   * occurs */
> + XLogRecPtr skiplsn; /* All changes which finished at this LSN are
> + * skipped */
>
> No need for 'which' in the above comments.

Removed.

>
> 9.
> Can we merge 029_disable_on_error in 030_skip_xact and name it as
> 029_on_error (or 029_on_error_skip_disable or some variant of it)?
> Both seem to be related features. I am slightly worried at the pace at
> which the number of test files are growing in subscription test.

Yes, we can merge them.

I'll submit an updated version patch after incorporating all comments I got.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

84 matches

Mail list logo