Re: Single transaction in the tablesync worker?

2021-02-02 Thread Ajin Cherian
On Tue, Feb 2, 2021 at 9:03 PM Ajin Cherian  wrote:

> I am sorry, my above steps were not correct. I think the reason for
> the failure I was seeing were some other steps I did prior to this. I
> will recreate this and update you with the appropriate steps.

The correct steps are as follows:

Publisher:

postgres=# CREATE TABLE tab_rep (a int primary key);
CREATE TABLE
postgres=# INSERT INTO tab_rep SELECT generate_series(1,100);
INSERT 0 100
postgres=# CREATE PUBLICATION tap_pub FOR ALL TABLES;
CREATE PUBLICATION

Subscriber:
postgres=# CREATE TABLE tab_rep (a int primary key);
CREATE TABLE
postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION

Allow the tablesync to complete and then drop the subscription, the
table remains full and restarting the subscription should fail with a
constraint violation during tablesync but it does not.


Subscriber:
postgres=# drop subscription tap_sub ;
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION
postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION

This takes the subscriber into an error loop but no mention of what
the error was:

2021-02-02 05:01:34.698 EST [1549] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-02 05:01:34.739 EST [1549] ERROR:  table copy could not
rollback transaction on publisher
2021-02-02 05:01:34.739 EST [1549] DETAIL:  The error was: another
command is already in progress
2021-02-02 05:01:34.740 EST [8028] LOG:  background worker "logical
replication worker" (PID 1549) exited with exit code 1
2021-02-02 05:01:40.107 EST [1711] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-02 05:01:40.121 EST [1711] ERROR:  could not create
replication slot "pg_16479_sync_16435": ERROR:  replication slot
"pg_16479_sync_16435" already exists
2021-02-02 05:01:40.121 EST [8028] LOG:  background worker "logical
replication worker" (PID 1711) exited with exit code 1
2021-02-02 05:01:45.140 EST [1891] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started

regards,
Ajin Cherian
Fujitsu Australia




Re: Typo in tablesync comment

2021-02-02 Thread Peter Smith
On Wed, Feb 3, 2021 at 6:13 PM Michael Paquier  wrote:
>
> On Tue, Feb 02, 2021 at 07:23:37PM +0530, Amit Kapila wrote:
> > I don't mind changing to your proposed text but I think the current
> > wording is also okay and seems clear to me.
>
> Reading that again, I still find the word "transient" to be misleading
> in this context.  Any extra opinions?

OTOH I thought "additionally stored" made it seem like those states
were in the catalog and "additionally" in shared memory.

Maybe better to rewrite it more drastically?

e.g
-
 *The catalog pg_subscription_rel is used to keep information about
 *subscribed tables and their state. The catalog holds all states
 *except SYNCWAIT and CATCHUP which are only in shared memory.
-

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Printing backtrace of postgres processes

2021-02-02 Thread Tom Lane
vignesh C  writes:
> On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy
>  wrote:
>> Are these superuser and permission checks enough from a security
>> standpoint that we don't expose some sensitive information to the
>> user?

> This will just print the backtrace of the current backend. Users
> cannot get password information from this.

Really?

A backtrace normally exposes the text of the current query, for
instance, which could contain very sensitive data (passwords in ALTER
USER, customer credit card numbers in ordinary data, etc etc).  We
don't allow the postmaster log to be seen by any but very privileged
users; it's not sane to think that this data is any less
security-critical than the postmaster log.

This point is entirely separate from the question of whether
triggering stack traces at inopportune moments could cause system
malfunctions, but that question is also not to be ignored.

TBH, I'm leaning to the position that this should be superuser
only.  I do NOT agree with the idea that ordinary users should
be able to trigger it, even against backends theoretically
belonging to their own userid.  (Do I need to point out that
some levels of the call stack might be from security-definer
functions with more privilege than the session's nominal user?)

regards, tom lane




Re: Typo in tablesync comment

2021-02-02 Thread Michael Paquier
On Tue, Feb 02, 2021 at 07:23:37PM +0530, Amit Kapila wrote:
> I don't mind changing to your proposed text but I think the current
> wording is also okay and seems clear to me.

Reading that again, I still find the word "transient" to be misleading
in this context.  Any extra opinions?
--
Michael


signature.asc
Description: PGP signature


Re: Printing backtrace of postgres processes

2021-02-02 Thread vignesh C
On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy
 wrote:
>
> On Mon, Feb 1, 2021 at 6:14 AM Bharath Rupireddy
>  wrote:
> > On Fri, Jan 29, 2021 at 7:10 PM vignesh C  wrote:
> > > > 4) How about following
> > > > + errmsg("must be a superuser to print backtrace
> > > > of backend process")));
> > > > instead of
> > > > + errmsg("must be a superuser to print backtrace
> > > > of superuser query process")));
> > > >
> > >
> > > Here the message should include superuser, we cannot remove it. Non
> > > super user can log non super user provided if user has permissions for
> > > it.
> > >
> > > > 5) How about following
> > > >  errmsg("must be a member of the role whose backed
> > > > process's backtrace is being printed or member of
> > > > pg_signal_backend")));
> > > > instead of
> > > > + errmsg("must be a member of the role whose
> > > > backtrace is being logged or member of pg_signal_backend")));
> > > >
> > >
> > > Modified it.
> >
> > Maybe I'm confused here to understand the difference between
> > SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and
> > corresponding error messages. Some clarification/use case to know in
> > which scenarios we hit those error messages might help me. Did we try
> > to add test cases that show up these error messages for
> > pg_print_backtrace? If not, can we add?
>
> Are these superuser and permission checks enough from a security
> standpoint that we don't expose some sensitive information to the
> user? Although I'm not sure, say from the backtrace printed and
> attached to GDB, can users see the passwords or other sensitive
> information from the system that they aren't supposed to see?
>
> I'm sure this point would have been discussed upthread.

This will just print the backtrace of the current backend. Users
cannot get password information from this. This backtrace will be sent
from customer side to the customer support. Developers will use gdb to
check the file and line number using the addresses. We are suggesting
to use gdb to get the file and line number from the address.  Users
can attach gdb to the process even now without this feature, I think
that behavior will be the same as is.  That will not be impacted
because of this feature. It was discussed to keep the checks similar
to pg_terminate_backend as discussed in [1].
[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZ8XeQVCCqfq826kAr804a1ZnYy46FnQB9r2n_iOofh8Q%40mail.gmail.com

Regards,
Vignesh




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-02 Thread Justin Pryzby
On Wed, Feb 03, 2021 at 03:37:39PM +0900, Michael Paquier wrote:
> index 627b36300c..4ee3951ca0 100644
> --- a/doc/src/sgml/ref/reindex.sgml
> +++ b/doc/src/sgml/ref/reindex.sgml
> @@ -293,8 +311,30 @@ REINDEX [ (  class="parameter">option [, ...] ) ] { IN
> respectively. Each partition of the specified partitioned relation is
> reindexed in a separate transaction. Those commands cannot be used inside
> a transaction block when working on a partitioned table or index.
> +   If a REINDEX command fails when run on a partitioned
> +   relation, and TABLESPACE was specified, then it may not
> +   have moved all indexes to the new tablespace. Re-running the command
> +   will rebuild again all the partitions and move previously-unprocessed

remove "again"

> +   indexes to the new tablespace.
> +  
> +  
> +  
> +   When using the TABLESPACE clause with
> +   REINDEX on a partitioned index or table, only the
> +   tablespace references of the partitions are updated. As partitioned 
> indexes

I think you should say "of the LEAF partitions ..".  The intermediate,
partitioned tables are also "partitions" (partitioned partitions if you like).

> +   are not updated, it is recommended to separately use
> +   ALTER TABLE ONLY on them to achieve that.

Maybe say: "..to set the default tablespace of any new partitions created in
the future".

-- 
Justin




Re: Printing backtrace of postgres processes

2021-02-02 Thread vignesh C
Thanks Bharath for your comments.

On Mon, Feb 1, 2021 at 6:14 AM Bharath Rupireddy
 wrote:
>
> On Fri, Jan 29, 2021 at 7:10 PM vignesh C  wrote:
> > > 4) How about following
> > > + errmsg("must be a superuser to print backtrace
> > > of backend process")));
> > > instead of
> > > + errmsg("must be a superuser to print backtrace
> > > of superuser query process")));
> > >
> >
> > Here the message should include superuser, we cannot remove it. Non
> > super user can log non super user provided if user has permissions for
> > it.
> >
> > > 5) How about following
> > >  errmsg("must be a member of the role whose backed
> > > process's backtrace is being printed or member of
> > > pg_signal_backend")));
> > > instead of
> > > + errmsg("must be a member of the role whose
> > > backtrace is being logged or member of pg_signal_backend")));
> > >
> >
> > Modified it.
>
> Maybe I'm confused here to understand the difference between
> SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and
> corresponding error messages. Some clarification/use case to know in
> which scenarios we hit those error messages might help me. Did we try
> to add test cases that show up these error messages for
> pg_print_backtrace? If not, can we add?

I have tested this manually:

I have tested it manually, Here is the test I did:
Create 2 users:
create user test password 'test@123';
create user test1 password 'test@123';

Test1: Test print backtrace of a different user's session:
./psql -d postgres -U test
psql (14devel)
Type "help" for help.
postgres=> select pg_backend_pid();
 pg_backend_pid

  30142
(1 row)
--
./psql -d postgres -U test1
psql (14devel)
Type "help" for help.
postgres=> select pg_print_backtrace(30142);
ERROR:  must be a member of the role whose backtrace is being logged
or member of pg_signal_backend

The above will be successful after:
grant pg_signal_backend to test1;

Test1: Test for non super user trying to print backtrace of a super
user's session:
./psql -d postgres
psql (14devel)
Type "help" for help.
postgres=# select pg_backend_pid();
 pg_backend_pid

  30211
(1 row)

./psql -d postgres -U test1
psql (14devel)
Type "help" for help.
postgres=> select pg_print_backtrace(30211);
ERROR:  must be a superuser to print backtrace of superuser process
I have not added any tests for this as we required 2 active sessions
and I did not see any existing framework for this.
This test should help in relating the behavior.

> > Attached v5 patch has the fixes for the same.
> > Thoughts?
>
> Thanks. Here are some comments on v5 patch:
>
> 1) typo - it's "process" not "porcess" +a backend porcess. For example:
>

Modified.

> 2) select * from pg_print_backtrace(NULL);
> postgres=# select proname, proisstrict from pg_proc where proname =
> 'pg_print_backtrace';
>   proname   | proisstrict
> +-
>  pg_print_backtrace | t
>
>  See the documentation:
>  "proisstrict bool
>
> Function returns null if any call argument is null. In that case the
> function won't actually be called at all. Functions that are not
> “strict” must be prepared to handle null inputs."
> So below PG_ARGISNUL check is not needed, you can remove that, because
> pg_print_backtrace will not get called with null input.
> intbt_pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
>

Modified.

> 3) Can we just set results = true instead of PG_RETURN_BOOL(true); so
> that it will be returned from PG_RETURN_BOOL(result); just for
> consistency?
> if (!SendProcSignal(bt_pid, PROCSIG_BACKTRACE_PRINT,
> InvalidBackendId))
> PG_RETURN_BOOL(true);
> else
> ereport(WARNING,
> (errmsg("failed to send signal to postmaster: %m")));
> }
>
> PG_RETURN_BOOL(result);
>

I felt existing is better as it will reduce one instruction of setting
first and then returning. There are only 2 places from where we
return. I felt we could directly return true or false.

> 4) Below is what happens if I request for a backtrace of the
> postmaster process? 1388210 is pid of postmaster.
> postgres=# select * from pg_print_backtrace(1388210);
> WARNING:  PID 1388210 is not a PostgreSQL server process
>  pg_print_backtrace
> 
>  f
>
> Does it make sense to have a postmaster's backtrace? If yes, can we
> have something like below in sigusr1_handler()?
> if (CheckPostmasterSignal(PMSIGNAL_BACKTRACE_EMIT))
> {
> LogBackTrace();
> }
>

We had a discussion about this in [1]  earlier and felt including this
is not very useful.

> 5) Can we have PROCSIG_PRINT_BACKTRACE instead of
> PROCSIG_BACKTRACE_PRINT, just for readability and consistency with
> function name?
>

PROCSIG_PRINT_BACKTRACE is better, I have modified it.

> 6) I think it's 

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-02 Thread Michael Paquier
On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote:
> On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote:
> > Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses
> > index_create() with a proper tablespaceOid instead of
> > SetRelationTableSpace(). And its checks structure is more restrictive even
> > without tablespace change, so it doesn't use CheckRelationTableSpaceMove().
> 
> Sure.  I have not checked the patch in details, but even with that it
> would be much safer to me if we apply the same sanity checks
> everywhere.  That's less potential holes to worry about.

Thanks Alexey for the new patch.  I have been looking at the main
patch in details.

/*
-* Don't allow reindex on temp tables of other backends ... their local
-* buffer manager is not going to cope.
+* We don't support moving system relations into different tablespaces
+* unless allow_system_table_mods=1.
 */
If you remove the check on RELATION_IS_OTHER_TEMP() in
reindex_index(), you would allow the reindex of a temp relation owned
by a different session if its tablespace is not changed, so this
cannot be removed.

+!allowSystemTableMods && IsSystemRelation(iRel))
 ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot reindex temporary tables of other sessions")));
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied: \"%s\" is a system catalog",
+RelationGetRelationName(iRel;
Indeed, a system relation with a relfilenode should be allowed to move
under allow_system_table_mods.  I think that we had better move this
check into CheckRelationTableSpaceMove() instead of reindex_index() to 
centralize the logic.  ALTER TABLE does this business in
RangeVarCallbackForAlterRelation(), but our code path opening the
relation is different for the non-concurrent case.

+   if (OidIsValid(params->tablespaceOid) &&
+   IsSystemClass(relid, classtuple))
+   {
+   if (!allowSystemTableMods)
+   {
+   /* Skip all system relations, if not allowSystemTableMods *
I don't see the need for having two warnings here to say the same
thing if a relation is mapped or not mapped, so let's keep it simple.

+REINDEX (TABLESPACE regress_tblspace) SYSTEM CONCURRENTLY postgres; -- fail
+ERROR:  cannot reindex system catalogs concurrently
[...]
+REINDEX (TABLESPACE regress_tblspace) DATABASE regression; -- ok with warning
+WARNING:  cannot change tablespace of indexes on system relations, skipping all
+REINDEX (TABLESPACE pg_default) DATABASE regression; -- ok with warning
+WARNING:  cannot change tablespace of indexes on system relations, skipping all
Those tests are costly by design, so let's drop them.  They have been
useful to check the patch, but if tests are changed with objects
remaining around this would cost a lot of resources.

+   /* It's not a shared catalog, so refuse to move it to shared tablespace */
+   if (params->tablespaceOid == GLOBALTABLESPACE_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot move non-shared relation totablespace \"%s\"",
+get_tablespace_name(params->tablespaceOid;
There is no test coverage for this case with REINDEX CONCURRENTLY, and
that's easy enough to stress.  So I have added one.

I have found that the test suite was rather messy in its
organization.  Table creations were done first with a set of tests not
really ordered, so that was really hard to follow.  This has also led
to a set of tests that were duplicated, while other tests have been
missed, mainly some cross checks for the concurrent and non-concurrent
behaviors.  I have reordered the whole so as tests on catalogs, normal
tables and partitions are done separately with relations created and
dropped for each set.  Partitions use a global check for tablespaces
and relfilenodes after one concurrent reindex (didn't see the point in
doubling with the non-concurrent case as the same code path to select
the relations from the partition tree is taken).  An ACL test has been
added at the end.

The case of partitioned indexes was kind of interesting and I thought
about that a couple of days, and I took the decision to ignore
relations that have no storage as you did, documenting that ALTER
TABLE can be used to update the references of the partitioned
relations.  The command is still useful with this behavior, and the
tests I have added track that.

Finally, I have reworked the docs, separating the limitations related
to system catalogs and partitioned relations, to be more consistent
with the notes at the end of the page.
--
Michael
From c021aeca905a738d38acb257cbc4724804273499 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 3 Feb 2021 15:26:21 +0900
Subject: [PATCH v11] Allow REINDEX to change tablespace

REINDEX already 

Can we have a new SQL callable function to get Postmaster PID?

2021-02-02 Thread Bharath Rupireddy
Hi,

Can we have a new function, say pg_postgres_pid(), to return
postmaster PID similar to pg_backend_pid()? At times, it will be
difficult to use OS level commands to get the postmaster pid of a
backend to which it is connected. It's even worse if we have multiple
postgres server instances running on the same system. I'm not sure
whether it's safe to expose postmaster pid this way, but it will be
useful at least for debugging purposes on say Windows or other
non-Linux platforms where it's a bit difficult to get process id.
Users can also look at the postmaster.pid file to figure out what's
the current postmaster pid, if not using OS level commands, but having
a SQL callable function makes life easier.

The function can look like this:
Datum
pg_postgres_pid(PG_FUNCTION_ARGS)
{
PG_RETURN_INT32(PostmasterPid);
}

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-02-02 Thread Bharath Rupireddy
On Thu, Jan 28, 2021 at 10:07 AM japin  wrote:
> Attaching v3 patches, please consider these for further review.

I think we can add a commitfest entry for this feature, so that the
patches will be tested on cfbot. Ignore if done already.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Identify missing publications from publisher while create/alter subscription.

2021-02-02 Thread Bharath Rupireddy
On Mon, Jan 25, 2021 at 10:32 PM vignesh C  wrote:
> > I mean it doesn’t seem right to disallow to create the subscription if
> > the publisher doesn't exist, and my reasoning was even though the
> > publisher exists while creating the subscription you might drop it
> > later right?.  So basically, now also we can create the same scenario
> > that a subscription may exist for the publication which does not
> > exist.
> >
>
> I would like to defer on documentation for this.
> I feel we should have the behavior similar to publication tables as given 
> below, then it will be consistent and easier for the users:
>
> This is the behavior in case of table:
> Step 1:
> PUBLISHER SIDE:
> create table t1(c1 int);
> create table t2(c1 int);
> CREATE PUBLICATION mypub1 for table t1,t2;
> -- All above commands succeeds
> Step 2:
> SUBSCRIBER SIDE:
> -- Create subscription without creating tables will result in error:
> CREATE SUBSCRIPTION mysub1 CONNECTION 'dbname=source_rep host=localhost 
> user=vignesh port=5432' PUBLICATION mypub1;
> ERROR:  relation "public.t2" does not exist
> create table t1(c1 int);
> create table t2(c1 int);
>
> CREATE SUBSCRIPTION mysub1 CONNECTION 'dbname=source_rep host=localhost 
> user=vignesh port=5432' PUBLICATION mypub1;
>
> postgres=# select * from pg_subscription;
>   oid  | subdbid | subname | subowner | subenabled | subbinary | substream |  
>  subconninfo   | subslotname | 
> subsynccommit | subpublications
> ---+-+-+--++---+---+-+-+---+-
>  16392 |   13756 | mysub1  |   10 | t  | f | f | 
> dbname=source_rep host=localhost user=vignesh port=5432 | mysub1  | off   
> | {mypub1}
> (1 row)
>
> postgres=# select *,srrelid::oid::regclass from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srrelid
> -+-++---+-
>16392 |   16389 | r  | 0/1608BD0 | t2
>16392 |   16384 | r  | 0/1608BD0 | t1
>
> (2 rows)
> Step 3:
> PUBLISHER:
> drop table t2;
> create table t3;
> CREATE PUBLICATION mypub2 for table t1,t3;
>
> Step 4:
> SUBSCRIBER:
> postgres=# select *,srrelid::oid::regclass from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srrelid
> -+-++---+-
>16392 |   16389 | r  | 0/1608BD0 | t2
>16392 |   16384 | r  | 0/1608BD0 | t1
>
> (2 rows)
>
> postgres=#  alter subscription mysub1 refresh publication ;
> ALTER SUBSCRIPTION
>
> -- Subscription relation will be updated.
> postgres=# select *,srrelid::oid::regclass from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srrelid
> -+-++---+-
>16392 |   16384 | r  | 0/1608BD0 | t1
> (1 row)
>
>
> -- Alter subscription fails while setting publication having a table that 
> does not exist
> postgres=#  alter subscription mysub1 set publication mysub2;
> ERROR:  relation "public.t3" does not exist
>
> To maintain consistency, we should have similar behavior in case of 
> publication too.
> If a publication which does not exist is specified during create 
> subscription, then we should throw an error similar to step 2 behavior. 
> Similarly if a publication which does not exist is specified during alter 
> subscription, then we should throw an error similar to step 4 behavior. If 
> publication is dropped after subscription is created, this should be removed 
> when an alter subscription subname refresh publication is performed similar 
> to step 4.
> Thoughts?

IIUC, your idea is to check if the publications (that are associated
with a subscription) are present in the publisher or not during ALTER
SUBSCRIPTION ... REFRESH PUBLICATION;. If that's the case, then I have

scenario:
1) subscription is created with pub1, pub2 and assume both the
publications are present in the publisher
2) pub1 and pub2 tables data is replicated properly
3) pub2 is dropped on the publisher
4) run alter subscription .. refresh publication on the subscriber, so
that the pub2 tables will be removed from the subscriber
5) for some reason, user creates pub2 again on the publisher and want
to replicated some tables
6) run alter subscription .. refresh publication on the subscriber, so
that the pub2 tables will be added to the subscriber table list

Now, if we remove the dropped publication pub2 in step 4 from the
subscription list(as per your above analysis and suggestion), then
after step 5, users will need to add the publication pub2 to the
subscription again. I feel this is a change in the current behaviour.
The existing behaviour on master doesn't mandate this as the dropped
publications are not removed from the subscription list at all.

To not mandate any new behaviour, I would suggest to have 

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-02-02 Thread Bharath Rupireddy
On Tue, Feb 2, 2021 at 9:45 AM Fujii Masao  wrote:
> One merit of keep_connections that I found is that we can use it even
> when connecting to the older PostgreSQL that doesn't support
> idle_session_timeout. Also it seems simpler to use keep_connections
> rather than setting idle_session_timeout in multiple remote servers.
> So I'm inclined to add this feature, but I'd like to hear more opinions.

Thanks.

> > And, just using idle_session_timeout on a remote server may not help
> > us completely. Because the remote session may go away, while we are
> > still using that cached connection in an explicit txn on the local
> > session. Our connection retry will also not work because we are in the
> > middle of an xact, so the local explicit txn gets aborted.
>
> Regarding idle_in_transaction_session_timeout, this seems true. But
> I was thinking that idle_session_timeout doesn't cause this issue because
> it doesn't close the connection in the middle of transaction. No?

You are right. idle_session_timeout doesn't take effect when in the
middle of an explicit txn. I missed this point.

> Here are some review comments.
>
> -   (used_in_current_xact && !keep_connections))
> +   (used_in_current_xact &&
> +   (!keep_connections || !entry->keep_connection)))
>
> The names of GUC and server-level option should be the same,
> to make the thing less confusing?

We can have GUC name keep_connections as there can be multiple
connections within a local session and I can change the server level
option keep_connection to keep_connections because a single foreign
server can have multiple connections as we have seen that in the use
case identified by you. I will change that in the next patch set.

> IMO the server-level option should override GUC. IOW, GUC setting
> should be used only when the server-level option is not specified.
> But the above code doesn't seem to do that. Thought?

Note that default values for GUC and server level option are on i.e.
connections are cached.

The main intention of the GUC is to not set server level options to
false for all the foreign servers in case users don't want to keep any
foreign server connections. If the server level option overrides GUC,
then even if users set GUC to off, they have to set the server level
option to false for all the foreign servers.

So, the below code in the patch, first checks the GUC. If the GUC is
off, then discards the connections. If the GUC is on, then it further
checks the server level option. If it's off discards the connection,
otherwise not.

I would like it to keep this behaviour as is. Thoughts?

 if (PQstatus(entry->conn) != CONNECTION_OK ||
 PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
 entry->changing_xact_state ||
 entry->invalidated ||
+(used_in_current_xact &&
+(!keep_connections || !entry->keep_connection)))
 {
 elog(DEBUG3, "discarding connection %p", entry->conn);
 disconnect_pg_server(entry);

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Custom compression methods

2021-02-02 Thread Dilip Kumar
On Wed, Feb 3, 2021 at 2:07 AM Robert Haas  wrote:
>
> Even more review comments, still looking mostly at 0001:
>
> If there's a reason why parallel_schedule is arranging to run the
> compression test in parallel with nothing else, the comment in that
> file should explain the reason. If there isn't, it should be added to
> a parallel group that doesn't have the maximum number of tests yet,
> probably the last such group in the file.
>
> serial_schedule should add the test in a position that roughly
> corresponds to where it appears in parallel_schedule.
>
> I believe it's relatively standard practice to put variable
> declarations at the top of the file. compress_lz4.c and
> compress_pglz.c instead put those declarations nearer to the point of
> use.
>
> compressamapi.c has an awful lot of #include directives for the code
> it actually contains. I believe that we should cut that down to what
> is required by 0001, and other patches can add more later as required.
> In fact, it's tempting to just get rid of this .c file altogether and
> make the two functions it contains static inline functions in the
> header, but I'm not 100% sure that's a good idea.
>
> The copyright dates in a number of the file headers are out of date.
>
> binary_upgrade_next_pg_am_oid and the related changes to
> CreateAccessMethod don't belong in 0001, because it doesn't support
> non-built-in compression methods. These changes and the related
> pg_dump change should be moved to the patch that adds support for
> that.
>
> The comments added to dumpTableSchema() say that "compression is
> assigned by ALTER" but don't give a reason. I think they should. I
> don't know how much they need to explain about what the code does, but
> they definitely need to explain why it does it. Also, isn't this bad?
> If we create the column with the wrong compression setting initially
> and then ALTER it, we have to rewrite the table. If it's empty, that's
> cheap, but it'd still be better not to do it at all.
>
> I'm not sure it's a good idea for dumpTableSchema() to leave out
> specifying the compression method if it happens to be pglz. I think we
> definitely shouldn't do it in binary-upgrade mode. What if we changed
> the default in a future release? For that matter, even 0002 could make
> the current approach unsafe I think, anyway.
>
> The changes to pg_dump.h look like they haven't had a visit from
> pgindent. You should probably try to do that for the whole patch,
> though it's a bit annoying since you'll have to manually remove
> unrelated changes to the same files that are being modified by the
> patch. Also, why the extra blank line here?
>
> GetAttributeCompression() is hard to understand. I suggest changing
> the comment to "resolve column compression specification to an OID"
> and somehow rejigger the code so that you aren't using one not-NULL
> test and one NULL test on the same variable. Like maybe change the
> first part to if (!IsStorageCompressible(typstorage)) { if
> (compression == NULL) return InvalidOid; ereport(ERROR, ...); }
>
> It puzzles me that CompareCompressionMethodAndDecompress() calls
> slot_getallattrs() just before clearing the slot. It seems like this
> ought to happen before we loop over the attributes, so that we don't
> need to call slot_getattr() every time. See the comment for that
> function. But even if we didn't do that for some reason, why would we
> do it here? If it's already been done, it shouldn't do anything, and
> if it hasn't been done, it might overwrite some of the values we just
> poked into tts_values. It also seems suspicious that we can get away
> with clearing the slot and then again marking it valid. I'm not sure
> it really works like that. Like, can't clearing the slot invalidate
> pointers stored in tts_values[]? For instance, if they are pointers
> into an in-memory heap tuple, tts_heap_clear() is going to free the
> tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
> going to unpin it. I think the supported procedure for this sort of
> thing is to have a second slot, set tts_values, tts_isnull etc. and
> then materialize the slot. After materializing the new slot, it's
> independent of the old slot, which can then be cleared. See for
> example tts_virtual_materialize(). The whole approach you've taken
> here might need to be rethought a bit. I think you are right to want
> to avoid copying everything over into a new slot if nothing needs to
> be done, and I think we should definitely keep that optimization, but
> I think if you need to copy stuff, you have to do the above procedure
> and then continue using the other slot instead of the original one.
> Some places I think we have functions that return either the original
> slot or a different one depending on how it goes; that might be a
> useful idea here. But, you also can't just spam-create slots; it's
> important that whatever ones we end up with get reused for every
> tuple.
>
> Doesn't the change to 

Re: memory leak in auto_explain

2021-02-02 Thread japin


On Wed, 03 Feb 2021 at 02:13, Tom Lane  wrote:
> japin  writes:
>> Here's my analysis:
>> 1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL 
>> function
>> memory context, which is a long-lived, cause the memory grow up.
>
> Yeah, agreed.  I think people looking at this have assumed that the
> ExecutorEnd hook would automatically be running in the executor's
> per-query context, but that's not so; we haven't yet entered
> standard_ExecutorEnd where the context switch is.  The caller's
> context is likely to be much longer-lived than the executor's.
>
> I think we should put the switch one level further out than you have
> it here, just to be sure that InstrEndLoop is covered too (that doesn't
> allocate memory, but auto_explain shouldn't assume that).  Otherwise
> seems like a good fix.
>

Thanks for your review and clarification.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Single transaction in the tablesync worker?

2021-02-02 Thread Peter Smith
On Wed, Feb 3, 2021 at 1:34 PM Amit Kapila  wrote:
>
> On Wed, Feb 3, 2021 at 6:38 AM Peter Smith  wrote:
> >
> > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila  wrote:
> > >
> > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith  wrote:
> > > >
> > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > > the tablesync worker. It seems that doing this can cause the sync
> > > > worker's MyLogicalRepWorker->relid to become invalid.
> > > >
> > >
> > > I think this should be fixed by latest patch because I have disallowed
> > > drop of a table when its synchronization is in progress. You can check
> > > once and let me know if the issue still exists?
> > >
> >
> > FYI - I confirmed that the problem scenario that I reported yesterday
> > is no longer possible because now the V25 patch is disallowing the
> > DROP TABLE while the tablesync is still running.
> >
>
> Thanks for the confirmation. BTW, can you please check if we can
> reproduce that problem without this patch? If so, we might want to
> apply this fix irrespective of this patch. If not, why not?
>

Yes, this was an existing postgres bug. It is independent of the patch.

I can reproduce exactly the same stacktrace using the HEAD src pulled @ 3/Feb.

PSA my test logs showing the details.


Kind Regards,
Peter Smith.
Fujitsu Australia
Test Scenario:

1. INSERT data so tablesync should copy something
2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out 
from under the sync worker!
3. Continue the sync worker see what happens

All code is from PG HEAD (3/Feb) except the "!!>>" added to allow me to attech 
to debugger.

==

##
## Insert data
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1

##
## SUBSCRIBE and continue to breakpoint at top of tablesync function 
LogicalRepSyncTableStart
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub;"
2021-02-03 13:42:20.031 AEDT [9555] LOG:  logical decoding found consistent 
point at 0/16605E8
2021-02-03 13:42:20.031 AEDT [9555] DETAIL:  There are no running transactions.
2021-02-03 13:42:20.031 AEDT [9555] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2021-02-03 13:42:20.043 AEDT [9556] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-02-03 13:42:20.043 AEDT [9556] LOG:  !!>> The apply worker process has PID 
= 9556
[postgres@CentOS7-x64 ~]$ 2021-02-03 13:42:20.052 AEDT [9562] LOG:  starting 
logical decoding for slot "tap_sub"
2021-02-03 13:42:20.052 AEDT [9562] DETAIL:  Streaming transactions committing 
after 0/1660620, reading WAL from 0/16605E8.
2021-02-03 13:42:20.052 AEDT [9562] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 13:42:20.052 AEDT [9562] LOG:  logical decoding found consistent 
point at 0/16605E8
2021-02-03 13:42:20.052 AEDT [9562] DETAIL:  There are no running transactions.
2021-02-03 13:42:20.052 AEDT [9562] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 13:42:20.056 AEDT [9564] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2021-02-03 13:42:20.056 AEDT [9564] LOG:  !!>> The tablesync worker process has 
PID = 9564
2021-02-03 13:42:20.056 AEDT [9564] LOG:  !!>>


Sleeping 30 secs. For debugging, attach to process 9564 now!

## PID 9564 is sync worker
## PID 9556 is apply worker

##
## While paused in debugger for the tablesync worker do DROP TABLE on 
subscriber.
## Note we have not done any ALTER SUBSCRIPTION.
##

psql -d test_sub -p 54321 -c "DROP TABLE test_tab;"
DROP TABLE

## Following stacktrace LOG happens

[postgres@CentOS7-x64 ~]$ TRAP: FailedAssertion("strvalue != NULL", File: 
"snprintf.c", Line: 442, PID: 9564)
postgres: logical replication worker for subscription 16407 sync 16385 
(ExceptionalCondition+0xb9)[0xad8f4b]
postgres: logical replication worker for subscription 16407 sync 16385 
[0xb4c6ee]
postgres: logical replication worker for subscription 16407 sync 16385 
(pg_vsnprintf+0x7c)[0xb4c072]
postgres: logical replication worker for subscription 16407 sync 16385 
(pvsnprintf+0x30)[0xb3f26f]
postgres: logical replication worker for subscription 16407 sync 16385 
(appendStringInfoVA+0x80)[0xb40cb6]
postgres: logical replication worker for subscription 16407 sync 16385 
(errmsg+0x183)[0xad9d33]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x8c52f3]
postgres: logical replication worker for subscription 16407 sync 16385 

Re: Single transaction in the tablesync worker?

2021-02-02 Thread Amit Kapila
On Wed, Feb 3, 2021 at 6:38 AM Peter Smith  wrote:
>
> On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila  wrote:
> >
> > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith  wrote:
> > >
> > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > the tablesync worker. It seems that doing this can cause the sync
> > > worker's MyLogicalRepWorker->relid to become invalid.
> > >
> >
> > I think this should be fixed by latest patch because I have disallowed
> > drop of a table when its synchronization is in progress. You can check
> > once and let me know if the issue still exists?
> >
>
> FYI - I confirmed that the problem scenario that I reported yesterday
> is no longer possible because now the V25 patch is disallowing the
> DROP TABLE while the tablesync is still running.
>

Thanks for the confirmation. BTW, can you please check if we can
reproduce that problem without this patch? If so, we might want to
apply this fix irrespective of this patch. If not, why not?

-- 
With Regards,
Amit Kapila.




Re: libpq debug log

2021-02-02 Thread Kyotaro Horiguchi
At Wed, 3 Feb 2021 01:26:41 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> (41)
> +void
> +PQtraceEx(PGconn *conn, FILE *debug_port, int flags)
> +{
> + if (conn == NULL)
> + return;
> ...
> + if (!debug_port)
> + return;
> 
> The if should better be as follows to match the style of existing surrounding 
> code.
> 
> + if (debug_port == NULL)

I don't have particular preference here, but FWIW the current
PQuntrace is doing this:

> void
> PQuntrace(PGconn *conn)
> {
>   if (conn == NULL)
>   return;
>   if (conn->Pfdebug)
>   {
>   fflush(conn->Pfdebug);

> (44)
> + if ((conn->traceFlags & PQTRACE_OUTPUT_TIMESTAMPS) == 0)
> + timestr_p = pqLogFormatTimestamp(timestr, sizeof(timestr));
> 
> == 0 should be removed.

Looking the doc mentioned in the comment #39:

+  flags contains flag bits describing the operating mode
+  of tracing.  If (flags  
PQTRACE_OUTPUT_TIMESTAMPS) is
+  true, then timestamp is not printed with each message.

PQTRACE_OUTPUT_TIMESTAMPS is designed to *inhibit* timestamps from
being prepended. If that is actually intended, the symbol name should
be PQTRACE_NOOUTPUT_TIMESTAMP. Otherwise, the doc need to be fixed.

By the way removing "== 0" makes it difficult to tell whether the
condition is correct or not; I recommend to use '!= 0" rather than
removing '== 0'.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: libpq debug log

2021-02-02 Thread tsunakawa.ta...@fujitsu.com
From: 'Alvaro Herrera' 
> > +   conn->fe_msg->num_fields != DEF_FE_MSGFIELDS)

> The rationale for that second condition is this: if the memory allocated
> is the initial size, we don't free memory, because it would just be
> allocated of the same size next time, and that size is not very big, so
> it's not a big deal if we just let it be, so that it is reused if we
> call PQtrace() again later.  However, if the allocated size is larger
> than default, then it is possible that some previous tracing run has
> enlarged the trace struct to a very large amount of memory, and we don't
> want to leave that in place.

Ah, understood.  In that case, num_fields should be max_fields.

This has reminded me that freePGconn() should free be_msg and fe_msg.


Regards
Takayuki Tsunakawa




Re: adding wait_start column to pg_locks

2021-02-02 Thread Fujii Masao




On 2021/02/03 1:49, Fujii Masao wrote:



On 2021/02/02 22:00, torikoshia wrote:

On 2021-01-25 23:44, Fujii Masao wrote:

Another comment is; Doesn't the change of MyProc->waitStart need the
lock table's partition lock? If yes, we can do that by moving
LWLockRelease(partitionLock) just after the change of
MyProc->waitStart, but which causes the time that lwlock is being held
to be long. So maybe we need another way to do that.


Thanks for your comments!

It would be ideal for the consistency of the view to record "waitstart" during 
holding the table partition lock.
However, as you pointed out, it would give non-negligible performance impacts.

I may miss something, but as far as I can see, the influence of not holding the lock is that 
"waitstart" can be NULL even though "granted" is false.

I think people want to know the start time of the lock when locks are held for 
a long time.
In that case, "waitstart" should have already been recorded.


Sounds reasonable.



If this is true, I think the current implementation may be enough on the condition that users 
understand it can happen that "waitStart" is NULL and "granted" is false.

Attached a patch describing this in the doc and comments.


Any Thoughts?


64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like 
pg_atomic_read_u64(), for that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Single transaction in the tablesync worker?

2021-02-02 Thread Ajin Cherian
On Wed, Feb 3, 2021 at 12:24 AM Amit Kapila  wrote:

> The problem here is that we are allowing to drop the table when table
> synchronization is still in progress and then we don't have any way to
> know the corresponding slot or origin. I think we can try to drop the
> slot and origin as well but that is not a good idea because slots once
> dropped won't be rolled back. So, I have added a fix to disallow the
> drop of the table when table synchronization is still in progress.
> Apart from that, I have fixed comments raised by Peter as discussed
> above and made some additional changes in comments, code (code changes
> are cosmetic), and docs.
>
> Let me know if the issue reported is fixed or not?

Yes, the issue is fixed, now the table drop results in an error.

postgres=# drop table tab_rep ;
ERROR:  could not drop relation mapping for subscription "tap_sub"
DETAIL:  Table synchronization for relation "tab_rep" is in progress
and is in state "f".
HINT:  Use ALTER SUBSCRIPTION ... ENABLE to enable subscription if not
already enabled or use DROP SUBSCRIPTION ... to drop the subscription.

regards,
Ajin Cherian
Fujitsu Australia




RE: libpq debug log

2021-02-02 Thread tsunakawa.ta...@fujitsu.com
(39)
+  of tracing.  If (flags  
PQTRACE_OUTPUT_TIMESTAMPS) is
+  true, then timestamp is not printed with each message.

The flag name (OUTPUT) and its description (not printed) doesn't match.

I think you can use less programmatical expression like "If 
flags contains PQTRACE_OUTPUT_TIMESTAMPS".


(40)
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
+PQtraceEx 180
\ No newline at end of file

What's the second line?  Isn't the file missing an empty line at the end?

(41)
+void
+PQtraceEx(PGconn *conn, FILE *debug_port, int flags)
+{
+   if (conn == NULL)
+   return;
...
+   if (!debug_port)
+   return;

The if should better be as follows to match the style of existing surrounding 
code.

+   if (debug_port == NULL)


(42)
+pqLogFormatTimestamp(char *timestr, Size ts_len)

I think you should use int or size_t instead of Size here, because other libpq 
code uses them.  int should be enough.  If the compiler gives warnings, prepend 
"(int)" before sizeof() at call sites.


(43)
+   /* append microseconds */
+   sprintf(timestr + strlen(timestr), ".%06d", (int) (tval.tv_usec / 
1000));

"/ 1000" should be removed.


(44)
+   if ((conn->traceFlags & PQTRACE_OUTPUT_TIMESTAMPS) == 0)
+   timestr_p = pqLogFormatTimestamp(timestr, sizeof(timestr));

== 0 should be removed.


Regards
Takayuki Tsunakawa



Re: Single transaction in the tablesync worker?

2021-02-02 Thread Peter Smith
On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila  wrote:
>
> On Tue, Feb 2, 2021 at 3:31 PM Peter Smith  wrote:
> >
> > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > tried a simple test where I do a DROP TABLE with very bad timing for
> > the tablesync worker. It seems that doing this can cause the sync
> > worker's MyLogicalRepWorker->relid to become invalid.
> >
>
> I think this should be fixed by latest patch because I have disallowed
> drop of a table when its synchronization is in progress. You can check
> once and let me know if the issue still exists?
>

FYI - I confirmed that the problem scenario that I reported yesterday
is no longer possible because now the V25 patch is disallowing the
DROP TABLE while the tablesync is still running.

PSA my test logs showing it is now working as expected.


Kind Regards,
Peter Smith.
Fujitsu Australia
Test Scenario

1. INSERT data so tablesync should copy something
2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out 
from under the sync worker!
3. Continue the sync worker see what happens

==

##
## Insert data
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1


##
## SUBSCRIBE and continue to breakpoint at top of tablesync function 
LogicalRepSyncTableStart
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub;"
2021-02-03 11:46:07.207 AEDT [8298] LOG:  logical decoding found consistent 
point at 0/1654698
2021-02-03 11:46:07.207 AEDT [8298] DETAIL:  There are no running transactions.
2021-02-03 11:46:07.207 AEDT [8298] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2021-02-03 11:46:07.218 AEDT [8302] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-02-03 11:46:07.218 AEDT [8302] LOG:  !!>> The apply worker process has PID 
= 8302
[postgres@CentOS7-x64 ~]$ 2021-02-03 11:46:07.224 AEDT [8306] LOG:  starting 
logical decoding for slot "tap_sub"
2021-02-03 11:46:07.224 AEDT [8306] DETAIL:  Streaming transactions committing 
after 0/16546D0, reading WAL from 0/1654698.
2021-02-03 11:46:07.224 AEDT [8306] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 11:46:07.225 AEDT [8306] LOG:  logical decoding found consistent 
point at 0/1654698
2021-02-03 11:46:07.225 AEDT [8306] DETAIL:  There are no running transactions.
2021-02-03 11:46:07.225 AEDT [8306] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 11:46:07.225 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:07.225 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:07.229 AEDT [8307] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2021-02-03 11:46:07.229 AEDT [8307] LOG:  !!>> The tablesync worker process has 
PID = 8307
2021-02-03 11:46:07.229 AEDT [8307] LOG:  !!>>


Sleeping 30 secs. For debugging, attach to process 8307 now!

## PID 8302 is apply worker
## PID 8307 is tablesync worker

2021-02-03 11:46:08.241 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:08.241 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:09.253 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:09.253 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:10.255 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:10.255 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:11.258 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:11.258 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:12.263 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:12.263 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:13.265 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:13.265 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:14.275 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:14.275 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:15.280 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:15.280 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 11:46:16.282 AEDT [8302] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-02-03 11:46:16.282 AEDT [8302] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-03 

RE: Determine parallel-safety of partition relations for Inserts

2021-02-02 Thread Hou, Zhijie
> For v3_0003-reloption-parallel_dml-src.patch :
> +   table_close(rel, NoLock);

> Since the rel would always be closed, it seems the return value from 
> RelationGetParallelDML() can be assigned to a variable, followed by call to 
> table_close(), then the return statement.

Thanks for the comment. Fixed in the latest patch.

Best regards,
houzj




RE: Determine parallel-safety of partition relations for Inserts

2021-02-02 Thread Hou, Zhijie
Hi,

Attaching v5 patches with the changes:
  * rebase the code on the greg's latest parallel insert patch
  * fix some code style.

Please consider it for further review.

Best regards,
Houzj
 
 





v5_0004-reloption-parallel_dml-test-and-doc.patch
Description: v5_0004-reloption-parallel_dml-test-and-doc.patch


v5_0001-guc-option-enable_parallel_dml-src.patch
Description: v5_0001-guc-option-enable_parallel_dml-src.patch


v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v5_0003-reloption-parallel_dml-src.patch.patch
Description: v5_0003-reloption-parallel_dml-src.patch.patch


Re: LogwrtResult contended spinlock

2021-02-02 Thread Andres Freund
Hi,

On 2021-02-02 20:19:19 -0300, Alvaro Herrera wrote:
> > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata,
> > >   {
> > >   /* advance global request to include new block(s)
> > >   */
> > >   pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> > > EndPos);
> > > + pg_memory_barrier();
> >
> > That's not really useful - the path that actually updates already
> > implies a barrier. It'd probably be better to add a barrier to a "never
> > executed cmpxchg" fastpath.
>
> Got it.  Do you mean in pg_atomic_monotonic_advance_u64() itself?

Yes.


> I'm not sure which is the nicer semantics.  (If it's got to be at the
> caller, then we'll need to return a boolean from there, which sounds
> worse.)

Nearly all other modifying atomic operations have full barrier
semantics, so I think it'd be better to have it inside the
pg_atomic_monotonic_advance_u64().


> +/*
> + * Monotonically advance the given variable using only atomic operations 
> until
> + * it's at least the target value.
> + */
> +static inline void
> +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 
> target_)
> +{
> + uint64  currval;
> +
> +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> + AssertPointerAlignment(ptr, 8);
> +#endif
> +
> + currval = pg_atomic_read_u64(ptr);
> + while (currval < target_)
> + {
> + if (pg_atomic_compare_exchange_u64(ptr, , target_))
> + break;
> + }
> +}

So I think it'd be

static inline void
pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
{
uint64  currval;

#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif

currval = pg_atomic_read_u64(ptr);
if (currval >= target_)
{
pg_memory_barrier();
return;
}

while (currval < target_)
{
if (pg_atomic_compare_exchange_u64(ptr, , target_))
break;
}
}


> @@ -1172,9 +1170,10 @@ XLogInsertRecord(XLogRecData *rdata,
>   /* advance global request to include new block(s) */
>   if (XLogCtl->LogwrtRqst.Write < EndPos)
>   XLogCtl->LogwrtRqst.Write = EndPos;
> - /* update local result copy while I have the chance */
> - LogwrtResult = XLogCtl->LogwrtResult;
>   SpinLockRelease(>info_lck);
> + /* update local result copy */
> + LogwrtResult.Write = 
> pg_atomic_read_u64(>LogwrtResult.Write);
> + LogwrtResult.Flush = 
> pg_atomic_read_u64(>LogwrtResult.Flush);
>   }

As mentioned before - it's not clear to me why this is a valid thing to
do without verifying all LogwrtResult.* usages. You can get updates
completely out of order / independently.


> @@ -2675,8 +2673,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
>* code in a couple of places.
>*/
>   {
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Write, 
> LogwrtResult.Write);
> + pg_atomic_monotonic_advance_u64(>LogwrtResult.Flush, 
> LogwrtResult.Flush);
> + pg_memory_barrier();
>   SpinLockAcquire(>info_lck);
> - XLogCtl->LogwrtResult = LogwrtResult;

I still don't see why we need "locked" atomic operations here, rather
than just a pg_atomic_write_u64(). They can only be modified
with WALWriteLock held.  There's two reasons for using a spinlock in
this place:
1) it avoids torn reads of 64bit values -
   pg_atomic_write_u64()/pg_atomic_read_u64() avoid that already.
2) it ensures that Write/Flush are updated in unison - but that's not
   useful anymore, given that other places now read the variables
   separately.


> @@ -3064,8 +3063,10 @@ XLogBackgroundFlush(void)
>   WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
>
>   /* if we have already flushed that far, consider async commit records */
> + LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
>   if (WriteRqst.Write <= LogwrtResult.Flush)
>   {
> + pg_memory_barrier();
>   SpinLockAcquire(>info_lck);
>   WriteRqst.Write = XLogCtl->asyncXactLSN;
>   SpinLockRelease(>info_lck);

A SpinLockAcquire() is a full memory barrier on its own I think. This'd
probably better solved by just making asyncXactLSN atomic...


Greetings,

Andres Freund




Re: Recording foreign key relationships for the system catalogs

2021-02-02 Thread Tom Lane
I wrote:
> * It would now be possible to remove the PGNSP and PGUID kluges
> entirely in favor of plain BKI_LOOKUP references to pg_namespace
> and pg_authid.  The catalog header usage would get a little
> more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog)
> and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES).  I'm a bit
> inclined to do it, simply to remove one bit of mechanism that has
> to be documented; but it's material for a separate patch perhaps.

Here's a patch for that part.  I think this is probably a good
idea not only because it removes magic, but because now that we
have various predefined roles it's becoming more and more likely
that some of those will need to be cross-referenced in other
catalogs' initial data.  With this change, nothing special
will be needed for that.  Multiple built-in schemas also become
more feasible than they were.

regards, tom lane

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 6d3c5be67f..db1b3d5e9a 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -540,17 +540,6 @@
   expected to be in the pg_catalog schema.
  
 
-
-
- 
-  In addition to the generic lookup mechanisms, there is a special
-  convention that PGNSP is replaced by the OID of
-  the pg_catalog schema,
-  and PGUID is replaced by the OID of the bootstrap
-  superuser role.  These usages are somewhat historical but so far
-  there hasn't been a need to generalize them.
- 
-

 

diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 5bdc7adc44..b159958112 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -184,15 +184,9 @@ my $GenbkiNextOid = $FirstGenbkiObjectId;
 # within a given Postgres release, such as fixed OIDs.  Do not substitute
 # anything that could depend on platform or configuration.  (The right place
 # to handle those sorts of things is in initdb.c's bootstrap_template1().)
-my $BOOTSTRAP_SUPERUSERID =
-  Catalog::FindDefinedSymbolFromData($catalog_data{pg_authid},
-	'BOOTSTRAP_SUPERUSERID');
 my $C_COLLATION_OID =
   Catalog::FindDefinedSymbolFromData($catalog_data{pg_collation},
 	'C_COLLATION_OID');
-my $PG_CATALOG_NAMESPACE =
-  Catalog::FindDefinedSymbolFromData($catalog_data{pg_namespace},
-	'PG_CATALOG_NAMESPACE');
 
 
 # Fill in pg_class.relnatts by looking at the referenced catalog's schema.
@@ -213,11 +207,12 @@ foreach my $row (@{ $catalog_data{pg_am} })
 	$amoids{ $row->{amname} } = $row->{oid};
 }
 
-# There is only one authid at bootstrap time, and we handle it specially:
-# the usually-defaulted symbol PGUID becomes the bootstrap superuser's OID.
-# (We could drop this in favor of writing out BKI_DEFAULT(POSTGRES) ...)
+# role OID lookup
 my %authidoids;
-$authidoids{'PGUID'} = $BOOTSTRAP_SUPERUSERID;
+foreach my $row (@{ $catalog_data{pg_authid} })
+{
+	$authidoids{ $row->{rolname} } = $row->{oid};
+}
 
 # class (relation) OID lookup (note this only covers bootstrap catalogs!)
 my %classoids;
@@ -240,11 +235,12 @@ foreach my $row (@{ $catalog_data{pg_language} })
 	$langoids{ $row->{lanname} } = $row->{oid};
 }
 
-# There is only one namespace at bootstrap time, and we handle it specially:
-# the usually-defaulted symbol PGNSP becomes the pg_catalog namespace's OID.
-# (We could drop this in favor of writing out BKI_DEFAULT(pg_catalog) ...)
+# namespace (schema) OID lookup
 my %namespaceoids;
-$namespaceoids{'PGNSP'} = $PG_CATALOG_NAMESPACE;
+foreach my $row (@{ $catalog_data{pg_namespace} })
+{
+	$namespaceoids{ $row->{nspname} } = $row->{oid};
+}
 
 # opclass OID lookup
 my %opcoids;
diff --git a/src/include/catalog/pg_authid.dat b/src/include/catalog/pg_authid.dat
index a643a09588..87d917ffc3 100644
--- a/src/include/catalog/pg_authid.dat
+++ b/src/include/catalog/pg_authid.dat
@@ -15,6 +15,10 @@
 # The C code typically refers to these roles using the #define symbols,
 # so make sure every entry has an oid_symbol value.
 
+# The bootstrap superuser is named POSTGRES according to this data and
+# according to BKI_DEFAULT entries in other catalogs.  However, initdb
+# will replace that at database initialization time.
+
 { oid => '10', oid_symbol => 'BOOTSTRAP_SUPERUSERID',
   rolname => 'POSTGRES', rolsuper => 't', rolinherit => 't',
   rolcreaterole => 't', rolcreatedb => 't', rolcanlogin => 't',
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index bb6938caa2..3e37729436 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -38,7 +38,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
 	NameData	relname;
 
 	/* OID of namespace containing this class */
-	Oid			relnamespace BKI_DEFAULT(PGNSP) BKI_LOOKUP(pg_namespace);
+	Oid			relnamespace BKI_DEFAULT(pg_catalog) BKI_LOOKUP(pg_namespace);
 
 	/* OID of entry in pg_type for relation's implicit row type, if any */
 	Oid			reltype 

Re: LogwrtResult contended spinlock

2021-02-02 Thread Alvaro Herrera
Hello,

So I addressed about half of your comments in this version merely by
fixing silly bugs.  The problem I had which I described as
"synchronization fails" was one of those silly bugs.

So in further thinking, it seems simpler to make only LogwrtResult
atomic, and leave LogwrtRqst as currently, using the spinlock.  This
should solve the contention problem we saw at the customer (but I've
asked Jaime very nicely to do a test run, if possible, to confirm).

For things like logical replication, which call GetFlushRecPtr() very
frequently (causing the spinlock issue we saw) it is good, because we're
no longer hitting the spinlock at all in that case.

I have another (pretty mechanical) patch that renames LogwrtResult.Write
to LogWriteResult and LogwrtResult.Flush to LogFlushResult.  That more
clearly shows that we're no longer updating them on unison.  Didn't want
to attach here because I didn't rebase on current one.  But it seems
logical: there's no longer any point in doing struct assignment, which
is the only thing that stuff was good for.


On 2021-Jan-29, Andres Freund wrote:

> > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata,
> > {
> > /* advance global request to include new block(s)
> > */
> > pg_atomic_monotonic_advance_u64(>LogwrtRqst.Write, 
> > EndPos);
> > +   pg_memory_barrier();
> 
> That's not really useful - the path that actually updates already
> implies a barrier. It'd probably be better to add a barrier to a "never
> executed cmpxchg" fastpath.

Got it.  Do you mean in pg_atomic_monotonic_advance_u64() itself?  I'm
not sure which is the nicer semantics.  (If it's got to be at the
caller, then we'll need to return a boolean from there, which sounds
worse.)

> > @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record)
> > WriteRqstPtr = pg_atomic_read_u64(>LogwrtRqst.Write);
> > LogwrtResult.Write = 
> > pg_atomic_read_u64(>LogwrtResult.Write);
> > LogwrtResult.Flush = 
> > pg_atomic_read_u64(>LogwrtResult.Flush);
> > +   pg_read_barrier();
> 
> I'm not sure you really can get away with just a read barrier in these
> places. We can't e.g. have later updates to other shared memory
> variables be "moved" to before the barrier (which a read barrier
> allows).

Ah, that makes sense.

I have not really studied the barrier locations terribly closely in this
version of the patch.  It probably misses some (eg. in GetFlushRecPtr
and GetXLogWriteRecPtr).  It is passing the tests for me, but alone
that's probably not enough.  I'm gonna try and study the generated
assembly and see if I can make sense of things ...

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From ca05447f41270d6b3ac3b6fcf816f86aba86d08f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 29 Jan 2021 22:34:23 -0300
Subject: [PATCH 1/2] add pg_atomic_monotonic_advance_u64

---
 src/include/port/atomics.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 856338f161..a1c4764d18 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -519,6 +519,27 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+/*
+ * Monotonically advance the given variable using only atomic operations until
+ * it's at least the target value.
+ */
+static inline void
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
+{
+	uint64		currval;
+
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+
+	currval = pg_atomic_read_u64(ptr);
+	while (currval < target_)
+	{
+		if (pg_atomic_compare_exchange_u64(ptr, , target_))
+			break;
+	}
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif			/* ATOMICS_H */
-- 
2.20.1

>From d185cf50a47a0f9e346e49ccda038bb016ce246b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 2 Feb 2021 14:03:43 -0300
Subject: [PATCH 2/2] make LogwrtResult atomic

---
 src/backend/access/transam/xlog.c | 66 +++
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f03bd473e2..10802bd56f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -405,12 +405,9 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
  * These structs are identical but are declared separately to indicate their
  * slightly different functions.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * 

Re: Proposal: Save user's original authenticated identity for logging

2021-02-02 Thread Jacob Champion
On Thu, 2021-01-28 at 18:22 +, Jacob Champion wrote:
> = Proposal =
> 
> I propose that every auth method should store the string it uses to
> identify a user -- what I'll call an "authenticated identity" -- into
> one central location in Port, after authentication succeeds but before
> any pg_ident authorization occurs.

Thanks everyone for all of the feedback! Here's my summary of the
conversation so far:

- The idea of storing the user's original identity consistently across
all auth methods seemed to be positively received.

- Exposing this identity through log_line_prefix was not as well-
received, landing somewhere between "meh" and "no thanks". The main
concern was log bloat/expense.

- Exposing it through the CSV log got the same reception: if we expose
it through log_line_prefix, we should expose it through CSV, but no one
seemed particularly excited about either.

- The idea of logging this information once per session, as part of
log_connection, got a more positive response. That way the information
can still be obtained, but it doesn't clutter every log line.

- There was also some interest in exposing this through the statistics
collector, either as a superuser-only feature or via the
pg_read_all_stats role.

- There was some discussion around *which* string to choose as the
identifer for more complicated cases, such as TLS client certificates.

- Other improvements around third-party authorization and role
management were discussed, including the ability to auto-create
nonexistent roles, to sync role definitions as a first-party feature,
and to query an external system for role authorization.

(Let me know if there's something else I've missed.)

== My Plans ==

Given the feedback above, I'll continue to flesh out the PoC patch,
focusing on 1) storing the identity in a single place for all auth
methods and 2) exposing it consistently in the logs as part of
log_connections. I'll drop the log_line_prefix format specifier from
the patch and see what that does to the testing side of things. I also
plan to write a follow-up patch to add the authenticated identity to
the statistics collector, with pg_get_authenticated_identity() to
retrieve it.

I'm excited to see where the third-party authz and role management
conversations go, but I won't focus on those for my initial patchset. I
think this patch has use even if those ideas are implemented too.

--Jacob


Re: Recording foreign key relationships for the system catalogs

2021-02-02 Thread Tom Lane
"Joel Jacobson"  writes:
> On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote:
>> Appreciate the review!  Please confirm if you agree with above
>> analysis.

> Yes, I agree with the analysis.

Cool, thanks.  I've pushed it now.

regards, tom lane




Re: Background writer and checkpointer in crash recovery

2021-02-02 Thread Robert Haas
On Sat, Aug 29, 2020 at 8:13 PM Thomas Munro  wrote:
> Currently we don't run the bgwriter process during crash recovery.
> I've CCed Simon and Heikki who established this in commit cdd46c76.
> Based on that commit message, I think the bar to clear to change the
> policy is to show that it's useful, and that it doesn't make crash
> recovery less robust.   See the other thread for some initial evidence
> of usefulness from Jakub Wartak.  I think it also just makes intuitive
> sense that it's got to help bigger-than-buffer-pool crash recovery if
> you can shift buffer eviction out of the recovery loop.   As for
> robustness, I suppose we could provide the option to turn it off just
> in case that turns out to be useful in some scenarios, but I'm
> wondering why we would expect something that we routinely run in
> archive/streaming recovery to reduce robustness in only slightly
> different circumstances.
>
> Here's an experiment-grade patch, comments welcome, though at this
> stage it's primarily thoughts about the concept that I'm hoping to
> solicit.

I think the way it works right now is stupid and the proposed change
is going in the right direction. We have ample evidence already that
handing off fsyncs to a background process is a good idea, and there's
no reason why that shouldn't be beneficial during crash recovery just
as it is at other times. But even if it somehow failed to improve
performance during recovery, there's another good reason to do this,
which is that it would make the code simpler. Having the pendingOps
stuff in the startup process in some recovery situations and in the
checkpointer in other recovery situations makes this harder to reason
about. As Tom said, the system state where bgwriter and checkpointer
are not running is an uncommon one, and is probably more likely to
have (or grow) bugs than the state where they are running.

The rat's-nest of logic introduced by the comment "Perform a
checkpoint to update all our recovery activity to disk." inside
StartupXLOG() could really do with some simplification. Right now we
have three cases: CreateEndOfRecoveryRecord(), RequestCheckpoint(),
and CreateCheckpoint(). Maybe with this change we could get it down to
just two, since RequestCheckpoint() already knows what to do about
!IsUnderPostmaster.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-02 Thread John Naylor
On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas  wrote:
> Thanks. I fixed it slightly differently, and also changed LocalToUtf()
> to follow the same pattern, even though LocalToUtf() did not have the
> same bug.

Looks good to me.

> I added a bunch of tests for various built-in conversions.

Nice! I would like to have utf8 tests for every category of invalid byte
(overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.

> I spent some time refactoring and adding comments all around the patch,
> hopefully making it all more clear. One notable difference is that I
> renamed 'raw_buf' (which exists in master too) to 'input_buf', and
> renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
> patch again another day with fresh eyes, and also try to add some tests
> for the corner cases at buffer boundaries.

The comments and renaming are really helpful in understanding that file!

Although a new patch is likely forthcoming, I did take a brief look and
found the following:


In copyfromparse.c, this is now out of date:

 * Read the next input line and stash it in line_buf, with conversion to
 * server encoding.


One of your FIXME comments seems to allude to this, but if we really need a
difference here, maybe it should be explained:

+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */

+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */


Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003
accidentally?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-02-02 Thread Tom Lane
"Joel Jacobson"  writes:
> Doc: improve documentation of oid columns that can be zero.

Since this is pretty closely tied to the catalog-foreign-key work,
I went ahead and reviewed/pushed it.  The zero notations now match
up with what we'd found in the other thread.

regards, tom lane




Re: [HACKERS] Custom compression methods

2021-02-02 Thread Robert Haas
Even more review comments, still looking mostly at 0001:

If there's a reason why parallel_schedule is arranging to run the
compression test in parallel with nothing else, the comment in that
file should explain the reason. If there isn't, it should be added to
a parallel group that doesn't have the maximum number of tests yet,
probably the last such group in the file.

serial_schedule should add the test in a position that roughly
corresponds to where it appears in parallel_schedule.

I believe it's relatively standard practice to put variable
declarations at the top of the file. compress_lz4.c and
compress_pglz.c instead put those declarations nearer to the point of
use.

compressamapi.c has an awful lot of #include directives for the code
it actually contains. I believe that we should cut that down to what
is required by 0001, and other patches can add more later as required.
In fact, it's tempting to just get rid of this .c file altogether and
make the two functions it contains static inline functions in the
header, but I'm not 100% sure that's a good idea.

The copyright dates in a number of the file headers are out of date.

binary_upgrade_next_pg_am_oid and the related changes to
CreateAccessMethod don't belong in 0001, because it doesn't support
non-built-in compression methods. These changes and the related
pg_dump change should be moved to the patch that adds support for
that.

The comments added to dumpTableSchema() say that "compression is
assigned by ALTER" but don't give a reason. I think they should. I
don't know how much they need to explain about what the code does, but
they definitely need to explain why it does it. Also, isn't this bad?
If we create the column with the wrong compression setting initially
and then ALTER it, we have to rewrite the table. If it's empty, that's
cheap, but it'd still be better not to do it at all.

I'm not sure it's a good idea for dumpTableSchema() to leave out
specifying the compression method if it happens to be pglz. I think we
definitely shouldn't do it in binary-upgrade mode. What if we changed
the default in a future release? For that matter, even 0002 could make
the current approach unsafe I think, anyway.

The changes to pg_dump.h look like they haven't had a visit from
pgindent. You should probably try to do that for the whole patch,
though it's a bit annoying since you'll have to manually remove
unrelated changes to the same files that are being modified by the
patch. Also, why the extra blank line here?

GetAttributeCompression() is hard to understand. I suggest changing
the comment to "resolve column compression specification to an OID"
and somehow rejigger the code so that you aren't using one not-NULL
test and one NULL test on the same variable. Like maybe change the
first part to if (!IsStorageCompressible(typstorage)) { if
(compression == NULL) return InvalidOid; ereport(ERROR, ...); }

It puzzles me that CompareCompressionMethodAndDecompress() calls
slot_getallattrs() just before clearing the slot. It seems like this
ought to happen before we loop over the attributes, so that we don't
need to call slot_getattr() every time. See the comment for that
function. But even if we didn't do that for some reason, why would we
do it here? If it's already been done, it shouldn't do anything, and
if it hasn't been done, it might overwrite some of the values we just
poked into tts_values. It also seems suspicious that we can get away
with clearing the slot and then again marking it valid. I'm not sure
it really works like that. Like, can't clearing the slot invalidate
pointers stored in tts_values[]? For instance, if they are pointers
into an in-memory heap tuple, tts_heap_clear() is going to free the
tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
going to unpin it. I think the supported procedure for this sort of
thing is to have a second slot, set tts_values, tts_isnull etc. and
then materialize the slot. After materializing the new slot, it's
independent of the old slot, which can then be cleared. See for
example tts_virtual_materialize(). The whole approach you've taken
here might need to be rethought a bit. I think you are right to want
to avoid copying everything over into a new slot if nothing needs to
be done, and I think we should definitely keep that optimization, but
I think if you need to copy stuff, you have to do the above procedure
and then continue using the other slot instead of the original one.
Some places I think we have functions that return either the original
slot or a different one depending on how it goes; that might be a
useful idea here. But, you also can't just spam-create slots; it's
important that whatever ones we end up with get reused for every
tuple.

Doesn't the change to describeOneTableDetails() require declaring
changing the declaration of char *headers[11] to char *headers[12]?
How does this not fail Assert(cols <= lengthof(headers))?

Why does describeOneTableDetais() arrange to truncate 

Re: New IndexAM API controlling index vacuum strategies

2021-02-02 Thread Peter Geoghegan
On Tue, Feb 2, 2021 at 6:28 AM Victor Yegorov  wrote:
> I really like this idea!

Cool!

> It resembles the approach used in bottom-up index deletion, block-based
> accounting provides a better estimate for the usefulness of the operation.

It does resemble bottom-up index deletion, in one important general
sense: it is somewhat qualitative (though *also* somewhat quantitive).
This is new for vacuumlazy.c. But the idea now is to deemphasize
bottom-up index deletion heavy workloads in the first version of this
patch -- just to cut scope.

The design I described yesterday centers around "~99.9% append-only
table" workloads, where anti-wraparound vacuums that scan indexes are
a big source of unnecessary work (in practice it is always
anti-wraparound vacuums, simply because there will never be enough
garbage to trigger a regular autovacuum run). But it now occurs to me
that there is another very important case that it will also help,
without making the triggering condition for index vacuuming any more
complicated: it will help cases where HOT updates are expected
(because all updates don't modify indexed columns).

It's practically impossible for HOT updates to occur 100% of the time,
even with workloads whose updates never modify indexed columns. You
can clearly see this by looking at the stats from pg_stat_user_tables
with a standard pgbench workload. It does get better with lower heap
fill factor, but I think that HOT is never 100% effective (i.e. 100%
of updates are HOT updates) in the real world -- unless maybe you set
heap fillfactor as low as 50, which is very rare. HOT might well be
95% effective, or 99% effective, but it's never truly 100% effective.
And so this is another important workload where the difference between
"practically zero dead tuples" and "precisely zero dead tuples"
*really* matters when deciding if a VACUUM operation needs to go
ahead.

Once again, a small difference, but also a big difference. Forgive me
for repeating myself do much, but: paying attention to cost/benefit
asymmetries like this one sometimes allow us to recognize an
optimization that is an "excellent deal". We saw this with bottom-up
index deletion. Seems good to keep an eye out for that.

> I suppose that 1% threshold should be configurable as a cluster-wide GUC
> and also as a table storage parameter?

Possibly. I'm concerned about making any user-visible interface (say a
GUC) compatible with an improved version that is smarter about
bottom-up index deletion (in particular, one that can vacuum only a
subset of the indexes on a table, which now seems too ambitious for
Postgres 14).

-- 
Peter Geoghegan




Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-02 Thread Michail Nikolaev
Hello, Peter.

> AFAICT that's not true, at least not in any practical sense. See the
> comment in the middle of MarkBufferDirtyHint() that begins with "If we
> must not write WAL, due to a relfilenode-specific...", and see the
> "Checksums" section at the end of src/backend/storage/page/README. The
> last paragraph in the README is particularly relevant:

I have attached a TAP-test to demonstrate how easily checksums on standby
and primary starts to differ. The test shows two different scenarios - for
both heap and index (and the bit is placed in both standby and primary).

Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on
secondary could be easily lost. But it leaves the page dirty if it already
is (or it could be marked dirty by WAL replay later). So, hints bits could
be easily flushed and taken into account during checksum calculation on
both - standby and primary.

> "We can set the hint, just not dirty the page as a result so the hint
> is lost when we evict the page or shutdown"

Yes, it is not allowed to mark a page as dirty because of hints on standby.
Because we could achieve this:

CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT

But this scenario is totally fine:

CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK

And, as result, this is fine too:

CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY

So, my point here - it is fine to mask LP_DEAD bits during replay because
they are already different on standby and primary. And it is fine to set
and flush hint bits (and LP_DEADs) on standby because they already could be
easily flushed (just need to consider minRecovertPoint and, probably,
OldesXmin from primary in case of LP_DEAD to make promotion easily).

>> And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.
> I don't see how that is relevant. btree_mask() is only used by
> wal_consistency_checking, which is mostly just for Postgres hackers.

I was thinking about the possibility to reuse these functions in masking
during replay.

Thanks,
Michail.


022_checksum_tests.pl
Description: Perl program


Re: Recording foreign key relationships for the system catalogs

2021-02-02 Thread Joel Jacobson
On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote:
>No, I think it's correct as-is (and this is one reason that I chose to
>have two separate FK entries for cases like this).  confrelid can be
>zero in rows that are not FK constraints.  However, such a row must
>also have empty confkey.  The above entry states that for each element
>of confkey, the pair (confrelid,confkey[i]) must be nonzero and have
>a match in pg_attribute.  It creates no constraint if confkey is empty.

Thanks for explaining, I get it now.

>Appreciate the review!  Please confirm if you agree with above
>analysis.

Yes, I agree with the analysis.

/Joel

Re: bugfix - plpgsql - statement counter is incremented 2x for FOR stmt

2021-02-02 Thread Pavel Stehule
út 2. 2. 2021 v 20:36 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > When I fixed one plpgsql_check issue, I found another plpgsql issue. Now
> we
> > have field nstatements that hold a number of plpgsql statements in
> > function. Unfortunately I made an error when I wrote this functionality
> and
> > for FOR statements, this counter is incremented 2x. Higher number than a
> > real number is better than a lesser number, but it can be a source of
> > problems too (inside plpgsql_check I iterate over 0 .. nstatements
> stmtid,
> > and due this bug I had a problem with missing statements).
>
> > Attached patch is pretty simple:
>
> Right, done.
>

Thank you for commit

Regards

Pavel


> regards, tom lane
>


Re: bugfix - plpgsql - statement counter is incremented 2x for FOR stmt

2021-02-02 Thread Tom Lane
Pavel Stehule  writes:
> When I fixed one plpgsql_check issue, I found another plpgsql issue. Now we
> have field nstatements that hold a number of plpgsql statements in
> function. Unfortunately I made an error when I wrote this functionality and
> for FOR statements, this counter is incremented 2x. Higher number than a
> real number is better than a lesser number, but it can be a source of
> problems too (inside plpgsql_check I iterate over 0 .. nstatements stmtid,
> and due this bug I had a problem with missing statements).

> Attached patch is pretty simple:

Right, done.

regards, tom lane




Re: memory leak in auto_explain

2021-02-02 Thread Tom Lane
japin  writes:
> Here's my analysis:
> 1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL function
> memory context, which is a long-lived, cause the memory grow up.

Yeah, agreed.  I think people looking at this have assumed that the
ExecutorEnd hook would automatically be running in the executor's
per-query context, but that's not so; we haven't yet entered
standard_ExecutorEnd where the context switch is.  The caller's
context is likely to be much longer-lived than the executor's.

I think we should put the switch one level further out than you have
it here, just to be sure that InstrEndLoop is covered too (that doesn't
allocate memory, but auto_explain shouldn't assume that).  Otherwise
seems like a good fix.

regards, tom lane




bugfix - plpgsql - statement counter is incremented 2x for FOR stmt

2021-02-02 Thread Pavel Stehule
Hi

When I fixed one plpgsql_check issue, I found another plpgsql issue. Now we
have field nstatements that hold a number of plpgsql statements in
function. Unfortunately I made an error when I wrote this functionality and
for FOR statements, this counter is incremented 2x. Higher number than a
real number is better than a lesser number, but it can be a source of
problems too (inside plpgsql_check I iterate over 0 .. nstatements stmtid,
and due this bug I had a problem with missing statements).

Attached patch is pretty simple:

Regards

Pavel
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index abf196d4be..34e0520719 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -1341,7 +1341,6 @@ stmt_for		: opt_loop_label K_FOR for_control loop_body
 
 			new = (PLpgSQL_stmt_fori *) $3;
 			new->lineno   = plpgsql_location_to_lineno(@2);
-			new->stmtid	  = ++plpgsql_curr_compile->nstatements;
 			new->label	  = $1;
 			new->body	  = $4.stmts;
 			$$ = (PLpgSQL_stmt *) new;
@@ -1356,7 +1355,6 @@ stmt_for		: opt_loop_label K_FOR for_control loop_body
 			/* forq is the common supertype of all three */
 			new = (PLpgSQL_stmt_forq *) $3;
 			new->lineno   = plpgsql_location_to_lineno(@2);
-			new->stmtid	  = ++plpgsql_curr_compile->nstatements;
 			new->label	  = $1;
 			new->body	  = $4.stmts;
 			$$ = (PLpgSQL_stmt *) new;


Re: adding wait_start column to pg_locks

2021-02-02 Thread Fujii Masao




On 2021/02/02 22:00, torikoshia wrote:

On 2021-01-25 23:44, Fujii Masao wrote:

Another comment is; Doesn't the change of MyProc->waitStart need the
lock table's partition lock? If yes, we can do that by moving
LWLockRelease(partitionLock) just after the change of
MyProc->waitStart, but which causes the time that lwlock is being held
to be long. So maybe we need another way to do that.


Thanks for your comments!

It would be ideal for the consistency of the view to record "waitstart" during 
holding the table partition lock.
However, as you pointed out, it would give non-negligible performance impacts.

I may miss something, but as far as I can see, the influence of not holding the lock is that 
"waitstart" can be NULL even though "granted" is false.

I think people want to know the start time of the lock when locks are held for 
a long time.
In that case, "waitstart" should have already been recorded.


Sounds reasonable.



If this is true, I think the current implementation may be enough on the condition that users 
understand it can happen that "waitStart" is NULL and "granted" is false.

Attached a patch describing this in the doc and comments.


Any Thoughts?


64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


+   Lock acquisition wait start time.

Isn't it better to describe this more clearly? What about the following?

Time when the server process started waiting for this lock,
or null if the lock is held.

+   Note that updating this field and lock acquisition are not performed
+   synchronously for performance reasons.  Therefore, depending on the
+   timing, it can happen that waitstart is
+   NULL even though
+   granted is false.

I agree that it's helpful to add the note about that NULL can be returned even when 
"granted" is false. But IMO we don't need to document why this behavior can 
happen internally. So what about the following?

Note that this can be null for a very short period of time after
the wait started even though granted
is false.

Since the document for pg_locks uses "null" instead of NULL (I'm not 
sure why, though), I used "null" for the sake of consistency.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Recording foreign key relationships for the system catalogs

2021-02-02 Thread Tom Lane
"Joel Jacobson"  writes:
> I could only find one minor error,
> found by running the regression-tests,
> and then using the query below to compare "is_opt"
> with my own "zero_values" in my tool
> that derives its value from pg_catalog content.
> ...
> I therefore think is_opt should be changed to true for this row:
> fktable|   fkcols|   pktable|  pkcols   | 
> is_array | is_opt
> ---+-+--+---+--+
> pg_constraint | {confrelid,confkey} | pg_attribute | {attrelid,attnum} | t
> | f

No, I think it's correct as-is (and this is one reason that I chose to
have two separate FK entries for cases like this).  confrelid can be
zero in rows that are not FK constraints.  However, such a row must
also have empty confkey.  The above entry states that for each element
of confkey, the pair (confrelid,confkey[i]) must be nonzero and have
a match in pg_attribute.  It creates no constraint if confkey is empty.

> If this is fixed, I also agree this is ready to be committed.

Appreciate the review!  Please confirm if you agree with above
analysis.

regards, tom lane




Re: Key management with tests

2021-02-02 Thread Bruce Momjian
On Mon, Feb  1, 2021 at 07:47:57PM -0500, Bruce Momjian wrote:
> On Mon, Feb  1, 2021 at 06:31:32PM -0500, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > >   The purpose of cluster file encryption is to prevent users with read
> > >   access to the directories used to store database files and write-ahead
> > >   log files from being able to access the data stored in those files.
> > >   For example, when using cluster file encryption, users who have read
> > >   access to the cluster directories for backup purposes will not be able
> > >   to decrypt the data stored in these files.  It also protects against
> > >   decrypted data access after media theft.
> > 
> > That's one valid use-case and it particularly makes sense to consider,
> > now that we support group read-access to the data cluster.  The last
> 
> Do enough people use group read-access to be useful?

I am thinking group read-access might be a requirement for cluster file
encryption to be effective.

> > line seems a bit unclear- I would update it to say:
> > Cluster file encryption also provides data-at-rest security, protecting
> > users from data loss should the physical media on which the cluster is
> > stored be stolen, improperly deprovisioned (not wiped or destroyed), or
> > otherwise ends up in the hands of an attacker.
> 
> I have split the section into three paragraphs, trimmed down some of the
> suggested text, and added it.  Full version below.

Here is an updated doc description of memory reading:

This also does not protect against users who have read access to
database process memory  all in-memory data pages and data
encryption keys are stored unencrypted in memory, so an attacker who
--> is able to read memory can decrypt the entire cluster.  The Postgres
--> operating system user and the operating system administrator, e.g.,
--> the root user, have such access.

> > >   File system write access can allow for unauthorized file system data
> > >   decryption if the writes can be used to weaken the system's security
> > >   and this weakened system is later supplied with externally-stored keys.
> > 
> > This isn't very clear as to exactly what the concern is or how an
> > attacker would be able to thwart the system if they had write access to
> > it.  An attacker with write access could possibly attempt to replace the
> > existing keys, but with the key wrapping that we're using, that should
> > result in just a decryption failure (unless, of course, the attacker has
> > the actual KEK that was used, but that's not terribly interesting to
> > worry about since then they could just go access the files directly).
> 
> Uh, well, they could modify postgresql.conf to change the script to save
> the secret returned by the script before returning it to the PG server. 
> We could require postgresql.conf to be somewhere secure, but then how do
> we know that is secure?  I just don't see a clean solution here, but the
> idea that you write and then wait for the key to show up seems like a
> very valid way of attack, and it took me a while to be able to
> articulate it.

Let's suppose you lock down your cluster --- the non-PGDATA files are
owned by root, postgresql.conf and pg_hba.conf are moved out of PGDATA
and are not writable by the database OS user, or we have the PGDATA
directory on another server, so the adversary can only write to the
remote PGDATA directory.

What can they do?  Well, they can't modify pg_proc to add a shared
library since pg_proc is encrypted, so we have to focus on files needed
before encryption starts or files that can't be easily encrypted.  They
could create postgresql.conf.auto in PGDATA, and modify
cluster_key_command to capture the key, or they could modify preload
libraries or archive command to call a command to read memory as the PG
OS user and write the key out somewhere, or use the key to rewrite the
database files --- those wouldn't even need a database restart, just a
reload.

They could also modify pg_xact files so that, even though the heap/index
files are encrypted, how the contents of those files are interpreted
would change.

In summary, to detect malicious user writes, you would need to protect
the files used before encryption starts (root owned or owned by another
user?), and encrypt all files after encryption starts --- any other
approach would probably leave open attack vectors, and I don't think
there is sufficient community desire to add such boundaries.

How do other database systems guarantee to detect malicious writes?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Add primary keys to system catalogs

2021-02-02 Thread Julien Rouhaud
On Tue, Feb 2, 2021 at 11:17 PM Tom Lane  wrote:
>
> Dunno about refobjversion; I have my doubts that putting that info in
> pg_depend was a sane design choice at all.  But from what I understand
> of it, wouldn't all deps on a given object necessarily have the same
> version?

Correct, assuming of course that the dependencies are from the same object.




Re: Add primary keys to system catalogs

2021-02-02 Thread Tom Lane
Peter Eisentraut  writes:
> I do wonder, however, under what circumstances code would be put into a 
> situation where it would add the exact same dependency again, and also 
> under what circumstances it would add a dependency between the same 
> objects but a different deptype, and how that would work during 
> recursive deletion.  Now that we have the refobjversion column, the 
> presence of duplicate dependencies might be even more dubious.  I think 
> that could be worth analyzing.

The duplicate-deps case occurs if, say, you use the same function more
than once in a view.  While we could probably get rid of dups generated
within the same recordMultipleDependencies call (and maybe already do?),
it's harder to deal with dups made by retail calls.  For example,
makeOperatorDependencies doesn't trouble to avoid making duplicate
dependencies when the operator's left and right input types are the
same, or the same as the result type.

The case with almost-dup-but-not-same-deptype *does* occur, for instance
it's very possible to have both normal and automatic dependencies.
As I recall, the policy is to perform the deletion if any of the deps
would allow it.

Dunno about refobjversion; I have my doubts that putting that info in
pg_depend was a sane design choice at all.  But from what I understand
of it, wouldn't all deps on a given object necessarily have the same
version?

regards, tom lane




Race between KeepFileRestoredFromArchive() and restartpoint

2021-02-02 Thread Noah Misch
A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as
seen once on the buildfarm[1].  The attached patch adds a test case; it
applies atop the "stop events" patch[2].  We have two systems for adding
long-term pg_wal directory entries.  KeepFileRestoredFromArchive() adds them
during archive recovery, while InstallXLogFileSegment() does so at all times.
Unfortunately, InstallXLogFileSegment() happens throughout archive recovery,
via the checkpointer recycling segments and calling PreallocXlogFiles().
Multiple processes can run InstallXLogFileSegment(), which uses
ControlFileLock to represent the authority to modify the directory listing of
pg_wal.  KeepFileRestoredFromArchive() just assumes it controls pg_wal.

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.

Notable alternatives:

- Release ControlFileLock at the end of XLogFileInit(), not at the end of
  InstallXLogFileSegment().  Add ControlFileLock acquisition to
  KeepFileRestoredFromArchive().  This provides adequate mutual exclusion, but
  XLogFileInit() could return a file descriptor for an unlinked file.  That's
  fine for PreallocXlogFiles(), but it feels wrong.

- During restartpoints, never preallocate or recycle segments.  (Just delete
  obsolete WAL.)  By denying those benefits, this presumably makes streaming
  recovery less efficient.

- Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment,
  then copy bytes.  This is simple, but it multiplies I/O.  That might be
  negligible on account of caching, or it might not be.  A variant, incurring
  extra fsyncs, would be to use durable_rename() to replace the segment we get
  from XLogFileInit().

- Make KeepFileRestoredFromArchive() rename without first unlinking.  This
  avoids checkpoint failure, but a race could trigger noise from the LOG
  message in InstallXLogFileSegment -> durable_rename_excl.

Does anyone prefer some alternative?  It's probably not worth back-patching
anything for a restartpoint failure this rare, because most restartpoint
outcomes are not user-visible.

Thanks,
nm

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill=2020-10-05%2023%3A02%3A17
[2] 
https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
Author: Noah Misch 
Commit: Noah Misch 

Not for commit; for demonstration only.

Before applying this, apply

https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 199d911..1c0a988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -71,6 +71,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/stopevent.h"
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -3583,6 +3584,23 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, 
XLogSegNo srcsegno,
elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
+static Jsonb *
+InstallXLogFileSegmentEndStopEventParams(XLogSegNo segno)
+{
+   MemoryContext oldCxt;
+   JsonbParseState *state = NULL;
+   Jsonb  *res;
+
+   stopevents_make_cxt();
+   oldCxt = MemoryContextSwitchTo(stopevents_cxt);
+   pushJsonbValue(, WJB_BEGIN_OBJECT, NULL);
+   jsonb_push_int8_key(, "segno", segno);
+   res = JsonbValueToJsonb(pushJsonbValue(, WJB_END_OBJECT, NULL));
+   MemoryContextSwitchTo(oldCxt);
+
+   return res;
+}
+
 /*
  * Install a new XLOG segment file as a current or future log segment.
  *
@@ -3664,6 +3682,9 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
if (use_lock)
LWLockRelease(ControlFileLock);
 
+   STOPEVENT(InstallXLogFileSegmentEndStopEvent,
+ InstallXLogFileSegmentEndStopEventParams(*segno));
+
return true;
 }
 
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8..d9c5a3a 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -30,6 +30,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
+#include "storage/stopevent.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -372,6 +373,23 @@ ExecuteRecoveryCommand(const char *command, const char 
*commandName, bool failOn
 }
 
 
+static Jsonb *

Re: Add primary keys to system catalogs

2021-02-02 Thread Julien Rouhaud
On Tue, Feb 2, 2021 at 6:49 PM Peter Eisentraut
 wrote:
>
> I do wonder, however, under what circumstances code would be put into a
> situation where it would add the exact same dependency again, and also
> under what circumstances it would add a dependency between the same
> objects but a different deptype, and how that would work during
> recursive deletion.  Now that we have the refobjversion column, the
> presence of duplicate dependencies might be even more dubious.  I think
> that could be worth analyzing.

There's at least dependencies from indexes to pg_class (underlying
table) and pg_collation that can be entirely duplicated (if the same
column and/or collation is used in both key and predicate for
instance), and this was the case before refobjversion was introduced.




Re: New IndexAM API controlling index vacuum strategies

2021-02-02 Thread Victor Yegorov
вт, 2 февр. 2021 г. в 05:27, Peter Geoghegan :

> And now here is the second thing I thought of, which is much better:
>
> Sometimes 1% of the dead tuples in a heap relation will be spread
> across 90%+ of the pages. With other workloads 1% of dead tuples might
> be highly concentrated, and appear in no more than 1% of all heap
> pages. Obviously the distinction between these two cases/workloads
> matters a lot. And so the triggering criteria must be quantitative
> *and* qualitative. It should not be based on counting dead tuples,
> since that alone won't differentiate these two extreme cases - both of
> which are probably quite common (in the real world extremes are
> actually the normal and common case IME).
>
> I like the idea of basing it on counting *heap blocks*, not dead
> tuples. We can count heap blocks that have *at least* one dead tuple
> (of course it doesn't matter how they're dead, whether it was this
> VACUUM operation or some earlier opportunistic pruning). Note in
> particular that it should not matter if it's a heap block that has
> only one LP_DEAD line pointer or a heap page that is near the
> MaxHeapTuplesPerPage limit for the page -- we count either type of
> page towards the heap-page based limit used to decide if index
> vacuuming goes ahead for all indexes during VACUUM.
>

I really like this idea!

It resembles the approach used in bottom-up index deletion, block-based
accounting provides a better estimate for the usefulness of the operation.

I suppose that 1% threshold should be configurable as a cluster-wide GUC
and also as a table storage parameter?


-- 
Victor Yegorov


Re: libpq debug log

2021-02-02 Thread 'Alvaro Herrera'
On 2021-Jan-29, tsunakawa.ta...@fujitsu.com wrote:

> (30)
> +/*
> + * Deallocate FE/BE message tracking memory.  We only do this because
> + * FE message can grow arbitrarily large, and skip it in the initial state,
> + * because it's likely pointless.
> + */
> +void
> +pqTraceUninit(PGconn *conn)
> +{
> + if (conn->fe_msg &&
> + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS)
> + {
> + free(conn->fe_msg);
> + conn->fe_msg = NULL;
> + }
> 
> What does the second if condition mean?  If it's not necessary, change the 
> comment accordingly.

The rationale for that second condition is this: if the memory allocated
is the initial size, we don't free memory, because it would just be
allocated of the same size next time, and that size is not very big, so
it's not a big deal if we just let it be, so that it is reused if we
call PQtrace() again later.  However, if the allocated size is larger
than default, then it is possible that some previous tracing run has
enlarged the trace struct to a very large amount of memory, and we don't
want to leave that in place.

-- 
Álvaro Herrera   Valdivia, Chile




Re: [PATCH] remove deprecated v8.2 containment operators

2021-02-02 Thread Justin Pryzby
On Mon, Nov 30, 2020 at 02:09:10PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > I think this is waiting on me to provide a patch for the contrib/ modules 
> > with
> > update script removing the SQL operators, and documentating their 
> > deprecation.
> 
> Right.  We can remove the SQL operators, but not (yet) the C code support.
> 
> I'm not sure that the docs change would do more than remove any existing
> mentions of the operators.

I've finally returned to this.  RFC.

-- 
Justin
>From 98a870449bdf3670e2e190d8bdfccd59dae42452 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 11 Apr 2020 22:57:06 -0500
Subject: [PATCH] remove deprecated v8.2 containment operators

See also:
ba920e1c9182eac55d5f1327ab0d29b721154277 684ad6a92fcc33adebdab65c4e7d72a68ba05408
3165426e54df02a6199c0a216546e5095e875a0a 2f70fdb0644c32c4154236c2b5c241bec92eac5e 591d282e8d3e0448ec1339c6b066e10953f040a2
---
 contrib/cube/Makefile   |  2 +-
 contrib/cube/cube--1.4--1.5.sql | 10 ++
 contrib/cube/cube.control   |  2 +-
 contrib/hstore/Makefile |  1 +
 contrib/hstore/hstore--1.0--1.1.sql |  7 +++
 contrib/hstore/hstore--1.8--1.9.sql |  8 
 contrib/hstore/hstore.control   |  2 +-
 contrib/intarray/Makefile   |  2 +-
 contrib/intarray/intarray--1.4--1.5.sql | 18 ++
 contrib/intarray/intarray.control   |  2 +-
 contrib/seg/Makefile|  2 +-
 contrib/seg/seg--1.3--1.4.sql   | 11 +++
 contrib/seg/seg.control |  2 +-
 doc/src/sgml/cube.sgml  |  8 
 doc/src/sgml/hstore.sgml| 10 --
 doc/src/sgml/intarray.sgml  |  8 
 doc/src/sgml/seg.sgml   |  8 
 17 files changed, 62 insertions(+), 41 deletions(-)
 create mode 100644 contrib/cube/cube--1.4--1.5.sql
 create mode 100644 contrib/hstore/hstore--1.0--1.1.sql
 create mode 100644 contrib/hstore/hstore--1.8--1.9.sql
 create mode 100644 contrib/intarray/intarray--1.4--1.5.sql
 create mode 100644 contrib/seg/seg--1.3--1.4.sql

diff --git a/contrib/cube/Makefile b/contrib/cube/Makefile
index 54f609db17..cf195506c7 100644
--- a/contrib/cube/Makefile
+++ b/contrib/cube/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 	cubeparse.o
 
 EXTENSION = cube
-DATA = cube--1.2.sql cube--1.2--1.3.sql cube--1.3--1.4.sql \
+DATA = cube--1.2.sql cube--1.2--1.3.sql cube--1.3--1.4.sql cube--1.4--1.5.sql \
 	cube--1.1--1.2.sql cube--1.0--1.1.sql
 PGFILEDESC = "cube - multidimensional cube data type"
 
diff --git a/contrib/cube/cube--1.4--1.5.sql b/contrib/cube/cube--1.4--1.5.sql
new file mode 100644
index 00..07a5f0755d
--- /dev/null
+++ b/contrib/cube/cube--1.4--1.5.sql
@@ -0,0 +1,10 @@
+/* contrib/cube/cube--1.4--1.5.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION cube UPDATE TO '1.5'" to load this file. \quit
+
+-- Remove @ and ~
+ALTER OPERATOR FAMILY gist_cube_ops USING gist
+DROP OPERATOR 13 (cube, cube);
+ALTER OPERATOR FAMILY gist_cube_ops USING gist
+DROP OPERATOR 14 (cube, cube);
diff --git a/contrib/cube/cube.control b/contrib/cube/cube.control
index 3e238fc937..50427ec117 100644
--- a/contrib/cube/cube.control
+++ b/contrib/cube/cube.control
@@ -1,6 +1,6 @@
 # cube extension
 comment = 'data type for multidimensional cubes'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/cube'
 relocatable = true
 trusted = true
diff --git a/contrib/hstore/Makefile b/contrib/hstore/Makefile
index c4e339b57c..97b228b65f 100644
--- a/contrib/hstore/Makefile
+++ b/contrib/hstore/Makefile
@@ -12,6 +12,7 @@ OBJS = \
 
 EXTENSION = hstore
 DATA = hstore--1.4.sql \
+	hstore--1.8--1.9.sql \
 	hstore--1.7--1.8.sql \
 	hstore--1.6--1.7.sql \
 	hstore--1.5--1.6.sql \
diff --git a/contrib/hstore/hstore--1.0--1.1.sql b/contrib/hstore/hstore--1.0--1.1.sql
new file mode 100644
index 00..4e32a575c5
--- /dev/null
+++ b/contrib/hstore/hstore--1.0--1.1.sql
@@ -0,0 +1,7 @@
+/* contrib/hstore/hstore--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION hstore UPDATE TO '1.1'" to load this file. \quit
+
+ALTER EXTENSION hstore DROP OPERATOR => (text, text);
+DROP OPERATOR => (text, text);
diff --git a/contrib/hstore/hstore--1.8--1.9.sql b/contrib/hstore/hstore--1.8--1.9.sql
new file mode 100644
index 00..2220fee444
--- /dev/null
+++ b/contrib/hstore/hstore--1.8--1.9.sql
@@ -0,0 +1,8 @@
+/* contrib/hstore/hstore--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION hstore UPDATE TO '1.9'" to load this file. \quit
+
+-- Remove @
+ALTER OPERATOR FAMILY gist_hstore_ops USING gist
+DROP OPERATOR 13 (hstore, hstore);
diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control
index 89e3c746c4..b73c28aa4d 100644
--- 

Re: Typo in tablesync comment

2021-02-02 Thread Amit Kapila
On Tue, Feb 2, 2021 at 10:49 AM Michael Paquier  wrote:
>
> On Tue, Feb 02, 2021 at 10:38:31AM +1100, Peter Smith wrote:
> > PSA a trivial patch to correct what seems like a typo in the tablesync 
> > comment.
>
> - *   subscribed tables and their state.  Some transient state during data
> - *   synchronization is kept in shared memory.  The states SYNCWAIT and
> + *   subscribed tables and their state.  Some transient states during 
> data
> + *   synchronization are kept in shared memory. The states SYNCWAIT and
>
> This stuff refers to SUBREL_STATE_* in pg_subscription_rel.h, and FWIW
> I find confusing the term "transient" in this context as a state may
> last for a rather long time, depending on the time it takes to
> synchronize the relation, no?
>

These in-memory states are used after the initial copy is done. So,
these are just for the time the tablesync worker is synced-up with
apply worker. In some cases, they could be for a longer period of time
when apply worker is quite ahead of tablesync worker then we will be
in the CATCHUP state for a long time but SYNCWAIT will still be for a
shorter period of time.

>  I am wondering if we could do better
> here, say:
> "The state tracking the progress of the relation synchronization is
> additionally stored in shared memory, with SYNCWAIT and CATCHUP only
> appearing in memory."
>

I don't mind changing to your proposed text but I think the current
wording is also okay and seems clear to me.

-- 
With Regards,
Amit Kapila.




RE: libpq debug log

2021-02-02 Thread iwata....@fujitsu.com
Hi all,

Thank you Kirk san for creating the v14 patch.
I update the patch.  I fixed all of Tsunakawa san's review comments. 
I am trying to solve three bugs. Two bags were pointed out by Alvaro san
in a previous e-mail. And I found one bug. 

> From: Alvaro Herrera 
> Sent: Friday, January 22, 2021 6:53 AM
...
> > (5)
> > @@ -966,10 +966,6 @@ pqSaveParameterStatus(PGconn *conn, const char
> *name, const char *value)
> > pgParameterStatus *pstatus;
> > pgParameterStatus *prev;
> >
> > -   if (conn->Pfdebug)
> > -   fprintf(conn->Pfdebug, "pqSaveParameterStatus: '%s' =
> '%s'\n",
> > -   name, value);
> > -
> >
> > Where is this information output instead?
> 
> Hmm, seems to have been lost.  I restored it, but didn't verify
> the resulting behavior carefully.

I checked log output using Matsumura san's application app.c;
---
2021-01-27 11:29:49.718 <   ParameterStatus 22 "application_name" ""
pqSaveParameterStatus: 'application_name' = ''
2021-01-27 11:29:49.718 <   ParameterStatus 25 "client_encoding" "UTF8"
pqSaveParameterStatus: 'client_encoding' = 'UTF8'
---

New trace log prints "ParameterStatus" which report run-time parameter status.
And new log also prints parameter name and value in " StartupMessage " message.
We can know the parameter status using these protocol messages.
So I think "pqSaveParameterStatus:" is not necessary to output.

In StartupMessage, the protocol message is output as "UnknownMessage" like 
this. 
 [...] >   UnknownMessage  38 
'\x00\x03\x00\x00user\x00iwata\x00database\x00postgres\x00\x00'

To fix this, I want to move a protocol message without the first byte1 to a 
frontend message logging function.
I will work on this. 

> From: 'Alvaro Herrera' 
> Sent: Friday, January 22, 2021 10:38 PM
> > Ah, that was the reason for separation.  Then, I'm fine with either
> > keeping the separation, or integrating the two functions in fe-misc.c
> > with PQuntrace() being also there.  I kind of think the latter is
> > better, because of code readability and, because tracing facility is
> > not a connection-related reature so categorizing it as "misc" feels
> > natural.
> 
> Maybe we can create a new file specifically for this to avoid mixing
> unrelated stuff in fe-misc.c -- say fe-trace.c where all this new
> tracing stuff goes.

I moved PQtrace() from fe-connect.c to fe-misc.c.
And I agree with creating new file for new tracing functions. I will do this.

> From: 'Alvaro Herrera' 
> Sent: Thursday, January 28, 2021 11:14 PM
...
> On 2021-Jan-28, k.jami...@fujitsu.com wrote:
> 
> > I realized that existing libpq regression tests in src/test/examples do not
> > test PQtrace(). At the same time, no other functions Is calling PQtrace(),
> > so it's expected that by default if there are no compilation errors, then it
> > will pass all the tests. And it did.
> >
> > So to check that the tracing feature is enabled and, as requested, outputs
> > a trace file, I temporarily modified a bit of testlibpq.c instead **based 
> > from
> > my current environment settings**, and ran the regression test.
> 
> Yeah.  It seems useful to build a real test harness based on the new
> tracing functionality.  And that is precisely why I added the option to
> suppress timestamps: so that we can write trace files that do not vary
> from run to run.  That allows us to have an "expected" trace to which we
> can compare the trace file produced by the actual run.  I had the idea
> that instead of a timestamp we would print a message counter.  I didn't
> implement that, however.

I think it is a useful suggestion. However I couldn't think of how to find out
if the logs were really in the correct order. And implementing this change 
looks very complicated.
So I would like to brush up this patch at first.

> (29)
> -void PQtrace(PGconn *conn, FILE *stream);
> +void PQtrace(PGconn *conn, FILE *stream, int flags);
> 
> As I said before, I'm afraid we cannot change the signature of existing API
> functions.  Please leave the signature of this function unchanged, and add a
> new function like PQtraceEx() that adds the flags argument.
> 
> I guess the purpose of adding the flag argument is to not output the timestamp
> by default, because getting timestamps would incur significant overhead
> (however, I'm not sure if that overhead is worth worrying about given the
> already incurred logging overhead.)  So, I think it's better to have a flag 
> like
> PQTRACE_OUTPUT_TIMESTAMPS instead of
> PQTRACE_SUPPRESS_TIMESTAMPS, and the functions would look like:
> 
> void
> PQtrace(PGconn *conn, FILE *stream)
> {
>   PQtraceEx(conn, stream, 0);
> }
> 
> void
> PQtraceEx(PGconn *conn, FILE *stream, int flags)
> {
>   ... the new implementation in the patch
> }
> 
> Remember to add PQtraceEx in exports.txt and make sure it builds on
> Windows with Visual Studio.

I fixed just add new function. 
I will look into similar changes so far and give PQtraceEx() a better name. 


> 

Re: Single transaction in the tablesync worker?

2021-02-02 Thread Amit Kapila
On Tue, Feb 2, 2021 at 3:31 PM Peter Smith  wrote:
>
> After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> tried a simple test where I do a DROP TABLE with very bad timing for
> the tablesync worker. It seems that doing this can cause the sync
> worker's MyLogicalRepWorker->relid to become invalid.
>

I think this should be fixed by latest patch because I have disallowed
drop of a table when its synchronization is in progress. You can check
once and let me know if the issue still exists?

-- 
With Regards,
Amit Kapila.




Re: Single transaction in the tablesync worker?

2021-02-02 Thread Amit Kapila
On Tue, Feb 2, 2021 at 11:35 AM Ajin Cherian  wrote:
>
> Another failure I see in my testing
>

The problem here is that we are allowing to drop the table when table
synchronization is still in progress and then we don't have any way to
know the corresponding slot or origin. I think we can try to drop the
slot and origin as well but that is not a good idea because slots once
dropped won't be rolled back. So, I have added a fix to disallow the
drop of the table when table synchronization is still in progress.
Apart from that, I have fixed comments raised by Peter as discussed
above and made some additional changes in comments, code (code changes
are cosmetic), and docs.

Let me know if the issue reported is fixed or not?

-- 
With Regards,
Amit Kapila.


v25-0001-Tablesync-Solution1.patch
Description: Binary data


Re: adding wait_start column to pg_locks

2021-02-02 Thread torikoshia

On 2021-01-25 23:44, Fujii Masao wrote:

Another comment is; Doesn't the change of MyProc->waitStart need the
lock table's partition lock? If yes, we can do that by moving
LWLockRelease(partitionLock) just after the change of
MyProc->waitStart, but which causes the time that lwlock is being held
to be long. So maybe we need another way to do that.


Thanks for your comments!

It would be ideal for the consistency of the view to record "waitstart" 
during holding the table partition lock.
However, as you pointed out, it would give non-negligible performance 
impacts.


I may miss something, but as far as I can see, the influence of not 
holding the lock is that "waitstart" can be NULL even though "granted" 
is false.


I think people want to know the start time of the lock when locks are 
held for a long time.

In that case, "waitstart" should have already been recorded.

If this is true, I think the current implementation may be enough on the 
condition that users understand it can happen that "waitStart" is NULL 
and "granted" is false.


Attached a patch describing this in the doc and comments.


Any Thoughts?

Regards,


--
Atsushi TorikoshiFrom 03c6e1ed6ffa215ee898b5a6a75d77277fb8e672 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 2 Feb 2021 21:32:36 +0900
Subject: [PATCH v6] To examine the duration of locks, we did join on pg_locks
 and pg_stat_activity and used columns such as query_start or state_change.
 However, since they are the moment when queries have started or their state
 has changed, we could not get the lock duration in this way. This patch adds
 a new field "waitstart" preserving lock acquisition wait start time.

Note that updating this field and lock acquisition are not performed
synchronously for performance reasons.  Therefore, depending on the
timing, it can happen that waitstart is NULL even though granted is
false.

Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a...@oss.nttdata.com
---
 contrib/amcheck/expected/check_btree.out |  4 ++--
 doc/src/sgml/catalogs.sgml   | 14 ++
 src/backend/storage/ipc/standby.c| 16 ++--
 src/backend/storage/lmgr/lock.c  |  8 
 src/backend/storage/lmgr/proc.c  | 10 ++
 src/backend/utils/adt/lockfuncs.c|  9 -
 src/include/catalog/pg_proc.dat  |  6 +++---
 src/include/storage/lock.h   |  2 ++
 src/include/storage/proc.h   |  1 +
 src/test/regress/expected/rules.out  |  5 +++--
 10 files changed, 65 insertions(+), 10 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 13848b7449..5a3f1ef737 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx');
 SELECT * FROM pg_locks
 WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 'bttest_b_idx']::regclass[])
 AND pid = pg_backend_pid();
- locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath 
---+--+--+--+---++---+-+---+--++-+--+-+--
+ locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart 
+--+--+--+--+---++---+-+---+--++-+--+-+--+---
 (0 rows)
 
 COMMIT;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 865e826fb0..d81d6e1c52 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10592,6 +10592,20 @@ SCRAM-SHA-256$iteration count:
lock table
   
  
+
+ 
+  
+   waitstart timestamptz
+  
+  
+   Lock acquisition wait start time.
+   Note that updating this field and lock acquisition are not performed
+   synchronously for performance reasons.  Therefore, depending on the
+   timing, it can happen that waitstart is
+   NULL even though
+   granted is false.
+  
+ 
 

   
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..2282229568 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -539,13 +539,25 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict)
 {
-	TimestampTz ltime;
+	TimestampTz ltime, now;
 
 	Assert(InHotStandby);
 
 	ltime = GetStandbyLimitTime();
+	now = GetCurrentTimestamp();
 
-	if (GetCurrentTimestamp() >= ltime && ltime != 0)
+	

Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-02-02 Thread Joel Jacobson
On Tue, Feb 2, 2021, at 12:34, Euler Taveira wrote:
>You should probably email: webmaster (at) postgresql (dot) org

Thanks, done.

/Joel

Re: row filtering for logical replication

2021-02-02 Thread Euler Taveira
On Tue, Feb 2, 2021, at 8:38 AM, japin wrote:
> In 0003 patch, function GetPublicationRelationQuals() has been defined, but it
> never used.  So why should we define it?
Thanks for taking a look again. It is an oversight. It was introduced in an
attempt to refactor ALTER PUBLICATION SET TABLE. In AlterPublicationTables, we
could possibly keep some publication-table mappings that does not change,
however, since commit 3696a600e2, it is required to find the qual for all
inheritors (see GetPublicationRelations). I explain this decision in the
following comment:

/*
 * Remove all publication-table mappings.  We could possibly
 * remove (i) tables that are not found in the new table list and
 * (ii) tables that are being re-added with a different qual
 * expression. For (ii), simply updating the existing tuple is not
 * enough, because of qual expression dependencies.
 */

I will post a new patch set later.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Improve new hash partition bound check error messages

2021-02-02 Thread Heikki Linnakangas

On 02/02/2021 12:35, Peter Eisentraut wrote:

I had a bit of trouble parsing the error message "every hash partition
modulus must be a factor of the next larger modulus", so I went into the
code, added some comments and added errdetail messages for each case.  I
think it's a bit clearer now.


Yeah, that error message is hard to understand. This is an improvement, 
but it still took me a while to understand it.


Let's look at the example in the regression test:

-- check partition bound syntax for the hash partition
CREATE TABLE hash_parted (
a int
) PARTITION BY HASH (a);
CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
10, REMAINDER 0);
CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
50, REMAINDER 1);
CREATE TABLE hpart_3 PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
200, REMAINDER 2);


With this patch, you get this:

CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
25, REMAINDER 3);
ERROR:  every hash partition modulus must be a factor of the next larger 
modulus

DETAIL:  The existing modulus 10 is not a factor of the new modulus 25.

CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
150, REMAINDER 3);
ERROR:  every hash partition modulus must be a factor of the next larger 
modulus

DETAIL:  The new modulus 150 is not factor of the existing modulus 200.


How about this?

CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
25, REMAINDER 3);
ERROR:  every hash partition modulus must be a factor of the next larger 
modulus
DETAIL:  25 is not divisible by 10, the modulus of existing partition 
"hpart_1"


CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS 
150, REMAINDER 3);
ERROR:  every hash partition modulus must be a factor of the next larger 
modulus
DETAIL:  150 is not a factor of 200, the modulus of existing partition 
"hpart_3"


Calling the existing partition by name seems good. And this phrasing 
puts the focus on the new modulus in both variants; presumably the 
existing partition is OK and the problem is in the new definition.


- Heikki




Re: Improve new hash partition bound check error messages

2021-02-02 Thread Amit Langote
On Tue, Feb 2, 2021 at 7:36 PM Peter Eisentraut
 wrote:
> I had a bit of trouble parsing the error message "every hash partition
> modulus must be a factor of the next larger modulus", so I went into the
> code, added some comments and added errdetail messages for each case.  I
> think it's a bit clearer now.

That is definitely an improvement, thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: row filtering for logical replication

2021-02-02 Thread japin


On Tue, 02 Feb 2021 at 19:16, japin  wrote:
> On Tue, 02 Feb 2021 at 13:02, Michael Paquier  wrote:
>> On Mon, Feb 01, 2021 at 04:11:50PM -0300, Euler Taveira wrote:
>>> After the commit 3696a600e2, the last patch does not apply cleanly. I'm
>>> attaching another version to address the documentation issues.
>>
>> I have bumped into this thread, and applied 0001.  My guess is that
>> one of the patches developped originally for logical replication
>> defined atttypmod in LogicalRepRelation, but has finished by not using
>> it.  Nice catch.
>
> Since the 0001 patch already be commited (4ad31bb2ef), we can remove it.

In 0003 patch, function GetPublicationRelationQuals() has been defined, but it
never used.  So why should we define it?

$ grep 'GetPublicationRelationQuals' -rn src/
src/include/catalog/pg_publication.h:116:extern List 
*GetPublicationRelationQuals(Oid pubid);
src/backend/catalog/pg_publication.c:347:GetPublicationRelationQuals(Oid pubid)

If we must keep it, here are some comments on it.

(1)
value_datum = heap_getattr(tup, Anum_pg_publication_rel_prqual, 
RelationGetDescr(pubrelsrel), );

It looks too long, we can split it into two lines.

(2)
Since qual_value only used in "if (!isnull)" branch, so we can narrow it's 
scope.

(3)
Should we free the memory for qual_value?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-02-02 Thread Euler Taveira
On Tue, Feb 2, 2021, at 5:13 AM, Joel Jacobson wrote:
> I've tried to login to the CF interface a couple of times now, but seems to 
> have lost my password.
> I've tried to use the "Password reset" form [1], but I don't get any email.
> The email is correct, because when I try to re-register it says it's taken.
> 
> Not sure who I should ask for help. Anyone?
You should probably email: webmaster (at) postgresql (dot) org


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Typo in tablesync comment

2021-02-02 Thread Euler Taveira
On Tue, Feb 2, 2021, at 2:19 AM, Michael Paquier wrote:
> On Tue, Feb 02, 2021 at 10:38:31AM +1100, Peter Smith wrote:
> > PSA a trivial patch to correct what seems like a typo in the tablesync 
> > comment.
> 
> - *   subscribed tables and their state.  Some transient state during data
> - *   synchronization is kept in shared memory.  The states SYNCWAIT and
> + *   subscribed tables and their state.  Some transient states during 
> data
> + *   synchronization are kept in shared memory. The states SYNCWAIT and
> 
> This stuff refers to SUBREL_STATE_* in pg_subscription_rel.h, and FWIW
> I find confusing the term "transient" in this context as a state may
> last for a rather long time, depending on the time it takes to
> synchronize the relation, no?  I am wondering if we could do better
> here, say:
> "The state tracking the progress of the relation synchronization is
> additionally stored in shared memory, with SYNCWAIT and CATCHUP only
> appearing in memory."
WFM.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: row filtering for logical replication

2021-02-02 Thread japin


On Tue, 02 Feb 2021 at 13:02, Michael Paquier  wrote:
> On Mon, Feb 01, 2021 at 04:11:50PM -0300, Euler Taveira wrote:
>> After the commit 3696a600e2, the last patch does not apply cleanly. I'm
>> attaching another version to address the documentation issues.
>
> I have bumped into this thread, and applied 0001.  My guess is that
> one of the patches developped originally for logical replication
> defined atttypmod in LogicalRepRelation, but has finished by not using
> it.  Nice catch.

Since the 0001 patch already be commited (4ad31bb2ef), we can remove it.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Add primary keys to system catalogs

2021-02-02 Thread Peter Eisentraut

On 2021-01-22 16:42, Tom Lane wrote:

pg_depend
pg_shdepend

Yeah, this is noted in the patch's own regression tests.


Wouldn't it be possible to add primary keys to these two as well?

Neither of the existing indexes is suitable, not being unique.

We could imagine adding a unique index across the whole column set,
but that would be an awfully large price to pay for neatnik-ism.
Also, at least for pg_depend (less sure about pg_shdepend), some code
cleanup would be required, because I don't think that we try very
hard to avoid making duplicate dependency entries.  On the whole
I feel this'd be counterproductive.


I did attempt to make a six- or seven-column unique constraint on 
pg_depend a while ago.  This fails pretty much immediately during 
initdb.  None of the code that adds dependencies, in particular 
recordMultipleDependencies(), checks if the dependency is already there.


I do wonder, however, under what circumstances code would be put into a 
situation where it would add the exact same dependency again, and also 
under what circumstances it would add a dependency between the same 
objects but a different deptype, and how that would work during 
recursive deletion.  Now that we have the refobjversion column, the 
presence of duplicate dependencies might be even more dubious.  I think 
that could be worth analyzing.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Re: Add primary keys to system catalogs

2021-02-02 Thread Peter Eisentraut

On 2021-02-01 15:24, Tom Lane wrote:

Peter Eisentraut  writes:

On 2021-01-30 22:56, Tom Lane wrote:

Hmm, shouldn't there have been a catversion bump in there?



I suppose yes on the grounds that it introduces something new in a
freshly initdb-ed database.  But I thought it wasn't necessary because
there is no dependency between the binaries and the on-disk state.


I've generally worked on the theory that a catversion bump is indicated
if you need to initdb in order to pass the updated regression tests.
Which one did in this case.


Yeah, that's a good way of looking at it.  I'll keep that in mind.

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Improve new hash partition bound check error messages

2021-02-02 Thread Peter Eisentraut
I had a bit of trouble parsing the error message "every hash partition 
modulus must be a factor of the next larger modulus", so I went into the 
code, added some comments and added errdetail messages for each case.  I 
think it's a bit clearer now.
From e7f392e2f8950236a22f007cc3aed36729da22e1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 2 Feb 2021 11:21:21 +0100
Subject: [PATCH] Improve new hash partition bound check error messages

For the error message "every hash partition modulus must be a factor
of the next larger modulus", add a detail message that shows the
particular numbers involved.  Also comment the code more.
---
 src/backend/partitioning/partbounds.c  | 58 +++---
 src/test/regress/expected/alter_table.out  |  1 +
 src/test/regress/expected/create_table.out |  2 +
 3 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c 
b/src/backend/partitioning/partbounds.c
index 0c3f212ff2..7425e34040 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -2832,14 +2832,9 @@ check_new_partition_bound(char *relname, Relation parent,
 
if (partdesc->nparts > 0)
{
-   Datum **datums = boundinfo->datums;
-   int ndatums = 
boundinfo->ndatums;
int 
greatest_modulus;
int remainder;
int offset;
-   boolvalid_modulus = true;
-   int prev_modulus,   
/* Previous largest modulus */
-   next_modulus;   
/* Next largest modulus */
 
/*
 * Check rule that every modulus must 
be a factor of the
@@ -2849,7 +2844,9 @@ check_new_partition_bound(char *relname, Relation parent,
 * modulus 15, but you cannot add both 
a partition with
 * modulus 10 and a partition with 
modulus 15, because 10
 * is not a factor of 15.
-*
+*/
+
+   /*
 * Get the greatest (modulus, 
remainder) pair contained in
 * boundinfo->datums that is less than 
or equal to the
 * (spec->modulus, spec->remainder) 
pair.
@@ -2859,26 +2856,51 @@ check_new_partition_bound(char *relname, Relation 
parent,

spec->remainder);
if (offset < 0)
{
-   next_modulus = 
DatumGetInt32(datums[0][0]);
-   valid_modulus = (next_modulus % 
spec->modulus) == 0;
+   int 
next_modulus;
+
+   /*
+* All existing moduli are 
greater or equal, so the
+* new one must be a factor of 
the smallest one, which
+* is first in the boundinfo.
+*/
+   next_modulus = 
DatumGetInt32(boundinfo->datums[0][0]);
+   if (next_modulus % 
spec->modulus != 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+
errmsg("every hash partition modulus must be a factor of the next larger 
modulus"),
+
errdetail("The new modulus %d is not a factor of the existing modulus %d.",
+   
   spec->modulus, next_modulus)));
}
else
{
-   prev_modulus = 
DatumGetInt32(datums[offset][0]);
-   valid_modulus = (spec->modulus 
% prev_modulus) == 0;
+  

Re: Single transaction in the tablesync worker?

2021-02-02 Thread Ajin Cherian
On Tue, Feb 2, 2021 at 7:40 PM Amit Kapila  wrote:
>
> On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian  wrote:
> >
> > On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila  wrote:
> > > I have updated the patch to display WARNING for each of the tablesync
> > > slots during DropSubscription. As discussed, I have moved the drop
> > > slot related code towards the end in AlterSubscription_refresh. Apart
> > > from this, I have fixed one more issue in tablesync code where in
> > > after catching the exception we were not clearing the transaction
> > > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > > have also fixed other comments raised by you. Additionally, I have
> > > removed the test because it was creating the same name slot as the
> > > tablesync worker and tablesync worker removed the same due to new
> > > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > > the bug in that code which I have fixed in the attached.
> > >
> >
> > I was testing this patch. I had a table on the subscriber which had a
> > row that would cause a PK constraint
> > violation during the table copy. This is resulting in the subscriber
> > trying to rollback the table copy and failing.
> >
>
> I am not getting this error. I have tried by below test:

I am sorry, my above steps were not correct. I think the reason for
the failure I was seeing were some other steps I did prior to this. I
will recreate this and update you with the appropriate steps.

regards,
Ajin Cherian
Fujitsu Australia




Re: Single transaction in the tablesync worker?

2021-02-02 Thread Peter Smith
After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
tried a simple test where I do a DROP TABLE with very bad timing for
the tablesync worker. It seems that doing this can cause the sync
worker's MyLogicalRepWorker->relid to become invalid.

In my test this caused a stack trace within some logging, but I
imagine other bad things can happen if the tablesync worker can be
executed with an invalid relid.

Possibly this is an existing PG bug which has just never been seen
before; The ereport which has failed here is not new code.

PSA the log for the test steps and the stack trace details.


[ac0202] 
https://www.postgresql.org/message-id/CAFPTHDYzjaNfzsFHpER9idAPB8v5j%3DSUbWL0AKj5iVy0BKbTpg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
Test Scenario

1. INSERT data so tablesync should copy something
2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out 
from under the sync worker!
3. Continue the sync worker see what happens

==

##
## Insert data
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1

##
## SUBSCRIBE and continue to breakpoint at top of tablesync function 
LogicalRepSyncTableStart
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub;"
2021-02-02 19:29:16.578 AEDT [26402] LOG:  logical decoding found consistent 
point at 0/165F800
2021-02-02 19:29:16.578 AEDT [26402] DETAIL:  There are no running transactions.
2021-02-02 19:29:16.578 AEDT [26402] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2021-02-02 19:29:16.587 AEDT [26405] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-02-02 19:29:16.587 AEDT [26405] LOG:  !!>> The apply worker process has 
PID = 26405
2021-02-02 19:29:16.597 AEDT [26411] LOG:  starting logical decoding for slot 
"tap_sub"
2021-02-02 19:29:16.597 AEDT [26411] DETAIL:  Streaming transactions committing 
after 0/165F838, reading WAL from 0/165F800.
2021-02-02 19:29:16.597 AEDT [26411] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-02 19:29:16.598 AEDT [26411] LOG:  logical decoding found consistent 
point at 0/165F800
2021-02-02 19:29:16.598 AEDT [26411] DETAIL:  There are no running transactions.
2021-02-02 19:29:16.598 AEDT [26411] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-02 19:29:16.598 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:16.598 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
[postgres@CentOS7-x64 ~]$ 2021-02-02 19:29:16.602 AEDT [26415] LOG:  logical 
replication table synchronization worker for subscription "tap_sub", table 
"test_tab" has started
2021-02-02 19:29:16.602 AEDT [26415] LOG:  !!>> The tablesync worker process 
has PID = 26415
2021-02-02 19:29:16.602 AEDT [26415] LOG:  !!>>


Sleeping 30 secs. For debugging, attach to process 26415 now!



2021-02-02 19:29:17.610 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:17.610 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:18.611 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:18.611 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:19.612 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:19.612 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:20.613 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:20.613 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:21.614 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:21.614 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:22.615 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:22.615 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:23.615 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:23.615 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:24.658 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:24.658 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:25.661 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:25.661 AEDT [26405] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-02-02 19:29:26.662 AEDT [26405] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2021-02-02 19:29:26.662 AEDT 

Re: Two patches to speed up pg_rewind.

2021-02-02 Thread Paul Guo


On Jan 28, 2021, at 3:31 PM, Michael Paquier 
mailto:mich...@paquier.xyz>> wrote:

On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote:
Second one is use copy_file_range() for the local rewind case to replace 
read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could 
benefit more
on Linux fs on network-based block storage.

Have you done some measurements?

I did not test pg_rewind but for patch 2, I tested copy_fiile_range() vs 
read()+write()
on XFS in Ubuntu 20.04.1 when working on the patches,

Here is the test time of 1G file (fully populated with random data) copy. The 
test is a simple C program.

copy_file_range() loop (actually it finished after one call) + fsync()
0m0.048s

For read()+write() loop with read/write buffer size 32K + fsync()
0m5.004s

For patch 1, it skips syncing less files so it surely benefits the performance.


Re: memory leak in auto_explain

2021-02-02 Thread japin

On Tue, 02 Feb 2021 at 07:12, Jeff Janes  wrote:
> On Mon, Feb 1, 2021 at 6:09 PM Jeff Janes  wrote:
>
>>
>>
>> create or replace function gibberish(int) returns text language SQL as $_$
>> select left(string_agg(md5(random()::text),),$1) from
>> generate_series(0,$1/32) $_$;
>>
>> create table j1 as select x, md5(random()::text) as t11, gibberish(1500)
>> as t12 from generate_series(1,20e6) f(x);
>>
>
> I should have added, found it on HEAD, verified it also in 12.5.
>

Here's my analysis:
1) In the explain_ExecutorEnd(), it will create a ExplainState on SQL function
memory context, which is a long-lived, cause the memory grow up.

/*
 * Switch to context in which the fcache lives.  This ensures that our
 * tuplestore etc will have sufficient lifetime.  The sub-executor is
 * responsible for deleting per-tuple information.  (XXX in the case of a
 * long-lived FmgrInfo, this policy represents more memory leakage, but
 * it's not entirely clear where to keep stuff instead.)
 */
oldcontext = MemoryContextSwitchTo(fcache->fcontext);

2) I try to call pfree() to release ExplainState memory, however, it does not
make sence, I do not know why this does not work? So I try to create it in
queryDesc->estate->es_query_cxt memory context like queryDesc->totaltime, and
it works.

Attached fix the memory leakage in auto_explain. Any thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 2589cf9c78bbd308b6f300277500c920152ee6f2 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 2 Feb 2021 17:27:21 +0800
Subject: [PATCH v1] Fix memory leak in auto_explain

---
 contrib/auto_explain/auto_explain.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index faa6231d87..033540a3f4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -383,6 +383,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		msec = queryDesc->totaltime->total * 1000.0;
 		if (msec >= auto_explain_log_min_duration)
 		{
+			MemoryContext oldctx = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
 			ExplainState *es = NewExplainState();
 
 			es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
@@ -426,6 +427,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 	 errhidestmt(true)));
 
 			pfree(es->str->data);
+			MemoryContextSwitchTo(oldctx);
 		}
 	}
 
-- 
2.30.0



RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-02 Thread tsunakawa.ta...@fujitsu.com
From: Andrey Lepikhov 
> Of course, you can rebase it.

Thank you.  I might modify the basic part to incorporate my past proposal about 
improving the layering or modularity related to ri_useMultiInsert.  (But I may 
end up giving up due to lack of energy.)

Also, I might defer working on the extended part (v9 0003 and 0004) and further 
separate them in a different thread, if it seems to take longer.


Regards
Takayuki Tsunakawa




join plan with unexpected var clauses

2021-02-02 Thread Luc Vlaming

Hi,

At a customer we came across a curious plan (see attached testcase).

Given the testcase we see that the outer semi join tries to join the 
outer with the inner table id columns, even though the middle table id 
column is also there. Is this expected behavior?


The reason i'm asking is two-fold:
- the inner hash table now is bigger than i'd expect and has columns 
that you would normally not select on.
- the middle join now projects the inner as result, which is quite 
suprising and seems invalid from a SQL standpoint.


Plan:
 Finalize Aggregate
   Output: count(*)
   ->  Gather
 Output: (PARTIAL count(*))
 Workers Planned: 4
 ->  Partial Aggregate
   Output: PARTIAL count(*)
   ->  Parallel Hash Semi Join
 Hash Cond: (_outer.id3 = _inner.id2)
 ->  Parallel Seq Scan on public._outer
   Output: _outer.id3, _outer.extra1
 ->  Parallel Hash
   Output: middle.id1, _inner.id2
   ->  Parallel Hash Semi Join
 Output: middle.id1, _inner.id2
 Hash Cond: (middle.id1 = _inner.id2)
 ->  Parallel Seq Scan on public.middle
   Output: middle.id1
 ->  Parallel Hash
   Output: _inner.id2
   ->  Parallel Seq Scan on 
public._inner

 Output: _inner.id2

Kind regards,
Luc
Swarm64


testcase.sql
Description: application/sql


Re: proposal: schema variables

2021-02-02 Thread Pavel Stehule
Hi

rebase and set PK for pg_variable table

Regards

Pavel


schema-variables-20210202.patch.gz
Description: application/gzip


Re: Single transaction in the tablesync worker?

2021-02-02 Thread Amit Kapila
On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian  wrote:
>
> On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila  wrote:
> > I have updated the patch to display WARNING for each of the tablesync
> > slots during DropSubscription. As discussed, I have moved the drop
> > slot related code towards the end in AlterSubscription_refresh. Apart
> > from this, I have fixed one more issue in tablesync code where in
> > after catching the exception we were not clearing the transaction
> > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > have also fixed other comments raised by you. Additionally, I have
> > removed the test because it was creating the same name slot as the
> > tablesync worker and tablesync worker removed the same due to new
> > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > the bug in that code which I have fixed in the attached.
> >
>
> I was testing this patch. I had a table on the subscriber which had a
> row that would cause a PK constraint
> violation during the table copy. This is resulting in the subscriber
> trying to rollback the table copy and failing.
>

I am not getting this error. I have tried by below test:
Publisher
===
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl1;

Subscriber
=
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE SUBSCRIPTION mysub
 CONNECTION 'host=localhost port=5432 dbname=postgres'
PUBLICATION mypublication;

It generates the PK violation the first time and then I removed the
conflicting rows in the subscriber and it passed. See logs below.

2021-02-02 13:51:34.316 IST [20796] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
started
2021-02-02 13:52:43.625 IST [20796] ERROR:  duplicate key value
violates unique constraint "mytbl1_pkey"
2021-02-02 13:52:43.625 IST [20796] DETAIL:  Key (id)=(1) already exists.
2021-02-02 13:52:43.625 IST [20796] CONTEXT:  COPY mytbl1, line 1
2021-02-02 13:52:43.695 IST [27840] LOG:  background worker "logical
replication worker" (PID 20796) exited with exit code 1
2021-02-02 13:52:43.884 IST [6260] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
started
2021-02-02 13:53:54.680 IST [6260] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
finished

Also, a similar test exists in 0004_sync.pl, is that also failing for
you? Can you please provide detailed steps that led to this failure?

>
> And one more thing I see is that now we error out in PG_CATCH() in
> LogicalRepSyncTableStart() with the above error and as a result, the
> tablesync slot is not dropped. Hence causing the slot create to fail
> in the next restart.
> I think this can be avoided. We could either attempt a rollback only
> on specific failures and drop slot prior to erroring out.
>

Hmm, we have to first rollback before attempting any other operation
because the transaction on the publisher is in an errored state.


-- 
With Regards,
Amit Kapila.




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-02 Thread Andrey Lepikhov

On 2/2/21 11:57, tsunakawa.ta...@fujitsu.com wrote:

Hello, Andrey-san,


From: Tang, Haiying 

Sometimes before i suggested additional optimization [1] which can
additionally speed up COPY by 2-4 times. Maybe you can perform the
benchmark for this solution too?

...

But the patch no longer applies, I tried to apply on e42b3c3bd6(2021/1/26) but
failed.

Can you send a rebased version?

When do you think you can submit the rebased version of them?  (IIUC at the 
off-list HighGo meeting, you're planning to post them late this week after the 
global snapshot patch.)  Just in case you are not going to do them for the 
moment, can we rebase and/or further modify them so that they can be committed 
in PG 14?

Of course, you can rebase it.

--
regards,
Andrey Lepikhov
Postgres Professional




RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-02 Thread Hou, Zhijie
> 
> I've had a further look at this, and this walker function is doing a lot
> of work recursing the parse tree, and I'm not sure that it reliably retrieves
> the information that we;re looking for, for all cases of different SQL
> queries. Unless it can be made much more efficient and specific to our needs,
> I think we should not try to do this optimization, because there's too much
> overhead. Also, keep in mind that for the current parallel SELECT
> functionality in Postgres, I don't see any similar optimization being
> attempted (and such optimization should be attempted at the SELECT level).
> So I don't think we should be attempting such optimization in this patch
> (but could be attempted in a separate patch, just related to current parallel
> SELECT functionality).

Yes, I agreed,
I was worried about the overhead it may bring too,
we can remove this from the current patch.

Best regards,
houzj




Re: Transactions involving multiple postgres foreign servers, take 2

2021-02-02 Thread Fujii Masao




On 2021/01/27 14:08, Masahiko Sawada wrote:

On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
 wrote:



You fixed some issues. But maybe you forgot to attach the latest patches?


Yes, I've attached the updated patches.


Thanks for updating the patch! I tried to review 0001 and 0002 as the 
self-contained change.

+ * An FDW that implements both commit and rollback APIs can request to register
+ * the foreign transaction by FdwXactRegisterXact() to participate it to a
+ * group of distributed tranasction.  The registered foreign transactions are
+ * identified by OIDs of server and user.

I'm afraid that the combination of OIDs of server and user is not unique. IOW, 
more than one foreign transactions can have the same combination of OIDs of 
server and user. For example, the following two SELECT queries start the 
different foreign transactions but their user OID is the same. OID of user 
mapping should be used instead of OID of user?

CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 'postgres');
CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres');
CREATE TABLE t(i int);
CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't');
BEGIN;
SELECT * FROM ft;
DROP USER MAPPING FOR postgres SERVER loopback ;
SELECT * FROM ft;
COMMIT;

+   /* Commit foreign transactions if any */
+   AtEOXact_FdwXact(true);

Don't we need to pass XACT_EVENT_PARALLEL_PRE_COMMIT or XACT_EVENT_PRE_COMMIT 
flag? Probably we don't need to do this if postgres_fdw is only user of this 
new API. But if we make this new API generic one, such flags seem necessary so 
that some foreign data wrappers might have different behaviors for those flags.

Because of the same reason as above, AtEOXact_FdwXact() should also be called 
after CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_COMMIT : 
XACT_EVENT_COMMIT)?

+   /*
+* Abort foreign transactions if any.  This needs to be done before 
marking
+* this transaction as not running since FDW's transaction callbacks 
might
+* assume this transaction is still in progress.
+*/
+   AtEOXact_FdwXact(false);

Same as above.

+/*
+ * This function is called at PREPARE TRANSACTION.  Since we don't support
+ * preparing foreign transactions yet, raise an error if the local transaction
+ * has any foreign transaction.
+ */
+void
+AtPrepare_FdwXact(void)
+{
+   if (FdwXactParticipants != NIL)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot PREPARE a transaction that has 
operated on foreign tables")));
+}

This means that some foreign data wrappers suppporting the prepare transaction 
(though I'm not sure if such wappers actually exist or not) cannot use the new 
API? If we want to allow those wrappers to use new API, AtPrepare_FdwXact() 
should call the prepare callback and each wrapper should emit an error within 
the callback if necessary.

+   foreach(lc, FdwXactParticipants)
+   {
+   FdwXactParticipant *fdw_part = (FdwXactParticipant *) 
lfirst(lc);
+
+   if (fdw_part->server->serverid == serverid &&
+   fdw_part->usermapping->userid == userid)

Isn't this ineffecient when starting lots of foreign transactions because we 
need to scan all the entries in the list every time?

+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
+{
+   boolfound;
+   ConnCacheEntry *entry;
+   ConnCacheKey key;
+
+   /* First time through, initialize connection cache hashtable */
+   if (ConnectionHash == NULL)
+   {
+   HASHCTL ctl;
+
+   ctl.keysize = sizeof(ConnCacheKey);
+   ctl.entrysize = sizeof(ConnCacheEntry);
+   ConnectionHash = hash_create("postgres_fdw connections", 8,
+,
+
HASH_ELEM | HASH_BLOBS);

Currently ConnectionHash is created under TopMemoryContext. With the patch, 
since GetConnectionCacheEntry() can be called in other places, ConnectionHash 
may be created under the memory context other than TopMemoryContext? If so, 
that's safe?

-   if (PQstatus(entry->conn) != CONNECTION_OK ||
-   PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-   entry->changing_xact_state ||
-   entry->invalidated)
...
+   if (PQstatus(entry->conn) != CONNECTION_OK ||
+   PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+   entry->changing_xact_state)

Why did you get rid of the condition "entry->invalidated"?






I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that 
worth applying independent 

Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-02-02 Thread Joel Jacobson
Hi Euler,

I've tried to login to the CF interface a couple of times now, but seems to 
have lost my password.
I've tried to use the "Password reset" form [1], but I don't get any email.
The email is correct, because when I try to re-register it says it's taken.

Not sure who I should ask for help. Anyone?

/Joel

[1] https://www.postgresql.org/account/reset/

On Tue, Feb 2, 2021, at 03:10, Euler Taveira wrote:
> On Sun, Jan 31, 2021, at 10:27 AM, Joel Jacobson wrote:
>> Here comes the patch again, now including data.
> Joel, register this patch into the next CF [1] so we don't lose track of it.
> 
> 
> [1] https://commitfest.postgresql.org/32/
> 
> 
> --
> Euler Taveira
> EDB   https://www.enterprisedb.com/
> 

Kind regards,

Joel


Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-02 Thread Greg Nancarrow
On Mon, Feb 1, 2021 at 8:19 PM Hou, Zhijie  wrote:
>
> Hi,
>
> When developing the reloption patch, I noticed some issues in the patch.
>
> 1).
> > - Reduce Insert parallel-safety checks required for some SQL, by noting
> > that the subquery must operate on a relation (check for RTE_RELATION in
> > subquery range-table)
>
> +   foreach(lcSub, rte->subquery->rtable)
> +   {
> +   rteSub = lfirst_node(RangeTblEntry, lcSub);
> +   if (rteSub->rtekind == RTE_RELATION)
> +   {
> +   hasSubQueryOnRelation = true;
> +   break;
> +   }
> +   }
> It seems we can not only search RTE_RELATION in rtable,
> because RTE_RELATION may exist in other place like:
>
> ---
> --** explain insert into target select (select * from test);
> Subplan's subplan
>
> --** with cte as (select * from test) insert into target select * from cte;
> In query's ctelist.
> ---
>
> May be we should use a walker function [1] to
> search the subquery and ctelist.
>
>
>
> [1]
> static bool
> relation_walker(Node *node)
> {
> if (node == NULL)
> return false;
>
> else if (IsA(node, RangeTblEntry))
> {
> RangeTblEntry *rte = (RangeTblEntry *) node;
> if (rte->rtekind == RTE_RELATION)
> return true;
>
> return false;
> }
>
> else if (IsA(node, Query))
> {
> Query  *query = (Query *) node;
>
> /* Recurse into subselects */
> return query_tree_walker(query, relation_walker,
>  NULL, 
> QTW_EXAMINE_RTES_BEFORE);
> }
>
> /* Recurse to check arguments */
> return expression_tree_walker(node,
>   
> relation_walker,
>   NULL);
> }
>

I've had a further look at this, and this walker function is doing a
lot of work recursing the parse tree, and I'm not sure that it
reliably retrieves the information that we;re looking for, for all
cases of different SQL queries. Unless it can be made much more
efficient and specific to our needs, I think we should not try to do
this optimization, because there's too much overhead. Also, keep in
mind that for the current parallel SELECT functionality in Postgres, I
don't see any similar optimization being attempted (and such
optimization should be attempted at the SELECT level). So I don't
think we should be attempting such optimization in this patch (but
could be attempted in a separate patch, just related to current
parallel SELECT functionality).

Regards,
Greg Nancarrow
Fujitsu Australia