Re: Add Information during standby recovery conflicts

2020-12-14 Thread Fujii Masao




On 2020/12/15 15:40, Fujii Masao wrote:



On 2020/12/15 12:04, Kyotaro Horiguchi wrote:

At Tue, 15 Dec 2020 02:00:21 +0900, Fujii Masao  
wrote in

Thanks for the review! I'm thinking to wait half a day before
commiting
the patch just in the case someone may object the patch.


Sorry for coming late.  I have looked only the latest thread so I
should be missing many things so please ignore if it was settled in
the past discussion.


It emits messages like the follows;

[40509:startup] LOG:  recovery still waiting after 1021.431 ms: recovery 
conflict on lock
[40509:startup] DETAIL:  Conflicting processes: 41171, 41194.
[40509:startup] CONTEXT:  WAL redo at 0/3013158 for Standby/LOCK: xid 510 db 
13609 rel 16384

IFAIK DETAIL usually shows ordinary sentences so the first word is
capitalized and ends with a period. But it is not a sentence so
following period looks odd.  a searcheing the tree for errdetails
showed some anomalies.

src/backend/parser/parse_param.c  errdetail("%s versus %s",
src/backend/jit/llvm/llvmjit_error.cpp  errdetail("while in 
LLVM")));
src/backend/replication/logical/tablesync.c  errdetail("The error was: 
%s", res->err)));
src/backend/tcop/postgres.c errdetail("prepare: %s", 
pstmt->plansource->query_string);
src/backend/tcop/postgres.c errdetail("abort reason: recovery 
conflict");

and one similar occurance:

src/backend/utils/adt/dbsize.c  errdetail("Invalid size unit: 
\"%s\".", strptr),

I'm not sure, but it seems to me at least the period is unnecessary
here.


Since Error Message Style Guide in the docs says "Detail and hint messages:
Use complete sentences, and end each with a period.", I think that a period
is necessary here. No?





+    bool    maybe_log_conflict =
+    (standbyWaitStart != 0 && !logged_recovery_conflict);

odd indentation.


This is the result of pgindent run. I'm not sure why pgindent indents
that way, but for now I'd like to follow pgindent.





+    /* Also, set the timer if necessary */
+    if (logging_timer)
+    {
+    timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+    timeouts[cnt].type = TMPARAM_AFTER;
+    timeouts[cnt].delay_ms = DeadlockTimeout;
+    cnt++;
+    }

This doesn't consider spurious wakeup. I'm not sure it actually
happenes but we usually consider that.  That is since before this
patch, but ProcWaitForSignal()'s comment says that:


  * As this uses the generic process latch the caller has to be robust against
  * unrelated wakeups: Always check that the desired state has occurred, and
  * wait again if not.


If we don't care of spurious wakeups, we should add such a comment.


If ProcWaitForSignal() wakes up because of the reason (e.g., SIGHUP)
other than deadlock_timeout, ProcSleep() will call
ResolveRecoveryConflictWithLock() and we sleep on ProcWaitForSignal()
again since the recovery conflict has not been resolved yet. So we can
say that we consider "spurious wakeups"?

However when I read the related code again, I found another issue in
the patch. In ResolveRecoveryConflictWithLock(), if SIGHUP causes us to
wake up out of ProcWaitForSignal() before deadlock_timeout is reached,
we will use deadlock_timeout again when sleeping on ProcWaitForSignal().
Instead, probably we should use the "deadlock_timeout - elapsed time"
so that we can emit a log message as soon as deadlock_timeout passes
since starting waiting on recovery conflict. Otherwise it may take at most
"deadlock_timeout * 2" to log "still waiting ..." message.

To fix this issue, "deadlock_timeout - elapsed time" needs to be used as
the timeout when enabling the timer at least in
ResolveRecoveryConflictWithLock() and ResolveRecoveryConflictWithBufferPin().
Also similar change needs to be applied to
ResolveRecoveryConflictWithVirtualXIDs().

BTW, without applying the patch, *originally*
ResolveRecoveryConflictWithBufferPin() seems to have this issue.
It enables deadlock_timeout timer so as to request for hot-standbfy
backends to check themselves for deadlocks. But if we wake up out of
ProcWaitForSignal() before deadlock_timeout is reached, the subsequent
call to ResolveRecoveryConflictWithBufferPin() also uses deadlock_timeout
again instead of "deadlock_timeout - elapsed time". So the request for
deadlock check can be delayed. Furthermore,
if ResolveRecoveryConflictWithBufferPin() always wakes up out of
ProcWaitForSignal() before deadlock_timeout is reached, the request
for deadlock check may also never happen infinitely.

Maybe we should fix the original issue at first separately from the patch.


Hmm... commit ac22929a26 seems to make the thing worse. Before that commit,
other wakeup request like SIGHUP didn't cause ProcWaitForSignal() to
actually wake up in ResolveRecoveryConflictWithBufferPin(). Because such
other wakeup requests use the different latch from that that
ProcWaitForSignal() waits on.

But com

Re: Add Information during standby recovery conflicts

2020-12-14 Thread Fujii Masao




On 2020/12/15 12:04, Kyotaro Horiguchi wrote:

At Tue, 15 Dec 2020 02:00:21 +0900, Fujii Masao  
wrote in

Thanks for the review! I'm thinking to wait half a day before
commiting
the patch just in the case someone may object the patch.


Sorry for coming late.  I have looked only the latest thread so I
should be missing many things so please ignore if it was settled in
the past discussion.


It emits messages like the follows;

[40509:startup] LOG:  recovery still waiting after 1021.431 ms: recovery 
conflict on lock
[40509:startup] DETAIL:  Conflicting processes: 41171, 41194.
[40509:startup] CONTEXT:  WAL redo at 0/3013158 for Standby/LOCK: xid 510 db 
13609 rel 16384

IFAIK DETAIL usually shows ordinary sentences so the first word is
capitalized and ends with a period. But it is not a sentence so
following period looks odd.  a searcheing the tree for errdetails
showed some anomalies.

src/backend/parser/parse_param.c 
errdetail("%s versus %s",
src/backend/jit/llvm/llvmjit_error.cpp   errdetail("while in 
LLVM")));
src/backend/replication/logical/tablesync.c  errdetail("The 
error was: %s", res->err)));
src/backend/tcop/postgres.c errdetail("prepare: %s", 
pstmt->plansource->query_string);
src/backend/tcop/postgres.c errdetail("abort reason: recovery 
conflict");

and one similar occurance:

src/backend/utils/adt/dbsize.c   errdetail("Invalid size unit: 
\"%s\".", strptr),

I'm not sure, but it seems to me at least the period is unnecessary
here.


Since Error Message Style Guide in the docs says "Detail and hint messages:
Use complete sentences, and end each with a period.", I think that a period
is necessary here. No?





+   boolmaybe_log_conflict =
+   (standbyWaitStart != 0 && !logged_recovery_conflict);

odd indentation.


This is the result of pgindent run. I'm not sure why pgindent indents
that way, but for now I'd like to follow pgindent.





+   /* Also, set the timer if necessary */
+   if (logging_timer)
+   {
+   timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;
+   cnt++;
+   }

This doesn't consider spurious wakeup. I'm not sure it actually
happenes but we usually consider that.  That is since before this
patch, but ProcWaitForSignal()'s comment says that:


  * As this uses the generic process latch the caller has to be robust against
  * unrelated wakeups: Always check that the desired state has occurred, and
  * wait again if not.


If we don't care of spurious wakeups, we should add such a comment.


If ProcWaitForSignal() wakes up because of the reason (e.g., SIGHUP)
other than deadlock_timeout, ProcSleep() will call
ResolveRecoveryConflictWithLock() and we sleep on ProcWaitForSignal()
again since the recovery conflict has not been resolved yet. So we can
say that we consider "spurious wakeups"?

However when I read the related code again, I found another issue in
the patch. In ResolveRecoveryConflictWithLock(), if SIGHUP causes us to
wake up out of ProcWaitForSignal() before deadlock_timeout is reached,
we will use deadlock_timeout again when sleeping on ProcWaitForSignal().
Instead, probably we should use the "deadlock_timeout - elapsed time"
so that we can emit a log message as soon as deadlock_timeout passes
since starting waiting on recovery conflict. Otherwise it may take at most
"deadlock_timeout * 2" to log "still waiting ..." message.

To fix this issue, "deadlock_timeout - elapsed time" needs to be used as
the timeout when enabling the timer at least in
ResolveRecoveryConflictWithLock() and ResolveRecoveryConflictWithBufferPin().
Also similar change needs to be applied to
ResolveRecoveryConflictWithVirtualXIDs().

BTW, without applying the patch, *originally*
ResolveRecoveryConflictWithBufferPin() seems to have this issue.
It enables deadlock_timeout timer so as to request for hot-standbfy
backends to check themselves for deadlocks. But if we wake up out of
ProcWaitForSignal() before deadlock_timeout is reached, the subsequent
call to ResolveRecoveryConflictWithBufferPin() also uses deadlock_timeout
again instead of "deadlock_timeout - elapsed time". So the request for
deadlock check can be delayed. Furthermore,
if ResolveRecoveryConflictWithBufferPin() always wakes up out of
ProcWaitForSignal() before deadlock_timeout is reached, the request
for deadlock check may also never happen infinitely.

Maybe we should fix the original issue at first separately from the patch.



+   boolmaybe_log_conflict;
+   boolmaybe_update_title;

Although it should be a matter of taste and I understand t

Re: Misleading comment in prologue of ReorderBufferQueueMessage

2020-12-14 Thread Ashutosh Bapat
On Mon, Dec 14, 2020 at 3:14 PM Amit Kapila  wrote:

> On Mon, Dec 14, 2020 at 2:45 PM Ashutosh Bapat
>  wrote:
> >
> > The name of the function suggests that the given message will be queued
> in ReorderBuffer. The prologue of the function says so too
> >  776 /*
> >  777  * Queue message into a transaction so it can be processed upon
> commit.
> >  778  */
> > It led me to think that a non-transactional message is processed along
> with the surrounding transaction, esp. when it has an associated xid.
> >
> > But in reality, the function queues only a transactional message and
> decoders a non-transactional message immediately without waiting for a
> commit.
> >
> > We should modify the prologue to say
> > "Queue a transactional message into a transaction so that it can be
> processed upon commit. A non-transactional message is processed
> immediately." and also change the name of the function to
> ReorderBufferProcessMessage(), but the later may break API compatibility.
> >
>
> +1 for the comment change but I am not sure if it is a good idea to
> change the API name.
>
> Can you please review wording? I will create a patch with updated wording.
-- 
--
Best Wishes,
Ashutosh


Re: Movement of restart_lsn position movement of logical replication slots is very slow

2020-12-14 Thread Jammie
Thanks Amit for the response

We are using pgJDBC sample program here
https://jdbc.postgresql.org/documentation/head/replication.html

the setFlushLSN is coming from the pgJDBC only.

git hub for APIs of pgJDBC methods available.

https://github.com/pgjdbc/pgjdbc

The second slot refers to "private" slot.

So ""we are not doing reading from the stream' ==> It means that we are
having readPending call only from the shared slot then we get the
lastReceivedLSN() from stream and
send it back to stream as confirmed_flush_lsn for both private and shared
slot. We dont do readPending call to private slot. we will use private slot
only when we dont have choice. It is kind of reserver slot for us.

We are also doing forceUpdateStatus for both the slots().

Questions :
1) The confirmed_flush_lsn is updated when we get confirmation
from the downstream node about the flush_lsn but restart_lsn is only
incremented based on the LSN required by the oldest in-progress
transaction. ==> As explained above we are updating (setFlshLSN an API to
update confirmed_flush_lsn) both the slots with same LSN. So dont
understand why one leaves behind.

2) What are the other factors that might cause delay in updating
restart_lsn of the slot ?
3) In PG -13 does this behaviour change ?

Regards
Shailesh

the s

On Mon, Dec 14, 2020 at 4:51 PM Amit Kapila  wrote:

> On Mon, Dec 14, 2020 at 9:30 AM Jammie  wrote:
> >
> > Hello,
> >
> > We have two logical replication slots in our postgresql database
> (version-11) instance and we are using pgJDBC to stream data from these two
> slots.
> >
>
> IIUC, you are using some out-of-core outputplugin to stream the data?
> Are you using in  walsender mechanism to decode the changes from slots
> or via SQL APIs?
>
> > We are ensuring that when we regularly send feedback and update the
> confirmed_flush_lsn (every 10 minutes) for both the slots to the same
> position. However From our data we have seen that the restart_lsn movement
> of the two are not in sync and most of the time one of them lags too far
> behind to hold the WAL files unnecessarily. Here are some data points to
> indicate the problem
> >
> > Thu Dec 10 05:37:13 CET 2020
> >   slot_name   |  restart_lsn  |
> confirmed_flush_lsn
> >
> --+---+-
> >  db_dsn_metadata_src_private  | 48FB/F3000208 | 48FB/F3000208
> >  db_dsn_metadata_src_shared   | 48FB/F3000208 | 48FB/F3000208
> > (2 rows)
> >
> >
> >
> > Thu Dec 10 13:53:46 CET 2020
> >   slot_name  |  restart_lsn  |
> confirmed_flush_lsn
> >
> -+---+-
> >  db_dsn_metadata_src_private | 48FC/2309B150 | 48FC/233AA1D0
> >  db_dsn_metadata_src_shared  | 48FC/233AA1D0 | 48FC/233AA1D0
> > (2 rows)
> >
> >
> > Thu Dec 10 17:13:51 CET 2020
> >   slot_name  |  restart_lsn  |
> confirmed_flush_lsn
> >
> -+---+-
> >  db_dsn_metadata_src_private | 4900/B4C3AE8  | 4900/94FDF908
> >  db_dsn_metadata_src_shared  | 48FD/D2F66F10 | 4900/94FDF908
> > (2 rows)
> >
> > Though we are using setFlushLsn() and forceStatusUpdate for both the
> slot's stream regularly still the slot with name private is far behind the
> confirmed_flush_lsn and slot with name shared is also behind with
> confirmed_flush_lsn but not too far. Since the restart_lsn is not moving
> fast enough, causing lot of issues with WAL log file management and not
> allowing to delete them to free up disk space
> >
>
> What is this setFlushLsn? I am not able to find in the PG-code. If it
> is some outside code reference then please provide the link to code.
> In general, the restart_lsn and confirmed_flush_lsn are advanced in
> different ways so you might see some difference but it should not be
> this much. The confirmed_flush_lsn is updated when we get confirmation
> from the downstream node about the flush_lsn but restart_lsn is only
> incremented based on the LSN required by the oldest in-progress
> transaction.
>
> >
> > Please note that for the second slot we are not doing reading from the
> stream rather just sending the feedback.
> >
>
> Here does the second slot refers to 'shared' or 'private'? It is not
> very clear what you mean by "we are not doing reading from the
> stream', do you mean to say that decoding happens in the slot but the
> output plugin just throws away the streamed data and in the end just
> send the feedback?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Proposed patch for key managment

2020-12-14 Thread Michael Paquier
On Mon, Dec 14, 2020 at 10:19:02PM -0500, Bruce Momjian wrote:
> On Tue, Dec 15, 2020 at 10:05:49AM +0900, Michael Paquier wrote:
>> - The HMAC split is terrible, as mentioned upthread.  I think that you
>> would need to extract this stuff as a separate patch, and not add more
>> mess to the existing mess (SCRAM uses its own HMAC with SHA256, which
>> is already wrong).  We can and should have a fallback implementation,
>> because that's easy to do.  And we need to have the fallback
>> implementation depend on the contents of cryptohash.c (in a
>> src/common/hmac.c), while the OpenSSL portion requires a
>> hmac_openssl.c where you can choose the hash type based on
>> pg_cryptohash_type.  So ossl_HMAC_SHA512() does not do the right
>> thing.  APIs flexible enough to allow a new SSL library to plug into
>> this stuff are required.
>> - Not much a fan of the changes done in cryptohash.c for the resource
>> owners as well.  It also feels like this could be thought as something
>> directly for resowner.c.
>> - The cipher split also should be done in its own patch, and reviewed
>> on its own.  There is a mix of dependencies between non-OpenSSL and
>> OpenSSL code paths, making the pluggin of a new SSL library harder to
>> do.  In short, this requires proper API design, which is not something
>> we have here.  There are also no changes in pgcrypto for that stuff.
> 
> I am going to need someone to help me make these changes.  I don't feel
> I know enough about the crypto API to do it, and it will take me 1+ week
> to learn it.

I think that designing a correct set of APIs that can be plugged with
any SSL library is the correct move in the long term.  I have on my
agenda to clean up HMAC as SCRAM uses that with SHA256 and you would
use that with SHA512.  Daniel has mentioned that he has been touching
this area, and I also got a patch halfly done though pgcrypto needs
some extra thoughts.  So this is still WIP but you could reuse that
here.

> Uh, I got this code from pg_resetwal.c, which does something similar to
> pg_altercpass.

Yeah, that's actually the part where it is a bad idea to just copy
this pattern.  pg_resetwal should not do that in the long term in my
opinion, but nobody has come to clean up this stuff.  (- -;)

>> +#ifdef USE_OPENSSL
>> +   ctx = (PgCipherCtx *) palloc0(sizeof(PgCipherCtx));
>> +
>> +   ctx->encctx = ossl_cipher_ctx_create(cipher, key, klen, true);
>> +   ctx->decctx = ossl_cipher_ctx_create(cipher, key, klen, false);
>> +#endif
>> There is a cipher_openssl.c and a cipher.c that includes USE_OPENSSL.
>> This requires a cleaner split IMO.  We should avoid as much as
>> possible OpenSSL-specific code paths in files part of src/common/ when
>> not building with OpenSSL.  So this is now a mixed bag of
>> dependencies.
> 
> Again, I need help here.

My take would be to try to sort out the HMAC part first, then look at
the ciphers.  One step at a time.
--
Michael


signature.asc
Description: PGP signature


Re: Proposed patch for key managment

2020-12-14 Thread Bruce Momjian
On Tue, Dec 15, 2020 at 10:36:56AM +0800, Neil Chen wrote:
> Hi, Bruce  
> 
> I read your question and here are some of my thoughts.
> 
> 
>         Why is KmgrShmemData a struct, when it only has a single member? 
> Are
>         all shared memory areas structs?
> 
>  
> Yes, all areas created by ShmemInitStruct() are structs. I think the
> significance of using struct is that it delimits an "area"  to store the KMS
> related things, which makes our memory management more clear and unified.

OK, thanks, that helps.

>         What testing infrastructure should this have?
> 
>         There are a few shell script I should include to show how to 
> create
>         commands.  Where should they be stored?  /contrib module?
> 
>  
> 
>         Are people okay with having the feature enabled, but invisible
>         since the docs and --help output are missing? When we enable
>         ssl_passphrase_command to prompt from the terminal, some of the
>         command-line options will be useful.
> 
>  
> Since our implementation is not in contrib, I don't think we should put the
> script there. Maybe we can refer to postgresql.conf.sample?  

Uh, the script are 20-60 lines long --- I am attaching them to this
email.  Plus, when we allow user prompting for the SSL passphrase, we
will have another script, or maybe three mor if people want to use a
Yubikey to unlock the SSL passphrase.

> Actually, I wonder whether we can add the UDK(user data encryption key) to the
> first version of KMS, which can be provided to plug-ins such as pgsodium. In
> this way, users can at least use it to a certain extent.

I don't thinK I want to get into that at this point.  It can be done
later.

> Also, I have tested some KMS functionalities by adding very simple TDE logic.
> In the meantime, I found some confusion in the following places:

OK, I know Cybertec has a TDE patch too.

> 1. Previously, we added a variable bootstrap_keys_wrap that is used for
> encryption during initdb. However, since we save the "wrapped" key, we need to
> use a global KEK that can be accessed in boot mode to unwrap it before use... 
> I
> don't know if that's good. To make it simple, I modified the
> bootstrap_keys_wrap to store the "unwrapped" key so that the encryption
> function can get it correctly. (The variable name should be changed
> accordingly).

I see what you are saying.  We store the wrapped in bootstrap mode, but
the unwrapped in normal mode.  There is also the case of when we copy
the keys from an old cluster.  I will work on a patch tomorrow and
report back here.

> 2. I tried to add support for AES_CTR mode, and the code for encrypting buffer
> blocks. In the process I found that in pg_cipher_ctx_create, the key length is
> declared as "byte". However, in the CryptoKey structure, the length is stored
> as "bit", which leads me to use a form similar to Key->klen / 8 when I call
> this function. Maybe we should unify the two to avoid unnecessary confusion.

Agreed.  I will look at that too tomorrow.

> 3. This one is not a problem/bug. I have some doubts about the length of data
> encryption blocks. For the page, we do not encrypt the PageHeaderData, which 
> is
> 192 bit long. By default, the page size is 8K(65536 bits), so the length of 
> the
> data we want to encrypt is 65344 bits. This number can't be divisible by 128
> and 192 keys and 256 bits long keys. Will this cause a problem?

Since we are using CTR mode for everything, it should be fine.  We
encrypt WAL as it is written.

> Here is a simple patch that I wrote with references to Sawada's previous TDE
> works, hope it can help you.

Thanks.  I will work on the items you mentioned.  Can you look at the
Cybertec patch and then our wiki to see what needs to be done next?

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

Thanks for getting a proof-of-concept patch out there.  :-)

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



pass_fd.sh
Description: Bourne shell script


pass_yubi_nopin.sh
Description: Bourne shell script


pass_yubi_pin.sh
Description: Bourne shell script


Re: a misbehavior of partition row movement (?)

2020-12-14 Thread Amit Langote
On Tue, Dec 15, 2020 at 12:43 PM Amit Langote  wrote:
> Quoting your original example:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;
> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly errors out instead of silently performing the ON DELETE CASCADE
> update a set id=2;
> ERROR:  update or delete on table "a" violates foreign key constraint
> "a_fk" on table "b"
> DETAIL:  Key (id)=(1) is still referenced from table "b".
>
> select * from b;
>  id
> 
>   1
> (1 row)
>
> Changing the example to specify ON UPDATE CASCADE:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;

Oops, I copy-pasted the wrong block of text from my terminal.  I meant:

alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade on update cascade;

> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly applies ON UPDATE CASCADE action
> update a set id=2;
> UPDATE 1
>
> select * from b;
>  id
> 
>   2
> (1 row)

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: a misbehavior of partition row movement (?)

2020-12-14 Thread Amit Langote
Hi,

On Tue, Dec 15, 2020 at 12:01 AM Arne Roland  wrote:
> thanks for looking into this. I haven't yet looked at your patch in detail, 
> yet I think we should address the general issue here. To me this doesn't seem 
> to be a RI-trigger issue, but a more general issue like I mentioned in the 
> pg-bugs thread 
> https://www.postgresql.org/message-id/b1bfc99296e34725900bcd9689be8...@index.de

Hmm, maybe you're reading too much into the implementation details of
the fix, because the patch does fix the issue that you mention in the
linked report:

Quoting your original example:

drop table a, b;
create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade;
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

-- correctly errors out instead of silently performing the ON DELETE CASCADE
update a set id=2;
ERROR:  update or delete on table "a" violates foreign key constraint
"a_fk" on table "b"
DETAIL:  Key (id)=(1) is still referenced from table "b".

select * from b;
 id

  1
(1 row)

Changing the example to specify ON UPDATE CASCADE:

drop table a, b;
create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade;
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

-- correctly applies ON UPDATE CASCADE action
update a set id=2;
UPDATE 1

select * from b;
 id

  2
(1 row)

What am I missing about what you think is the more general problem to be solved?

> While I like the idea of making fks work, I'd prefer a solution that fires 
> the appropriate row trigger for partitioned tables for non RI-cases as well.

Maybe we could do that, but I don't know of a known issue where the
root cause is our firing of a row trigger on the leaf partition
instead of on the root partitioned table.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Proposed patch for key managment

2020-12-14 Thread Bruce Momjian
On Tue, Dec 15, 2020 at 10:05:49AM +0900, Michael Paquier wrote:
> On Mon, Dec 14, 2020 at 06:06:15PM -0500, Bruce Momjian wrote:
> > I am getting close to applying these patches, probably this week.  The
> > patches are cumulative:
> > 
> > https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
> > 
> > https://github.com/bmomjian/postgres/compare/key...bmomjian:key-alter.diff
> 
> I strongly object to a commit based on the current state of the patch.
> Based on my lookup of the patches you are referring to, I see a couple
> of things that should be splitted up and refactored properly before
> thinking about the core part of the patch (FWIW, I don't have much
> thoughts to offer about the core part because I haven't really thought
> about it, but it does not prevent to do a correct refactoring).  Here
> are some notes:
> - The code lacks a lot of comments IMO.  Why is retrieve_passphrase()
> doing what it does?  It seems to me that pg_altercpass needs a large
> brushup.

Uh, pg_altercpass is a new file I wrote and almost every block has a
comment.  I added a few more, but can you be more specific?

> - There are no changes to src/tools/msvc/.  Seeing the diffs in
> src/common/, this stuff breaks Windows builds.

OK, done in patch at URL.

> - The HMAC split is terrible, as mentioned upthread.  I think that you
> would need to extract this stuff as a separate patch, and not add more
> mess to the existing mess (SCRAM uses its own HMAC with SHA256, which
> is already wrong).  We can and should have a fallback implementation,
> because that's easy to do.  And we need to have the fallback
> implementation depend on the contents of cryptohash.c (in a
> src/common/hmac.c), while the OpenSSL portion requires a
> hmac_openssl.c where you can choose the hash type based on
> pg_cryptohash_type.  So ossl_HMAC_SHA512() does not do the right
> thing.  APIs flexible enough to allow a new SSL library to plug into
> this stuff are required.
> - Not much a fan of the changes done in cryptohash.c for the resource
> owners as well.  It also feels like this could be thought as something
> directly for resowner.c.
> - The cipher split also should be done in its own patch, and reviewed
> on its own.  There is a mix of dependencies between non-OpenSSL and
> OpenSSL code paths, making the pluggin of a new SSL library harder to
> do.  In short, this requires proper API design, which is not something
> we have here.  There are also no changes in pgcrypto for that stuff.

I am going to need someone to help me make these changes.  I don't feel
I know enough about the crypto API to do it, and it will take me 1+ week
to learn it.

> > I do have a few questions:
> 
> That looks like a lot of things to sort out as well.
> 
> > Can anyone test this on Windows, particularly -R handling?
> > 
> > What testing infrastructure should this have?
> 
> Using TAP tests would be adapted for those two points.

OK, I will try that.

> > There are a few shell script I should include to show how to create
> > commands.  Where should they be stored?  /contrib module?
> 
> Why are these needed.  Is it a matter of documentation?

I have posted some of the scripts.  They involved complex shell
scripting that I doubt the average user can do.  The scripts are for
prompting for a passphrase from the user's terminal, or using the
Yubikey, with our without typing a pin.  I can put them in the docs and
then users can copy them into a file.  Is that the preferred method?

> > Are people okay with having the feature enabled, but invisible
> > since the docs and --help output are missing?  When we enable
> > ssl_passphrase_command to prompt from the terminal, some of the
> > command-line options will be useful.
> 
> You are suggesting to enable the feature by default, make it invisible
> to the users and leave it undocumented?  Is there something I missing
> here?

The point is that the command-line options will be active, but will not
be documented.  It will not do anything on its own unless you specify
that command-line option.  I can comment-out the command-line options
from being active but the code it going to look very messy.

> > Do people like the command-letter choices?
> > 
> > I called the alter passphrase utility pg_altercpass.  I could
> > have called it pg_clusterpass, but I wanted to highlight it is
> > only for changing the passphrase, not for creating them.
> 
> I think that you should attach such patches directly to the emails
> sent to pgsql-hackers, if those github links disappear for some
> reason, it would become impossible to refer to see what has been
> discussed here.

Well, the patches are changing frequently.  I am attaching a version to
this email.

> +/*
> + * We have to use postgres.h not postgres_fe.h here, because there's so much
> + * backend-only stuff in the kmgr include files we need.  But we need a
> + * frontend-ish environment otherwise.  Hence this 

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-14 Thread Bharath Rupireddy
On Tue, Dec 15, 2020 at 8:25 AM Zhihong Yu  wrote:
> Is the following sequence possible ?
> In pgfdw_inval_callback():
>
> entry->invalidated = true;
> +   have_invalid_connections = true;
>
> At which time the loop in pgfdw_xact_callback() is already running (but past 
> the above entry).
> Then:
>
> +   /* We are done closing all the invalidated connections so reset. */
> +   have_invalid_connections = false;
>
> At which time, there is still at least one invalid connection but the global 
> flag is off.

It's not possible, as this backend specific code doesn't run in
multiple threads. We can not have pgfdw_inval_callback() and
pgfdw_xact_callback() running at the same time, so we are safe there.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Add Information during standby recovery conflicts

2020-12-14 Thread Kyotaro Horiguchi
At Tue, 15 Dec 2020 02:00:21 +0900, Fujii Masao  
wrote in 
> Thanks for the review! I'm thinking to wait half a day before
> commiting
> the patch just in the case someone may object the patch.

Sorry for coming late.  I have looked only the latest thread so I
should be missing many things so please ignore if it was settled in
the past discussion.


It emits messages like the follows;

[40509:startup] LOG:  recovery still waiting after 1021.431 ms: recovery 
conflict on lock
[40509:startup] DETAIL:  Conflicting processes: 41171, 41194.
[40509:startup] CONTEXT:  WAL redo at 0/3013158 for Standby/LOCK: xid 510 db 
13609 rel 16384 

IFAIK DETAIL usually shows ordinary sentences so the first word is
capitalized and ends with a period. But it is not a sentence so
following period looks odd.  a searcheing the tree for errdetails
showed some anomalies.

src/backend/parser/parse_param.c 
errdetail("%s versus %s",
src/backend/jit/llvm/llvmjit_error.cpp   errdetail("while in 
LLVM")));
src/backend/replication/logical/tablesync.c  
errdetail("The error was: %s", res->err)));
src/backend/tcop/postgres.c errdetail("prepare: 
%s", pstmt->plansource->query_string);
src/backend/tcop/postgres.c errdetail("abort reason: recovery 
conflict");

and one similar occurance:

src/backend/utils/adt/dbsize.c   
errdetail("Invalid size unit: \"%s\".", strptr),

I'm not sure, but it seems to me at least the period is unnecessary
here.


+   boolmaybe_log_conflict =
+   (standbyWaitStart != 0 && !logged_recovery_conflict);

odd indentation.


+   /* Also, set the timer if necessary */
+   if (logging_timer)
+   {
+   timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;
+   cnt++;
+   }

This doesn't consider spurious wakeup. I'm not sure it actually
happenes but we usually consider that.  That is since before this
patch, but ProcWaitForSignal()'s comment says that:

>  * As this uses the generic process latch the caller has to be robust against
>  * unrelated wakeups: Always check that the desired state has occurred, and
>  * wait again if not.

If we don't care of spurious wakeups, we should add such a comment.


+   boolmaybe_log_conflict;
+   boolmaybe_update_title;

Although it should be a matter of taste and I understand that the
"maybe" means that "that logging and changing of ps display may not
happen in this iteration" , that variables seem expressing
respectively "we should write log if the timeout for recovery conflict
expires" and "we should update title if 500ms elapsed". So they seem
*to me* better be just "log_conflict" and "update_title".

I feel the same with  "maybe_log_conflict" in ProcSleep().


+for recovery conflicts.  This is useful in determining if recovery
+conflicts prevent the recovery from applying WAL.

(I'm not confident on this) Isn't the sentence better be in past or
present continuous tense?


> BTW, attached is the POC patch that implements the idea discussed
> upthread;
> if log_recovery_conflict_waits is enabled, the startup process reports
> the log also after the recovery conflict was resolved and the startup
> process
> finished waiting for it. This patch needs to be applied after
> v11-0002-Log-the-standby-recovery-conflict-waits.patch is applied.

Ah. I was just writing a comment about that. I haven't looked it
closer but it looks good to me.  By the way doesn't it contains a
simple fix of a comment for the base patch?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-14 Thread Zhihong Yu
Is the following sequence possible ?
In pgfdw_inval_callback():

entry->invalidated = true;
+   have_invalid_connections = true;

At which time the loop in pgfdw_xact_callback() is already running (but
past the above entry).
Then:

+   /* We are done closing all the invalidated connections so reset. */
+   have_invalid_connections = false;

At which time, there is still at least one invalid connection but the
global flag is off.

On Mon, Dec 14, 2020 at 6:39 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> As discussed in [1], in postgres_fdw the cached connections to remote
> servers can stay until the lifetime of the local session without
> getting a chance to disconnect (connection leak), if the underlying
> user mapping or foreign server is dropped in another session. Here are
> few scenarios how this can happen:
>
> Use case 1:
> 1) Run a foreign query in session 1 with server 1 and user mapping 1
> 2) Drop user mapping 1 in another session 2, an invalidation message
> gets generated which will have to be processed by all the sessions
> 3) Run the foreign query again in session 1, at the start of txn the
> cached entry gets invalidated via pgfdw_inval_callback() (as part of
> invalidation message processing). Whatever may be the type of foreign
> query (select, update, explain, delete, insert, analyze etc.), upon
> next call to GetUserMapping() from postgres_fdw.c, the cache lookup
> fails(with ERROR:  user mapping not found for "") since the user
> mapping 1 has been dropped in session 2 and the query will also fail
> before reaching GetConnection() where the connections associated with
> the invalidated entries would have got disconnected.
>
> So, the connection associated with invalidated entry would remain
> until the local session exits.
>
> Use case 2:
> 1) Run a foreign query in session 1 with server 1 and user mapping 1
> 2) Try to drop foreign server 1, then we would not be allowed because
> of dependency. Use CASCADE so that dependent objects i.e. user mapping
> 1 and foreign tables get dropped along with foreign server 1.
> 3) Run the foreign query again in session 1, at the start of txn, the
> cached entry gets invalidated via pgfdw_inval_callback() and the query
> fails because there is no foreign table.
>
> Note that the remote connection remains open in session 1 until the
> local session exits.
>
> To solve the above connection leak problem, it looks like the right
> place to close all the invalid connections is pgfdw_xact_callback(),
> once registered, which gets called at the end of every txn in the
> current session(by then all the sub txns also would have been
> finished). Note that if there are too many invalidated entries, then
> the following txn has to close all of them, but that's okay than
> having leaked connections and it's a one time job for the following
> one txn.
>
> Attaching a patch for the same.
>
> Thoughts?
>
> [1] -
> https://www.postgresql.org/message-id/flat/CALj2ACUOQYs%2BssxkxRvZ6Ja5%2Bsfc6a-s_0e-Nv2A591hEyOgiw%40mail.gmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>


adding wait_start column to pg_locks

2020-12-14 Thread torikoshia

Hi,

When examining the duration of locks, we often do join on pg_locks
and pg_stat_activity and use columns such as query_start or
state_change.

However, since these columns are the moment when queries have
started or their state has changed, we cannot get the exact lock
duration in this way.

So I'm now thinking about adding a new column in pg_locks which
keeps the time at which locks started waiting.

One problem with this idea would be the performance impact of
calling gettimeofday repeatedly.
To avoid it, I reused the result of the gettimeofday which was
called for deadlock_timeout timer start as suggested in the
previous discussion[1].

Attached a patch.

BTW in this patch, for fast path locks, wait_start is set to
zero because it seems the lock will not be waited for.
If my understanding is wrong, I would appreciate it if
someone could point out.


Any thoughts?


[1] 
https://www.postgresql.org/message-id/28804.1407907184%40sss.pgh.pa.us


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 1a6a7377877cc52e4b87a05bbb8ffae92cdb91ab Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 15 Dec 2020 10:55:32 +0900
Subject: [PATCH v1] Add wait_start field into pg_locks.

To examine the duration of locks, we did join on pg_locks and
pg_stat_activity and used columns such as query_start or state_change.
However, since they are the moment when queries have started or their
state has changed, we could not get the exact lock duration in this way.

This patch adds a new field preserving the time at which locks
started waiting.
---
 doc/src/sgml/catalogs.sgml  |  9 +
 src/backend/storage/lmgr/lock.c | 10 ++
 src/backend/storage/lmgr/proc.c |  2 ++
 src/backend/utils/adt/lockfuncs.c   |  9 -
 src/include/catalog/pg_proc.dat |  6 +++---
 src/include/storage/lock.h  |  3 +++
 src/test/regress/expected/rules.out |  5 +++--
 7 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 62711ee83f..19af0e9af4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10567,6 +10567,15 @@ SCRAM-SHA-256$:&l
lock table
   
  
+
+ 
+  
+   wait_start timestamptz
+  
+  
+   The time at which lock started waiting
+  
+ 
 

   
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..7b30508f95 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1196,6 +1196,7 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 		lock->waitMask = 0;
 		SHMQueueInit(&(lock->procLocks));
 		ProcQueueInit(&(lock->waitProcs));
+		lock->wait_start = 0;
 		lock->nRequested = 0;
 		lock->nGranted = 0;
 		MemSet(lock->requested, 0, sizeof(int) * MAX_LOCKMODES);
@@ -3628,6 +3629,12 @@ GetLockStatusData(void)
 			instance->leaderPid = proc->pid;
 			instance->fastpath = true;
 
+			/*
+			 * Successfully taking fast path lock means there was no
+			 * conflicting locks.
+			 */
+			instance->wait_start = 0;
+
 			el++;
 		}
 
@@ -3655,6 +3662,7 @@ GetLockStatusData(void)
 			instance->pid = proc->pid;
 			instance->leaderPid = proc->pid;
 			instance->fastpath = true;
+			instance->wait_start = 0;
 
 			el++;
 		}
@@ -3707,6 +3715,7 @@ GetLockStatusData(void)
 		instance->pid = proc->pid;
 		instance->leaderPid = proclock->groupLeader->pid;
 		instance->fastpath = false;
+		instance->wait_start = lock->wait_start;
 
 		el++;
 	}
@@ -4184,6 +4193,7 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 		lock->waitMask = 0;
 		SHMQueueInit(&(lock->procLocks));
 		ProcQueueInit(&(lock->waitProcs));
+		lock->wait_start = 0;
 		lock->nRequested = 0;
 		lock->nGranted = 0;
 		MemSet(lock->requested, 0, sizeof(int) * MAX_LOCKMODES);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..f3702cc681 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1259,6 +1259,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		}
 		else
 			enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
+
+		lock->wait_start = get_timeout_start_time(DEADLOCK_TIMEOUT);
 	}
 
 	/*
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index f592292d06..5ee0953305 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -63,7 +63,7 @@ typedef struct
 } PG_Lock_Status;
 
 /* Number of columns in pg_locks output */
-#define NUM_LOCK_STATUS_COLUMNS		15
+#define NUM_LOCK_STATUS_COLUMNS		16
 
 /*
  * VXIDGetDatum - Construct a text representation of a VXID
@@ -142,6 +142,8 @@ pg_lock_status(PG_FUNCTION_ARGS)
 		   BOOLOID, -1, 0);
 		TupleDescInitEntry(tupdesc, (AttrNumber) 15, "fastpath",
 		   BOOLOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 16, "wait_start",
+		   TIMESTAMPTZOID, -1, 0);
 
 		funcctx->t

postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-14 Thread Bharath Rupireddy
Hi,

As discussed in [1], in postgres_fdw the cached connections to remote
servers can stay until the lifetime of the local session without
getting a chance to disconnect (connection leak), if the underlying
user mapping or foreign server is dropped in another session. Here are
few scenarios how this can happen:

Use case 1:
1) Run a foreign query in session 1 with server 1 and user mapping 1
2) Drop user mapping 1 in another session 2, an invalidation message
gets generated which will have to be processed by all the sessions
3) Run the foreign query again in session 1, at the start of txn the
cached entry gets invalidated via pgfdw_inval_callback() (as part of
invalidation message processing). Whatever may be the type of foreign
query (select, update, explain, delete, insert, analyze etc.), upon
next call to GetUserMapping() from postgres_fdw.c, the cache lookup
fails(with ERROR:  user mapping not found for "") since the user
mapping 1 has been dropped in session 2 and the query will also fail
before reaching GetConnection() where the connections associated with
the invalidated entries would have got disconnected.

So, the connection associated with invalidated entry would remain
until the local session exits.

Use case 2:
1) Run a foreign query in session 1 with server 1 and user mapping 1
2) Try to drop foreign server 1, then we would not be allowed because
of dependency. Use CASCADE so that dependent objects i.e. user mapping
1 and foreign tables get dropped along with foreign server 1.
3) Run the foreign query again in session 1, at the start of txn, the
cached entry gets invalidated via pgfdw_inval_callback() and the query
fails because there is no foreign table.

Note that the remote connection remains open in session 1 until the
local session exits.

To solve the above connection leak problem, it looks like the right
place to close all the invalid connections is pgfdw_xact_callback(),
once registered, which gets called at the end of every txn in the
current session(by then all the sub txns also would have been
finished). Note that if there are too many invalidated entries, then
the following txn has to close all of them, but that's okay than
having leaked connections and it's a one time job for the following
one txn.

Attaching a patch for the same.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/flat/CALj2ACUOQYs%2BssxkxRvZ6Ja5%2Bsfc6a-s_0e-Nv2A591hEyOgiw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 76ae4aad5c78ace5ab348c024e724669057e755f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 14 Dec 2020 22:39:46 +0530
Subject: [PATCH v1] postgres_fdw - cached connection leaks if the associated
 user mapping is dropped

In postgres_fdw the cached connections to remote servers can stay
until the lifetime of the local session, if the underlying user
mapping is dropped in another session.

To solve the above connection leak problem, it looks like the right
place to close all the invalid connections is pgfdw_xact_callback(),
once registered, which gets called at the end of every txn in the
current session(by then all the sub txns also would have been finished).
Note that if there are too many invalidated entries, then the
following txn has to bear running this extra disconnect code, but
that's okay than having leaked connections.
---
 contrib/postgres_fdw/connection.c | 31 ++-
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..e7ec914e48 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -73,6 +73,12 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
+/*
+ * Tracks whether there exists at least one invalid connection in the
+ * connection cache.
+ */
+static bool have_invalid_connections = false;
+
 /* prototypes of private functions */
 static void make_new_connection(ConnCacheEntry *entry, UserMapping *user);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
@@ -788,8 +794,15 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 	HASH_SEQ_STATUS scan;
 	ConnCacheEntry *entry;
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
+	/*
+	 * Quick exit if no connections were touched in this transaction and there
+	 * are no invalid connections in the cache. If there are any invalid
+	 * connections exists in the cache, this is the safe place to close all of
+	 * them so that they will not be leaked in case underlying user mappings or
+	 * foreign servers are dropped in other sessions because of which they
+	 * may not have a chance to get disconnected in GetConnection.
+	 */
+	if (!xact_got_connection && !have_invalid_connections)
 		return;
 
 	/*
@@ -943,12 +956,14 @@ pgfdw_xact_callback(XactEv

Re: Proposed patch for key managment

2020-12-14 Thread Neil Chen
Hi, Bruce

I read your question and here are some of my thoughts.

Why is KmgrShmemData a struct, when it only has a single member?
> Are
> all shared memory areas structs?
>

Yes, all areas created by ShmemInitStruct() are structs. I think the
significance of using struct is that it delimits an "area"  to store the
KMS related things, which makes our memory management more clear and
unified.

What testing infrastructure should this have?
>
> There are a few shell script I should include to show how to create
> commands.  Where should they be stored?  /contrib module?



Are people okay with having the feature enabled, but invisible
> since the docs and --help output are missing? When we enable
> ssl_passphrase_command to prompt from the terminal, some of the
> command-line options will be useful.


Since our implementation is not in contrib, I don't think we should put the
script there. Maybe we can refer to postgresql.conf.sample?
Actually, I wonder whether we can add the UDK(user data encryption key) to
the first version of KMS, which can be provided to plug-ins such as
pgsodium. In this way, users can at least use it to a certain extent.

Also, I have tested some KMS functionalities by adding very simple TDE
logic. In the meantime, I found some confusion in the following places:

1. Previously, we added a variable bootstrap_keys_wrap that is used for
encryption during initdb. However, since we save the "wrapped" key, we need
to use a global KEK that can be accessed in boot mode to unwrap it before
use... I don't know if that's good. To make it simple, I modified the
bootstrap_keys_wrap to store the "unwrapped" key so that the encryption
function can get it correctly. (The variable name should be changed
accordingly).

2. I tried to add support for AES_CTR mode, and the code for encrypting
buffer blocks. In the process I found that in pg_cipher_ctx_create, the key
length is declared as "byte". However, in the CryptoKey structure, the
length is stored as "bit", which leads me to use a form similar to
Key->klen / 8 when I call this function. Maybe we should unify the two to
avoid unnecessary confusion.

3. This one is not a problem/bug. I have some doubts about the length of
data encryption blocks. For the page, we do not encrypt the PageHeaderData,
which is 192 bit long. By default, the page size is 8K(65536 bits), so the
length of the data we want to encrypt is 65344 bits. This number can't be
divisible by 128 and 192 keys and 256 bits long keys. Will this cause a
problem?

Here is a simple patch that I wrote with references to Sawada's previous
TDE works, hope it can help you.

Thanks.
-- 
There is no royal road to learning.
HighGo Software Co.


encrypt_buffer.diff
Description: Binary data


Re: Autovacuum on partitioned table (autoanalyze)

2020-12-14 Thread yuzuko
Hello Alvaro,

On Thu, Dec 3, 2020 at 10:28 PM Alvaro Herrera  wrote:
>
> Hello Yuzuko,
>
> On 2020-Dec-02, yuzuko wrote:
>
> > The problem Horiguchi-san mentioned is as follows:
> > [explanation]
>
> Hmm, I see.  So the problem is that if some ancestor is analyzed first,
> then analyze of one of its partition will cause a redundant analyze of
> the ancestor, because the number of tuples that is propagated from the
> partition represents a set that had already been included in the
> ancestor's analysis.
>
> If the problem was just that, then I think it would be very simple to
> solve: just make sure to sort the tables to vacuum so that all leaves
> are vacuumed first, and then all ancestors, sorted from the bottom up.
> Problem solved.
>

Indeed.  When discussed with Horiguchi-san before,  He mentioned
the same way:
> So, to propagate the count properly, we need to analyze relations
> leaf-to-root order, or propagate the counter only to anscestors that
> haven't been processed in the current iteration.  It seems a bit too
> complex to sort analyze relations in that order.

but we didn't select it because of its complexity as you also said.

> But I'm not sure that that's the whole story, for two reasons: one, two
> workers can run simultaneously, where one analyzes the partition and the
> other analyzes the ancestor.  Then the order is not guaranteed (and
> each process will get no effect from remembering whether it did that one
> or not).  Second, manual analyzes can occur in any order.
>
> Maybe it's more useful to think about this in terms of rememebering that
> partition P had changed_tuples set to N when we analyzed ancestor A.
> Then, when we analyze partition P, we send the message listing A as
> ancestor; on receipt of that message, we see M+N changed tuples in P,
> but we know that we had already seen N, so we only record M.
>
> I'm not sure how to implement this idea however, since on analyze of
> ancestor A we don't have the list of partitions, so we can't know the N
> for each partition.
>
I thought about it for a while, but I can't come up with how to implement it.
And also I think the other way Horiguchi-san suggested in [1] would be
more simple to solve the problem we are facing.

Attach the new patch based on his patch.  What do you think?

[1] 
https://www.postgresql.org/message-id/20201110.203557.1420746510378864931.horikyota.ntt%40gmail.com

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


v12_autovacuum_on_partitioned_table.patch
Description: Binary data


Re: pg_shmem_allocations & documentation

2020-12-14 Thread Michael Paquier
On Mon, Dec 14, 2020 at 10:33:06AM +0100, Benoit Lobréau wrote:
>
>
> The offset at which the allocation starts. NULL for anonymous
> -   allocations and unused memory.
> +   allocations, since details related to them are not known.
>

Both of you seem to agree about having more details about that, which
is fine by me at the end.  Horiguchi-san, do you have more thoughts to
offer?  Benoit's version is similar to yours, just simpler.
--
Michael


signature.asc
Description: PGP signature


Re: Proposed patch for key managment

2020-12-14 Thread Michael Paquier
On Mon, Dec 14, 2020 at 06:06:15PM -0500, Bruce Momjian wrote:
> I am getting close to applying these patches, probably this week.  The
> patches are cumulative:
> 
>   https://github.com/postgres/postgres/compare/master...bmomjian:key.diff
>   
> https://github.com/bmomjian/postgres/compare/key...bmomjian:key-alter.diff

I strongly object to a commit based on the current state of the patch.
Based on my lookup of the patches you are referring to, I see a couple
of things that should be splitted up and refactored properly before
thinking about the core part of the patch (FWIW, I don't have much
thoughts to offer about the core part because I haven't really thought
about it, but it does not prevent to do a correct refactoring).  Here
are some notes:
- The code lacks a lot of comments IMO.  Why is retrieve_passphrase()
doing what it does?  It seems to me that pg_altercpass needs a large
brushup.
- There are no changes to src/tools/msvc/.  Seeing the diffs in
src/common/, this stuff breaks Windows builds.
- The HMAC split is terrible, as mentioned upthread.  I think that you
would need to extract this stuff as a separate patch, and not add more
mess to the existing mess (SCRAM uses its own HMAC with SHA256, which
is already wrong).  We can and should have a fallback implementation,
because that's easy to do.  And we need to have the fallback
implementation depend on the contents of cryptohash.c (in a
src/common/hmac.c), while the OpenSSL portion requires a
hmac_openssl.c where you can choose the hash type based on
pg_cryptohash_type.  So ossl_HMAC_SHA512() does not do the right
thing.  APIs flexible enough to allow a new SSL library to plug into
this stuff are required.
- Not much a fan of the changes done in cryptohash.c for the resource
owners as well.  It also feels like this could be thought as something
directly for resowner.c.
- The cipher split also should be done in its own patch, and reviewed
on its own.  There is a mix of dependencies between non-OpenSSL and
OpenSSL code paths, making the pluggin of a new SSL library harder to
do.  In short, this requires proper API design, which is not something
we have here.  There are also no changes in pgcrypto for that stuff.

> I do have a few questions:

That looks like a lot of things to sort out as well.

>   Can anyone test this on Windows, particularly -R handling?
>   
>   What testing infrastructure should this have?

Using TAP tests would be adapted for those two points.

>   There are a few shell script I should include to show how to create
>   commands.  Where should they be stored?  /contrib module?

Why are these needed.  Is it a matter of documentation?

>   Are people okay with having the feature enabled, but invisible
>   since the docs and --help output are missing?  When we enable
>   ssl_passphrase_command to prompt from the terminal, some of the
>   command-line options will be useful.

You are suggesting to enable the feature by default, make it invisible
to the users and leave it undocumented?  Is there something I missing
here?

>   Do people like the command-letter choices?
> 
>   I called the alter passphrase utility pg_altercpass.  I could
>   have called it pg_clusterpass, but I wanted to highlight it is
>   only for changing the passphrase, not for creating them.

I think that you should attach such patches directly to the emails
sent to pgsql-hackers, if those github links disappear for some
reason, it would become impossible to refer to see what has been
discussed here.

+/*
+ * We have to use postgres.h not postgres_fe.h here, because there's so much
+ * backend-only stuff in the kmgr include files we need.  But we need a
+ * frontend-ish environment otherwise.  Hence this ugly hack.
+ */
+#define FRONTEND 1
+
+#include "postgres.h"
I would advise to really understand why this happens and split up the
backend and frontend parts cleanly.  This style ought to be avoided as
much as possible.

@@ -95,9 +101,9 @@ pg_cryptohash_create(pg_cryptohash_type type)

if (state->evpctx == NULL)
{
+#ifndef FRONTEND
explicit_bzero(state, sizeof(pg_cryptohash_state));
explicit_bzero(ctx, sizeof(pg_cryptohash_ctx));
-#ifndef FRONTEND
I think that this change is incorrect.  Any clean up of memory should
be done for the frontend *and* the backend.

+#ifdef USE_OPENSSL
+   ctx = (PgCipherCtx *) palloc0(sizeof(PgCipherCtx));
+
+   ctx->encctx = ossl_cipher_ctx_create(cipher, key, klen, true);
+   ctx->decctx = ossl_cipher_ctx_create(cipher, key, klen, false);
+#endif
There is a cipher_openssl.c and a cipher.c that includes USE_OPENSSL.
This requires a cleaner split IMO.  We should avoid as much as
possible OpenSSL-specific code paths in files part of src/common/ when
not building with OpenSSL.  So this is now a mixed bag of
dependencies.
--
Michael


signature.asc
Description: PGP signature


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-14 Thread Justin Pryzby
On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote:
> On Sat, Dec 12, 2020 at 09:20:35AM +0100, Peter Eisentraut wrote:
> > On 2020-12-11 21:27, Alvaro Herrera wrote:
> > > By the way--  What did you think of the idea of explictly marking the
> > > types used for bitmasks using types bits32 and friends, instead of plain
> > > int, which is harder to spot?
> > 
> > If we want to make it clearer, why not turn the thing into a struct, as in
> > the attached patch, and avoid the bit fiddling altogether.
> 
> I like this.
> It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and ReindexParams
> In my v31 patch, I moved ReindexOptions to a private structure in indexcmds.c,
> with an "int options" bitmask which is passed to reindex_index() et al.  Your
> patch keeps/puts ReindexOptions index.h, so it also applies to reindex_index,
> which I think is good.
> 
> So I've rebased this branch on your patch.
> 
> Some thoughts:
> 
>  - what about removing the REINDEXOPT_* prefix ?
>  - You created local vars with initialization like "={}". But I thought it's
>needed to include at least one struct member like "={false}", or else
>they're not guaranteed to be zerod ?
>  - You passed the structure across function calls.  The usual convention is to
>pass a pointer.

I think maybe Michael missed this message (?)
I had applied some changes on top of Peter's patch.

I squished those commits now, and also handled ClusterOption and VacuumOption
in the same style.

Some more thoughts:
 - should the structures be named in plural ? "ReindexOptions" etc.  Since they
   define *all* the options, not just a single bit.
 - For vacuum, do we even need a separate structure, or should the members be
   directly within VacuumParams ?  It's a bit odd to write
   params.options.verbose.  Especially since there's also ternary options which
   are directly within params.
 - Then, for cluster, I think it should be called ClusterParams, and eventually
   include the tablespaceOid, like what we're doing for Reindex.

I am awaiting feedback on these before going further since I've done too much
rebasing with these ideas going back and forth and back.

-- 
Justin
>From 34bf8abaca50d39cb9fd00b21752f931b05a62f3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 12 Dec 2020 09:17:55 +0100
Subject: [PATCH 1/4] Convert reindex options to struct

---
 src/backend/catalog/index.c  | 24 
 src/backend/commands/cluster.c   |  3 +-
 src/backend/commands/indexcmds.c | 96 
 src/backend/commands/tablecmds.c |  4 +-
 src/backend/tcop/utility.c   | 10 ++--
 src/include/catalog/index.h  | 16 +++---
 src/include/commands/defrem.h|  9 +--
 7 files changed, 83 insertions(+), 79 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..da2f45b796 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3594,7 +3594,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  */
 void
 reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
-			  int options)
+			  ReindexOptions *options)
 {
 	Relation	iRel,
 heapRelation;
@@ -3602,7 +3602,6 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	IndexInfo  *indexInfo;
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
-	bool		progress = (options & REINDEXOPT_REPORT_PROGRESS) != 0;
 
 	pg_rusage_init(&ru0);
 
@@ -3611,12 +3610,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	 * we only need to be sure no schema or data changes are going on.
 	 */
 	heapId = IndexGetRelation(indexId,
-			  (options & REINDEXOPT_MISSING_OK) != 0);
+			  options->REINDEXOPT_MISSING_OK);
 	/* if relation is missing, leave */
 	if (!OidIsValid(heapId))
 		return;
 
-	if ((options & REINDEXOPT_MISSING_OK) != 0)
+	if (options->REINDEXOPT_MISSING_OK)
 		heapRelation = try_table_open(heapId, ShareLock);
 	else
 		heapRelation = table_open(heapId, ShareLock);
@@ -3625,7 +3624,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	if (!heapRelation)
 		return;
 
-	if (progress)
+	if (options->REINDEXOPT_REPORT_PROGRESS)
 	{
 		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
 	  heapId);
@@ -3641,7 +3640,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	 */
 	iRel = index_open(indexId, AccessExclusiveLock);
 
-	if (progress)
+	if (options->REINDEXOPT_REPORT_PROGRESS)
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_ACCESS_METHOD_OID,
 	 iRel->rd_rel->relam);
 
@@ -3792,14 +3791,14 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	}
 
 	/* Log what we did */
-	if (options & REINDEXOPT_VERBOSE)
+	if (options->REINDEXOPT_VERBOSE)
 		ereport(INFO,
 (errmsg("index \"%s\" was reindexed",
 		get_rel_name(indexId)),
  errdetail_internal("%s",
 	pg_rusage_show(&ru0;
 
-	

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

2020-12-14 Thread Tom Lane
Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..651227f510 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2607,7 +2607,8 @@ createConnHash(void)
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(remoteConnHashEnt);
 
-	return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+	return hash_create("Remote Con hash", NUMCONN, &ctl,
+	   HASH_ELEM | HASH_STRINGS);
 }
 
 static void
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..2f00344b7f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -567,7 +567,6 @@ pgss_shmem_startup(void)
 		pgss->stats.dealloc = 0;
 	}
 
-	memset(&info, 0, sizeof(info));
 	info.keysize = sizeof(pgssHashKey);
 	info.entrysize = sizeof(pgssEntry);
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..66581e5414 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	{
 		HASHCTL		ctl;
 
-		MemSet(&ctl, 0, sizeof(ctl));
 		ctl.keysize = sizeof(ConnCacheKey);
 		ctl.entrysize = sizeof(ConnCacheEntry);
-		/* allocate ConnectionHash in the cache context */
-		ctl.hcxt = CacheMemoryContext;
 		ConnectionHash = hash_create("postgres_fdw connections", 8,
 	 &ctl,
-	 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	 HASH_ELEM | HASH_BLOBS);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 3433c19712..b4766dc5ff 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -93,7 +93,6 @@ InitializeShippableCache(void)
 	HASHCTL		ctl;
 
 	/* Create the hash table. */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(ShippableCacheKey);
 	ctl.entrysize = sizeof(ShippableCacheEntry);
 	ShippableCacheHash =
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..e9a9741154 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	MemoryContext SPIcontext;
 
 	/* initialize the category hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
 	ctl.keysize = MAX_CATNAME_LEN;
 	ctl.entrysize = sizeof(crosstab_HashEnt);
 	ctl.hcxt = per_query_ctx;
@@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	crosstab_hash = hash_create("crosstab hash",
 INIT_CATS,
 &ctl,
-HASH_ELEM | HASH_CONTEXT);
+HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	/* Connect to SPI manager */
 	if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4ad67c88b4..217c199a14 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
 	 * nodeBuffersTab hash is association between index blocks and it's
 	 * buffers.
 	 */
-	memset(&hashCtl, 0, sizeof(hashCtl));
 	hashCtl.keysize = sizeof(BlockNumber);
 	hashCtl.entrysize = sizeof(GISTNodeBuffer);
 	hashCtl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494..c77a189907 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	bool		found;
 
 	/* Initialize hash tables used to track TIDs */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(ItemPointerData);
 	hash_ctl.entrysize = sizeof(ItemPointerData);
 	hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 39e33763df..65942cc428 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cxt = rw_cxt;
 
 	/* Initialize hash tables used to track update chains */
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
 	hash_ctl.hcxt = state->rs_cxt;
@@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state)
 	state->rs_begin_lsn = GetXLogInsertRecPtr();
 	state->rs_num_rewrite_mappings = 0;
 
-	memset(&hash_ctl, 0, sizeof(hash_ctl));
 	hash_ctl

Re: Proposed patch for key managment

2020-12-14 Thread Bruce Momjian
On Wed, Dec  2, 2020 at 04:38:14PM -0500, Bruce Momjian wrote:
> Attached is a patch for key management, which will eventually be part of
> cluster file encryption (CFE), called TDE (Transparent Data Encryption)
> by Oracle.  It is an update of Masahiko Sawada's patch from July 31:
> 
>   
> https://www.postgresql.org/message-id/ca+fd4k6rjwnvztro3q2f5hsdd8hgyuc4cuy9u3e6ran4c6t...@mail.gmail.com
> 
> Sawada-san did all the hard work, and I just redirected the patch.  The
> general outline of this CFE feature can be seen here:
> 
>   https://wiki.postgresql.org/wiki/Transparent_Data_Encryption
> 
> The currently planned progression for this feature is to allow secure
> retrieval of key encryption keys (KEK) outside of the database, then use
> those to encrypt data keys that encrypt heap/index/tmpfile files.
...
> If most people approve of this general approach, and the design
> decisions made, I would like to apply this in the next few weeks, but
> this brings complications.  The syntax added by this commit might not
> provide a useful feature until PG 15, so how do we hide it from users. 
> I was thinking of not applying the doc changes (or commenting them out)
> and commenting out the --help output.

I am getting close to applying these patches, probably this week.  The
patches are cumulative:

https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

https://github.com/bmomjian/postgres/compare/key...bmomjian:key-alter.diff

I do have a few questions:

Why is KmgrShmemData a struct, when it only has a single member?  Are
all shared memory areas structs?

Should pg_altercpass be using fsync's for directory renames?

Can anyone test this on Windows, particularly -R handling?

What testing infrastructure should this have?

There are a few shell script I should include to show how to create
commands.  Where should they be stored?  /contrib module?

Are people okay with having the feature enabled, but invisible
since the docs and --help output are missing?  When we enable
ssl_passphrase_command to prompt from the terminal, some of the
command-line options will be useful.

Do people like the command-letter choices?

I called the alter passphrase utility pg_altercpass.  I could
have called it pg_clusterpass, but I wanted to highlight it is
only for changing the passphrase, not for creating them.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

 




Re: Optimizing the documentation

2020-12-14 Thread Joshua Drake
>
>
>
> > Queries can also access multiple tables at once, or access the same table
> > in a way that multiple rows are processed. A query that accesses multiple
> > rows of the same or different tables at one time is a join. For example,
> if
> > you wish to list all of the weather records with the location of the
> > associated city, we would compare the city column of each row of the
> weather
> > table with the name column of all rows in the cities table, and select
> the
> > rows *WHERE* the values match.
>
> TBH, I'm not sure that that is an improvement at all.  I'm constantly
> reminded that for many of our users, English is not their first language.
> A little bit of redundancy in wording is often helpful for them.
>

Interesting point, it is certainly true that many of our users are ESL
folks. I would expect a succinct version to be easier to understand but I
have no idea.


>
> The places where I think the docs need help tend to be places where
> assorted people have added information over time, such that there's
> not a consistent style throughout a section; or maybe the information
> could be presented in a better order.  We don't need to be taking a
> hacksaw to text that's perfectly clear as it stands.
>

The term perfectly clear is part of the problem I am trying to address. I
can pick and pull at the documentation all day long and show things that
are not perfectly clear. They are clear to you, myself and I imagine most
of the readers on this list. Generally speaking we are not the target of
the documentation and we may easily get pulled into the "good enough" when
in reality it could be so much better. I have gotten so used to our
documentation that I literally skip over unneeded words to get to the
answer I am looking for. I don't think that is the target we want to hit.

Wouldn't we want the least amount of mental energy to understand the
concept as possible for the reader? Every extra word that isn't needed,
every extra adjective, repeated term or "very unique" that exists is extra
energy spent to understand what the writer is trying to say. That mental
energy can be exhausted quickly, especially when considering dense
technical topics.



> (If I were thinking of rewriting this text, I'd probably think of
> removing the references to self-joins and covering that topic
> in a separate para.  But that's because self-joins aren't basic
> usage, not because I think the text is unreadable.)
>

That makes sense. I was just taking the direct approach of making existing
content better as an example. I would agree with your assessment if it were
to be submitted as a patch.


> > The reason I bolded and capitalized WHERE was to provide a visual signal
> to
> > the example that is on the page.
>
> IMO, typographical tricks are not something to lean on heavily.
>

Fair enough.

JD


Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

2020-12-14 Thread Tom Lane
I wrote:
> There are a couple of other API oddities that maybe we should think
> about while we're here:

> * Should we just have a blanket insistence that all callers supply
> HASH_ELEM?  The default sizes that dynahash.c uses without that are
> undocumented and basically useless.  We're already asserting that
> in the HASH_BLOBS path, which is the majority use-case, and this
> patch now asserts it for HASH_STRINGS too.

Here's a follow-up patch for that part, which also tries to respond
a bit to Heikki's complaint about skimpy documentation.  While at it,
I const-ified the HASHCTL argument, since there's no need for
hash_create to modify that.

regards, tom lane

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 07cae638df..49f21b77bb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *	*info: additional table parameters, as indicated by flags
  *	flags: bitmask indicating which parameters to take from *info
  *
- * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
  * or HASH_FUNCTION, to define the key hashing semantics (C strings,
  * binary blobs, or custom, respectively).  Callers specifying a custom
  * hash function will likely also want to use HASH_COMPARE, and perhaps
  * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used.
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
  *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
@@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR*hctl;
 
+	/*
+	 * Hash tables now allocate space for key and data, but you have to say
+	 * how much space to allocate.
+	 */
+	Assert(flags & HASH_ELEM);
+	Assert(info->keysize > 0);
+	Assert(info->entrysize >= info->keysize);
+
 	/*
 	 * For shared hash tables, we have a local hash header (HTAB struct) that
 	 * we allocate in TopMemoryContext; all else is in shared memory.
@@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 	{
 		Assert(!(flags & HASH_STRINGS));
 		/* We can optimize hashing for common key sizes */
-		Assert(flags & HASH_ELEM);
 		if (info->keysize == sizeof(uint32))
 			hashp->hash = uint32_hash;
 		else
@@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		 * it's almost certainly an integer or pointer not a string.
 		 */
 		Assert(flags & HASH_STRINGS);
-		Assert(flags & HASH_ELEM);
 		Assert(info->keysize > 8);
 
 		hashp->hash = string_hash;
@@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		hctl->dsize = info->dsize;
 	}
 
-	/*
-	 * hash table now allocates space for key and data but you have to say how
-	 * much space to allocate
-	 */
-	if (flags & HASH_ELEM)
-	{
-		Assert(info->entrysize >= info->keysize);
-		hctl->keysize = info->keysize;
-		hctl->entrysize = info->entrysize;
-	}
+	/* remember the entry sizes, too */
+	hctl->keysize = info->keysize;
+	hctl->entrysize = info->entrysize;
 
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
@@ -617,10 +625,6 @@ hdefault(HTAB *hashp)
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
 
-	/* rather pointless defaults for key & entry size */
-	hctl->keysize = sizeof(char *);
-	hctl->entrysize = 2 * sizeof(char *);
-
 	hctl->num_partitions = 0;	/* not partitioned */
 
 	/* table has no fixed maximum size */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 666ad33567..c3daaae92b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -124,7 +124,7 @@ typedef struct
  * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-		 HASHCTL *info, int flags);
+		 const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *key

Re: A few new options for CHECKPOINT

2020-12-14 Thread Bossart, Nathan
On 12/5/20, 9:11 AM, "Stephen Frost"  wrote:
> * Bossart, Nathan (bossa...@amazon.com) wrote:
>> On 12/5/20, 6:41 AM, "Stephen Frost"  wrote:
>> > Assuming we actually want to do this, which I still generally don't
>> > agree with since it isn't really clear if it'll actually end up doing
>> > something, or not, wouldn't it make more sense to have a command that
>> > just sits and waits for the currently running (or next) checkpoint to
>> > complete..?  For the use-case that was explained, at least, we don't
>> > actually need to cause another checkpoint to happen, we just want to
>> > know when a checkpoint has completed, right?
>> 
>> If it's enough to just wait for the current checkpoint to complete or
>> to wait for the next one to complete, I suppose you could just poll
>> pg_control_checkpoint().  I think the only downside is that you could
>> end up sitting idle for a while, especially if checkpoint_timeout is
>> high and checkpoint_completion_target is low.  But, as you point out,
>> that may not be a typically recommended way to configure the system.
>
> Maybe I missed something, but aren't you going to be waiting a while
> with this patch given that it's asking for a spread checkpoint too..?
>
> I agree that you could just monitor for the next checkpoint using
> pg_control_checkpoint(), which is why I'm wondering again what the
> point is behind this patch...  I'm trying to understand why we'd be
> encouraging people to increase the number of checkpoints that are
> happening when they're still going to be waiting around for that spread
> (in other words, non-immediate) checkpoint to happen (just as if they'd
> just waited until the next regular checkpoint), and they're still going
> to have a fair bit of WAL to replay because it'll be however much WAL
> has been written since we started the spread/non-immediate checkpoint.

I plan to mark this patch as withdrawn after the next commitfest
unless anyone objects.

Nathan



Re: Optimizing the documentation

2020-12-14 Thread Tom Lane
Joshua Drake  writes:
> Certainly and I didn't want to just start dumping patches. Part of this is
> just style, for example:

> Thus far, our queries have only accessed one table at a time. Queries can
> access multiple tables at once, or access the same table in such a way that
> multiple rows of the table are being processed at the same time. A query
> that accesses multiple rows of the same or different tables at one time is
> called a join query. As an example, say you wish to list all the weather
> records together with the location of the associated city. To do that, we
> need to compare the city column of each row of the weather table with the
> name column of all rows in the cities table, and select the pairs of rows
> where these values match.

> It isn't "terrible" but can definitely be optimized. In a quick review, I
> would put it something like this:

> Queries can also access multiple tables at once, or access the same table
> in a way that multiple rows are processed. A query that accesses multiple
> rows of the same or different tables at one time is a join. For example, if
> you wish to list all of the weather records with the location of the
> associated city, we would compare the city column of each row of the weather
> table with the name column of all rows in the cities table, and select the
> rows *WHERE* the values match.

TBH, I'm not sure that that is an improvement at all.  I'm constantly
reminded that for many of our users, English is not their first language.
A little bit of redundancy in wording is often helpful for them.

The places where I think the docs need help tend to be places where
assorted people have added information over time, such that there's
not a consistent style throughout a section; or maybe the information
could be presented in a better order.  We don't need to be taking a
hacksaw to text that's perfectly clear as it stands.

(If I were thinking of rewriting this text, I'd probably think of
removing the references to self-joins and covering that topic
in a separate para.  But that's because self-joins aren't basic
usage, not because I think the text is unreadable.)

> The reason I bolded and capitalized WHERE was to provide a visual signal to
> the example that is on the page.

IMO, typographical tricks are not something to lean on heavily.

regards, tom lane




Re: Optimizing the documentation

2020-12-14 Thread Peter Geoghegan
On Mon, Dec 14, 2020 at 12:50 PM Tom Lane  wrote:
>  Most of the docs contain pretty dense technical
> material that's not going to be improved by making it even denser.

It's always hard to write dense technical prose, for a variety of
reasons. I often struggle with framing. For example I seem to write
sentences that sound indecisive. But is that necessarily a bad thing?
It seems wise to hedge a little bit when talking about (say) some kind
of complex system with many moving parts. Ernest Hemingway never had
to describe how VACUUM works.

I agree with Heikki to some degree; there is value in trying to follow
a style guide. But let's not forget about the other problem with the
docs, which is that there isn't enough low level technical details of
the kind that advanced users value. There is a clear unmet demand for
that IME. If we're going to push in the direction of simplification,
it should not make this other important task harder.

-- 
Peter Geoghegan




Re: Optimizing the documentation

2020-12-14 Thread Joshua Drake
>
>
>
> In short, the devil's in the details.  Maybe there are lots of
> places where this type of approach would help, but I think it's
> going to be a case-by-case discussion not something where there's
> a clear win overall.
>

Certainly and I didn't want to just start dumping patches. Part of this is
just style, for example:

Thus far, our queries have only accessed one table at a time. Queries can
access multiple tables at once, or access the same table in such a way that
multiple rows of the table are being processed at the same time. A query
that accesses multiple rows of the same or different tables at one time is
called a join query. As an example, say you wish to list all the weather
records together with the location of the associated city. To do that, we
need to compare the city column of each row of the weather table with the
name column of all rows in the cities table, and select the pairs of rows
where these values match.

It isn't "terrible" but can definitely be optimized. In a quick review, I
would put it something like this:

Queries can also access multiple tables at once, or access the same table
in a way that multiple rows are processed. A query that accesses multiple
rows of the same or different tables at one time is a join. For example, if
you wish to list all of the weather records with the location of the
associated city, we would compare the city column of each row of the weather
table with the name column of all rows in the cities table, and select the
rows *WHERE* the values match.

The reason I bolded and capitalized WHERE was to provide a visual signal to
the example that is on the page. I could also argue that we could remove
"For example," though I understand its purpose here.

Again, this was just a quick review.

JD


Re: Optimizing the documentation

2020-12-14 Thread David G. Johnston
On Mon, Dec 14, 2020 at 1:40 PM Joshua Drake  wrote:

> For example, what would be exceedly helpful would be a documentation style
>>> guide that is canonical and we can review documentation against.
>>>
>>
I do agree with that premise, with the goal of getting more people to
contribute to writing and reviewing documentation and having more than
vague ideas about what is or isn't considered minor items to just leave
alone or points of interest to debate.  But as much as I would love
perfectly written English documentation I try to consciously make an effort
to accept things that maybe aren't perfect but are good enough in the
interest of having a larger set of contributors with more varied abilities
in this area.  "It is clear enough" is a valid trade-off to take.


> Yes. https://www.postgresql.org/docs/13/preface.html
>

Thanks, though it was meant to be a bit rhetorical.


>
>

>> While I don't think making readability changes is a bad thing, and maybe
>> my perspective is a bit biased and negative right now, but the attention
>> given to the existing documentation patches in the commitfest isn't that
>> great - so adding another mass of patches fixing up items that haven't
>> provoked complaints seems likely to just make the list longer.
>>
>
> One of the issues is that editing documentation with patches is a pain. It
> is simpler and a lower barrier of effort to pull up an existing section of
> Docbook and edit that (just like code) than it is to break out specific
> text within a patch. Though I would be happy to take a swipe at reviewing a
> specific documentation patch (as you linked).
>

I'm not following this line of reasoning.


>
>>
>> In short, I don't think optimization should be a goal in its own right;
>> but rather changes should mostly be driven by questions asked by our
>> users.  I don't think reading random chapters of the documentation to find
>> non-optimal exposition is going to be a good use of time.
>>
>
> I wasn't planning on reading random chapters. I was planning on walking
> through the documentation as it is written and hopefully others would join.
> This is a monumental effort to perform completely. Also consider the
> overall benefit, not just one specific piece. Would you not consider it a
> net win if certain questions were being answered in a succinct way as to
> allow users to use the documentation instead of asking the most novice of
> questions on various channels?
>

I suspect over half of the questions asked are due to not reading the
documentation at all - I tend to get good results when I point someone to
the correct terminology and section, and if there are follow-up questions
then I know where to look for improvements and have a concrete question or
two in hand to ensure that the revised documentation answers.

I'm fairly well plugged into user questions and have recently made an
attempt to respond to those with specific patches to improve the
documentation involved in those questions.  And also have been working to
help other documentation patches get pushed through.  Based upon those
experiences I think this monumental community effort is going to stall out
pretty quickly - regardless of its merits - though if the effort results in
a new guidelines document then I would say it was worth the effort
regardless of how many paragraphs are optimized away.

My $0.02

David J.


Re: Optimizing the documentation

2020-12-14 Thread Tom Lane
Heikki Linnakangas  writes:
> On 14/12/2020 21:50, Joshua Drake wrote:
>> Issues that are resolved with the optimized text:
>> 
>> * Succinct text is more likely to be read than skimmed
>> 
>> * Removal of extraneous mentions of PostgreSQL
>> 
>> * Removal of unneeded justifications
>> 
>> * Joining of two paragraphs into one that provides only the needed
>> information to the user
>> 
>> * Word count decreased by over 50%. As changes such as these are
>> adopted it would make the documentation more consumable.

> I agree with these goals in general. I like to refer to 
> http://www.plainenglish.co.uk/how-to-write-in-plain-english.html when 
> writing documentation. Or anything else, really.

I think this particular chunk of text is an outlier.  (Not unreasonably
so; as Heikki notes, it's customary for the very beginning of a book to
be a bit more formal.)  Most of the docs contain pretty dense technical
material that's not going to be improved by making it even denser.
Also, to the extent that there's duplication, it's often deliberate.
For example, if a given bit of info appears in the tutorial and the
main docs and the reference pages, that doesn't mean we should rip
out two of the three appearances.

There certainly are sections that are crying out for reorganization,
but that's going to be very topic-specific and not something that
just going into it with a copy-editing mindset will help.

In short, the devil's in the details.  Maybe there are lots of
places where this type of approach would help, but I think it's
going to be a case-by-case discussion not something where there's
a clear win overall.

regards, tom lane




Re: Optimizing the documentation

2020-12-14 Thread Joshua Drake
>
>
>
> > This is the official PostgreSQL documentation. It is written by the
> > PostgreSQL community in parallel with the development of the software.
> > We have organized it by the type of user and their stages of experience:
>
> Some thoughts on this example:
>
> - Changing "has been" to "is" changes the tone here. "Is" implies that
> it is being written continuously, whereas "has been" implies that it's
> finished. We do update the docs continuously, but point of the sentence
> is that the docs were developed together with the features, so "has
> been" seems more accurate.
>

No argument.


>
> ´- I like "PostgreSQL developers and other volunteers" better than the
> "PostgreSQL community". This is the very first introduction to
> PostgreSQL, so we can't expect the reader to know what the "PostgreSQL
> community" is. I like the "volunteers" word here a lot.
>
>
There is a huge community for PostgreSQL, the developers are only a
small (albeit critical) part of it. By using the term "PostgreSQL
community" we are providing equity to all those who participate in the
success of the project. I could definitely see saying "PostgreSQL
volunteers".



> - I think a little bit of ceremony is actually OK in this particular
> paragraph, since it's the very first one in the docs.
>
> - I agree with dropping the "to make the large amount of information
> manageable".
>
> So I would largely keep this example unchanged, changing it into:
>
> ---
> This book is the official documentation of PostgreSQL. It has been
> written by the PostgreSQL developers and other volunteers in parallel to
> the development of the PostgreSQL software. It describes all the
> functionality that the current version of PostgreSQL officially supports.
>
> This book has been organized in several parts. Each part is targeted at
> a different class of users, or at users in different stages of their
> PostgreSQL experience:
> ---
>
>
I appreciate the feedback and before we get too far down the rabbit hole, I
would like to note that I am not tied to an exact wording as my post was
more about the general goal and results based on that goal.


> I agree with these goals in general. I like to refer to
> http://www.plainenglish.co.uk/how-to-write-in-plain-english.html when
> writing documentation. Or anything else, really.
>

Great resource!

JD


>
> - Heikki
>


Re: Optimizing the documentation

2020-12-14 Thread Joshua Drake
>
>
>>
>> Technical documentation should only be as verbose as needed to illustrate
>> the concept or task that we are explaining. It should not be redundant, nor
>> should it use .50 cent words when a .10 cent word would suffice. I would
>> like to put effort into optimizing the documentation and am requesting
>> general consensus that this would be a worthwhile effort before I begin to
>> dust off my Docbook skills.
>>
>>
> As a quick observation, it would be more immediately helpful to add to the
> existing proposal to add more details about architecture and get that
> committed before embarking on a new documentation project.
>
> https://commitfest.postgresql.org/31/2541/
>

I considered just starting to review patches as such but even with that,
doesn't it make sense that if I am going to be putting a particular thought
process into my efforts that there is a general consensus? For example,
what would be exceedly helpful would be a documentation style guide that is
canonical and we can review documentation against. Currently our
documentation is all over the place. It isn't that it is not technically
accurate or comprehensive


> Optimized text (35 words):
>>
>> This is the official PostgreSQL documentation. It is written by the
>> PostgreSQL community in parallel with the development of the software. We
>> have organized it by the type of user and their stages of experience:
>>
>> Issues that are resolved with the optimized text:
>>
>>-
>>
>>Succinct text is more likely to be read than skimmed
>>-
>>
>>Removal of extraneous mentions of PostgreSQL
>>-
>>
>>Removal of unneeded justifications
>>-
>>
>>Joining of two paragraphs into one that provides only the needed
>>information to the user
>>-
>>
>>Word count decreased by over 50%. As changes such as these are
>>adopted it would make the documentation more consumable.
>>
>> That actually exists in our documentation?
>

Yes. https://www.postgresql.org/docs/13/preface.html


> I suspect changing it isn't all that worthwhile as the typical user isn't
> reading the documentation like a book and with the entry point being the
> table of contents most of that material is simply gleaned from observing
> the presented structure without words needed to describe it.
>

It is a matter of consistency.


>
> While I don't think making readability changes is a bad thing, and maybe
> my perspective is a bit biased and negative right now, but the attention
> given to the existing documentation patches in the commitfest isn't that
> great - so adding another mass of patches fixing up items that haven't
> provoked complaints seems likely to just make the list longer.
>

One of the issues is that editing documentation with patches is a pain. It
is simpler and a lower barrier of effort to pull up an existing section of
Docbook and edit that (just like code) than it is to break out specific
text within a patch. Though I would be happy to take a swipe at reviewing a
specific documentation patch (as you linked).


>
> In short, I don't think optimization should be a goal in its own right;
> but rather changes should mostly be driven by questions asked by our
> users.  I don't think reading random chapters of the documentation to find
> non-optimal exposition is going to be a good use of time.
>

I wasn't planning on reading random chapters. I was planning on walking
through the documentation as it is written and hopefully others would join.
This is a monumental effort to perform completely. Also consider the
overall benefit, not just one specific piece. Would you not consider it a
net win if certain questions were being answered in a succinct way as to
allow users to use the documentation instead of asking the most novice of
questions on various channels?

JD

>
>


Re: Optimizing the documentation

2020-12-14 Thread Heikki Linnakangas

On 14/12/2020 21:50, Joshua Drake wrote:
The community has spent a lot of time optimizing features over the 
years. Excellent examples include parallel query and partitioning which 
have been multi-year efforts to increase the quality,  performance, and 
extend features of the original commit. We should consider the 
documentation in a similar manner. Just like code, documentation can 
sometimes use a bug fix, optimization, and/or new features added to the 
original implementation.


Technical documentation should only be as verbose as needed to 
illustrate the concept or task that we are explaining. It should not be 
redundant, nor should it use .50 cent words when a .10 cent word would 
suffice. I would like to put effort into optimizing the documentation 
and am requesting general consensus that this would be a worthwhile 
effort before I begin to dust off my Docbook skills.


Hard to argue with "let's make the doc better" :-).

I expect that there will be a lot of bikeshedding over the exact 
phrases. That's OK. Every improvement that actually gets committed 
helps, even if we don't make progress on other parts.



I have provided an example below:


Original text (79 words):


This book is the official documentation of PostgreSQL. It has been 
written by the PostgreSQL developers and other volunteers in parallel to 
the development of the PostgreSQL software. It describes all the 
functionality that the current version of PostgreSQL officially supports.


To make the large amount of information about PostgreSQL manageable, 
this book has been organized in several parts. Each part is targeted at 
a different class of users, or at users in different stages of their 
PostgreSQL experience:


Optimized text (35 words):


This is the official PostgreSQL documentation. It is written by the 
PostgreSQL community in parallel with the development of the software. 
We have organized it by the type of user and their stages of experience:


Some thoughts on this example:

- Changing "has been" to "is" changes the tone here. "Is" implies that 
it is being written continuously, whereas "has been" implies that it's 
finished. We do update the docs continuously, but point of the sentence 
is that the docs were developed together with the features, so "has 
been" seems more accurate.


´- I like "PostgreSQL developers and other volunteers" better than the 
"PostgreSQL community". This is the very first introduction to 
PostgreSQL, so we can't expect the reader to know what the "PostgreSQL 
community" is. I like the "volunteers" word here a lot.


- I think a little bit of ceremony is actually OK in this particular 
paragraph, since it's the very first one in the docs.


- I agree with dropping the "to make the large amount of information 
manageable".


So I would largely keep this example unchanged, changing it into:

---
This book is the official documentation of PostgreSQL. It has been 
written by the PostgreSQL developers and other volunteers in parallel to 
the development of the PostgreSQL software. It describes all the 
functionality that the current version of PostgreSQL officially supports.


This book has been organized in several parts. Each part is targeted at 
a different class of users, or at users in different stages of their 
PostgreSQL experience:

---


Issues that are resolved with the optimized text:

  * Succinct text is more likely to be read than skimmed

  * Removal of extraneous mentions of PostgreSQL

  * Removal of unneeded justifications

  * Joining of two paragraphs into one that provides only the needed
information to the user

  * Word count decreased by over 50%. As changes such as these are
adopted it would make the documentation more consumable.
I agree with these goals in general. I like to refer to 
http://www.plainenglish.co.uk/how-to-write-in-plain-english.html when 
writing documentation. Or anything else, really.


- Heikki




Re: Optimizing the documentation

2020-12-14 Thread David G. Johnston
On Mon, Dec 14, 2020 at 12:50 PM Joshua Drake  wrote:

> -hackers,
>
> The community has spent a lot of time optimizing features over the years.
> Excellent examples include parallel query and partitioning which have been
> multi-year efforts to increase the quality,  performance, and extend
> features of the original commit. We should consider the documentation in a
> similar manner. Just like code, documentation can sometimes use a bug fix,
> optimization, and/or new features added to the original implementation.
>
> Technical documentation should only be as verbose as needed to illustrate
> the concept or task that we are explaining. It should not be redundant, nor
> should it use .50 cent words when a .10 cent word would suffice. I would
> like to put effort into optimizing the documentation and am requesting
> general consensus that this would be a worthwhile effort before I begin to
> dust off my Docbook skills.
>
>
As a quick observation, it would be more immediately helpful to add to the
existing proposal to add more details about architecture and get that
committed before embarking on a new documentation project.

https://commitfest.postgresql.org/31/2541/



> I have provided an example below:
>
> Original text (79 words):
>
> This book is the official documentation of PostgreSQL. It has been written
> by the PostgreSQL developers and other volunteers in parallel to the
> development of the PostgreSQL software. It describes all the functionality
> that the current version of PostgreSQL officially supports.
>
> To make the large amount of information about PostgreSQL manageable, this
> book has been organized in several parts. Each part is targeted at a
> different class of users, or at users in different stages of their
> PostgreSQL experience:
>
> Optimized text (35 words):
>
> This is the official PostgreSQL documentation. It is written by the
> PostgreSQL community in parallel with the development of the software. We
> have organized it by the type of user and their stages of experience:
>
> Issues that are resolved with the optimized text:
>
>-
>
>Succinct text is more likely to be read than skimmed
>-
>
>Removal of extraneous mentions of PostgreSQL
>-
>
>Removal of unneeded justifications
>-
>
>Joining of two paragraphs into one that provides only the needed
>information to the user
>-
>
>Word count decreased by over 50%. As changes such as these are adopted
>it would make the documentation more consumable.
>
> That actually exists in our documentation?  I suspect changing it isn't
all that worthwhile as the typical user isn't reading the documentation
like a book and with the entry point being the table of contents most of
that material is simply gleaned from observing the presented structure
without words needed to describe it.

While I don't think making readability changes is a bad thing, and maybe my
perspective is a bit biased and negative right now, but the attention given
to the existing documentation patches in the commitfest isn't that great -
so adding another mass of patches fixing up items that haven't provoked
complaints seems likely to just make the list longer.

In short, I don't think optimization should be a goal in its own right; but
rather changes should mostly be driven by questions asked by our users.  I
don't think reading random chapters of the documentation to find
non-optimal exposition is going to be a good use of time.

David J.


Optimizing the documentation

2020-12-14 Thread Joshua Drake
-hackers,

The community has spent a lot of time optimizing features over the years.
Excellent examples include parallel query and partitioning which have been
multi-year efforts to increase the quality,  performance, and extend
features of the original commit. We should consider the documentation in a
similar manner. Just like code, documentation can sometimes use a bug fix,
optimization, and/or new features added to the original implementation.

Technical documentation should only be as verbose as needed to illustrate
the concept or task that we are explaining. It should not be redundant, nor
should it use .50 cent words when a .10 cent word would suffice. I would
like to put effort into optimizing the documentation and am requesting
general consensus that this would be a worthwhile effort before I begin to
dust off my Docbook skills.

I have provided an example below:

Original text (79 words):

This book is the official documentation of PostgreSQL. It has been written
by the PostgreSQL developers and other volunteers in parallel to the
development of the PostgreSQL software. It describes all the functionality
that the current version of PostgreSQL officially supports.

To make the large amount of information about PostgreSQL manageable, this
book has been organized in several parts. Each part is targeted at a
different class of users, or at users in different stages of their
PostgreSQL experience:

Optimized text (35 words):

This is the official PostgreSQL documentation. It is written by the
PostgreSQL community in parallel with the development of the software. We
have organized it by the type of user and their stages of experience:

Issues that are resolved with the optimized text:

   -

   Succinct text is more likely to be read than skimmed
   -

   Removal of extraneous mentions of PostgreSQL
   -

   Removal of unneeded justifications
   -

   Joining of two paragraphs into one that provides only the needed
   information to the user
   -

   Word count decreased by over 50%. As changes such as these are adopted
   it would make the documentation more consumable.

Thanks,
JD

-- 
Founder - https://commandprompt.com/ - 24x7x365 Postgres since 1997
Co-Chair - https://postgresconf.org/ - Postgres Education at its finest
People, Postgres, Data


Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

2020-12-14 Thread Tom Lane
Noah Misch  writes:
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
>> A quick count of grep hits suggest that the large majority of
>> existing hash_create() calls use HASH_BLOBS, and there might be
>> only order-of-ten calls that would need to be touched if we
>> required an explicit HASH_STRING flag.  So option #2 is seeming
>> kind of attractive.  Maybe that together with an assertion that
>> string keys have to exceed 8 or 16 bytes would be enough protection.

> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.

Attached is a proposed patch that requires HASH_STRINGS to be stated
explicitly (in the event, there are 13 callers needing that) and insists
on keysize > 8 for string keys.  In examining the now-easily-visible uses
of string keys, almost all of them are using NAMEDATALEN-sized keys, or
in a few places larger values.  Only two are smaller:

1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48.

2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because
it stores relfilenode OIDs as strings.  That seems pretty damfool
to me, so I'm inclined to change it to store binary OIDs instead;
those'd be a third the size (or probably a quarter the size after
alignment padding) and likely faster to hash or compare.  But I
didn't do that here, since it's still more than 8.  (I did whack
it upside the head to the extent of not storing its temporary
hash table in CacheMemoryContext.)

So it seems to me that insisting on keysize > 8 is fine.

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM?  The default sizes that dynahash.c uses without that are
undocumented and basically useless.  We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

* The coding convention that the HASHCTL argument struct should be
pre-zeroed seems to have been ignored at a lot of call sites.
I added a memset call to a couple of callers that I was touching
in this patch, but I'm having second thoughts about that.  Maybe
we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero.  The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..8b17fb06eb 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2604,10 +2604,12 @@ createConnHash(void)
 {
 	HASHCTL		ctl;
 
+	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = NAMEDATALEN;
 	ctl.entrysize = sizeof(remoteConnHashEnt);
 
-	return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+	return hash_create("Remote Con hash", NUMCONN, &ctl,
+	   HASH_ELEM | HASH_STRINGS);
 }
 
 static void
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..ec7819ca77 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
 	crosstab_hash = hash_create("crosstab hash",
 INIT_CATS,
 &ctl,
-HASH_ELEM | HASH_CONTEXT);
+HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 
 	/* Connect to SPI manager */
 	if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..5ba7c2eb3c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -414,7 +414,7 @@ InitQueryHashTable(void)
 	prepared_queries = hash_create("Prepared Queries",
    32,
    &hash_ctl,
-   HASH_ELEM);
+   HASH_ELEM | HASH_STRINGS);
 }
 
 /*
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..2fe89fd361 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
 		ctl.keysize = EXTNODENAME_MAX_LEN;
 		ctl.entrysize = sizeof(ExtensibleNodeEntry);
 
-		*p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+		*p_htable = hash_create(htable_label, 100, &ctl,
+HASH_ELEM | HASH_STRINGS);
 	}
 
 	if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..f21ab67ae4 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		memset(&ctl, 0, sizeof(ctl));
 		ctl.keysiz

Re: cutting down the TODO list thread

2020-12-14 Thread John Naylor
On Thu, Dec 10, 2020 at 3:29 PM John Naylor 
wrote:
>
> *Views and Rules

> *SQL Commands

Hearing no objections, the items mentioned have been moved over.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Rethinking plpgsql's assignment implementation

2020-12-14 Thread Pavel Stehule
po 14. 12. 2020 v 17:25 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > I checked a performance and it looks so access to record's field is
> faster,
> > but an access to arrays field is significantly slower
>
> Hmm, I'd drawn the opposite conclusion in my own testing ...
>
> > for i in 1..5000
> > loop
> >   if a[i] > a[i+1] then
> > aux := a[i];
> > a[i] := a[i+1]; a[i+1] := aux;
> > rep := true;
> >   end if;
> > end loop;
>
> ... but I now see that I'd not checked cases like "a[i] := a[j]".
> exec_check_rw_parameter() is being too conservative about whether
> it can optimize a case like that.  The attached incremental patch
> fixes it.
>
> > I tested pi calculation
> > ...
> > And the performance is 10% slower than on master
>
> Can't reproduce that here.  For the record, I get the following
> timings (medians of three runs) for your test cases:
>
> HEAD:
>
> sort:   Time: 13974.709 ms (00:13.975)
> pi_est_1(1000): Time: 3537.482 ms (00:03.537)
> pi_est_2(1000): Time: 3546.557 ms (00:03.547)
>
> Patch v1:
>
> sort:   Time: 47053.892 ms (00:47.054)
> pi_est_1(1000): Time: 3456.078 ms (00:03.456)
> pi_est_2(1000): Time: 3451.347 ms (00:03.451)
>
> + exec_check_rw_parameter fix:
>
> sort:   Time: 12199.724 ms (00:12.200)
> pi_est_1(1000): Time: 3357.955 ms (00:03.358)
> pi_est_2(1000): Time: 3367.526 ms (00:03.368)
>
> I'm inclined to think that the differences in the pi calculation
> timings are mostly chance effects; there's certainly no reason
> why exec_check_rw_parameter should affect that test case at all.
>

performance patch helps lot of for sort - with patch it is faster 5-10%
than master 10864 x 12122 ms

I found probably reason why patched was slower

I used

CFLAGS="-fno-omit-frame-pointer -Wall -Wmissing-prototypes
-Wdeclaration-after-statement  -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-g -O2 -Werror=switch"

With these options the pi test was slower. When I used default, then there
is no difference.

So it can be very good feature, new code has same speed or it is faster

Regards

Pavel










> regards, tom lane
>
>


Re: libpq compression

2020-12-14 Thread Daniil Zakhlystov
> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> I still think this is excessively baroque and basically useless.
> Nobody wants to allow compression levels 1, 3, and 5 but disallow 2
> and 4. At the very most, somebody might want to start a maximum or
> minimum level. But even that I think is pretty pointless. Check out
> the "Decompression Time" and "Decompression Speed" sections from this
> link:
> 
> https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/
> 
> This shows that decompression time and speed is basically independent
> of compression method for all three of these compressors; to the
> extent that there is a difference, higher compression levels are
> generally slightly faster to decompress. I don't really see the
> argument for letting either side be proscriptive here. Deciding with
> algorithms you're willing to accept is totally reasonable since
> different things may be supported, security concerns, etc. but
> deciding you're only willing to accept certain levels seems unuseful.
> It's also unenforceable, I think, since the receiving side has no way
> of knowing what the sender actually did.

I agree that decompression time and speed are basically the same for different 
compression ratios for most algorithms.
But it seems like that this may not be true for memory usage.

Check out these links: http://mattmahoney.net/dc/text.html and 
https://community.centminmod.com/threads/round-4-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.18669/

According to these sources, zstd uses significantly more memory while 
decompressing the data which has been compressed with high compression ratios.

So I’ll test the different ZSTD compression ratios with the current version of 
the patch and post the results later this week.


> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> 
> Good points. I guess you need to arrange to "flush" at the compression
> layer as well as the libpq layer so that you don't end up with data
> stuck in the compression buffers.

I think that “flushing” the libpq and compression buffers before setting the 
new compression method will help to solve issues only at the compressing 
(sender) side
but won't help much on the decompressing (receiver) side.

In the current version of the patch, the decompressor acts as a proxy between 
secure_read and PqRecvBuffer /  conn->inBuffer. It is unaware of the Postgres 
protocol and 
will fail to do anything other than decompressing the bytes received from the 
secure_read function and appending them to the PqRecvBuffer.
So the problem is that we can’t decouple the compressed bytes from the 
uncompressed ones (actually ZSTD detects the compressed block end, but some 
other algorithms don’t).

We may introduce some hinges to control the decompressor behavior from the 
underlying levels after reading the SetCompressionMethod message
from PqRecvBuffer, but I don’t think that it is the correct approach.

> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> Another idea is that you could have a new message type that says "hey,
> the payload of this is 1 or more compressed messages." It uses the
> most-recently set compression method. This would make switching
> compression methods easier since the SetCompressionMethod message
> itself could always be sent uncompressed and/or not take effect until
> the next compressed message. It also allows for a prudential decision
> not to bother compressing messages that are short anyway, which might
> be useful. On the downside it adds a little bit of overhead. Andres
> was telling me on a call that he liked this approach; I'm not sure if
> it's actually best, but have you considered this sort of approach?

This may help to solve the above issue. For example, we may introduce the 
CompressedData message:

CompressedData (F & B) 

Byte1(‘m’) // I am not so sure about the ‘m’ identifier :)
Identifies the message as compressed data. 

Int32 
Length of message contents in bytes, including self. 

Byten
Data that forms part of a compressed data stream.

Basically, it wraps some chunk of compressed data (like the CopyData message).

On the sender side, the compressor will wrap all outgoing message chunks into 
the CopyData messages.

On the receiver side, some intermediate component between the secure_read and 
the decompressor will do the following:
1. Read the next 5 bytes (type and length) from the buffer 
2.1 If the message type is other than CompressedData, forward it straight to 
the PqRecvBuffer /  conn->inBuffer.
2.2 If the message type is CompressedData, forward its contents to the current 
decompressor.

What do you think of this approach?

—
Daniil Zakhlystov



Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-14 Thread Bharath Rupireddy
On Mon, Dec 14, 2020 at 11:00 PM Fujii Masao
 wrote:
> > One more point for the above scenario: if the user mapping is dropped
> > in another session, then cache lookup for that entry in the
> > postgres_fdw_get_connections() returns a null tuple which I plan to
> > not throw an error, but just to skip in that case and continue. But if
> > the user mapping is not dropped in another session but altered, then
> > postgres_fdw_get_connections() still can show that in the output.
>
> Yes, so *if* we really want to return even connection invalidated by drop of
> user mapping, the cached connection entry may need to store not only
> user mapping id but also server id so that we can get the server name without
> user mapping entry.

We can do that, but what happens if the foreign server itself get
dropped with cascade option in another session, use case is as
follows:

1) Run a foreign query in session 1 with server 1, user mapping 1
2) Try to drop foreign server 1, then we would not be allowed to do so
because of dependency, if we use CASCADE, then the dependent user
mapping 1 and foreign tables get dropped too.
3) Run the postgres_fdw_get_connections(), at the start of txn, the
cached entry gets invalidated via pgfdw_inval_callback() and we try to
use the stored server id of the invalid entry (for which the foreign
server would have been dropped) and lookup in sys catalogues, so again
a null tuple is returned.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-14 Thread Fujii Masao




On 2020/12/15 1:40, Bharath Rupireddy wrote:

On Mon, Dec 14, 2020, 9:47 PM Bharath Rupireddy
 wrote:


On Mon, Dec 14, 2020 at 8:03 PM Fujii Masao  wrote:

We will not output if the invalidated entry has no active connection[2], so if we 
fix the connection leak issue with the above discussed fix i.e closing all the 
invalidated connections at the end of next xact, there are less chances that we 
will output invalidated entries in the postgres_fdw_get_connections() output. Only 
case we may show up invalidated connections(which have active connections 
entry->conn) in the postgres_fdw_get_connections() output is as follows:

1) say we have few cached active connections exists in session 1
2) drop the user mapping (in another session) associated with any of the cached 
connections to make that entry invalid
3) run select * from postgres_fdw_get_connections(); in session 1.  At the start 
of the xact, the invalidation message gets processed and the corresponding entry 
gets marked as invalid. If we allow invalid connections (that have entry->conn) 
to show up in the output, then we show them in the result of the query. At the end 
of xact, we close these invalid connections, in this case, user might think that 
he still have invalid connections active.


What about the case where the transaction started at the above 1) at session 1, 
and postgres_fdw_get_connections() in the above 3) is called within that 
transaction at session 1? In this case, postgres_fdw_get_connections() can 
return even invalidated connections?


In that case, since the user mapping would have been dropped in
another session and we are in the middle of a txn in session 1, the
entries would not get marked as invalid until the invalidation message
gets processed by the session 1 which may happen if the session 1


Yes, and this can happen by other commands, for example, CREATE TABLE.



opens a sub txn, if not then for postgres_fdw_get_connections() the
entries will still be active as they would not have been marked as
invalid yet and postgres_fdw_get_connections() would return them in
the output.


One more point for the above scenario: if the user mapping is dropped
in another session, then cache lookup for that entry in the
postgres_fdw_get_connections() returns a null tuple which I plan to
not throw an error, but just to skip in that case and continue. But if
the user mapping is not dropped in another session but altered, then
postgres_fdw_get_connections() still can show that in the output.


Yes, so *if* we really want to return even connection invalidated by drop of
user mapping, the cached connection entry may need to store not only
user mapping id but also server id so that we can get the server name without
user mapping entry.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_waldump error message fix

2020-12-14 Thread Bossart, Nathan
On 12/13/20, 7:01 PM, "Michael Paquier"  wrote:
> On Mon, Dec 14, 2020 at 10:26:01AM +0900, Kyotaro Horiguchi wrote:
>> Yeah, I had the same feeling. At least, the two LSNs in the message
>> under discussion are simply redundant. So +1 to just remove the LSN at
>> the caller site.
>
> That would mean that we are ready to accept that we will never forget
> to a LSN in any of the messages produced by xlogreader.c or any of the
> callbacks used by pg_waldump.  FWIW, I'd rather let a position in this
> report than none.  At least it allows users to know the area where the
> problem happened.

Yeah.  Unfortunately, I suspect we will have the same problem if we
add a new variable that we only use to track the LSN to report for
errors.

Nathan



Re: Add Information during standby recovery conflicts

2020-12-14 Thread Fujii Masao



On 2020/12/15 0:49, Drouvot, Bertrand wrote:

Hi,

On 12/14/20 4:20 PM, Fujii Masao wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On 2020/12/14 21:31, Fujii Masao wrote:



On 2020/12/05 12:38, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 7:22 PM Drouvot, Bertrand  wrote:


Hi,

On 12/4/20 2:21 AM, Fujii Masao wrote:


On 2020/12/04 9:28, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao
 wrote:




On 2020/12/01 17:29, Drouvot, Bertrand wrote:

Hi,

On 12/1/20 12:35 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization.
Do not click links or open attachments unless you can confirm the
sender and know the content is safe.



On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera
 wrote:

On 2020-Dec-01, Fujii Masao wrote:


+ if (proc)
+ {
+ if (nprocs == 0)
+ appendStringInfo(&buf, "%d", proc->pid);
+ else
+ appendStringInfo(&buf, ", %d", proc->pid);
+
+ nprocs++;

What happens if all the backends in wait_list have gone? In
other words,
how should we handle the case where nprocs == 0 (i.e., nprocs
has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no
need to log
the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You
could
test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.


Thank you all for the review/remarks.

They have been addressed in the new attached patch version.


Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.


Thank you for updating the patch! I have one question.



+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.


As the comment of ResolveRecoveryConflictWithLock() says the
following, a deadlock is detected by the ordinary backend process:

   * Deadlocks involving the Startup process and an ordinary backend
proces
   * will be detected by the deadlock detector within the ordinary
backend.

If we use STANDBY_DEADLOCK_TIMEOUT,
SendRecoveryConflictWithBufferPin() will be called after
DeadlockTimeout passed, but I think it's not necessary for the startup
process in this case.


Thanks for pointing this! You are right.



If we want to just wake up the startup process
maybe we can use STANDBY_TIMEOUT here?



Thanks for the patch updates! Except what we are still discussing below,
it looks good to me.


When STANDBY_TIMEOUT happens, a request to release conflicting buffer
pins is sent. Right? If so, we should not also use STANDBY_TIMEOUT there?


Agree



Or, first of all, we don't need to enable the deadlock timer at all?
Since what we'd like to do is to wake up after deadlock_timeout
passes, we can do that by changing ProcWaitForSignal() so that it can
accept the timeout and giving the deadlock_timeout to it. If we do
this, maybe we can get rid of STANDBY_LOCK_TIMEOUT from
ResolveRecoveryConflictWithLock(). Thought?


Where do we enable deadlock timeout in hot standby case? You meant to
enable it in ProcWaitForSignal() or where we set a timer for not hot
standby case, in ProcSleep()?


No, what I tried to say is to change ResolveRecoveryConflictWithLock() so that 
it does

1. calculate the "minimum" timeout from deadlock_timeout and 
max_standby_xxx_delay
2. give the calculated timeout value to ProcWaitForSignal()
3. wait for signal and timeout on ProcWaitForSignal()

Since ProcWaitForSignal() calls WaitLatch(), seems it's not so difficult to 
make ProcWaitForSignal() handle the timeout. If we do this, I was thinking that 
we can get rid of enable_timeouts() from ResolveRecoveryConflictWithLock().






Why not simply use (again) the STANDBY_LOCK_TIMEOUT one? (as it triggers
a call to StandbyLockTimeoutHandler() which does nothing, except waking
up. That's what we want, right?)


Right, what I wanted to mean is STANDBY_LOCK_TIMEOUT. The startup
process can wake up and do nothing. Thank you for pointing out.


O

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-14 Thread Bharath Rupireddy
On Mon, Dec 14, 2020, 9:47 PM Bharath Rupireddy
 wrote:
>
> On Mon, Dec 14, 2020 at 8:03 PM Fujii Masao  
> wrote:
> > > We will not output if the invalidated entry has no active connection[2], 
> > > so if we fix the connection leak issue with the above discussed fix i.e 
> > > closing all the invalidated connections at the end of next xact, there 
> > > are less chances that we will output invalidated entries in the 
> > > postgres_fdw_get_connections() output. Only case we may show up 
> > > invalidated connections(which have active connections entry->conn) in the 
> > > postgres_fdw_get_connections() output is as follows:
> > >
> > > 1) say we have few cached active connections exists in session 1
> > > 2) drop the user mapping (in another session) associated with any of the 
> > > cached connections to make that entry invalid
> > > 3) run select * from postgres_fdw_get_connections(); in session 1.  At 
> > > the start of the xact, the invalidation message gets processed and the 
> > > corresponding entry gets marked as invalid. If we allow invalid 
> > > connections (that have entry->conn) to show up in the output, then we 
> > > show them in the result of the query. At the end of xact, we close these 
> > > invalid connections, in this case, user might think that he still have 
> > > invalid connections active.
> >
> > What about the case where the transaction started at the above 1) at 
> > session 1, and postgres_fdw_get_connections() in the above 3) is called 
> > within that transaction at session 1? In this case, 
> > postgres_fdw_get_connections() can return even invalidated connections?
>
> In that case, since the user mapping would have been dropped in
> another session and we are in the middle of a txn in session 1, the
> entries would not get marked as invalid until the invalidation message
> gets processed by the session 1 which may happen if the session 1
> opens a sub txn, if not then for postgres_fdw_get_connections() the
> entries will still be active as they would not have been marked as
> invalid yet and postgres_fdw_get_connections() would return them in
> the output.

One more point for the above scenario: if the user mapping is dropped
in another session, then cache lookup for that entry in the
postgres_fdw_get_connections() returns a null tuple which I plan to
not throw an error, but just to skip in that case and continue. But if
the user mapping is not dropped in another session but altered, then
postgres_fdw_get_connections() still can show that in the output.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Rethinking plpgsql's assignment implementation

2020-12-14 Thread Tom Lane
Pavel Stehule  writes:
> I checked a performance and it looks so access to record's field is faster,
> but an access to arrays field is significantly slower

Hmm, I'd drawn the opposite conclusion in my own testing ...

> for i in 1..5000
> loop
>   if a[i] > a[i+1] then
> aux := a[i];
> a[i] := a[i+1]; a[i+1] := aux;
> rep := true;
>   end if;
> end loop;

... but I now see that I'd not checked cases like "a[i] := a[j]".
exec_check_rw_parameter() is being too conservative about whether
it can optimize a case like that.  The attached incremental patch
fixes it.

> I tested pi calculation
> ...
> And the performance is 10% slower than on master

Can't reproduce that here.  For the record, I get the following
timings (medians of three runs) for your test cases:

HEAD:

sort:   Time: 13974.709 ms (00:13.975)
pi_est_1(1000): Time: 3537.482 ms (00:03.537)
pi_est_2(1000): Time: 3546.557 ms (00:03.547)

Patch v1:

sort:   Time: 47053.892 ms (00:47.054)
pi_est_1(1000): Time: 3456.078 ms (00:03.456)
pi_est_2(1000): Time: 3451.347 ms (00:03.451)

+ exec_check_rw_parameter fix:

sort:   Time: 12199.724 ms (00:12.200)
pi_est_1(1000): Time: 3357.955 ms (00:03.358)
pi_est_2(1000): Time: 3367.526 ms (00:03.368)

I'm inclined to think that the differences in the pi calculation
timings are mostly chance effects; there's certainly no reason
why exec_check_rw_parameter should affect that test case at all.

regards, tom lane

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 4c8a739bc4..15cb3b312f 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -2583,7 +2583,9 @@ array_set_element_expanded(Datum arraydatum,
 	/*
 	 * Copy new element into array's context, if needed (we assume it's
 	 * already detoasted, so no junk should be created).  If we fail further
-	 * down, this memory is leaked, but that's reasonably harmless.
+	 * down, this memory is leaked, but that's reasonably harmless.  Note in
+	 * particular that doing this early ensures sanity in case the source
+	 * Datum is a pointer to a pass-by-ref element of this same array.
 	 */
 	if (!eah->typbyval && !isNull)
 	{
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 5c1db1dcfb..1378f40d4d 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -8181,14 +8181,7 @@ exec_check_rw_parameter(PLpgSQL_expr *expr, int target_dno)
 			contains_target_param((Node *) sbsref->refexpr, &target_dno))
 			return;
 
-		/* the other subexpressions must not contain target */
-		if (contains_target_param((Node *) sbsref->refupperindexpr,
-  &target_dno) ||
-			contains_target_param((Node *) sbsref->reflowerindexpr,
-  &target_dno) ||
-			contains_target_param((Node *) sbsref->refassgnexpr,
-  &target_dno))
-			return;
+		/* we do *not* need to restrict the subscripts or source expression */
 
 		/* OK, we can pass target as a read-write parameter */
 		expr->rwparam = target_dno;


Re: Parallel Inserts in CREATE TABLE AS

2020-12-14 Thread Zhihong Yu
For set_append_rel_size(), it seems this is the difference
between query_level != 1 and query_level == 1:

+   (root->parent_root->parse->CTASParallelInsInfo &
CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND) &&

Maybe extract the common conditions into its own expression / variable so
that the code is easier to read.

Cheers

On Mon, Dec 14, 2020 at 4:50 AM Hou, Zhijie 
wrote:

> Hi
>
> > Attaching v11 patch set. Please review it further.
>
> Currently with the patch, we can allow parallel CTAS when topnode is
> Gather.
> When top-node is Append and Gather is the sub-node of Append, I think we
> can still enable
> Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such
> as:
>
> Append
> -->Gather
> ->Create table
> ->Seqscan
> -->Gather
> ->create table
> ->Seqscan
>
> And the use case seems common to me, such as:
> select * from A where xxx union all select * from B where xxx;
>
> I attatch a WIP patch which just show the possibility of this feature.
> The patch is based on the latest v11-patch.
>
> What do you think?
>
> Best regards,
> houzj
>
>
>
>
>
>


Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-14 Thread Bharath Rupireddy
On Mon, Dec 14, 2020 at 8:03 PM Fujii Masao  wrote:
> > We will not output if the invalidated entry has no active connection[2], so 
> > if we fix the connection leak issue with the above discussed fix i.e 
> > closing all the invalidated connections at the end of next xact, there are 
> > less chances that we will output invalidated entries in the 
> > postgres_fdw_get_connections() output. Only case we may show up invalidated 
> > connections(which have active connections entry->conn) in the 
> > postgres_fdw_get_connections() output is as follows:
> >
> > 1) say we have few cached active connections exists in session 1
> > 2) drop the user mapping (in another session) associated with any of the 
> > cached connections to make that entry invalid
> > 3) run select * from postgres_fdw_get_connections(); in session 1.  At the 
> > start of the xact, the invalidation message gets processed and the 
> > corresponding entry gets marked as invalid. If we allow invalid connections 
> > (that have entry->conn) to show up in the output, then we show them in the 
> > result of the query. At the end of xact, we close these invalid 
> > connections, in this case, user might think that he still have invalid 
> > connections active.
>
> What about the case where the transaction started at the above 1) at session 
> 1, and postgres_fdw_get_connections() in the above 3) is called within that 
> transaction at session 1? In this case, postgres_fdw_get_connections() can 
> return even invalidated connections?

In that case, since the user mapping would have been dropped in
another session and we are in the middle of a txn in session 1, the
entries would not get marked as invalid until the invalidation message
gets processed by the session 1 which may happen if the session 1
opens a sub txn, if not then for postgres_fdw_get_connections() the
entries will still be active as they would not have been marked as
invalid yet and postgres_fdw_get_connections() would return them in
the output.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Add Information during standby recovery conflicts

2020-12-14 Thread Fujii Masao



On 2020/12/14 21:31, Fujii Masao wrote:



On 2020/12/05 12:38, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 7:22 PM Drouvot, Bertrand  wrote:


Hi,

On 12/4/20 2:21 AM, Fujii Masao wrote:


On 2020/12/04 9:28, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao
 wrote:




On 2020/12/01 17:29, Drouvot, Bertrand wrote:

Hi,

On 12/1/20 12:35 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization.
Do not click links or open attachments unless you can confirm the
sender and know the content is safe.



On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera
 wrote:

On 2020-Dec-01, Fujii Masao wrote:


+ if (proc)
+ {
+ if (nprocs == 0)
+ appendStringInfo(&buf, "%d", proc->pid);
+ else
+ appendStringInfo(&buf, ", %d", proc->pid);
+
+ nprocs++;

What happens if all the backends in wait_list have gone? In
other words,
how should we handle the case where nprocs == 0 (i.e., nprocs
has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no
need to log
the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You
could
test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.


Thank you all for the review/remarks.

They have been addressed in the new attached patch version.


Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.


Thank you for updating the patch! I have one question.



+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.


As the comment of ResolveRecoveryConflictWithLock() says the
following, a deadlock is detected by the ordinary backend process:

   * Deadlocks involving the Startup process and an ordinary backend
proces
   * will be detected by the deadlock detector within the ordinary
backend.

If we use STANDBY_DEADLOCK_TIMEOUT,
SendRecoveryConflictWithBufferPin() will be called after
DeadlockTimeout passed, but I think it's not necessary for the startup
process in this case.


Thanks for pointing this! You are right.



If we want to just wake up the startup process
maybe we can use STANDBY_TIMEOUT here?



Thanks for the patch updates! Except what we are still discussing below,
it looks good to me.


When STANDBY_TIMEOUT happens, a request to release conflicting buffer
pins is sent. Right? If so, we should not also use STANDBY_TIMEOUT there?


Agree



Or, first of all, we don't need to enable the deadlock timer at all?
Since what we'd like to do is to wake up after deadlock_timeout
passes, we can do that by changing ProcWaitForSignal() so that it can
accept the timeout and giving the deadlock_timeout to it. If we do
this, maybe we can get rid of STANDBY_LOCK_TIMEOUT from
ResolveRecoveryConflictWithLock(). Thought?


Where do we enable deadlock timeout in hot standby case? You meant to
enable it in ProcWaitForSignal() or where we set a timer for not hot
standby case, in ProcSleep()?


No, what I tried to say is to change ResolveRecoveryConflictWithLock() so that 
it does

1. calculate the "minimum" timeout from deadlock_timeout and 
max_standby_xxx_delay
2. give the calculated timeout value to ProcWaitForSignal()
3. wait for signal and timeout on ProcWaitForSignal()

Since ProcWaitForSignal() calls WaitLatch(), seems it's not so difficult to 
make ProcWaitForSignal() handle the timeout. If we do this, I was thinking that 
we can get rid of enable_timeouts() from ResolveRecoveryConflictWithLock().






Why not simply use (again) the STANDBY_LOCK_TIMEOUT one? (as it triggers
a call to StandbyLockTimeoutHandler() which does nothing, except waking
up. That's what we want, right?)


Right, what I wanted to mean is STANDBY_LOCK_TIMEOUT. The startup
process can wake up and do nothing. Thank you for pointing out.


Okay, understood! Firstly I was thinking that enabling the same type (i.e., 
STANDBY_LOCK_TIMEOUT) of lock twice doesn't work properly, but as far as I read 
the code, it works. In that case, only the shorter timeout would be activated 
in enable_timeouts(). So I ag

Re: a misbehavior of partition row movement (?)

2020-12-14 Thread Arne Roland
Hi,


thanks for looking into this. I haven't yet looked at your patch in detail, yet 
I think we should address the general issue here. To me this doesn't seem to be 
a RI-trigger issue, but a more general issue like I mentioned in the pg-bugs 
thread 
https://www.postgresql.org/message-id/b1bfc99296e34725900bcd9689be8...@index.de


While I like the idea of making fks work, I'd prefer a solution that fires the 
appropriate row trigger for partitioned tables for non RI-cases as well.


Regards

Arne





Re: Feature Proposal: Add ssltermination parameter for SNI-based LoadBalancing

2020-12-14 Thread Lukas Meisegeier

I liked the idea with separate ports for ssl and non ssl requests and
tried it with haproxy.
The psql-client connects with haproxy and receives the fixed 'S' byte
response. After that he tried to continue on the same connection and
doens't open a new one. This crashes the connection because haproxy
expects a new tcp connection.


psqlClient: opens connection (ARP: SYN)
haproxy: accepts connection (ARP: SYN ACK)
psqlClient: confirmes the connection (ARP: ACK)
psqlClient: sends SSLRequest
haproxy: sends confirmation (ARP: ACK)
haproxy: sends fixed byte response ('S')
haproxy: closes connection (ARP: FIN, ACK)
psqlclient: confirmed fixed byte response (ARP: ACK)
psqlclient: sends ssl hello request --> error connection already
closed("psql: error: SSL SYSCALL error: No error (0x/0))

In my eyes the problem lies in upgrading the connection rather then
opening a new one.

Am 14-Dec-20 um 14:50 schrieb Heikki Linnakangas:

On 12/12/2020 13:52, Lukas Meisegeier wrote:

Thanks for the provided ideas :)
I use HaProxy for my load-balancing and unfortunately I can't define
that I want to listen on a port for both ssl and non ssl requests.


Could you configure HaProxy to listen on separate ports for SSL and
non-SSL connections, then? And forward both to the same Postgres server.


That means if I try to return a fixed response 'S' on the SSLRequest it
fails with an SSL-Handshake failure cause the server expects a ssl
message.


That doesn't sound right to me, or perhaps I have misunderstood what you
mean. If you don't send the SSLRequest to the Postgres server, but "eat"
it in the proxy, the Postgres server will not try to do an SSL handshake.


I have to say the psql ssl handshake procedure is really unique and
challenging :D


Yeah. IMAP and SMTP can use "STARTTLS" to switch an unencrypted
connection to encrypted, though. That's pretty similar to the
'SSLRequest' message used in the postgres protocol.

- Heikki





Re: Parallel Inserts in CREATE TABLE AS

2020-12-14 Thread Dilip Kumar
On Mon, Dec 14, 2020 at 4:06 PM Bharath Rupireddy
 wrote:
>
> On Thu, Dec 10, 2020 at 7:20 PM Bharath Rupireddy
>  wrote:
> > On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar  wrote:
> > > > > > +   allow = ps && IsA(ps, GatherState) && 
> > > > > > !ps->ps_ProjInfo &&
> > > > > > +   plannedstmt->parallelModeNeeded &&
> > > > > > +   plannedstmt->planTree &&
> > > > > > +   IsA(plannedstmt->planTree, Gather) 
> > > > > > &&
> > > > > > +   plannedstmt->planTree->lefttree &&
> > > > > > +   
> > > > > > plannedstmt->planTree->lefttree->parallel_aware &&
> > > > > > +   
> > > > > > plannedstmt->planTree->lefttree->parallel_safe;
> > > > > >
> > > > > > I noticed it check both IsA(ps, GatherState) and 
> > > > > > IsA(plannedstmt->planTree, Gather).
> > > > > > Does it mean it is possible that IsA(ps, GatherState) is true but 
> > > > > > IsA(plannedstmt->planTree, Gather) is false ?
> > > > > >
> > > > > > I did some test but did not find a case like that.
> > > > > >
> > > > > This seems like an extra check.  Apart from that if we combine 0001
> > > > > and 0002 there should be an additional protection so that it should
> > > > > not happen that in cost_gather we have ignored the parallel tuple cost
> > > > > and now we are rejecting the parallel insert. Probably we should add
> > > > > an assert.
> > > >
> > > > Yeah it's an extra check. I don't think we need that extra check 
> > > > IsA(plannedstmt->planTree, Gather). GatherState check is enough. I 
> > > > verified it as follows: the gatherstate will be allocated and 
> > > > initialized with the plan tree in ExecInitGather which are the ones we 
> > > > are checking here. So, there is no chance that the plan state is 
> > > > GatherState and the plan tree will not be Gather.  I will remove 
> > > > IsA(plannedstmt->planTree, Gather) check in the next version of the 
> > > > patch set.
> > > >
> > > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 
> > > > , estate=0x1ca8, eflags=730035099) at 
> > > > nodeGather.c:61
> > > > (gdb) p gatherstate
> > > > $10 = (GatherState *) 0x5647fac83850
> > > > (gdb) p gatherstate->ps.plan
> > > > $11 = (Plan *) 0x5647fac918a0
> > > >
> > > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, 
> > > > queryDesc=0x5647fac835e0) at createas.c:663
> > > > 663 {
> > > > (gdb) p ps
> > > > $13 = (PlanState *) 0x5647fac83850
> > > > (gdb) p ps->plan
> > > > $14 = (Plan *) 0x5647fac918a0
> > > >
> > > Hope you did not miss the second part of my comment
> > > "
> > > > Apart from that if we combine 0001
> > > > and 0002 there should be additional protection so that it should
> > > > not happen that in cost_gather we have ignored the parallel tuple cost
> > > > and now we are rejecting the parallel insert. Probably we should add
> > > > an assert.
> > > "
> >
> > IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we
> > ignore the parallel tuple cost and while checking to allow or disallow
> > parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an
> > assert something like Assert(cost_ignored_in_cost_gather && allow)
> > before return allow;
> >
> > This assertion fails 1) either if we have not ignored the cost but
> > allowing parallel inserts 2) or we ignored the cost but not allowing
> > parallel inserts.
> >
> > 1) seems to be fine, we can go ahead and perform parallel inserts. 2)
> > is the concern that the planner would have wrongly chosen the parallel
> > plan, but in this case also isn't it better to go ahead with the
> > parallel plan instead of failing the query?
> >
> > +/*
> > + * We allow parallel inserts by the workers only if the Gather 
> > node has
> > + * no projections to perform and if the upper node is Gather. In 
> > case,
> > + * the Gather node has projections, which is possible if there are 
> > any
> > + * subplans in the query, the workers can not do those 
> > projections. And
> > + * when the upper node is GatherMerge, then the leader has to 
> > perform
> > + * the final phase i.e. merge the results by workers.
> > + */
> > +allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
> > +plannedstmt->parallelModeNeeded &&
> > +plannedstmt->planTree &&
> > +plannedstmt->planTree->lefttree &&
> > +plannedstmt->planTree->lefttree->parallel_aware &&
> > +plannedstmt->planTree->lefttree->parallel_safe;
> > +
> > +return allow;
> > +}
>
> I added the assertion into the 0002 patch so that it fails when the
> planner ignores parallel tuple cost and may choose parallel plan but
> later we don't allow parallel inserts. make check and make check-world
> passeses without any assertion failures.
>
> Attaching v11 patch set

Re: [Patch] ALTER SYSTEM READ ONLY

2020-12-14 Thread Amul Sul
On Mon, Dec 14, 2020 at 11:28 AM Amul Sul  wrote:
>
> On Thu, Dec 10, 2020 at 6:04 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > > That's not good. On a typical busy system, a system is going to be in
> > > the middle of a checkpoint most of the time, and the checkpoint will
> > > take a long time to finish - maybe minutes.
> >
> > Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> > reduce the frequency of that cost...
> >
> >
> > > We want this feature to respond within milliseconds or a few seconds,
> > > not minutes. So we need something better here.
> >
> > Indeed.
> >
> >
> > > I'm inclined to think
> > > that we should try to CompleteWALProhibitChange() at the same places
> > > we AbsorbSyncRequests(). We know from experience that bad things
> > > happen if we fail to absorb sync requests in a timely fashion, so we
> > > probably have enough calls to AbsorbSyncRequests() to make sure that
> > > we always do that work in a timely fashion. So, if we do this work in
> > > the same place, then it will also be done in a timely fashion.
> >
> > Sounds sane, without having looked in detail.
> >
>
> Understood & agreed that we need to change the system state as soon as 
> possible.
>
> I can see AbsorbSyncRequests() is called from 4 routing as
> CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
> CheckpointerMain().  Out of 4, the first three executes with an interrupt is 
> on
> hod which will cause a problem when we do emit barrier and wait for those
> barriers absorption by all the process including itself and will cause an
> infinite wait. I think that can be fixed by teaching 
> WaitForProcSignalBarrier(),
> do not wait on self to absorb barrier.  Let that get absorbed at a later point
> in time when the interrupt is resumed.  I assumed that we cannot do barrier
> processing right away since there could be other barriers (maybe in the 
> future)
> including ours that should not process while the interrupt is on hold.
>

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold.  This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time.  We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

Regards,
Amul




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-14 Thread Fujii Masao




On 2020/12/14 14:36, Bharath Rupireddy wrote:

On Mon, Dec 14, 2020 at 9:38 AM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:
 > On 2020/12/12 15:05, Bharath Rupireddy wrote:
 > > On Sat, Dec 12, 2020 at 12:19 AM Fujii Masao mailto:masao.fu...@oss.nttdata.com> >> wrote:
 > >  > I was thinking that in the case of drop of user mapping or server, 
hash_search(ConnnectionHash) in GetConnection() cannot find the cached connection entry invalidated 
by that drop. Because "user->umid" used as hash key is changed. So I was thinking that 
that invalidated connection will not be closed nor reconnected.
 > >  >
 > >
 > > You are right in saying that the connection leaks.
 > >
 > > Use case 1:
 > > 1) Run foreign query in session1 with server1, user mapping1
 > > 2) Drop user mapping1 in another session2, invalidation message gets 
logged which will have to be processed by other sessions
 > > 3) Run foreign query again in session1, at the start of txn, the cached entry gets 
invalidated via pgfdw_inval_callback(). Whatever may be the type of foreign query (select, 
update, explain, delete, insert, analyze etc.), upon next call to GetUserMapping() from 
postgres_fdw.c, the cache lookup fails(with ERROR:  user mapping not found for 
"") since the user mapping1 has been dropped in session2 and the query will also 
fail before reaching GetConnection() where the connections associated with invalidated entries 
would have got disconnected.
 > >
 > > So, the connection associated with invalidated entry would remain until 
the local session exits which is a problem to solve.
 > >
 > > Use case 2:
 > > 1) Run foreign query in session1 with server1, user mapping1
 > > 2) Try to drop foreign server1, then we would not be allowed to do so 
because of dependency. If we use CASCADE, then the dependent user mapping1 and 
foreign tables get dropped too [1].
 > > 3) Run foreign query again in session1, at the start of txn, the cached 
entry gets invalidated via pgfdw_inval_callback(), it fails because there is no 
foreign table and user mapping1.
 > >
 > > But, note that the connection remains open in session1, which is again a 
problem to solve.
 > >
 > > To solve the above connection leak problem, it looks like the right place 
to close all the invalid connections is pgfdw_xact_callback(), once registered, which 
gets called at the end of every txn in the current session(by then all the sub txns 
also would have been finished). Note that if there are too many invalidated entries, 
then one of the following txn has to bear running this extra code, but that's okay 
than having leaked connections. Thoughts? If okay, I can code a separate patch.
 >
 > Thanks for further analysis! Sounds good. Also +1 for making it as separate 
patch. Maybe only this patch needs to be back-patched.

Thanks. Yeah once agreed on the fix, +1 to back patch. Shall I start a separate 
thread for connection leak issue and patch, so that others might have different 
thoughts??


Yes, of course!




 > > static void
 > > pgfdw_xact_callback(XactEvent event, void *arg)
 > > {
 > >      HASH_SEQ_STATUS scan;
 > >      ConnCacheEntry *entry;
 > > *     /* HERE WE CAN LOOK FOR ALL INVALIDATED ENTRIES AND DISCONNECT 
THEM*/*
 >
 > This may cause the connection to be closed before sending COMMIT TRANSACTION 
command to the foreign server, i.e., the connection is closed in the middle of the 
transaction. So as you explained before, we should avoid that? If this my 
understanding is right, probably the connection should be closed after COMMIT 
TRANSACTION command is sent to the foreign server. What about changing the 
following code in pgfdw_xact_callback() so that it closes the connection even when 
it's marked as invalidated?

You are right! I'm posting what I have in my mind for fixing this connection 
leak problem.

/* tracks whether any work is needed in callback functions */
static bool xact_got_connection = false;
/* tracks whether there exists at least one invalid connection in the 
connection cache */
*static bool invalid_connections_exist = false;*

static void
pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue)
{

         /* hashvalue == 0 means a cache reset, must clear all state */
         if (hashvalue == 0 ||
             (cacheid == FOREIGNSERVEROID &&
              entry->server_hashvalue == hashvalue) ||
             (cacheid == USERMAPPINGOID &&
              entry->mapping_hashvalue == hashvalue))
         {
             entry->invalidated = true;
*            invalid_connections_exist = true;*
     }

static void
pgfdw_xact_callback(XactEvent event, void *arg)
{
     HASH_SEQ_STATUS scan;
     ConnCacheEntry *entry;

     /* Quick exit if no connections were touched in this transaction or there 
are no invalid connections in the cache. */
     if (!xact_got_connection *&& !invalid_connections_exist)*
         return;

         /*
          * If the 

Re: Insert Documentation - Returning Clause and Order

2020-12-14 Thread Ashutosh Bapat
On Sat, Dec 12, 2020 at 8:41 PM David G. Johnston
 wrote:
>
> On Sat, Dec 12, 2020 at 7:02 AM James Coleman  wrote:
>>
>>
>> Certainly almost every ORM, and maybe even other forms of application
>> code, need to be able to associate the serial column value returned
>> with what it inserted.
>
>
> Yet most ORM would perform single inserts at a time, not in bulk, making such 
> a feature irrelevant to them.
>
> I don't think having such a feature is all that important personally, but the 
> question comes every so often and it would be nice to be able to point at the 
> documentation for a definitive answer - not just one inferred from a lack of 
> documentation - especially since the observed behavior is that order is 
> preserved today.
>

That's a valid usecase, but adding such a guarantee in documentation
would restrict implementation. So at best we can say "no order is
guaranteed". But we write what's guaranteed. Anything not written in
the documents is not guaranteed.

There are ways to get it working, but let's not go into those details
in this thread.

-- 
Best Wishes,
Ashutosh Bapat




Re: Feature improvement for pg_stat_statements

2020-12-14 Thread Fujii Masao



On 2020/12/14 18:17, Seino Yuki wrote:

2020-12-09 21:14 に Fujii Masao さんは書きました:

On 2020/12/09 20:42, Li Japin wrote:

Hi,


On Dec 9, 2020, at 6:37 PM, Seino Yuki mailto:sein...@oss.nttdata.com>> wrote:

2020-12-01 01:04 に Fujii Masao さんは書きました:

On 2020/11/30 23:05, Seino Yuki wrote:

2020-11-30 15:02 に Fujii Masao さんは書きました:

On 2020/11/30 12:08, Seino Yuki wrote:

2020-11-27 22:39 に Fujii Masao さんは書きました:

On 2020/11/27 21:39, Seino Yuki wrote:

2020-11-27 21:37 に Seino Yuki さんは書きました:

2020-11-16 12:28 に Seino Yuki さんは書きました:

Due to similar fixed, we have also merged the patches discussed in the
following thread.
https://commitfest.postgresql.org/30/2736/ 

The patch is posted on the 2736 side of the thread.
Regards.

I forgot the attachment and will resend it.
The following patches have been committed and we have created a
compliant patch for them.
https://commitfest.postgresql.org/30/2736/ 

Please confirm.

Thanks for updating the patch! Here are review comments from me.
+    OUT reset_exec_time TIMESTAMP WITH TIME ZONE
I prefer "stats_reset" as the name of this column for the sake of
consistency with the similar column in other stats views like
pg_stat_database.
Isn't it better to use lower cases for "TIMESTAMP WITH TIME ZONE"
because other DDLs in pg_stat_statements do that for the declaraion
of data type?
@@ -565,6 +568,8 @@ pgss_shmem_startup(void)
 pgss->n_writers = 0;
 pgss->gc_count = 0;
 pgss->stats.dealloc = 0;
+    pgss->stats.reset_exec_time = 0;
+    pgss->stats.reset_exec_time_isnull = true;
The reset time should be initialized with GetCurrentTimestamp() instead
of 0? If so, I think that we can completely get rid of reset_exec_time_isnull.
+    memset(nulls, 0, sizeof(nulls));
If some entries in "values[]" may not be set, "values[]" also needs to
be initialized with 0.
MemSet() should be used, instead?
+    /* Read dealloc */
+    values[0] = stats.dealloc;
Int64GetDatum() should be used here?
+    reset_ts = GetCurrentTimestamp();
GetCurrentTimestamp() needs to be called only when all the entries
are removed. But it's called even in other cases.
Regards,

Thanks for the review!
Fixed the patch.

Thanks for updating the patch! Here are another review comments.
+   reset_exec_time timestamp
with time zone
You forgot to update the column name in the doc?
+    Shows the time at which
pg_stat_statements_reset(0,0,0) was last called.
What about updating this to something like "Time at which all statistics
in the pg_stat_statements view were last reset." for the sale of
onsistency with the description about stats_reset column in other
tats views?
+    /* Read stats_reset */
+    values[1] = stats.stats_reset;
TimestampTzGetDatum() seems necessary.
+    reset_ts = GetCurrentTimestamp();
 /* Reset global statistics for pg_stat_statements */
Isn't it better to call GetCurrentTimestamp() before taking
an exclusive lwlock, in entry_reset()?
Regards,

Thanks for the new comment.
I got the following pointers earlier.

+    reset_ts = GetCurrentTimestamp();
GetCurrentTimestamp() needs to be called only when all the entries
are removed. But it's called even in other cases.

Which do you think is better? I think the new pointing out is better,
because entry_reset is not likely to be called often.

I was thinking that GetCurrentTimestamp() should be called before
pgss->lock lwlock is taken, only when all three arguments userid, dbid
and queryid are zero. But on second thought, we should call
GetCurrentTimestamp() and reset the stats, after the following codes?
/* All entries are removed? */
if (num_entries != num_remove)
goto release_lock;
That is, IMO that even when pg_stat_statements_reset() with non-zero
arguments is executed, if it removes all the entries, we should reset
the stats. Thought?
Regards,


+1.Fixed the patch.
We tested various reset patterns and they seemed to be fine.



Since BlessTupleDesc() is for SRFs according to the comments, I think, we can
remove it here.  Correct?

+       if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+               elog(ERROR, "return type must be a row type");
+
+       tupdesc = BlessTupleDesc(tupdesc);


Here are other comments from me:

The patch seems to contain whitespace errors.
You can see them by "git diff --check".

+  
+    Time at which all statistics in the pg_stat_statements view
were last reset.
+  

"pg_stat_statements" in the above should be enclosed with
 and .

Regards,


Thank you for your comments, Mr. Fujii and Mr. Li.
I've posted a patch reflecting your comments.
Please check it out.


Thanks for updating the patch!

+   reset_ts = GetCurrentTimestamp();
+   /* Reset global statistics for pg_stat_statements */
+   {
+   volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+   SpinLockAcquire(&s->mutex);
+  

Re: Feature Proposal: Add ssltermination parameter for SNI-based LoadBalancing

2020-12-14 Thread Heikki Linnakangas

On 12/12/2020 13:52, Lukas Meisegeier wrote:

Thanks for the provided ideas :)
I use HaProxy for my load-balancing and unfortunately I can't define
that I want to listen on a port for both ssl and non ssl requests.


Could you configure HaProxy to listen on separate ports for SSL and 
non-SSL connections, then? And forward both to the same Postgres server.



That means if I try to return a fixed response 'S' on the SSLRequest it
fails with an SSL-Handshake failure cause the server expects a ssl message.


That doesn't sound right to me, or perhaps I have misunderstood what you 
mean. If you don't send the SSLRequest to the Postgres server, but "eat" 
it in the proxy, the Postgres server will not try to do an SSL handshake.



I have to say the psql ssl handshake procedure is really unique and
challenging :D


Yeah. IMAP and SMTP can use "STARTTLS" to switch an unencrypted 
connection to encrypted, though. That's pretty similar to the 
'SSLRequest' message used in the postgres protocol.


- Heikki




Re: anonymous block returning like a function

2020-12-14 Thread Pavel Stehule
po 14. 12. 2020 v 14:31 odesílatel Heikki Linnakangas 
napsal:

> On 11/12/2020 21:06, PegoraroF10 wrote:
> > I would like to have an anonymous block, like DO, but having resuts,
> like an
> > usual function does.
> >
> > I know any user can do ...
> >
> > create function pg_temp.run_time_bigger(numeric,numeric) returns numeric
> > language plpgsql as $$
> > begin if $1 > $2 then return $1; else return $2; end if; end;$$;
> > select * from pg_temp.run_time_bigger(5,3);
> > drop function pg_temp.run_time_bigger(numeric,numeric);
> >
> > but would be better if he could ...
> > execute block(numeric,numeric) returns numeric language plpgsql as $$
> > begin if $1 > $2 then return $1; else return $2; end if; end;$$
> > USING(5,3);
> >
> > That USING would be params, but if it complicates it could be easily be
> > replaced by real values because that block is entirely created in run
> time,
> > so its optional.
> >
> > What do you think about ?
>
> Yeah, I think that would be useful. This was actually proposed and
> discussed back in 2014 ([1], but it didn't lead to a patch. Not sure if
> it's been discussed again after that.
>
> > What part of postgres code do I have to carefully understand to write
> > something to do that ?
>
> Hmm, let's see. You'll need to modify the grammar in src/backend/gram.y,
> to accept the USING clause. DoStmt struct needs a new 'params' field to
> carry the params from the parser to the PL execution, I think you can
> look at how that's done for ExecuteStmt or CallStmt for inspiration.
> ExecuteDoStmt() needs some changes to pass the params to the 'laninline'
> handler of the PL language. And finally, the 'laninline' implementations
> of all the built-in languages needs to be modified to accept the
> parameters, like plpgsql_compile_inline() function for PL/pgSQL. For
> languages provided as extensions, there should be some mechanism to fail
> gracefully, if the PL implementation hasn't been taught about the
> parameters yet.
>
> [1]
> https://www.postgresql.org/message-id/1410849538.4296.19.camel%40localhost


Parametrization of DO statement can be first step and just this
functionality can be pretty useful. Today, the code can be modification of
CALL statement.

There should be discussion if DO statement will be more like procedure or
more like function. Now, DO statement is more procedure than function. And
I think so it is correct. Probably one day, the procedures can returns
multirecordsets, and then can be easy same functionality to push to DO
statement.

Oracle hace nice CTE enhancing

WITH
  FUNCTION with_function(p_id IN NUMBER) RETURN NUMBER IS
  BEGIN
RETURN p_id;
  END;
SELECT with_function(id)
FROM   t1
WHERE  rownum = 1

Can be nice to have this feature in Postgres. We don't need to invite
new syntax.

Regards

Pavel


>
> - Heikki
>
>
>


Re: anonymous block returning like a function

2020-12-14 Thread Heikki Linnakangas

On 11/12/2020 21:06, PegoraroF10 wrote:

I would like to have an anonymous block, like DO, but having resuts, like an
usual function does.

I know any user can do ...

create function pg_temp.run_time_bigger(numeric,numeric) returns numeric
language plpgsql as $$
begin if $1 > $2 then return $1; else return $2; end if; end;$$;
select * from pg_temp.run_time_bigger(5,3);
drop function pg_temp.run_time_bigger(numeric,numeric);

but would be better if he could ...
execute block(numeric,numeric) returns numeric language plpgsql as $$
begin if $1 > $2 then return $1; else return $2; end if; end;$$
USING(5,3);

That USING would be params, but if it complicates it could be easily be
replaced by real values because that block is entirely created in run time,
so its optional.

What do you think about ?


Yeah, I think that would be useful. This was actually proposed and 
discussed back in 2014 ([1], but it didn't lead to a patch. Not sure if 
it's been discussed again after that.



What part of postgres code do I have to carefully understand to write
something to do that ?


Hmm, let's see. You'll need to modify the grammar in src/backend/gram.y, 
to accept the USING clause. DoStmt struct needs a new 'params' field to 
carry the params from the parser to the PL execution, I think you can 
look at how that's done for ExecuteStmt or CallStmt for inspiration. 
ExecuteDoStmt() needs some changes to pass the params to the 'laninline' 
handler of the PL language. And finally, the 'laninline' implementations 
of all the built-in languages needs to be modified to accept the 
parameters, like plpgsql_compile_inline() function for PL/pgSQL. For 
languages provided as extensions, there should be some mechanism to fail 
gracefully, if the PL implementation hasn't been taught about the 
parameters yet.


[1] 
https://www.postgresql.org/message-id/1410849538.4296.19.camel%40localhost


- Heikki




Re: a misbehavior of partition row movement (?)

2020-12-14 Thread Amit Langote
On Fri, Nov 20, 2020 at 8:55 PM Amit Langote  wrote:
> On Sat, Oct 3, 2020 at 8:26 PM Amit Langote  wrote:
> > On Sat, Oct 3, 2020 at 8:15 PM Tomas Vondra  
> > wrote
> > > I think we need to differentiate between master and backbranches. IMO we
> > > should try to make it "just work" in master, and the amount of code
> > > should not be an issue there I think (no opinion on whether insert and
> > > update trigger is the way to go). For backbranches we may need to do
> > > something less intrusive, of course.
> >
> > Sure, that makes sense.  I will try making a patch for HEAD to make it
> > just work unless someone beats me to it.
>
> After working on this for a while, here is my proposal for HEAD.
>
> To reiterate, the problem occurs when an UPDATE of a partitioned PK
> table is turned into a DELETE + INSERT.  In that case, the UPDATE RI
> triggers are not fired at all, but DELETE ones are, so the foreign key
> may fail to get enforced correctly.  For example, even though the new
> row after the update is still logically present in the PK table, it
> would wrongly get deleted because of the DELETE RI trigger firing if
> there's a ON DELETE CASCADE clause on the foreign key.
>
> To fix that, I propose that we teach trigger.c to skip queuing the
> events that would be dangerous to fire, such as that for the DELETE on
> the source leaf partition mentioned above.  Instead, queue an UPDATE
> event on the root target table, matching the actual operation being
> performed.  Note though that this new arrangement doesn't affect the
> firing of any other triggers except those that are relevant to the
> reported problem, viz. the PK-side RI triggers.  All other triggers
> fire exactly as they did before.
>
> To make that happen, I had to:
>
> 1. Make RI triggers available on partitioned tables at all, which they
> are not today.  Also, mark the RI triggers in partitions correctly as
> *clones* of the RI triggers in their respective parents.
>
> 2. Make it possible to allow AfterTriggerSaveEvent() to access the
> query's actual target relation, that is, in addition to the target
> relation on which an event was fired.  Also, added a bunch of code to
> AFTER TRIGGER infrastructure to handle events fired on partitioned
> tables.  Because those events cannot contain physical references to
> affected tuples, I generalized the code currently used to handle after
> triggers on foreign tables by storing the tuples in and retrieving
> them from a tuple store.  I read a bunch of caveats of that
> implementation (such as its uselessness for deferred triggers), but
> for the limited cases for which it will be used for partitioned
> tables, it seems safe, because it won't be used for deferred triggers
> on partitioned tables.
>
> Attached patches 0001 and 0002 implement 1 and 2, respectively.
> Later, I will post an updated version of the patch for the
> back-branches, which, as mentioned upthread, is to prevent the
> cross-partition updates on foreign key PK tables.

I have created a CF entry for this:

https://commitfest.postgresql.org/31/2877/

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Improving spin-lock implementation on ARM.

2020-12-14 Thread Krunal Bauskar
Wondering if we can take this to completion (any idea what more we could
do?).

On Thu, 10 Dec 2020 at 14:48, Krunal Bauskar 
wrote:

>
> On Tue, 8 Dec 2020 at 14:33, Krunal Bauskar 
> wrote:
>
>>
>>
>> On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:
>>
>>> Krunal Bauskar  writes:
>>> > Any updates or further inputs on this.
>>>
>>> As far as LSE goes: my take is that tampering with the
>>> compiler/platform's default optimization options requires *very*
>>> strong evidence, which we have not got and likely won't get.  Users
>>> who are building for specific hardware can choose to supply custom
>>> CFLAGS, of course.  But we shouldn't presume to do that for them,
>>> because we don't know what they are building for, or with what.
>>>
>>> I'm very willing to consider the CAS spinlock patch, but it still
>>> feels like there's not enough evidence to show that it's a universal
>>> win.  The way to move forward on that is to collect more measurements
>>> on additional ARM-based platforms.  And I continue to think that
>>> pgbench is only a very crude tool for testing spinlock performance;
>>> we should look at other tests.
>>>
>>
>> Thanks Tom.
>>
>> Given pg-bench limited option I decided to try things with sysbench to
>> expose
>> the real contention using zipfian type (zipfian pattern causes part of
>> the database
>> to get updated there-by exposing main contention point).
>>
>>
>> 
>> *Baseline for 256 threads update-index use-case:*
>> -   44.24%174935  postgres postgres [.]
>> s_lock
>> transactions:
>> transactions:5587105 (92988.40 per sec.)
>>
>> *Patched for 256 threads update-index use-case:*
>>  0.02%80  postgres  postgres  [.] s_lock
>> transactions:
>> transactions:10288781 (171305.24 per sec.)
>>
>> *perf diff*
>>
>> * 0.02%+44.22%  postgres [.] s_lock*
>> 
>>
>> As we see from the above result s_lock is exposing major contention that
>> could be relaxed using the
>> said cas patch. Performance improvement in range of 80% is observed.
>>
>> Taking this guideline we decided to run it for all scalability for update
>> and non-update use-case.
>> Check the attached graph. Consistent improvement is observed.
>>
>> I presume this should help re-establish that for major contention cases
>> existing tas approach will always give up.
>>
>>
>> ---
>>
>> Unfortunately, I don't have access to different ARM arch except for
>> Kunpeng or Graviton2 where
>> we have already proved the value of the patch.
>> [ref: Apple M1 as per your evaluation patch doesn't show regression for
>> select. Maybe if possible can you try update scenarios too].
>>
>> Do you know anyone from the community who has access to other ARM arches
>> we can request them to evaluate?
>> But since it is has proven on 2 independent ARM arch I am pretty
>> confident it will scale with other ARM arches too.
>>
>>
>
> Any direction on how we can proceed on this?
>
> * We have tested it with both cloud vendors that provide ARM instances.
> * We have tested it with Apple M1 (partially at-least)
> * Ampere use to provide instance on packet.com but now it is an
> evaluation program only.
>
> No other active arm instance offering a cloud provider.
>
> Given our evaluation so far has proven to be +ve can we think of
> considering it on basis of the available
> data which is quite encouraging with 80% improvement seen for heavy
> contention use-cases.
>
>
>
>>
>>> From a system structural standpoint, I seriously dislike that lwlock.c
>>> patch: putting machine-specific variant implementations into that file
>>> seems like a disaster for maintainability.  So it would need to show a
>>> very significant gain across a range of hardware before I'd want to
>>> consider adopting it ... and it has not shown that.
>>>
>>> regards, tom lane
>>>
>>
>>
>> --
>> Regards,
>> Krunal Bauskar
>>
>
>
> --
> Regards,
> Krunal Bauskar
>


-- 
Regards,
Krunal Bauskar


Re: Add Information during standby recovery conflicts

2020-12-14 Thread Fujii Masao




On 2020/12/05 12:38, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 7:22 PM Drouvot, Bertrand  wrote:


Hi,

On 12/4/20 2:21 AM, Fujii Masao wrote:


On 2020/12/04 9:28, Masahiko Sawada wrote:

On Fri, Dec 4, 2020 at 2:54 AM Fujii Masao
 wrote:




On 2020/12/01 17:29, Drouvot, Bertrand wrote:

Hi,

On 12/1/20 12:35 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization.
Do not click links or open attachments unless you can confirm the
sender and know the content is safe.



On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera
 wrote:

On 2020-Dec-01, Fujii Masao wrote:


+ if (proc)
+ {
+ if (nprocs == 0)
+ appendStringInfo(&buf, "%d", proc->pid);
+ else
+ appendStringInfo(&buf, ", %d", proc->pid);
+
+ nprocs++;

What happens if all the backends in wait_list have gone? In
other words,
how should we handle the case where nprocs == 0 (i.e., nprocs
has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no
need to log
the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You
could
test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.


Thank you all for the review/remarks.

They have been addressed in the new attached patch version.


Thanks for updating the patch! I read through the patch again
and applied the following chages to it. Attached is the updated
version of the patch. Could you review this version? If there is
no issue in it, I'm thinking to commit this version.


Thank you for updating the patch! I have one question.



+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;

Maybe STANDBY_TIMEOUT should be STANDBY_DEADLOCK_TIMEOUT here?
I changed the code that way.


As the comment of ResolveRecoveryConflictWithLock() says the
following, a deadlock is detected by the ordinary backend process:

   * Deadlocks involving the Startup process and an ordinary backend
proces
   * will be detected by the deadlock detector within the ordinary
backend.

If we use STANDBY_DEADLOCK_TIMEOUT,
SendRecoveryConflictWithBufferPin() will be called after
DeadlockTimeout passed, but I think it's not necessary for the startup
process in this case.


Thanks for pointing this! You are right.



If we want to just wake up the startup process
maybe we can use STANDBY_TIMEOUT here?



Thanks for the patch updates! Except what we are still discussing below,
it looks good to me.


When STANDBY_TIMEOUT happens, a request to release conflicting buffer
pins is sent. Right? If so, we should not also use STANDBY_TIMEOUT there?


Agree



Or, first of all, we don't need to enable the deadlock timer at all?
Since what we'd like to do is to wake up after deadlock_timeout
passes, we can do that by changing ProcWaitForSignal() so that it can
accept the timeout and giving the deadlock_timeout to it. If we do
this, maybe we can get rid of STANDBY_LOCK_TIMEOUT from
ResolveRecoveryConflictWithLock(). Thought?


Where do we enable deadlock timeout in hot standby case? You meant to
enable it in ProcWaitForSignal() or where we set a timer for not hot
standby case, in ProcSleep()?


No, what I tried to say is to change ResolveRecoveryConflictWithLock() so that 
it does

1. calculate the "minimum" timeout from deadlock_timeout and 
max_standby_xxx_delay
2. give the calculated timeout value to ProcWaitForSignal()
3. wait for signal and timeout on ProcWaitForSignal()

Since ProcWaitForSignal() calls WaitLatch(), seems it's not so difficult to 
make ProcWaitForSignal() handle the timeout. If we do this, I was thinking that 
we can get rid of enable_timeouts() from ResolveRecoveryConflictWithLock().






Why not simply use (again) the STANDBY_LOCK_TIMEOUT one? (as it triggers
a call to StandbyLockTimeoutHandler() which does nothing, except waking
up. That's what we want, right?)


Right, what I wanted to mean is STANDBY_LOCK_TIMEOUT. The startup
process can wake up and do nothing. Thank you for pointing out.


Okay, understood! Firstly I was thinking that enabling the same type (i.e., 
STANDBY_LOCK_TIMEOUT) of lock twice doesn't work properly, but as far as I read 
the code, it works. In that case, only the shorter timeout would be activated 
in enable_timeouts(). So I agree to use STANDBY_LOCK_TIMEOUT.

Regards,

RE: Parallel Inserts in CREATE TABLE AS

2020-12-14 Thread Hou, Zhijie
Hi

> Attaching v11 patch set. Please review it further.

Currently with the patch, we can allow parallel CTAS when topnode is Gather.
When top-node is Append and Gather is the sub-node of Append, I think we can 
still enable 
Parallel CTAS by pushing Parallel CTAS down to the sub-node Gather, such as:

Append
-->Gather
->Create table
->Seqscan
-->Gather
->create table
->Seqscan

And the use case seems common to me, such as:
select * from A where xxx union all select * from B where xxx;

I attatch a WIP patch which just show the possibility of this feature.
The patch is based on the latest v11-patch.

What do you think?

Best regards,
houzj







0001-patch-for-pctas-in-append.patch
Description: 0001-patch-for-pctas-in-append.patch


Re: Movement of restart_lsn position movement of logical replication slots is very slow

2020-12-14 Thread Amit Kapila
On Mon, Dec 14, 2020 at 9:30 AM Jammie  wrote:
>
> Hello,
>
> We have two logical replication slots in our postgresql database (version-11) 
> instance and we are using pgJDBC to stream data from these two slots.
>

IIUC, you are using some out-of-core outputplugin to stream the data?
Are you using in  walsender mechanism to decode the changes from slots
or via SQL APIs?

> We are ensuring that when we regularly send feedback and update the 
> confirmed_flush_lsn (every 10 minutes) for both the slots to the same 
> position. However From our data we have seen that the restart_lsn movement of 
> the two are not in sync and most of the time one of them lags too far behind 
> to hold the WAL files unnecessarily. Here are some data points to indicate 
> the problem
>
> Thu Dec 10 05:37:13 CET 2020
>   slot_name   |  restart_lsn  | confirmed_flush_lsn
> --+---+-
>  db_dsn_metadata_src_private  | 48FB/F3000208 | 48FB/F3000208
>  db_dsn_metadata_src_shared   | 48FB/F3000208 | 48FB/F3000208
> (2 rows)
>
>
>
> Thu Dec 10 13:53:46 CET 2020
>   slot_name  |  restart_lsn  | confirmed_flush_lsn
> -+---+-
>  db_dsn_metadata_src_private | 48FC/2309B150 | 48FC/233AA1D0
>  db_dsn_metadata_src_shared  | 48FC/233AA1D0 | 48FC/233AA1D0
> (2 rows)
>
>
> Thu Dec 10 17:13:51 CET 2020
>   slot_name  |  restart_lsn  | confirmed_flush_lsn
> -+---+-
>  db_dsn_metadata_src_private | 4900/B4C3AE8  | 4900/94FDF908
>  db_dsn_metadata_src_shared  | 48FD/D2F66F10 | 4900/94FDF908
> (2 rows)
>
> Though we are using setFlushLsn() and forceStatusUpdate for both the slot's 
> stream regularly still the slot with name private is far behind the 
> confirmed_flush_lsn and slot with name shared is also behind with 
> confirmed_flush_lsn but not too far. Since the restart_lsn is not moving fast 
> enough, causing lot of issues with WAL log file management and not allowing 
> to delete them to free up disk space
>

What is this setFlushLsn? I am not able to find in the PG-code. If it
is some outside code reference then please provide the link to code.
In general, the restart_lsn and confirmed_flush_lsn are advanced in
different ways so you might see some difference but it should not be
this much. The confirmed_flush_lsn is updated when we get confirmation
from the downstream node about the flush_lsn but restart_lsn is only
incremented based on the LSN required by the oldest in-progress
transaction.

>
> Please note that for the second slot we are not doing reading from the stream 
> rather just sending the feedback.
>

Here does the second slot refers to 'shared' or 'private'? It is not
very clear what you mean by "we are not doing reading from the
stream', do you mean to say that decoding happens in the slot but the
output plugin just throws away the streamed data and in the end just
send the feedback?

-- 
With Regards,
Amit Kapila.




Re: Parallel Inserts in CREATE TABLE AS

2020-12-14 Thread Bharath Rupireddy
On Thu, Dec 10, 2020 at 7:20 PM Bharath Rupireddy
 wrote:
> On Thu, Dec 10, 2020 at 5:19 PM Dilip Kumar  wrote:
> > > > > +   allow = ps && IsA(ps, GatherState) && 
> > > > > !ps->ps_ProjInfo &&
> > > > > +   plannedstmt->parallelModeNeeded &&
> > > > > +   plannedstmt->planTree &&
> > > > > +   IsA(plannedstmt->planTree, Gather) &&
> > > > > +   plannedstmt->planTree->lefttree &&
> > > > > +   
> > > > > plannedstmt->planTree->lefttree->parallel_aware &&
> > > > > +   
> > > > > plannedstmt->planTree->lefttree->parallel_safe;
> > > > >
> > > > > I noticed it check both IsA(ps, GatherState) and 
> > > > > IsA(plannedstmt->planTree, Gather).
> > > > > Does it mean it is possible that IsA(ps, GatherState) is true but 
> > > > > IsA(plannedstmt->planTree, Gather) is false ?
> > > > >
> > > > > I did some test but did not find a case like that.
> > > > >
> > > > This seems like an extra check.  Apart from that if we combine 0001
> > > > and 0002 there should be an additional protection so that it should
> > > > not happen that in cost_gather we have ignored the parallel tuple cost
> > > > and now we are rejecting the parallel insert. Probably we should add
> > > > an assert.
> > >
> > > Yeah it's an extra check. I don't think we need that extra check 
> > > IsA(plannedstmt->planTree, Gather). GatherState check is enough. I 
> > > verified it as follows: the gatherstate will be allocated and initialized 
> > > with the plan tree in ExecInitGather which are the ones we are checking 
> > > here. So, there is no chance that the plan state is GatherState and the 
> > > plan tree will not be Gather.  I will remove IsA(plannedstmt->planTree, 
> > > Gather) check in the next version of the patch set.
> > >
> > > Breakpoint 4, ExecInitGather (node=0x5647f98ae994 
> > > , estate=0x1ca8, eflags=730035099) at 
> > > nodeGather.c:61
> > > (gdb) p gatherstate
> > > $10 = (GatherState *) 0x5647fac83850
> > > (gdb) p gatherstate->ps.plan
> > > $11 = (Plan *) 0x5647fac918a0
> > >
> > > Breakpoint 1, IsParallelInsertInCTASAllowed (into=0x5647fac97580, 
> > > queryDesc=0x5647fac835e0) at createas.c:663
> > > 663 {
> > > (gdb) p ps
> > > $13 = (PlanState *) 0x5647fac83850
> > > (gdb) p ps->plan
> > > $14 = (Plan *) 0x5647fac918a0
> > >
> > Hope you did not miss the second part of my comment
> > "
> > > Apart from that if we combine 0001
> > > and 0002 there should be additional protection so that it should
> > > not happen that in cost_gather we have ignored the parallel tuple cost
> > > and now we are rejecting the parallel insert. Probably we should add
> > > an assert.
> > "
>
> IIUC, we need to set a flag in cost_gather(in 0002 patch) whenever we
> ignore the parallel tuple cost and while checking to allow or disallow
> parallel inserts in IsParallelInsertInCTASAllowed(), we need to add an
> assert something like Assert(cost_ignored_in_cost_gather && allow)
> before return allow;
>
> This assertion fails 1) either if we have not ignored the cost but
> allowing parallel inserts 2) or we ignored the cost but not allowing
> parallel inserts.
>
> 1) seems to be fine, we can go ahead and perform parallel inserts. 2)
> is the concern that the planner would have wrongly chosen the parallel
> plan, but in this case also isn't it better to go ahead with the
> parallel plan instead of failing the query?
>
> +/*
> + * We allow parallel inserts by the workers only if the Gather node 
> has
> + * no projections to perform and if the upper node is Gather. In 
> case,
> + * the Gather node has projections, which is possible if there are 
> any
> + * subplans in the query, the workers can not do those projections. 
> And
> + * when the upper node is GatherMerge, then the leader has to perform
> + * the final phase i.e. merge the results by workers.
> + */
> +allow = ps && IsA(ps, GatherState) && !ps->ps_ProjInfo &&
> +plannedstmt->parallelModeNeeded &&
> +plannedstmt->planTree &&
> +plannedstmt->planTree->lefttree &&
> +plannedstmt->planTree->lefttree->parallel_aware &&
> +plannedstmt->planTree->lefttree->parallel_safe;
> +
> +return allow;
> +}

I added the assertion into the 0002 patch so that it fails when the
planner ignores parallel tuple cost and may choose parallel plan but
later we don't allow parallel inserts. make check and make check-world
passeses without any assertion failures.

Attaching v11 patch set. Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v11-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch
Description: Binary data


v11-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch
Description: Binary data


Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

2020-12-14 Thread Amit Kapila
On Mon, Dec 14, 2020 at 1:36 AM Noah Misch  wrote:
>
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> > But what jumps out at me here is that this sort of error seems way
> > too easy to make, and evidently way too hard to detect.  What can we
> > do to make it more obvious if one has incorrectly used or omitted
> > HASH_BLOBS?  Both directions of error might easily escape notice on
> > little-endian hardware.
> >
> > I thought of a few ideas, all of which have drawbacks:
> >
> > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> > This seems to just move the problem somewhere else, besides which
> > it'd require touching an awful lot of callers, and would silently
> > break third-party callers.
> >
> > 2. Don't allow a default: invent a new HASH_STRING flag, and
> > require that hash_create() calls specify exactly one of HASH_BLOBS,
> > HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> > hazard of mindless-copy-and-paste, but I think it might make it
> > a little more obvious.  Still requires touching a lot of calls.
>
> I like (2), for making the bug harder and for greppability.  Probably
> pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.
>
> > 3. Add some sort of heuristic restriction on keysize.  A keysize
> > that's only 4 or 8 bytes almost certainly is not a string.
> > This doesn't give us much traction for larger keysizes, though.
> >
> > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> > in string_hash().  I think we could get away with that given that
> > SQL disallows empty identifiers.  However, it would only help to
> > catch one direction of error (omitting HASH_BLOBS), and it would
> > only help on big-endian hardware, which is getting harder to find.
> > Still, we could hope that the buildfarm would detect errors.
>
> It's nontrivial to confirm that the empty-string key can't happen for a given
> hash table.  (In contrast, what (3) asserts on is usually a compile-time
> constant.)  I would stop short of adding (4), though it could be okay.
>
> > A quick count of grep hits suggest that the large majority of
> > existing hash_create() calls use HASH_BLOBS, and there might be
> > only order-of-ten calls that would need to be touched if we
> > required an explicit HASH_STRING flag.  So option #2 is seeming
> > kind of attractive.  Maybe that together with an assertion that
> > string keys have to exceed 8 or 16 bytes would be enough protection.
>
> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.
>

+1. I also think in most cases (2) would be sufficient to avoid such
bugs. Adding restriction on string size might annoy some out-of-core
user which is already using small strings. However, adding an 8-byte
restriction on string size would be still okay.

-- 
With Regards,
Amit Kapila.




Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

2020-12-14 Thread Peter Eisentraut

On 2020-12-13 17:49, Tom Lane wrote:

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious.  Still requires touching a lot of calls.


I think this sounds best, and also expand the documentation of these 
flags a bit.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Re: Misleading comment in prologue of ReorderBufferQueueMessage

2020-12-14 Thread Amit Kapila
On Mon, Dec 14, 2020 at 2:45 PM Ashutosh Bapat
 wrote:
>
> The name of the function suggests that the given message will be queued in 
> ReorderBuffer. The prologue of the function says so too
>  776 /*
>  777  * Queue message into a transaction so it can be processed upon commit.
>  778  */
> It led me to think that a non-transactional message is processed along with 
> the surrounding transaction, esp. when it has an associated xid.
>
> But in reality, the function queues only a transactional message and decoders 
> a non-transactional message immediately without waiting for a commit.
>
> We should modify the prologue to say
> "Queue a transactional message into a transaction so that it can be processed 
> upon commit. A non-transactional message is processed immediately." and also 
> change the name of the function to ReorderBufferProcessMessage(), but the 
> later may break API compatibility.
>

+1 for the comment change but I am not sure if it is a good idea to
change the API name.

-- 
With Regards,
Amit Kapila.




Re: pg_shmem_allocations & documentation

2020-12-14 Thread Benoit Lobréau
Here's a proposal patch.

Le ven. 11 déc. 2020 à 09:58, Benoit Lobréau  a
écrit :

> Would "NULL for anonymous allocations, since details related to them are
> not known." be ok ?
>
>
> Le ven. 11 déc. 2020 à 09:29, Kyotaro Horiguchi 
> a écrit :
>
>> At Fri, 11 Dec 2020 14:42:45 +0900, Michael Paquier 
>> wrote in
>> > On Fri, Dec 11, 2020 at 11:00:58AM +0900, Kyotaro Horiguchi wrote:
>> > > Although we could just rip some words off, I'd like to propose instead
>> > > to add an explanation why it is not exposed for anonymous allocations,
>> > > like the column allocated_size.
>> >
>> > Indeed, there is a hiccup between what the code does and what the docs
>> > tell: the offset is not NULL for unused memory.
>> >
>> > > -   The offset at which the allocation starts. NULL for anonymous
>> > > -   allocations and unused memory.
>> > > +   The offset at which the allocation starts. For anonymous
>> allocations,
>> > > +   no information about individual allocations is available, so
>> the column
>> > > +   will be NULL in that case.
>> >
>> > I'd say: let's be simple and just remove "and unused memory" because
>> > anonymous allocations are...  Anonymous so you cannot know details
>> > related to them.  That's something easy to reason about, and the docs
>> > were written originally to remain simple.
>>
>> Hmm. I don't object to that.  Howerver, isn't the description for
>> allocated_size too verbose in that sense?
>>
>> regards.
>>
>> --
>> Kyotaro Horiguchi
>> NTT Open Source Software Center
>>
>
From 711f6b60098e67c23a97458ce56893f0ac1afcb6 Mon Sep 17 00:00:00 2001
From: benoit 
Date: Fri, 11 Dec 2020 09:44:51 +0100
Subject: [PATCH] Fix pg_shmem_allocation

---
 doc/src/sgml/catalogs.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 62711ee83f..89ca59b92b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -12493,7 +12493,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   
The offset at which the allocation starts. NULL for anonymous
-   allocations and unused memory.
+   allocations, since details related to them are not known.
   
  
 
-- 
2.25.4



Re: Feature improvement for pg_stat_statements

2020-12-14 Thread Seino Yuki

2020-12-09 21:14 に Fujii Masao さんは書きました:

On 2020/12/09 20:42, Li Japin wrote:

Hi,

On Dec 9, 2020, at 6:37 PM, Seino Yuki > wrote:


2020-12-01 01:04 に Fujii Masao さんは書きました:

On 2020/11/30 23:05, Seino Yuki wrote:

2020-11-30 15:02 に Fujii Masao さんは書きました:

On 2020/11/30 12:08, Seino Yuki wrote:

2020-11-27 22:39 に Fujii Masao さんは書きました:

On 2020/11/27 21:39, Seino Yuki wrote:

2020-11-27 21:37 に Seino Yuki さんは書きました:

2020-11-16 12:28 に Seino Yuki さんは書きました:
Due to similar fixed, we have also merged the patches 
discussed in the

following thread.
https://commitfest.postgresql.org/30/2736/ 


The patch is posted on the 2736 side of the thread.
Regards.

I forgot the attachment and will resend it.
The following patches have been committed and we have created a
compliant patch for them.
https://commitfest.postgresql.org/30/2736/ 


Please confirm.

Thanks for updating the patch! Here are review comments from me.
+    OUT reset_exec_time TIMESTAMP WITH TIME ZONE
I prefer "stats_reset" as the name of this column for the sake 
of

consistency with the similar column in other stats views like
pg_stat_database.
Isn't it better to use lower cases for "TIMESTAMP WITH TIME 
ZONE"
because other DDLs in pg_stat_statements do that for the 
declaraion

of data type?
@@ -565,6 +568,8 @@ pgss_shmem_startup(void)
 pgss->n_writers = 0;
 pgss->gc_count = 0;
 pgss->stats.dealloc = 0;
+    pgss->stats.reset_exec_time = 0;
+    pgss->stats.reset_exec_time_isnull = true;
The reset time should be initialized with GetCurrentTimestamp() 
instead
of 0? If so, I think that we can completely get rid of 
reset_exec_time_isnull.

+    memset(nulls, 0, sizeof(nulls));
If some entries in "values[]" may not be set, "values[]" also 
needs to

be initialized with 0.
MemSet() should be used, instead?
+    /* Read dealloc */
+    values[0] = stats.dealloc;
Int64GetDatum() should be used here?
+    reset_ts = GetCurrentTimestamp();
GetCurrentTimestamp() needs to be called only when all the 
entries

are removed. But it's called even in other cases.
Regards,

Thanks for the review!
Fixed the patch.

Thanks for updating the patch! Here are another review comments.
+   reset_exec_time timestamp
with time zone
You forgot to update the column name in the doc?
+    Shows the time at which
pg_stat_statements_reset(0,0,0) was last 
called.
What about updating this to something like "Time at which all 
statistics

in the pg_stat_statements view were last reset." for the sale of
onsistency with the description about stats_reset column in other
tats views?
+    /* Read stats_reset */
+    values[1] = stats.stats_reset;
TimestampTzGetDatum() seems necessary.
+    reset_ts = GetCurrentTimestamp();
 /* Reset global statistics for pg_stat_statements */
Isn't it better to call GetCurrentTimestamp() before taking
an exclusive lwlock, in entry_reset()?
Regards,

Thanks for the new comment.
I got the following pointers earlier.

+    reset_ts = GetCurrentTimestamp();
GetCurrentTimestamp() needs to be called only when all the entries
are removed. But it's called even in other cases.
Which do you think is better? I think the new pointing out is 
better,

because entry_reset is not likely to be called often.

I was thinking that GetCurrentTimestamp() should be called before
pgss->lock lwlock is taken, only when all three arguments userid, 
dbid

and queryid are zero. But on second thought, we should call
GetCurrentTimestamp() and reset the stats, after the following 
codes?

/* All entries are removed? */
if (num_entries != num_remove)
goto release_lock;
That is, IMO that even when pg_stat_statements_reset() with non-zero
arguments is executed, if it removes all the entries, we should 
reset

the stats. Thought?
Regards,


+1.Fixed the patch.
We tested various reset patterns and they seemed to be fine.



Since BlessTupleDesc() is for SRFs according to the comments, I think, 
we can

remove it here.  Correct?

+       if (get_call_result_type(fcinfo, NULL, &tupdesc) != 
TYPEFUNC_COMPOSITE)

+               elog(ERROR, "return type must be a row type");
+
+       tupdesc = BlessTupleDesc(tupdesc);


Here are other comments from me:

The patch seems to contain whitespace errors.
You can see them by "git diff --check".

+  
+Time at which all statistics in the pg_stat_statements view
were last reset.
+  

"pg_stat_statements" in the above should be enclosed with
 and .

Regards,


Thank you for your comments, Mr. Fujii and Mr. Li.
I've posted a patch reflecting your comments.
Please check it out.
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
index 2019a4ffe0..3504ca7eb1 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8

Misleading comment in prologue of ReorderBufferQueueMessage

2020-12-14 Thread Ashutosh Bapat
The name of the function suggests that the given message will be queued in
ReorderBuffer. The prologue of the function says so too
 776 /*
 777  * Queue message into a transaction so it can be processed upon commit.
 778  */
It led me to think that a non-transactional message is processed along with
the surrounding transaction, esp. when it has an associated xid.

But in reality, the function queues only a transactional message and
decoders a non-transactional message immediately without waiting for a
commit.

We should modify the prologue to say
"Queue a transactional message into a transaction so that it can be
processed upon commit. A non-transactional message is processed
immediately." and also change the name of the function
to ReorderBufferProcessMessage(), but the later may break API compatibility.

--
Best Wishes,
Ashutosh


Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-12-14 Thread Andrey V. Lepikhov

On 12/1/20 2:02 PM, Amit Langote wrote:

On Tue, Dec 1, 2020 at 2:40 PM tsunakawa.ta...@fujitsu.com
 wrote:

From: Amit Langote 

>> The code appears to require both BeginForeignCopy and EndForeignCopy,
>> while the following documentation says they are optional.  Which is
>> correct?  (I suppose the latter is correct just like other existing
>> Begin/End functions are optional.)

Fixed.

> Anyway, one thing we could do is rename
> ExecRelationAllowsMultiInsert() to ExecSetRelationUsesMultiInsert(

Renamed.

>> I agree with your idea of adding multi_insert argument to 
ExecFindPartition() to request a multi-insert-capable partition.  At 
first, I thought ExecFindPartition() is used for all operations, 
insert/delete/update/select, so I found it odd to add multi_insert 
argument.  But ExecFindPartion() is used only for insert, so 
multi_insert argument seems okay.

>
> Good.  Andrey, any thoughts on this?

I have no serious technical arguments against this, other than code 
readability and reduce of a routine parameters. Maybe we will be 
rethinking it later?


The new version rebased on commit 525e60b742 is attached.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 98a6f077cd3b694683ec0e4a3250c040cc33cb39 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 14 Dec 2020 11:29:03 +0500
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
properties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copyfrom.c  | 142 ++-
 src/backend/executor/execMain.c  |  52 +
 src/backend/executor/execPartition.c |   7 ++
 src/include/commands/copyfrom_internal.h |  10 --
 src/include/executor/execPartition.h |   2 +
 src/include/executor/executor.h  |   2 +
 src/include/nodes/execnodes.h|   8 +-
 7 files changed, 108 insertions(+), 115 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1b14e9a6eb..6d4f6cb80d 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -535,12 +535,10 @@ CopyFrom(CopyFromState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 	Assert(list_length(cstate->range_table) == 1);
@@ -650,6 +648,30 @@ CopyFrom(CopyFromState cstate)
 	resultRelInfo = target_resultRelInfo = makeNode(ResultRelInfo);
 	ExecInitResultRelation(estate, resultRelInfo, 1);
 
+	Assert(target_resultRelInfo->ri_usesMultiInsert == false);
+
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately for
+	 * every tuple. However, there are a number of reasons why we might not be
+	 * able to do this.  For example, if there any volatile expressions in the
+	 * table's default values or in the statement's WHERE clause, which may
+	 * query the table we are inserting into, buffering tuples might produce
+	 * wrong results.  Also, the relation we are trying to insert into itself
+	 * may not be amenable to buffered inserts.
+	 *
+	 * Note: For partitions, this flag is set considering the target table's
+	 * flag that is being set here and partition's own properties which are
+	 * checked by calling ExecSetRelationUsesMultiInsert().  It does not matter
+	 * whether partitions have any volatile default expressions as we use the
+	 * defaults from the target of the COPY command.
+	 */
+	if (!cstate->volatile_defexprs &&
+		!contain_volatile_functions(cstate->whereClause))
+		target_resultRelInfo->ri_usesMultiInsert =
+	ExecSetRelationUsesMultiInsert(target_resultRelInfo, NULL);
+
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
 
@@ -665,6 +687,10 @@ CopyFrom(CopyFromState cstate)
 	mtstate->operation = CMD_INSERT;
 	mtstate->resultRelInfo = resultRelInfo;
 
+	/*
+	 * Init copying process into foreign table. Initialization of copying into
+	 * foreign partit

Some error messages are omitted while recovery.

2020-12-14 Thread Kyotaro Horiguchi
At Mon, 14 Dec 2020 16:48:05 +0900, Michael Paquier  wrote 
in 
> On Mon, Dec 14, 2020 at 11:34:51AM +0900, Kyotaro Horiguchi wrote:
> > Apart from this issue, while checking that, I noticed that if server
> > starts having WALs from a server of a different systemid, the server
> > stops with obscure messages.
> 
> Wouldn't it be better to discuss that on a separate thread?  I have
> mostly missed your message here.

Right. Here is the duplicate of the message.  Thanks for the
suggestion!

=
While in another discussion related to xlogreader[2], I noticed that
if server starts having WALs from a server of a different systemid,
the server stops with obscure messages.


> LOG:  database system was shut down at 2020-12-14 10:36:02 JST
> LOG:  invalid primary checkpoint record
> PANIC:  could not locate a valid checkpoint record

The cause is XLogPageRead erases the error message set by
XLogReaderValidatePageHeader(). As the comment just above says, this
is required to continue replication under a certain situation. The
code is aiming to allow continue replication when the first half of a
continued record has been removed on the primary so we don't need to
do the amendment unless we're in standby mode. If we let the savior
code only while StandbyMode, we would have the correct error message.

> JST LOG:  database system was shut down at 2020-12-14 10:36:02 JST
> LOG:  WAL file is from different database system: WAL file database system 
> identifier is 6905923817995618754, pg_control database system identifier is 
> 6905924227171453468
> JST LOG:  invalid primary checkpoint record
> JST PANIC:  could not locate a valid checkpoint record

I confirmed 0668719801 still works under the intended context using
the steps shown in [1].


[1]: 
https://www.postgresql.org/message-id/flat/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com

[2]: 
https://www.postgresql.org/message-id/flat/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




>From d54531aa2774bad7e426cc16691553fbc8f0b3b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 14 Dec 2020 11:18:08 +0900
Subject: [PATCH] Don't cancel invalid-page-header error in unwanted situation

The commit 0668719801 is intending to work while streaming replication
but it cancels the error message regardless of the context. As the
result ReadRecord fails to show the correct error messages even when
it is required, that is, not while replication.  Allowing the
cancellation happen only on non-standby fixes that.
---
 src/backend/access/transam/xlog.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f17..770902518d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12055,7 +12055,8 @@ retry:
 	 * Validating the page header is cheap enough that doing it twice
 	 * shouldn't be a big deal from a performance point of view.
 	 */
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (StandbyMode &&
+		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		xlogreader->errormsg_buf[0] = '\0';
-- 
2.27.0



Re: Asynchronous Append on postgres_fdw nodes.

2020-12-14 Thread Kyotaro Horiguchi
At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita  
wrote in 
> On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
>  wrote:
> > I looked through the nodeAppend.c and postgres_fdw.c part and those
> > are I think the core of this patch.
> 
> Thanks again for the review!
> 
> > -* figure out which subplan we are currently processing
> > +* try to get a tuple from async subplans
> > +*/
> > +   if (!bms_is_empty(node->as_needrequest) ||
> > +   (node->as_syncdone && 
> > !bms_is_empty(node->as_asyncpending)))
> > +   {
> > +   if (ExecAppendAsyncGetNext(node, &result))
> > +   return result;
> >
> > The function ExecAppendAsyncGetNext() is a function called only here,
> > and contains only 31 lines.  It doesn't seem to me that the separation
> > makes the code more readable.
> 
> Considering the original ExecAppend() is about 50 lines long, 31 lines
> of code would not be small.  So I'd vote for separating it into
> another function as proposed.

Ok, I no longer oppose to separating some part from ExecAppend().

> > -   /* choose new subplan; if none, we're done */
> > -   if (!node->choose_next_subplan(node))
> > +   /* wait or poll async events */
> > +   if (!bms_is_empty(node->as_asyncpending))
> > +   {
> > +   Assert(!node->as_syncdone);
> > +   Assert(bms_is_empty(node->as_needrequest));
> > +   ExecAppendAsyncEventWait(node);
> >
> > You moved the function to wait for events from execAsync to
> > nodeAppend. The former is a generic module that can be used from any
> > kind of executor nodes, but the latter is specialized for nodeAppend.
> > In other words, the abstraction level is lowered here.  What is the
> > reason for the change?
> 
> The reason is just because that function is only called from
> ExecAppend().  I put some functions only called from nodeAppend.c in
> execAsync.c, though.

(I think) You told me that you preferred the genericity of the
original interface, but you're doing the opposite.  If you think we
can move such a generic feature to a part of Append node, all other
features can be move the same way.  I guess there's a reason you want
only the this specific feature out of all of them be Append-spcific
and I want to know that.

> > +   /* Perform the actual callback. */
> > +   ExecAsyncRequest(areq);
> > +   if (ExecAppendAsyncResponse(areq))
> > +   {
> > +   Assert(!TupIsNull(areq->result));
> > +   *result = areq->result;
> >
> > Putting aside the name of the functions, the first two function are
> > used only this way at only two places. ExecAsyncRequest(areq) tells
> > fdw to store the first tuple among the already received ones to areq,
> > and ExecAppendAsyncResponse(areq) is checking the result is actually
> > set. Finally the result is retrieved directory from areq->result.
> > What is the reason that the two functions are separately exists?
> 
> I think that when an async-aware node gets a tuple from an
> async-capable node, they should use ExecAsyncRequest() /
> ExecAyncHogeResponse() rather than ExecProcNode() [1].  I modified the
> patch so that ExecAppendAsyncResponse() is called from Append, but to
> support bubbling up the plan tree discussed in [2], I think it should
> be called from ForeignScans (the sides of async-capable nodes).  Am I
> right?  Anyway, I’ll rename ExecAppendAyncResponse() to the one
> proposed in Robert’s original patch.

Even though I understand the concept but to make work it we need to
remember the parent *async* node somewhere. In my faint memory the
very early patch did something like that.

So I think just providing ExecAsyncResponse() doesn't make it
true. But if we make it true, it would be something like
partially-reversed steps from what the current Exec*()s do for some of
the existing nodes and further code is required for some other nodes
like WindowFunction. Bubbling up works only in very simple cases where
a returned tuple is thrown up to further parent as-is or at least when
the node convers a tuple into another shape. If an async-receiver node
wants to process multiple tuples from a child or from multiple
children, it is no longer be just a bubbling up.

That being said, we could avoid passing (a-kind-of) side-channel
information when ExecProcNode is called by providing
ExecAsyncResponse(). But I don't think the "side-channel" is not a
problem since it is just another state of the node.


And.. I think the reason I feel uneasy for the patch may be that the
patch uses the interface names in somewhat different context.
Origianlly the fraemework resides in-between executor nodes, not on a
node of either side. ExecAsyncNotify() notifies the requestee about an
event and ExecAsyncResonse() not

Re: pg_basebackup caused FailedAssertion

2020-12-14 Thread Heikki Linnakangas

On 12/12/2020 00:47, Jeff Davis wrote:

On Wed, 2013-02-27 at 19:29 +0200, Heikki Linnakangas wrote:

Right. I fixed that by adding WL_SOCKET_READABLE, and handling any
messages that might arrive after the frontend already sent CopyEnd.
The
frontend shouldn't send any messages after CopyEnd, until it receives
a
CopyEnd from the backend.


It looks like 4bad60e3 may have fixed the problem, is it possible to
just revert 3a9e64aa and allow the case?


Yes, I think you're right.


Also, the comment added by 3a9e64aa is misleading, because waiting for
a CopyDone from the server is not enough. It's possible that the client
receives the CopyDone from the server and the client sends a new query
before the server breaks from the loop. The client needs to wait until
at least the first CommandComplete.


Good point. I think that's a bug in the implementation rather than the 
comment, though. ProcessRepliesIfAny() should exit the loop immediately 
if (streamingDoneReceiving && streamingDoneSending). But that's moot if 
we revert 3a9e64aa altogether. I think we could backpatch the revert, 
because it's not quite right as it is, and we have 3a9e64aa in all the 
supported versions.



In theory, the frontend could already send the next query before
receiving the CopyEnd, but libpq doesn't currently allow that. Until
someone writes a client that actually tries to do that, I'm not going
to
try to support that in the backend. It would be a lot more work, and
likely be broken anyway, without any way to test it.


I tried to add streaming replication support (still a work in progress)
to the rust client[1], and I ran into this problem.

The core of the rust client is fully pipelined and async, so it's a bit
annoying to work around this problem.


Since you have the means to test this, would you like to do the honors 
and revert 3a9e64aa?


- Heikki




Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-14 Thread Bharath Rupireddy
On Mon, Dec 14, 2020 at 11:52 AM Michael Paquier  wrote:
> On Mon, Dec 14, 2020 at 03:15:12PM +0900, Michael Paquier wrote:
> > Please note that this case fails with your patch, but the presence of
> > IF NOT EXISTS should ensure that we don't fail and issue a NOTICE
> > instead, no?

Thanks for the use case. The provided use case (or for that matter any
use case with explain analyze ctas if-not-exists) fails if the
relation already exists. It happens on the master branch, please have
a look at tests [1]. You are right in saying that whether it is
explain/explain analyze ctas if there is if-not-exists we should issue
notice instead of error as with plain ctas.

Do we want to fix this behaviour for explain/explain analyze ctat with
if-not-exists cases? Thoughts?

If yes, we could change the code in ExplainOneUtility() such that we
check relation existence before rewrite/planning, issue notice and
return. Then. the user sees a notice and an empty plan as we are
returning from ExplainOneUtility(). Is it okay to show a warning and
an empty plan to the user? Thoughts?

>> Taking this case specifically (OK, I am playing with
> > the rules a bit to insert data into the relation itself, still), this
> > query may finish by adding tuples to the table whose creation should
> > have been bypassed but the query got executed and inserted tuples.

IIUC, with the use case provided, the tuples will not be inserted as
the query later fails (and the txn gets aborted) if the relation
exists.

> > That's one example of behavior that may be confusing.  There may be
> > others, but it seems to me that it may be simpler to execute or even
> > plan the query at all if the relation already exists.
>
> Er..  Sorry. I meant here to *not* execute or even *not* plan the
> query at all if the relation already exists.

+1 to not plan and execute the query at all if the relation which
ctas/cmv is trying to create already exists.

[1] -
postgres=#   explain analyze
postgres-#  create table if not exists aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
ERROR:  relation "aa" already exists

postgres=#  explain analyze
postgres-#  create table aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
ERROR:  relation "aa" already exists

postgres=#  explain
postgres-#  create table aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
 QUERY PLAN

 CTE Scan on insert_query  (cost=0.01..0.03 rows=1 width=4)
   CTE insert_query
 ->  Insert on aa  (cost=0.00..0.01 rows=1 width=4)
   ->  Result  (cost=0.00..0.01 rows=1 width=4)

postgres=#   explain
postgres-#   create table if not exists aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
 QUERY PLAN

 CTE Scan on insert_query  (cost=0.01..0.03 rows=1 width=4)
   CTE insert_query
 ->  Insert on aa  (cost=0.00..0.01 rows=1 width=4)
   ->  Result  (cost=0.00..0.01 rows=1 width=4)

postgres=#   create table aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
ERROR:  relation "aa" already exists

postgres=#   create table if not exists aa as
postgres-#with insert_query as
postgres-#  (insert into aa values (1) returning a1)
postgres-#select * from insert_query;
NOTICE:  relation "aa" already exists, skipping
CREATE TABLE AS

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Rethinking plpgsql's assignment implementation

2020-12-14 Thread Pavel Stehule
Hi

I checked a performance and it looks so access to record's field is faster,
but an access to arrays field is significantly slower

do $$
declare
  a int[];
  aux int;
  rep boolean default true;
begin
  for i in 1..5000
  loop
a[i]:= 5000 - i;
  end loop;

  raise notice '%', a[1:10];

  while rep
  loop
rep := false;
for i in 1..5000
loop
  if a[i] > a[i+1] then
aux := a[i];
a[i] := a[i+1]; a[i+1] := aux;
rep := true;
  end if;
end loop;
  end loop;

  raise notice '%', a[1:10];

end;
$$;

This code is about 3x slower than master (40 sec x 12 sec). I believe so
this is a worst case scenario

I tested pi calculation

CREATE OR REPLACE FUNCTION pi_est_1(n int)
RETURNS numeric AS $$
DECLARE
  accum double precision DEFAULT 1.0;
  c1 double precision DEFAULT 2.0;
  c2 double precision DEFAULT 1.0;
BEGIN
  FOR i IN 1..n
  LOOP
accum := accum * ((c1 * c1) / (c2 * (c2 + 2.0)));
c1 := c1 + 2.0;
c2 := c2 + 2.0;
  END LOOP;
  RETURN accum * 2.0;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION pi_est_2(n int)
RETURNS numeric AS $$
DECLARE
  accum double precision DEFAULT 1.0;
  c1 double precision DEFAULT 2.0;
  c2 double precision DEFAULT 1.0;
BEGIN
  FOR i IN 1..n
  LOOP
accum := accum * ((c1 * c1) / (c2 * (c2 + double precision '2.0')));
c1 := c1 + double precision '2.0';
c2 := c2 + double precision '2.0';
  END LOOP;
  RETURN accum * double precision '2.0';
END;
$$ LANGUAGE plpgsql;

And the performance is 10% slower than on master

Interesting point - the master is about 5% faster than pg13