Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-11-15 Thread Michael Paquier
On Thu, Sep 24, 2020 at 05:55:17PM +1200, Thomas Munro wrote:
> Right, RestoreArchivedFile() uses system(), so I guess it can hang
> around for a long time after unexpected postmaster exit on every OS if
> the command waits.  To respond to various kinds of important
> interrupts, I suppose that'd ideally use something like
> OpenPipeStream() and a typical WaitLatch() loop with CFI().  I'm not
> sure what our policy is or should be for exiting while we have running
> subprocesses.  I guess that is a separate issue.

-   if (IsUnderPostmaster && !PostmasterIsAlive())
+   if (IsUnderPostmaster &&
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+   count++ % 1024 == 0 &&
+#endif
+   !PostmasterIsAlive())
That's pretty hack-ish, still efficient.  Could we consider a
different approach like something relying on
PostmasterIsAliveInternal() with repetitive call handling?  This may
not be the only place where we care about that, particularly for
non-core code. 

No objections with the two changes from pg_usleep() to WaitLatch() so
they could be applied separately first.
--
Michael


signature.asc
Description: PGP signature


Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-15 Thread Masahiro Ikeda

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both 
number

are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse 
pgWalUsage's

value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - 
prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;

+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one 
m_wal_write_xxx

counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the 
counters
per process type if necessary. If we adopt this approach, we can 
easily
extend pg_stat_wal so that any fields can be reported per process 
type.


I'll remove the above source code because these counters are not 
useful.



On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, 
auto_explain,

autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire 
database,

they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is 
reported high,
to reduce the number of this initialization, we can tune WAL-related 
parameters

so that more "recycled" WAL files can be held.



3.

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-15 Thread Masahiro Ikeda

On 2020-11-12 16:27, Fujii Masao wrote:

On 2020/11/12 14:58, Fujii Masao wrote:



On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both 
number

are important.

Also about performance, I thought there are few impacts because 
it
increments stats in memory. If I can implement to reuse 
pgWalUsage's

value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - 
prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;

+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two 
for
walwriter and other processes. Instead, we can use one 
m_wal_write_xxx

counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the 
counters
per process type if necessary. If we adopt this approach, we can 
easily
extend pg_stat_wal so that any fields can be reported per process 
type.


I'll remove the above source code because these counters are not 
useful.



On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the 
performance

trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may 
tune

"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, 
auto_explain,

autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire 
database,

they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is 
reported high,
to reduce the number of this initialization, we can tune 
WAL-related parameters

so t

Re: Tracking cluster upgrade and configuration history

2020-11-15 Thread Ian Lawrence Barwick
2020年11月16日(月) 15:48 Bharath Rupireddy :
>
> On Thu, Nov 12, 2020 at 4:01 AM Mark Dilger
>  wrote:
> >
> > While supporting customers, it would frequently be useful to have more 
> > information about the history of a cluster.  For example, which prior 
> > versions were ever installed and running on the cluster?  Has the cluster 
> > ever been run with fsync=off?  When did the server last enter recovery, if 
> > ever?  Was a backup_label file present at that time?
> >
>
> +1 for the idea. The information will be useful at times for debugging 
> purposes.

It's certainly something which would be nice to have.

> > Would it make sense to alternately, or additionally, store some of this 
> > information in a flat text file in pg_data, say a new file named 
> > "cluster_history" or such?
> >
>
> IMHO, this is also a good idea. We need to think of the APIs to
> open/read/write/close that history file? How often and which processes
> and what type of data they write? Is it that the postmaster alone will
> write into that file? If multiple processes are allowed to write, how
> to deal with concurrent writers? Will users have to open manually and
> read that file? or Will we have some program similar to
> pg_controldata? Will we have some maximum limit to the size of this
> file?

pg_stat_statements might be worth looking at as one way of handling that kind
of file.

However the problem with keeping a separate file which is not WAL-logged would
mean it doesn't get propagated to standbys, and there's also the question
of how it could be maintained across upgrades via pg_upgrade.

FWIW I did once create a background worker extension [1] which logs
configuration changes to a table, though it's not particularly maintained or
recommended for production use.

[1] https://github.com/ibarwick/config_log


Regards

Ian Barwick
-- 
EnterpriseDB: https://www.enterprisedb.com




RE: Disable WAL logging to speed up data loading

2020-11-15 Thread osumi.takami...@fujitsu.com
Hi


On Thursday, October 29, 2020 11:42 AM Fujii Masao 
 wrote:
> BTW, with the patch, I observed that PREPARE TRANSACTION and COMMIT
> PREPARED caused assertion failure in my env, as I pointed upthread.
> 
> How does the patch handle other feature depending on the existence of WAL,
> e.g., pg_logical_emit_message()?

I've updated the first patch, based on comments
from both Tsunakwa-San and Fujii-San mainly.
I'll take other comments in the next patch.
(Note that this patch passes make check-world but
doesn't pass installcheck-world yet.)

The assertion failure Fujii-San reported in the past has been protected by
adding a check to detect whether PREPARE TRANSACTION is issued
when wal_level=none or not in the v02.
Meanwhile, I didn't change the code for COMMIT PREPARED
because prohibiting usage of PREPARE TRANSACTION
means user cannot use COMMIT PREPARED as well.
I learnt this way of modification from the case that when 
max_prepared_transaction
is set to zero, PREPARE TRANSACTION cannot be used because of wa_level check,
while issuing COMMIT TRANSACTION itself doesn't claim wal_level.
Just prohibiting PREPARE TRANSACTION seemed OK for me.

As for pg_resetwal (and other commands like pg_rewind),
I didn't make any change. pg_resetwal is used to make corrupted server
start up again as a *last* resort by guessing the content of the control file,
while wal_level=none is designed never to make
the server start again when any unexpected crash is detected.
Here, the documentation in the patch about wal_level=none 
requests user to recreate the whole cluster again
when such a corruption of the cluster happens. Thus, it's not natural to think
that user of wal_level=none will continue to use the coruppted cluster forcibly
by applying pg_resetwal. This is the reason I made no fix of pg_resetwal.
In terms of other commands, I don't think there was a need to fix.

By the way, I'm not sure why functions related to replication origin
(e.g. pg_replication_origin_create) doesn't have a check of
wal_level. It can be used when wal_level < logical even though
their purposes are to safely keep track of replication progress.
Did it mean something special to execute such function when wal_level < logical 
?
In the patch, I added an error check on this point.

Another similar case is that
there's no test to check wal_level for pg_logical_emit_message.
In my humble opinion, pg_logical_emit_message should not be used
when wal_level <= minimal. I didn't think that
without setting up a slot for logical replication
(with the restriction that user cannot create a slot by 
pg_create_physical_replication_slot when wal_level < replica),
plugins will utilize message from pg_logical_emit_message.
Or, if there is a slot that has been created before
restarting the server to turn on wal_level=none,
I cannot get the worth to execute the function
because other kinds of WAL are not issued.
Thus, I prohibited the usage of pg_logical_emit_message this time.
Did it make sense ?


Best,
Takamichi Osumi


disable_WAL_logging_v02.patch
Description: disable_WAL_logging_v02.patch


Re: Cache relation sizes?

2020-11-15 Thread Thomas Munro
On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro  wrote:
> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
>  wrote:
> > This shared relation cache can easily store relation size as well.
> > In addition it will solve a lot of other problems:
> > - noticeable overhead of local relcache warming
> > - large memory consumption in case of larger number of relations
> > O(max_connections*n_relations)
> > - sophisticated invalidation protocol and related performance issues
> > Certainly access to shared cache requires extra synchronization.But DDL
> > operations are relatively rare.
> > So in most cases we will have only shared locks. May be overhead of
> > locking will not be too large?
>
> Yeah, I would be very happy if we get a high performance shared
> sys/rel/plan/... caches in the future, and separately, having the
> relation size available in shmem is something that has come up in
> discussions about other topics too (tree-based buffer mapping,
> multi-relation data files, ...).  ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory.  Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance.  In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size.  The scheme is described in
the commit message and comments.  The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd.  By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache".  It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory.  It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers.  Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com
From f34ac22ace4d6cee15e150357d361f73fead2755 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 13 Nov 2020 14:38:41 +1300
Subject: [PATCH] WIP: Track relation sizes in shared memory.

Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
smgr_shared_relation controls how many can exist at once, and they are
evicted as required.  "Dirty" SMgrSharedRelations can only be evicted
after being synced to disk.  Goals:

1.  Provide fast lookups of relation sizes, cutting down on lseek()
calls.  This supercedes the recovery-only caching added recently, and
replaces preexisting FSM and VM caching schemes.

2.  Stop trusting the operating system to keep track of the size of
files that we have recently extended, until fsync() has been called.

XXX smgrimmedsync() is maybe too blunt an instrument?

XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
some new interface, but it doesn't actually know if it's cleaning a fork
that extended a relation...

XXX perhaps bgwriter should try to clean them?

XXX currently reusine the array of locks also used for buffer mapping,
need to define some more in lwlocks.c...

Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c |   1 -
 src/backend/access/heap/visibilitymap.c   |  28 +-
 src/backend/catalog/storage.c |   2 -
 src/backend/storage/freespace/freespace.c |  35 +-
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/smgr/md.c |  10 +
 src/backend/storage/smgr/smgr.c   | 554 --
 src/backend/utils/misc/guc.c  |  11 +
 src/include/storage/smgr.h|  21 +-
 9 files changed, 576 insertions(+), 89 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 54e47b810f..702951e487 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	check_relation_relkind(rel);
 
 	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FO

Re: Corner-case bug in pg_rewind

2020-11-15 Thread Pavel Borisov
>
> > 1. Comment regarding
> > --
> > 347  * Check for the possibility that the target is in fact a
> direct
> > 348  * ancestor of the source. In that case, there is no
> divergent history
> > 349  * in the target that needs rewinding.
> > --
> > are better be reformulated as overall block contents are mostly cover
> vice
> > versa case of a target NOT being a direct ancestor of the source. Maybe:
> "We
> > need rewind in cases when  and don't need only if the target is a
> direct
> > ancestor of the source." I think it will be more understandable if it
> would be
> > a commentary with descriptions of all cases in the block or no commentary
> > before the block at all with commentaries of these cases on each if/else
> >  inside the block instead.
>
> The comment you cite is not part of this patch. I suspect there might be
> some
>
Sure, it is comment describes the whole block the patch introduces changes
into. If it could be rendered more relevant and anyway you are planning to
revise it again, it would be great to change it also. But I don't insist.


> Note that the patch  may require reworking for HEAD due to changes in
> commit 9c4f5192f6. I'll try to take another look this week.
>
Thank you!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Tracking cluster upgrade and configuration history

2020-11-15 Thread Bharath Rupireddy
On Thu, Nov 12, 2020 at 4:01 AM Mark Dilger
 wrote:
>
> While supporting customers, it would frequently be useful to have more 
> information about the history of a cluster.  For example, which prior 
> versions were ever installed and running on the cluster?  Has the cluster 
> ever been run with fsync=off?  When did the server last enter recovery, if 
> ever?  Was a backup_label file present at that time?
>

+1 for the idea. The information will be useful at times for debugging purposes.

>
> Some of this type of information could strictly be fixed size, such as a 
> fixed set of timestamps for the time at which a fixed set of things last 
> occurred, or a fixed set of bits indicating whether a fixed set of things 
> ever happened.
>
> Some other types would be variable size, but hopefully short in practice, 
> like a list of all postgres versions that have ever been run on the cluster.
>
> Logging the information via the usual log mechanism seems insufficient, as 
> log files may get rotated and this information lost.
>

True. Just a thought, can we use existing logging mechanism and APIs
to write to a new file that never gets rotated by the syslogger(Of
course, we need to think of the maximum file size that's allowed)? The
idea is like this: we use elog/ereport and so on with a new debug
level, when specified, instead of logging into the standard log files,
we log it to the new file.

>
> Would it be acceptable to store some fixed set of flag bits and timestamps in 
> pg_control?  Space there is at a premium.
>

Since we allocate ControlFileData in shared memory and also we may
have some data with timestamps, variable texts and so on, having this
included in pg_control data structure would not seem a good idea to
me.

>
> Would it make sense to alternately, or additionally, store some of this 
> information in a flat text file in pg_data, say a new file named 
> "cluster_history" or such?
>

IMHO, this is also a good idea. We need to think of the APIs to
open/read/write/close that history file? How often and which processes
and what type of data they write? Is it that the postmaster alone will
write into that file? If multiple processes are allowed to write, how
to deal with concurrent writers? Will users have to open manually and
read that file? or Will we have some program similar to
pg_controldata? Will we have some maximum limit to the size of this
file?

>
> I'm happy to put together a more concrete proposal, but solicit your opinions 
> first on the merits of the idea generally, and if you think the idea good, on 
> the specifics you'd like to see included.
>

Welcome to know more about this idea.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Supporting = operator in gin/gist_trgm_ops

2020-11-15 Thread Alexander Korotkov
On Mon, Nov 16, 2020 at 2:13 AM Jeff Janes  wrote:
> On Sat, Nov 14, 2020 at 12:31 AM Alexander Korotkov  
> wrote:
>> I went through and revised this patch.  I made the documentation
>> statement less categorical.  pg_trgm gist/gin indexes might have lower
>> performance of equality operator search than B-tree.  So, we can't
>> claim the B-tree index is always not needed.  Also, simple comparison
>> operators are <, <=, >, >=, and they are not supported.
>
> Is "simple comparison" here a well-known term of art?  If I read the doc as 
> committed (which doesn't include the sentence above), and if I didn't already 
> know what it was saying, I would be left wondering which comparisons those 
> are.  Could we just say "inequality operators"?

You're right.  "Simple comparison" is vague, let's replace it with
"inequality".  Pushed, thanks!

--
Regards,
Alexander Korotkov




Re: Add Information during standby recovery conflicts

2020-11-15 Thread Masahiko Sawada
On Mon, Nov 9, 2020 at 2:49 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 11/6/20 3:21 AM, Masahiko Sawada wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> >
> >
> >
> > On Fri, Oct 30, 2020 at 6:02 PM Drouvot, Bertrand  
> > wrote:
> >> Hi,
> >>
> >> On 10/30/20 4:25 AM, Fujii Masao wrote:
> >>> CAUTION: This email originated from outside of the organization. Do
> >>> not click links or open attachments unless you can confirm the sender
> >>> and know the content is safe.
> >>>
> >>>
> >>>
> >>> On 2020/10/30 10:29, Masahiko Sawada wrote:
>  ,
> 
>  On Thu, 29 Oct 2020 at 00:16, Fujii Masao
>   wrote:
> > I read v8 patch. Here are review comments.
>  Thank you for your review.
> 
> > Thank you for updating the patch!
> >
> > Looking at the latest version patch
> > (v8-0002-Log-the-standby-recovery-conflict-waits.patch), I think it
> > doesn't address some comments from Fujii-san.
> >
> > When recovery conflict with buffer pin happens, log message is output
> > every deadlock_timeout. Is this intentional behavior? If yes, IMO
> > that's
> > not good because lots of messages can be output.
>  Agreed.
> > I think the latest patch doesn't fix the above comment. Log message
> > for recovery conflict on buffer pin is logged every deadlock_timeout.
> >
>  if we were to log the recovery conflict only once in bufferpin
>  conflict case, we would log it multiple times only in lock conflict
>  case. So I guess it's better to do that in all conflict cases.
> >>> Yes, I agree that this behavior basically should be consistent between
> >>> all cases.
> > The latest patch seems not to address this comment as well.
>
> Oh, I missed those ones, thanks for the feedback.
>
> New version attached, so that recovery conflict will be logged only once
> also for buffer pin and lock cases.

Thank you for updating the patch.

Here are review comments.

+   if (report_waiting && (!logged_recovery_conflict ||
new_status == NULL))
+   ts = GetCurrentTimestamp();

The condition will always be true if log_recovery_conflict_wait is
false and report_waiting is true, leading to unnecessary calling of
GetCurrentTimestamp().

---
+   
+You can control whether a log message is produced when the startup process
+is waiting longer than deadlock_timeout for recovery
+conflicts. This is controled by the 
+parameter.
+   

s/controled/controlled/

---
if (report_waiting)
waitStart = GetCurrentTimestamp();

Similarly, we have the above code but we don't need to call
GetCurrentTimestamp() if update_process_title is false, even if
report_waiting is true.

I've attached the patch that fixes the above comments. It can be
applied on top of your v8 patch.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


fix_masahiko2.patch
Description: Binary data


RE: Terminate the idle sessions

2020-11-15 Thread kuroda.hay...@fujitsu.com
Dear Li,

> Thanks for your advice! Attached v4.

I confirmed it. OK.

> @@ -30,6 +30,7 @@ typedef enum TimeoutId
>   STANDBY_DEADLOCK_TIMEOUT,
>   STANDBY_TIMEOUT,
>   STANDBY_LOCK_TIMEOUT,
> + IDLE_SESSION_TIMEOUT,
>   IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
>   /* First user-definable timeout reason */
>   USER_TIMEOUT,

I'm not familiar with timeout, but I can see that the priority of idle-session 
is set lower than transaction-timeout.
Could you explain the reason? In my image this timeout locates at the lowest 
layer, so it might have the lowest 
priority.

Other codes are still checked :-(.

Hayato Kuroda
FUJITSU LIMITED


Re: Feature improvement for pg_stat_statements

2020-11-15 Thread Seino Yuki
Due to similar fixed, we have also merged the patches discussed in the 
following thread.

https://commitfest.postgresql.org/30/2736/
The patch is posted on the 2736 side of the thread.

Regards.





Re: [PATCH] Add features to pg_stat_statements

2020-11-15 Thread Seino Yuki

Thanks for updating the patch!

+   pgss_info->dealloc = 0;
+   SpinLockInit(&pgss_info->mutex);
+   Assert(pgss_info->dealloc == 0);

Why is this assertion check necessary? It seems not necessary.

+   {
+   Assert(found == found_info);

Having pgssSharedState and pgssInfoCounters separately might make
the code a bit more complicated like the above? If this is true, what 
about

including pgssInfoCounters in pgssSharedState?

PGSS_FILE_HEADER needs to be changed since the patch changes
the format of pgss file?

+   /* Read pgss_info */
+   if (feof(file) == 0)
+   if (fread(pgss_info, sizeof(pgssInfoCounters), 1, file) != 1)
+   goto read_error;

Why does feof(file) need to be called here?

+pgss_info_update(void)
+{
+   {

Why is the second "{" necessary? It seems redundant.

+pgss_info_reset(void)
+{
+   {

Same as above.

+pg_stat_statements_info(PG_FUNCTION_ARGS)
+{
+   int64   d_count = 0;
+   {

Same as above.

+   SpinLockAcquire(&c->mutex);
+   d_count = Int64GetDatum(c->dealloc);
+   SpinLockRelease(&c->mutex);

Why does Int64GetDatum() need to be called here? It seems not 
necessary.


+   
+
+ pg_stat_statements_info() returns bigint
+ 
+  pg_stat_statements_info
+ 
+

Isn't it better not to expose pg_stat_statements_info() function in the
document because pg_stat_statements_info view is enough and there
seems no use case for the function?

Regards,


Thanks for the comment.
I'll post a fixed patch.
Due to similar fixed, we have also merged the patches discussed in the 
following thread.

https://commitfest.postgresql.org/30/2738/


Why is this assertion check necessary? It seems not necessary.

As indicated, it is unnecessary and will be removed.


Having pgssSharedState and pgssInfoCounters separately might make
the code a bit more complicated like the above? If this is true, what 
about

including pgssInfoCounters in pgssSharedState?
Fix pgssSharedState to include pgssInfoCounters . The related parts were 
also corrected accordingly.



PGSS_FILE_HEADER needs to be changed since the patch changes
the format of pgss file?

The value of PGSS_FILE_HEADER has been updated.


Why does feof(file) need to be called here?

As indicated, it is unnecessary and will be removed.


Why is the second "{" necessary? It seems redundant.

As indicated, it is unnecessary and will be removed.
But I left the {} in pg_stat_statements_info() to make the shared memory 
edit part explicit.


Why does Int64GetDatum() need to be called here? It seems not 
necessary.

As indicated, it is unnecessary and will be removed.


Isn't it better not to expose pg_stat_statements_info() function in the
document because pg_stat_statements_info view is enough and there
seems no use case for the function?

As indicated, it is unnecessary and will be removed.

Regards.
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 081f997d70..3ec627b956 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0edb134f3..a1eaedca8e 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -859,4 +859,19 @@ SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE
  SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 1 | 0 |0
 (6 rows)
 
+--
+-- Checking the execution of the pg_stat_statements_info view
+--
+SELECT pg_stat_statements_reset(0,0,0);
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT dealloc FROM pg_stat_statements_info;
+ dealloc 
+-
+   0
+(1 row)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
new file mode 100644
index 00..4edbc03dc1
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.9'" to load this file. \quit
+
+--- Define pg_stat_statements_info
+CREATE FUNCTION pg_stat_statements_inf

Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-15 Thread Michael Paquier
On Fri, Nov 13, 2020 at 12:58:52PM +0530, Bharath Rupireddy wrote:
> It's not required to set bistate to null as we have allocated myState
> with palloc0 in CreateIntoRelDestReceiver, it will anyways be null.
>   if (!into->skipData)
> myState->bistate = GetBulkInsertState();
> 
> Attaching v4 patch. Please review it.

I have reviewed this one this morning, and applied it after some
tweaks.  I have reworded some of the comments, fixed some typos, and
largely refactored the test cases to stress all the combinations
possible.  Please note that your patch would have caused failures
in the buildfarm, as any role created needs to be prefixed with
"regress_".
--
Michael


signature.asc
Description: PGP signature


Re: Corner-case bug in pg_rewind

2020-11-15 Thread Ian Lawrence Barwick
2020年11月10日(火) 18:07 Pavel Borisov :
>
> I did some effort to review your patch which seems legit to me.

Thanks for the review and feedback.

> I think some minor things are better to be improved i.e.
>
> 1. Comment regarding
> --
> 347  * Check for the possibility that the target is in fact a direct
> 348  * ancestor of the source. In that case, there is no divergent 
> history
> 349  * in the target that needs rewinding.
> --
> are better be reformulated as overall block contents are mostly cover vice
> versa case of a target NOT being a direct ancestor of the source. Maybe: "We
> need rewind in cases when  and don't need only if the target is a direct
> ancestor of the source." I think it will be more understandable if it would be
> a commentary with descriptions of all cases in the block or no commentary
> before the block at all with commentaries of these cases on each if/else
>  inside the block instead.

The comment you cite is not part of this patch. I suspect there might be some
scope for improving the comments in general, but that would need to be done
seperately.

> 2. When I do the test with no patching of pg_rewind.c I get the output:
> -
> #   Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from 
> last common checkpoint at)/'
> #   at t/007_min_recovery_point.pl line 107.
> #   'pg_rewind: servers diverged at WAL location 0/3000180 on 
> timeline 2
> # pg_rewind: no rewind required
> # '
> # doesn't match '(?^:rewinding from last common checkpoint at)'
> # Looks like you failed 1 test of 2.
> t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
> Failed 1/2 subtests
>
> Test Summary Report
> ---
> t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
>   Failed test:  2
>   Non-zero exit status: 1
> ---
> Maybe it can just give "failed" without so many details?

That's just the way the tests work - an individual test has no influence
over that output.

> Also, I think Heikki's notion could be fulfilled.

I spent a bit of time looking at that suggestion but couldn't actually
verify it was an issue which needed fixing.

Note that the patch  may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.


Regards

Ian Barwick


--
EnterpriseDB: https://www.enterprisedb.com




Re: Split copy.c

2020-11-15 Thread Justin Pryzby
On Tue, Nov 03, 2020 at 03:15:27PM +1300, David Rowley wrote:
> On Tue, 3 Nov 2020 at 07:35, Andres Freund  wrote:
> >
> > On 2020-11-02 19:43:38 +0200, Heikki Linnakangas wrote:
> > > On 02/11/2020 19:23, Andres Freund wrote:
> > > > On 2020-11-02 11:03:29 +0200, Heikki Linnakangas wrote:
> > > > > There isn't much common code between COPY FROM and COPY TO, so I 
> > > > > propose
> > > > > that we split copy.c into two: copyfrom.c and copyto.c. See attached. 
> > > > > I thin
> > > > > that's much nicer.
> > > >
> > > > Not quite convinced that's the right split - or perhaps there's just
> > > > more potential. My feeling is that splitting out all the DML related
> > > > code would make the code considerably easier to read.
> > >
> > > What do you mean by DML related code?
> >
> > Basically all the insertion related code (e.g CopyMultiInsert*, lots of
> > code in CopyFrom()) and perhaps also the type input invocations.
> 
> I quite like the fact that those are static and inline-able.  I very
> much imagine there'd be a performance hit if we moved them out to
> another .c file and made them extern.  Some of those functions can be
> quite hot when copying into a partitioned table.

For another patch [0], I moved into copy.h:
+typedef struct CopyMultiInsertBuffer
+typedef struct CopyMultiInsertInfo
+CopyMultiInsertBufferInit(ResultRelInfo *rri)
+CopyMultiInsertInfoSetupBuffer(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
+CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
+CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,

That's an experimental part 0002 of my patch in response to Simon's suggestion.
Maybe your response will be that variants of those interfaces should be added
to nodeModifyTable.[ch] instead of moving them.  Currently I'm passing
(void*)mtstate as cstate - if there were a generic interface, that would be a
void *state or so.

[0] https://commitfest.postgresql.org/30/2553/
should INSERT SELECT use a BulkInsertState? (and multi_insert)

-- 
Justin




Re: CLUSTER on partitioned index

2020-11-15 Thread Justin Pryzby
On Wed, Nov 04, 2020 at 08:23:56PM -0600, Justin Pryzby wrote:
> On Tue, Oct 27, 2020 at 07:33:12PM -0500, Justin Pryzby wrote:
> > I'm attaching a counter-proposal to your catalog change, which preserves
> > indisclustered on children of clustered, partitioned indexes, and 
> > invalidates
> > indisclustered when attaching unclustered indexes.
> 
> ..and now propagates CLUSTER ON to child indexes.
> 
> I left this as separate patches to show what I mean and what's new while we
> discuss it.

This fixes some omissions in the previous patch and error in its test cases.

CLUSTER ON recurses to children, since I think a clustered parent index means
that all its child indexes are clustered.  "SET WITHOUT CLUSTER" doesn't have
to recurse to children, but I did it like that for consistency and it avoids
the need to special case InvalidOid.

-- 
Justin
>From 75690aacef0d294e2667bd1091cf647dc9f5d187 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Jun 2020 16:58:42 -0500
Subject: [PATCH v4 1/5] Implement CLUSTER of partitioned table..

This requires either specification of a partitioned index on which to cluster,
or that an partitioned index was previously set clustered.
---
 doc/src/sgml/ref/cluster.sgml |   6 +
 src/backend/commands/cluster.c| 167 +++---
 src/bin/psql/tab-complete.c   |   1 +
 src/include/nodes/parsenodes.h|   5 +-
 src/test/regress/expected/cluster.out |  58 -
 src/test/regress/sql/cluster.sql  |  24 +++-
 6 files changed, 208 insertions(+), 53 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index b9450e7366..0476cfff72 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -172,6 +172,12 @@ CLUSTER [VERBOSE]
 are periodically reclustered.

 
+   
+Clustering a partitioned table clusters each of its partitions using the
+index partition of the given partitioned index or (if not specified) the
+partitioned index marked as clustered.
+   
+
  
 
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 04d12a7ece..391e018bbd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -32,7 +32,9 @@
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
+#include "catalog/partition.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_inherits.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
 #include "commands/progress.h"
@@ -72,6 +74,9 @@ static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 			bool verbose, bool *pSwapToastByContent,
 			TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
 static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
+		Oid indexOid);
+static void cluster_multiple_rels(List *rvs, int options);
 
 
 /*---
@@ -113,7 +118,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 			AccessExclusiveLock,
 			0,
 			RangeVarCallbackOwnsTable, NULL);
-		rel = table_open(tableOid, NoLock);
+		rel = table_open(tableOid, ShareUpdateExclusiveLock);
 
 		/*
 		 * Reject clustering a remote temp table ... their local buffer
@@ -124,14 +129,6 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("cannot cluster temporary tables of other sessions")));
 
-		/*
-		 * Reject clustering a partitioned table.
-		 */
-		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot cluster a partitioned table")));
-
 		if (stmt->indexname == NULL)
 		{
 			ListCell   *index;
@@ -169,8 +166,32 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 		/* close relation, keep lock till commit */
 		table_close(rel, NoLock);
 
-		/* Do the job. */
-		cluster_rel(tableOid, indexOid, stmt->options);
+		if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		{
+			/* Do the job. */
+			cluster_rel(tableOid, indexOid, stmt->options);
+		}
+		else
+		{
+			List	   *rvs;
+			MemoryContext cluster_context;
+
+			/* Refuse to hold strong locks in a user transaction */
+			PreventInTransactionBlock(isTopLevel, "CLUSTER");
+
+			cluster_context = AllocSetContextCreate(PortalContext,
+"Cluster",
+ALLOCSET_DEFAULT_SIZES);
+
+			rvs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
+			cluster_multiple_rels(rvs, stmt->options);
+
+			/* Start a new transaction for the cleanup work. */
+			StartTransactionCommand();
+
+			/* Clean up working storage */
+			MemoryContextDelete(cluster_context);
+		}
 	}
 	else
 	{
@@ -180,7 +201,6 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 		 */
 		MemoryContext cluster_context;
 		List	   *rvs;
-		ListCell   *rv;
 
 		/*
 		 * We ca

Re: Add docs stub for recovery.conf

2020-11-15 Thread Craig Ringer
On Sun, Nov 15, 2020 at 1:49 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Fri, Nov 13, 2020 at 10:42 AM Bruce Momjian  wrote:
>
>> I think the big problem, and I have seen this repeatedly, is showing up
>> with a patch without discussing whether people actually want the
>> feature.  I know it is a doc issue, but our TODO list has the order as:
>>
>> Desirability -> Design -> Implement -> Test -> Review -> Commit
>
>
>> and there is a reason for that.  When you appear with a patch, you are
>> already far down the steps, and you have to back up to explain why it is
>> useful.
>>
>
> That process is designed to prevent people from being exposed to wasted
> effort and hard feelings.  The choice to follow it individually, instead of
> collectively, doesn't diminish the value of the end result.
>

Frankly, it's also kind of a catch-22. Because often proposals for changes
or discussion get ignored until there's a patch, or the response is along
the lines of "show us a patch so we can try it and get a solid idea of what
this will do."

For major engineering changes, yes, discuss first. For small stuff, if you
don't want to get ignored, open with a patch.

On the point of renaming, my suggestion would be to have the documentation
> directory provide a file of all renaming for which redirects should be
> performed.  pgweb would source that file and actually establish the
> redirects on the main website.  Comments in the file can describe to a
> curious user why the name change was needed.  Though that honestly seems a
> bit overkill; for rename, the content as a whole still exists and a comment
> therein can talk about the renaming.  Users of the public website would
> still get the benefit of redirects, and there isn't any practical reason
> for people building documentation from source to want to establish such
> redirects even if they were provided the data in the form of the
> aforementioned file.
>

Agreed, there's no need to keep heading redirects in the source-built docs.
So if we're happy to maintain that on the website, in a way that makes
/current/ links work *and* following links from a /11/ docs page that has
vanished in pg12 to the right new place via the version navigation links,
that's sufficient for topic-heading renames.

We should, however, carry information about removals and renames in the
source-built docs to the extent that we have appropriate "see also" index
entries and useful information somewhere in the docs for people who are
upgrading. It just doesn't have to retain the same subject heading.


Re: Remove unused variable from SharedSort

2020-11-15 Thread Michael Paquier
On Sun, Nov 15, 2020 at 03:49:58PM +0530, Dilip Kumar wrote:
> On Sun, Nov 15, 2020 at 12:50 PM Bharath Rupireddy 
>  wrote:
>> We could have used that variable for an assert like
>> Assert(state->worker <= shared->nTapes) in worker_freeze_result_tape()
>> before accessing shared->tapes[state->worker] = output; as sometimes
>> state->worker is being set to -1. But, it seems like we reach
>> worker_freeze_result_tape(), only when  WORKER(state) is true. So, we
>> don't need that extra Assert and removing nTapes variable makes sense
>> to me.
> 
> Right, but anyway IMHO adding extra shared memory variables for just
> and assert purposes doesn't make sense.

FWIW, I disagree with the removal of this variable because it is
useful to track down the number of members in a flexible array at
shmem level.  Even if you don't use that in some sanity checks for
code paths, which I think we actually should really do for at least
inittapes() and leader_takeover_tapes() when it comes to the number of
participants assumed to exist, that's useful for debugging purposes.

Robert, this code has been introduced by 9da0cc3, could you comment on
that?
--
Michael


signature.asc
Description: PGP signature


Re: Online verification of checksums

2020-11-15 Thread Michael Paquier
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier  wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
> 
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
> 
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?
--
Michael


signature.asc
Description: PGP signature


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-15 Thread Michael Paquier
On Sun, Nov 15, 2020 at 12:16:56PM -0500, Tom Lane wrote:
> The obvious problem with this is that if !USE_OPENSSL, we will not have
> pulled in openssl's headers.

FWIW, I argued upthread against including this part because it is
useless: if not building with OpenSSL, we'll never have the base to be
able to use RAND_poll().

> However ... all these machines are pointing at line 96, which is not
> that one but the one under "#if defined(USE_OPENSSL)".  So I'm not sure
> what to make of that, except that a bit more finesse seems required.

The build scripts of src/tools/msvc/ choose to not use OpenSSL as
strong random source even if building with OpenSSL.  The top of the
file only includes openssl/rand.h if using USE_OPENSSL_RANDOM.

Thinking about that afresh, I think that we got that wrong here on
three points:
- If attempting to use OpenSSL on Windows, let's just bite the bullet
and use OpenSSL as random source, using Windows as source only when
not building with OpenSSL.
- Instead of using a call to RAND_poll() that we know will never work,
let's just issue a compilation failure if attempting to use
USE_OPENSSL_RANDOM without USE_OPENSSL.
- rand.h needs to be included under USE_OPENSSL.
--
Michael
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
index 6d85f50b7c..c5dfe4b072 100644
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 
-#ifdef USE_OPENSSL_RANDOM
+#ifdef USE_OPENSSL
 #include 
 #endif
 #ifdef USE_WIN32_RANDOM
@@ -98,14 +98,11 @@ pg_strong_random_init(void)
 
 #if defined(USE_OPENSSL_RANDOM)
 	/*
-	 * In case the backend is using the PRNG from OpenSSL without being built
-	 * with support for OpenSSL, make sure to perform post-fork initialization.
-	 * If the backend is using OpenSSL then we have already performed this
-	 * step. The same version caveat as discussed in the comment above applies
-	 * here as well.
+	 * If attempting to use OpenSSL as random source without support for it,
+	 * consider this combination as invalid.
 	 */
 #ifndef USE_OPENSSL
-	RAND_poll();
+#error cannot use OpenSSL as random source without building with it.
 #endif
 
 #elif defined(USE_WIN32_RANDOM)
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 17e480546c..cb01902ae9 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -525,6 +525,7 @@ sub GenerateFiles
 	if ($self->{options}->{openssl})
 	{
 		$define{USE_OPENSSL} = 1;
+		$define{USE_OPENSSL_RANDOM} = 1;
 
 		my ($digit1, $digit2, $digit3) = $self->GetOpenSSLVersion();
 


signature.asc
Description: PGP signature


Re: Supporting = operator in gin/gist_trgm_ops

2020-11-15 Thread Jeff Janes
On Sat, Nov 14, 2020 at 12:31 AM Alexander Korotkov 
wrote:

>
> I went through and revised this patch.  I made the documentation
> statement less categorical.  pg_trgm gist/gin indexes might have lower
> performance of equality operator search than B-tree.  So, we can't
> claim the B-tree index is always not needed.  Also, simple comparison
> operators are <, <=, >, >=, and they are not supported.
>

Is "simple comparison" here a well-known term of art?  If I read the doc as
committed (which doesn't include the sentence above), and if I didn't
already know what it was saying, I would be left wondering which
comparisons those are.  Could we just say "inequality operators"?

Cheers,

Jeff


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-15 Thread Tom Lane
Magnus Hagander  writes:
> I think the defensive-in-code instead of defensive-in-docs is a really
> strong argument, so I have pushed it as such.

I notice warnings that I think are caused by this patch on some buildfarm
members, eg

 drongo| 2020-11-15 06:59:05 | 
C:\prog\bf\root\HEAD\pgsql.build\src\port\pg_strong_random.c(96,11): warning 
C4013: 'RAND_poll' undefined; assuming extern returning int 
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
 drongo| 2020-11-15 06:59:05 | 
C:\prog\bf\root\HEAD\pgsql.build\src\port\pg_strong_random.c(96,11): warning 
C4013: 'RAND_poll' undefined; assuming extern returning int 
[C:\prog\bf\root\HEAD\pgsql.build\libpgport.vcxproj]
 drongo| 2020-11-15 06:59:05 |   
C:\prog\bf\root\HEAD\pgsql.build\src\port\pg_strong_random.c(96,11): warning 
C4013: 'RAND_poll' undefined; assuming extern returning int 
[C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
 drongo| 2020-11-15 06:59:05 |   
C:\prog\bf\root\HEAD\pgsql.build\src\port\pg_strong_random.c(96,11): warning 
C4013: 'RAND_poll' undefined; assuming extern returning int 
[C:\prog\bf\root\HEAD\pgsql.build\libpgport.vcxproj]

(bowerbird and hamerkop are showing the same).

My first thought about it was that this bit is busted:

+#ifndef USE_OPENSSL
+   RAND_poll();
+#endif

The obvious problem with this is that if !USE_OPENSSL, we will not have
pulled in openssl's headers.

However ... all these machines are pointing at line 96, which is not
that one but the one under "#if defined(USE_OPENSSL)".  So I'm not sure
what to make of that, except that a bit more finesse seems required.

regards, tom lane




Re: list of extended statistics on psql

2020-11-15 Thread Tomas Vondra
Thanks,

It's better to always post the whole patch series, so that cfbot can
test it properly. Sending just 0003 separately kind breaks that.

Also, 0003 seems to only tweak the .sql file, not the expected output,
and there actually seems to be two places that mistakenly use \dx (so
listing extensions) instead of \dX. I've fixed both issues in the
attached patches.

However, I think the 0002 tests are better/sufficient - I prefer to keep
it compact, not interleaving with the tests testing various other stuff.
So I don't intend to commit 0003, unless there's something that I don't
see for some reason.

The one remaining thing I'm not sure about is naming of the columns with
size of statistics - N_size, D_size and M_size does not seem very clear.
Any clearer naming will however make the tables wider, though :-/


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From e552a17a9c74952a002e6dabe8d57c9ab787addb Mon Sep 17 00:00:00 2001
From: Tatsuro Yamada 
Date: Mon, 9 Nov 2020 11:25:14 +0900
Subject: [PATCH 1/3] Add \dX command on psql

---
 doc/src/sgml/ref/psql-ref.sgml |  14 +
 src/bin/psql/command.c |   3 +
 src/bin/psql/describe.c| 100 +
 src/bin/psql/describe.h|   3 +
 src/bin/psql/help.c|   1 +
 src/bin/psql/tab-complete.c|   4 +-
 6 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..fd860776af 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1918,6 +1918,20 @@ testdb=>
 
 
   
+  
+  
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the pattern
+are listed.
+If + is appended to the command name, each extended statistics
+is listed with its size.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a83d5dfc..c6f1653cb7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -930,6 +930,9 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 else
 	success = listExtensions(pattern);
 break;
+			case 'X':			/* Extended Statistics */
+success = listExtendedStats(pattern, show_verbose);
+break;
 			case 'y':			/* Event Triggers */
 success = listEventTriggers(pattern, show_verbose);
 break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 07d640021c..1807324542 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4397,6 +4397,106 @@ listEventTriggers(const char *pattern, bool verbose)
 	return true;
 }
 
+/*
+ * \dX
+ *
+ * Briefly describes extended statistics.
+ */
+bool
+listExtendedStats(const char *pattern, bool verbose)
+{
+	PQExpBufferData buf;
+	PGresult   *res;
+	printQueryOpt myopt = pset.popt;
+
+	if (pset.sversion < 10)
+	{
+		char		sverbuf[32];
+
+		pg_log_error("The server (version %s) does not support extended statistics.",
+	 formatPGVersionNumber(pset.sversion, false,
+		   sverbuf, sizeof(sverbuf)));
+		return true;
+	}
+
+	initPQExpBuffer(&buf);
+	printfPQExpBuffer(&buf,
+	  "SELECT \n"
+	  "es.stxnamespace::pg_catalog.regnamespace::text AS \"%s\", \n"
+	  "es.stxname AS \"%s\", \n"
+	  "pg_catalog.format('%%s FROM %%s', \n"
+	  "  (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') \n"
+	  "   FROM pg_catalog.unnest(es.stxkeys) s(attnum) \n"
+	  "   JOIN pg_catalog.pg_attribute a \n"
+	  "   ON (es.stxrelid = a.attrelid \n"
+	  "   AND a.attnum = s.attnum \n"
+	  "   AND NOT a.attisdropped)), \n"
+	  "es.stxrelid::regclass) AS \"%s\", \n"
+	  "CASE WHEN esd.stxdndistinct IS NOT NULL THEN 'built' \n"
+	  " WHEN 'd' = any(es.stxkind) THEN 'defined' \n"
+	  "END AS \"%s\", \n"
+	  "CASE WHEN esd.stxddependencies IS NOT NULL THEN 'built' \n"
+	  " WHEN 'f' = any(es.stxkind) THEN 'defined' \n"
+	  "END AS \"%s\", \n"
+	  "CASE WHEN esd.stxdmcv IS NOT NULL THEN 'built' \n"
+	  " WHEN 'm' = any(es.stxkind) THEN 'defined' \n"
+	  "END AS \"%s\"",
+	  gettext_noop("Schema"),
+	  gettext_noop("Name"),
+	  gettext_noop("Definition"),
+	  gettext_noop("N_distinct"),
+	  gettext_noop("Dependencies"),
+	  gettext_noop("Mcv"));
+
+	if (verbose)
+	{
+		appendPQExpBuffer(&buf,
+		  ",\nCASE WHEN esd.stxdndistinct IS NOT NULL THEN \n"
+		  "  pg_catalog.pg_size_pretty(pg_catalog.length(stxdndistinct)::bigint) \n"
+		  " WHEN 'd' = any(stxkind) THEN '0 bytes' \n"
+		  "END AS \"%s\", \n"
+		  "CASE WHEN esd.stxddependencies IS NOT NULL THEN \n"
+		  "  pg_catalog.pg_size_pretty(pg_catalog.length(stxddepen

Re: Online checksums patch - once again

2020-11-15 Thread Magnus Hagander
On Fri, Nov 13, 2020 at 12:22 PM Heikki Linnakangas  wrote:

> On 12/11/2020 15:17, Daniel Gustafsson wrote:
> >> On 5 Oct 2020, at 14:14, Heikki Linnakangas  wrote:
> >> I would expect the checksums worker to be automatically started at
> postmaster startup. Can we make that happen?
> >
> > A dynamic background worker has to be registered from a regular backend,
> so
> > it's not entirely clear to me where in startup processing that would take
> > place.  Do you have any good suggestions?
>
> Could you launch it from the startup process, in StartupXLOG()?
> Does it have to be dynamic?
>


If it's not dynamic, you can't start it from a regular backend can you?  So
then you'd need a restart for it to happen?

As for launching it from the startup process I don't know, that might be a
viable path. The code specifically explains why it's not possible to launch
it from the postmaster, but I don't see anything that would make it
impossible from the startup process.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Online verification of checksums

2020-11-15 Thread Magnus Hagander
On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier  wrote:

> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> > I was referring to the patch I sent on this thread that fixes the
> > detection of a corruption for the zero-only case and where pd_lsn
> > and/or pg_upper are trashed by a corruption of the page header.  Both
> > cases allow a base backup to complete on HEAD, while sending pages
> > that could be corrupted, which is wrong.  Once you make the page
> > verification rely only on pd_checksum, as the patch does because the
> > checksum is the only source of truth in the page header, corrupted
> > pages are correctly detected, causing pg_basebackup to complain as it
> > should.  However, it has also the risk to cause pg_basebackup to fail
> > *and* to report as broken pages that are in the process of being
> > written, depending on how slow a disk is able to finish a 8kB write.
> > That's a different kind of wrongness, and users have two more reasons
> > to be pissed.  Note that if a page is found as torn we have a
> > consistent page header, meaning that on HEAD the PageIsNew() and
> > PageGetLSN() would pass, but the checksum verification would fail as
> > the contents at the end of the page does not match the checksum.
>
> Magnus, as the original committer of 4eb77d5, do you have an opinion
> to share?
>

I admit that I at some point lost track of the overlapping threads around
this, and just figured there was enough different checksum-involved-people
on those threads to handle it :) Meaning the short answer is "no, I don't
really have one at this point".

Slightly longer comment is that it does seem reasonable, but I have not
read in on all the different issues discussed over the whole thread, so
take that as a weak-certainty comment.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Refactor pg_rewind code and make it work against a standby

2020-11-15 Thread Heikki Linnakangas

On 15/11/2020 09:07, Tom Lane wrote:

I wrote:

Not sure if you noticed, but piculet has twice failed the
007_standby_source.pl test that was added by 9c4f5192f:
...
Now, I'm not sure what to make of that, but I can't help noticing that
piculet uses --disable-atomics while francolin uses --disable-spinlocks.
That leads the mind towards some kind of low-level synchronization
bug ...


Or, maybe it's less mysterious than that.  The failure looks like we
have not waited long enough for the just-inserted row to get replicated
to node C.  That wait is implemented as

$lsn = $node_a->lsn('insert');
$node_b->wait_for_catchup('node_c', 'write', $lsn);

which looks fishy ... shouldn't wait_for_catchup be told to
wait for replay of that LSN, not just write-the-WAL?


Yep, quite right. Fixed that way, thanks for the debugging!

- Heikki




Re: More time spending with "delete pending"

2020-11-15 Thread Alexander Lakhin
15.11.2020 08:00, Alexander Lakhin wrote:
> And it rises another question, what if pg_ls_dir_files() is called for a
> directory where hundreds or thousands of files are really inaccessible
> due to restrictions?
> I mean that using CreateFile() in the modified stat() implementation can
> be rather expensive for an arbitrary file (and worse yet, for many files).
> On the positive side, for now pg_ls_dir_files() is called only from
> pg_ls_logdir, pg_ls_waldir, pg_ls_tmpdir, pg_ls_archive_statusdir, where
> having a bunch of files that are inaccessible for the postgres user is
> not expected anyway.
I've missed the fact that pg_ls_dir_files() just fails on first error,
so the existing coding would not cause performance issues. But such
behavior is somewhat dubious, because having an inaccessible file in a
directory of interest would make any of those calls fail
> But probably getting directory contents with correct file sizes (>4GB)
> in pg_ls_dir_files() can be implemented without calling
> CreateFile()/stat() at all (as ProcMon shows, the "dir" command doesn't
> call CreateFile() (or any other system function) for each file in the
> target directory).
As I've found out, readdir() replacement for Windows in fact gets all
the needed information (correct file size, attributes...) in
WIN32_FIND_DATA, but it just leaves it inside and returns only fields of
the dirent structure. So pg_ls_dir_files() (that calls
ReadDir()/readdir()) needs to get this information again, that leads to
opening a file on Windows.
I think it can be done more effectively by adding a new function
ReadDirExtendedInfo(), that will additionally accept a pointer to
"struct dirent_extra" with fields {valid (indicating that the structure
is filled), attributes, file size, mtime}. Maybe the advanced function
could also invoke stat() inside (not on Windows).

As to patch I proposed before, I think it's still needed to fully
support the following usage pattern:
stat();
if (no_error) {
    do_something();
} else if (file_not_found) {
    do_something_else();
} else {
    error();
}

Best regards,
Alexander




Re: Supporting = operator in gin/gist_trgm_ops

2020-11-15 Thread Erik Rijkers

On 2020-11-15 06:55, Alexander Korotkov wrote:


> Sorry to be nitpicking - it's the one thing I'm really good at :P


Hi Alexander,

The last touch... (you forgot the missing 'the')

thanks!

Erik Rijkers
--- doc/src/sgml/pgtrgm.sgml.orig	2020-11-15 08:00:54.607816533 +0100
+++ doc/src/sgml/pgtrgm.sgml	2020-11-15 08:17:23.243316332 +0100
@@ -421,7 +421,7 @@
trigram-based index searches for LIKE, ILIKE,
~ and ~* queries.  Beginning in
PostgreSQL 14, these indexes also support
-   equality operator (simple comparison operators are not supported).
+   the equality operator (simple comparison operators are not supported).
Note that those indexes may not be as efficient as regular B-tree indexes
for equality operator.
   


Re: Remove unused variable from SharedSort

2020-11-15 Thread Dilip Kumar
On Sun, Nov 15, 2020 at 12:50 PM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 12, 2020 at 5:29 PM Dilip Kumar  wrote:
> >
> > While going through the code I noticed that the nTapes member in
> > SharedSort is unused.  This is just initialized with nworkers but
> > never used.  The attached patch removes this variable.
> >
>
> We could have used that variable for an assert like
> Assert(state->worker <= shared->nTapes) in worker_freeze_result_tape()
> before accessing shared->tapes[state->worker] = output; as sometimes
> state->worker is being set to -1. But, it seems like we reach
> worker_freeze_result_tape(), only when  WORKER(state) is true. So, we
> don't need that extra Assert and removing nTapes variable makes sense
> to me.

Right, but anyway IMHO adding extra shared memory variables for just
and assert purposes doesn't make sense.

> Patch looks good to me. Regression tests make check and make
> check-world ran successfully.

Thanks for looking into this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Terminate the idle sessions

2020-11-15 Thread Li Japin

On Nov 13, 2020, at 6:27 PM, 
kuroda.hay...@fujitsu.com wrote:


I read your patch, and I think the documentation is too simple to avoid all 
problems.
(I think if some connection pooling is used, the same problem will occur.)
Could you add some explanations in the doc file? I made an example:

```
Note that this values should be set to zero if you use postgres_fdw or some
Connection-pooling software, because connections might be closed unexpectedly.
```

Thanks for your advice! Attached v4.

--
Best regards
Japin Li



v4-0001-Allow-terminating-the-idle-sessions.patch
Description: v4-0001-Allow-terminating-the-idle-sessions.patch


v4-0002-Optimize-setitimer-usage.patch
Description: v4-0002-Optimize-setitimer-usage.patch