Proposal of new PostgreSQL Extension - PGSpiderExt

2020-09-09 Thread Taiga KATAYAMA

I would like to propose new PostgreSQL Extension - PGSpiderExt.

* What is PGSpiderExt
This extension makes it possible to treat multiple tables having the same 
schema as a single virtual table.
We call this table as a multi-tenant table.
If a foreign table has a key column identifying a table, you can realize it by 
using conventional declarative partitioning feature.
But, even if the foreign table does not have the key column, this extension can 
create such virtual table without changing table structure on foreign table.

* Example of execution
Data of foreign table1:
  i | t
+---
 10 | a
 11 | b

Data of foreign table2:
  i | t
+---
 20 | c
 21 | d

(Example1) Query and result for multi tenant table:
't1' is a multi-tenant table having 'table1' and 'table2'.
'node' is a key column of which values are 'node1' and 'node2' representing 
'table1' and 'table2' respectively.

SELECT * FROM t1;
  i | t | node
+---+---
 10 | a | node1
 11 | b | node1
 20 | c | node2
 21 | d | node2

(Example2) Query and result for multi tenant table:
SELECT * FROM t1 WHERE node = 'node1';
  i | t | node
+---+---
 10 | a | node1
 11 | b | node1


* How to create a multi-tenant table
pgspider_ext is one of foreign data wrapper for creating intermediate tables 
between a partition parent table and foreign tables.
Firstly, you create foreign tables using data source FDW such as postgres_fdw 
as usual.
Then create a partition table using declarative partitioning feature.
This table has a partition key column of text type in addition to same columns 
as the foreign table, and is partitioned by List.
After that, you create child tables of partition by using pgspider_ext.
You can define a value of partition key arbitrarily.

* Internal mechanism of accessing multi-tenant table PostgreSQL core
separates a query into queries for intermediate tables by partitioning feature.
pgspider_ext receives and analyzes the query, then passes query information to 
data source FDW.
More specifically, pgspider_ext creates information so that data source FDW can 
create a plan.
Then, creates a plan of intermediate table based on a plan created by data 
source FDW.
At that time, pgspider_ext does not pass query information about a partition 
key(like target list and WHERE condition) to data source FDW.
When executing the query, data source FDW accesses to a foreign server and 
fetch a result set.
pgspider_ext receives and return it to PostgreSQL core by adding information of 
a partition key.

* Example of usage
CREATE EXTENSION pgspider_ext;
CREATE EXTENSION postgres_fdw;

-- Create a server for pgspider_ext.
CREATE SERVER mtsrv FOREIGN DATA WRAPPER pgspider_ext;

-- Define data sources. pgsrv1 has a table 't1_pg1' and pgsrv2 has a table 
't1_pg2'.
-- These tables have 2 columns: 'i' integer and 't' text.
CREATE SERVER pgsrv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS(host
'127.0.0.1', port '5433', dbname 'postgres'); CREATE SERVER pgsrv2
FOREIGN DATA WRAPPER postgres_fdw OPTIONS(host '127.0.0.1', port
'5434', dbname 'postgres');

CREATE USER MAPPING FOR CURRENT_USER SERVER mtsrv; CREATE USER
MAPPING FOR CURRENT_USER SERVER pgsrv1 OPTIONS(user 'user', password
'pass'); CREATE USER MAPPING FOR CURRENT_USER SERVER pgsrv2
OPTIONS(user 'user', password 'pass');

-- Create foreign tables as usual using data source FDW.
CREATE FOREIGN TABLE t1_pg1_ft (i int, t text) SERVER pgsrv1 OPTIONS
(table_name 't1_pg1'); CREATE FOREIGN TABLE t1_pg2_ft (i int, t text)
SERVER pgsrv2 OPTIONS (table_name 't1_pg2');

-- Define a partition table and child tables using pgspider_ext.
-- Partition key column is 'node' which does not exist on foreign table.
CREATE TABLE t1(i int, t integer, node text) PARTITION BY LIST
(node); CREATE FOREIGN TABLE t1_pg1_tenant PARTITION OF t1 FOR VALUES
IN
('node1') SERVER mtsrv OPTIONS (child_name 't1_pg1_ft'); CREATE
FOREIGN TABLE t1_pg2_tenant PARTITION OF t1 FOR VALUES IN ('node2')
SERVER mtsrv OPTIONS (child_name 't1_pg2_ft');

Then, you can access t1 by SELECT query.

*
We hope to be incorporated this extension into PostgreSQL as one of
contrib module, and would like to try to propose to Commit Fest.
Could you kindly advise me and share your opinion?

Regards,
Taiga Katayama




Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-09 Thread David Zhang



On 2020-09-09 12:41 a.m., gkokola...@pm.me wrote:

‐‐‐ Original Message ‐‐‐
On Tuesday, 8 September 2020 22:26, David Zhang  wrote:



I found the function "table_relation_size" is only used by buffer
manager for "RELKIND_RELATION", "RELKIND_TOASTVALUE" and
"RELKIND_MATVIEW", i.e.

         case RELKIND_RELATION:
         case RELKIND_TOASTVALUE:
         case RELKIND_MATVIEW:
             {
                 /*
                  * Not every table AM uses BLCKSZ wide fixed size blocks.
                  * Therefore tableam returns the size in bytes - but
for the
                  * purpose of this routine, we want the number of blocks.
                  * Therefore divide, rounding up.
                  */
                 uint64        szbytes;

                 szbytes = table_relation_size(relation, forkNum);

                 return (szbytes + (BLCKSZ - 1)) / BLCKSZ;
             }

So using "calculate_relation_size" and "calculate_toast_table_size" in
"calculate_table_size" is easy to understand and the original logic is
simple.


You are correct. This is the logic that is attempted to be applied
in dbsize.c in this patch.

So what do you think of the patch?


I would suggest to keep the original logic unless there is a bug.

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca





Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Amit Kapila
On Wed, Sep 9, 2020 at 9:37 PM Fujii Masao  wrote:
>
> On 2020/09/09 22:57, Magnus Hagander wrote:
> > On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra  > > wrote:
> >
> > On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
> >  >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  > > wrote:
> >  >>
> >  >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  > > wrote:
> >  >>>
> >  >>
> >  >> Though in fact the one inconsistent place in the code now is that 
> > if it is corrupt in the db entry part of the file it returns true and the 
> > global timestamp, which I would argue is perhaps incorrect and it should 
> > return false.
> >  >>
> >  >
> >  >Yeah, this is exactly the case I was pointing out where we return true
> >  >before the patch, basically the code below:
> >  >case 'D':
> >  >if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
> >  >  fpin) != offsetof(PgStat_StatDBEntry, tables))
> >  >{
> >  >ereport(pgStatRunningInCollector ? LOG : WARNING,
> >  >(errmsg("corrupted statistics file \"%s\"",
> >  >statfile)));
> >  >goto done;
> >  >}
> >  >
> >  >done:
> >  >FreeFile(fpin);
> >  >return true;
> >  >
> >  >Now, if we decide to return 'false' here, then surely there is no
> >  >argument and we should return false in other cases as well. Basically,
> >  >I think we should be consistent in handling the corrupt file case.
> >  >
> >
> > FWIW I do agree with this - we should return false here, to make it
> > return false like in the other data corruption cases. AFAICS that's the
> > only inconsistency here.
> >
> >
> > +1, I think that's the place to fix, rather than reversing all the other 
> > places.
>
> +1 as I suggested upthread!
>

Please find the patch attached based on the above discussion. I have
slightly adjusted the comments.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-inconsistency-in-determining-the-timestamp-of.patch
Description: Binary data


Re: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread Peter Smith
On Thu, Sep 10, 2020 at 12:34 PM osumi.takami...@fujitsu.com
 wrote:
> I attached the v11 patch.

The v11 patch looked OK to me.

Since I have no more review comments I am marking this as "ready for committer".

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: please update ps display for recovery checkpoint

2020-09-09 Thread Michael Paquier
On Wed, Sep 09, 2020 at 09:00:50PM -0500, Justin Pryzby wrote:
> What would you want the checkpointer's ps to say ?
> 
> Normally it just says:
> postgres  3468  3151  0 Aug27 ?00:20:57 postgres: checkpointer
> 

Note that CreateCheckPoint() can also be called from the startup
process if the bgwriter has not been launched once recovery finishes.

> Or do you mean do the same thing as now, but one layer lower, like:
>
> @@ -8728,6 +8725,9 @@ CreateCheckPoint(int flags)
> +   if (flags & CHECKPOINT_END_OF_RECOVERY)
> +   set_ps_display("recovery checkpoint");

For the use-case discussed here, that would be fine.  Now the
difficult point is how much information we can actually display
without bloating ps while still have something meaningful.  Showing
all the information from LogCheckpointStart() would bloat the output a
lot for example.  So, thinking about that, my take would be to have ps
display the following at the beginning of CreateCheckpoint() and
CreateRestartPoint():
- restartpoint or checkpoint
- shutdown
- end-of-recovery

The output also needs to be cleared once the routines finish or if
there is a skip, of course.
--
Michael


signature.asc
Description: PGP signature


Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Justin Pryzby
On Tue, Sep 08, 2020 at 06:25:03PM +, Jameson, Hunter 'James' wrote:
> Hi, I ran across a small (but annoying) bug in initializing parallel BTree 
> scans, which causes the parallel-scan state machine to get confused. The fix 
> is one line; the description is a bit longer—

What postgres version was this ?

> Before, function _bt_first() would exit immediately if the specified scan 
> keys could never be satisfied--without notifying other parallel workers, if 
> any, that the scan key was done. This moved that particular worker to a scan 
> key beyond what was in the shared parallel-query state, so that it would 
> later try to read in "InvalidBlockNumber", without recognizing it as a 
> special sentinel value.
> 
> The basic bug is that the BTree parallel query state machine assumes that a 
> worker process is working on a key <= the global key--a worker process can be 
> behind (i.e., hasn't finished its work on a previous key), but never ahead. 
> By allowing the first worker to move on to the next scan key, in this one 
> case, without notifying other workers, the global key ends up < the first 
> worker's local key.
> 
> Symptoms of the bug are: on R/O, we get an error saying we can't extend the 
> index relation, while on an R/W we just extend the index relation by 1 block.

What's the exact error ?  Are you able to provide a backtrace ?

> To reproduce, you need a query that:
> 
> 1. Executes parallel BTree index scan;
> 2. Has an IN-list of size > 1;

Do you mean you have an index on col1 and a query condition like: col1 IN 
(a,b,c...) ?

> 3. Has an additional index filter that makes it impossible to satisfy the
> first IN-list condition.

.. AND col1::text||'foo' = '';
I think you mean that the "impossible" condition makes it so that a btree
worker exits early.

> (We encountered such a query, and therefore the bug, on a production 
> instance.)

Could you send the "shape" of the query or its plan, obfuscated and redacted as
need be ?

-- 
Justin




Re: Proposals for making it easier to write correct bgworkers

2020-09-09 Thread Pavel Stehule
čt 10. 9. 2020 v 5:02 odesílatel Craig Ringer 
napsal:

> Hi all
>
> As I've gained experience working on background workers, it's become
> increasingly clear that they're a bit too different to normal backends for
> many nontrivial uses.
>
> I thought I'd take a moment to note some of it here, along with some
> proposals for things we could potentially do to make it much easier to use
> bgworkers correctly especially when using them to run queries.
>
> This is NOT A PATCH SET. It's a set of discussion proposals and it's also
> intended as a bit of a helper for people just getting started on bgworkers.
> There are a lot of subtle differences in the runtime environment a basic
> bgworker provides vs the runtime environment extension authors will be used
> to when writing fmgr-callable C functions.
>
> (It looks like pg12 and pg13 have some improvements, so some of the issues
> I was going to mention with error cleanup paths and locking aren't relevant
> anymore.)
>
> DIFFERENCES WHEN CODING FOR BGWORKERS
> 
>
> Some of the most important differences vs normal backends that I've
> noticed are:
>
> * There's no supported way to recover from an error and continue
> execution. Each bgworker has to roll its own logic based on
> PostgresMain()'s sigsetjmp() handler and maintain it.
> * The bgworker infrastructure doesn't set up the the  application_name in
> PGPROC
> * Example bgworkers roll their own signal handlers rather than using die,
> procsignal_sigusr1_handler, and PostgresSigHupHandler. This makes them seem
> to work OK, but fail to respect CHECK_FOR_INTERRUPTS() and other expected
> behaviour when they call into the executor etc.
> * Each bgworker has to roll its own mainloop with ConfigReloadPending /
> ProcessConfigFile() etc, and this isn't illustrated at all in the examples
> * Example workers don't interact with the pgstat subsystem
>
> Managing bgworker lifecycle is also quite difficult. The postmaster
> doesn't offer any help and we don't expose much help from core. Any
> extension that spawns workers needs to do its own handling of worker
> startup registration in shmem, worker exit reaping, handling workers that
> exit before registering, etc. This can be seen in the logical replication
> code, which rolls its own management per
> src/backend/replication/logical/launcher.c .
>
> PROPOSED CHANGES IN EXAMPLE BGWORKERS/CONTRIBS
> -
>
> So I'd like to make these changes to example bgworkers and to contrib
> extensions:
>
> * Change example bgworkers in contrib to set up normal default signal
> handlers, not define their own
> * Add ConfigReloadPending and ProcessConfigFile() to all example bgworkers
> * Show pgstat use in example bgworkers
>
> PROPOSED BGW SETUP AND MAINLOOP HELPERS
> 
>
> and I'm wondering if anyone thinks it's a good idea to add some bgworker
> mainloop helper API, where one call is made at the start of the bgw, before
> BackgroundWorkerInitializeConnection(), to:
>
> * set application_name
> * set process title
> * set up default signal handlers and unblock signals
> * creates and assigns a BGWMemoryContext named after the bgw
> * set state to idle in pgstat
> * sets up MyProcPort (or teach InitPostgres to do so for a bgw)
> * takes a function pointer for a before_shmem_exit cleanup handler
>   to encourage ext authors to use this approach
>
> then a mainloop helper that:
>
> * Checks that CurrentMemoryContext == BGWMemoryContext
> * Runs CHECK_FOR_INTERRUPTS()
> * Checks for postmaster death and exits
> * Checks for and performs config file reload
>
>
>
> PROPOSED ERROR HANDLING HELPER
> 
>
> It'd also be beneficial to have a helper for bgw mainloops with error
> recovery, generalizing and extracting the pattern I see in all these
> routines:
>
> src/backend/postmaster/autovacuum.c=AutoVacLauncherMain
> src/backend/postmaster/autovacuum.c=AutoVacWorkerMain
> src/backend/postmaster/bgworker.c=StartBackgroundWorker
> src/backend/postmaster/bgwriter.c=BackgroundWriterMain
> src/backend/postmaster/checkpointer.c=CheckpointerMain
> src/backend/postmaster/walwriter.c=WalWriterMain
> src/backend/tcop/postgres.c=PostgresMain
>
> which all do their own error reset loops. There's way too much copy/paste
> going on there IMO, even before we consider bgworkers, and it's nearly
> impossible for anyone who isn't quite an experienced Pg developer to write
> a bgw that can do anything with errors except promote ERROR to FATAL and
> die.
>
> This would be modeled on PosgresMain(). The worker would only need to add
> its own logic to the sigsetjmp() error jump path, not duplicate all the
> core cleanup work.
>
> PROPOSED GENERALISED WORKER MANAGEMENT
> 
>
> Finally I'm wondering if there's any interest in generalizing the logical
> rep worker management for other bgworkers. I've done a ton of work with
> worker management and it's something I'm sure I could take on but I don't
> want to write it without knowing there's some chance of acceptance.
>
> The general idea is to 

Re: Resetting spilled txn statistics in pg_stat_replication

2020-09-09 Thread Amit Kapila
On Wed, Sep 9, 2020 at 3:20 PM Amit Kapila  wrote:
>
> On Tue, Sep 8, 2020 at 7:02 PM Amit Kapila  wrote:
> >
> > Comments on the latest patch:
> > =
> >
>
> Apart from the comments I gave yesterday, another thing I was
> wondering is how to write some tests for this patch. The two ideas I
> could think of are as follows:
>
> 1. One idea was to keep these stats for each WALSender as it was in
> the commit that we reverted as b074813d48. If we had that then we can
> query the stats for tests added in commit 58b5ae9d62. I am not sure
> whether we want to display it in view pg_stat_replication but it would
> be a really good way to test the streamed and serialized transactions
> in a predictable manner.
>
> 2. Then the second way is to try doing something similar to what we do
> in src/test/regress/sql/stats.sql
>
> I think we should do both if possible.
>

I have made a few comment changes on top of your last version. If you
are fine with these then include them with the next version of your
patch.

-- 
With Regards,
Amit Kapila.


cosmetic_fixes_1.patch
Description: Binary data


Proposals for making it easier to write correct bgworkers

2020-09-09 Thread Craig Ringer
Hi all

As I've gained experience working on background workers, it's become
increasingly clear that they're a bit too different to normal backends for
many nontrivial uses.

I thought I'd take a moment to note some of it here, along with some
proposals for things we could potentially do to make it much easier to use
bgworkers correctly especially when using them to run queries.

This is NOT A PATCH SET. It's a set of discussion proposals and it's also
intended as a bit of a helper for people just getting started on bgworkers.
There are a lot of subtle differences in the runtime environment a basic
bgworker provides vs the runtime environment extension authors will be used
to when writing fmgr-callable C functions.

(It looks like pg12 and pg13 have some improvements, so some of the issues
I was going to mention with error cleanup paths and locking aren't relevant
anymore.)

DIFFERENCES WHEN CODING FOR BGWORKERS


Some of the most important differences vs normal backends that I've noticed
are:

* There's no supported way to recover from an error and continue execution.
Each bgworker has to roll its own logic based on PostgresMain()'s
sigsetjmp() handler and maintain it.
* The bgworker infrastructure doesn't set up the the  application_name in
PGPROC
* Example bgworkers roll their own signal handlers rather than using die,
procsignal_sigusr1_handler, and PostgresSigHupHandler. This makes them seem
to work OK, but fail to respect CHECK_FOR_INTERRUPTS() and other expected
behaviour when they call into the executor etc.
* Each bgworker has to roll its own mainloop with ConfigReloadPending /
ProcessConfigFile() etc, and this isn't illustrated at all in the examples
* Example workers don't interact with the pgstat subsystem

Managing bgworker lifecycle is also quite difficult. The postmaster doesn't
offer any help and we don't expose much help from core. Any extension that
spawns workers needs to do its own handling of worker startup registration
in shmem, worker exit reaping, handling workers that exit before
registering, etc. This can be seen in the logical replication code, which
rolls its own management per src/backend/replication/logical/launcher.c .

PROPOSED CHANGES IN EXAMPLE BGWORKERS/CONTRIBS
-

So I'd like to make these changes to example bgworkers and to contrib
extensions:

* Change example bgworkers in contrib to set up normal default signal
handlers, not define their own
* Add ConfigReloadPending and ProcessConfigFile() to all example bgworkers
* Show pgstat use in example bgworkers

PROPOSED BGW SETUP AND MAINLOOP HELPERS


and I'm wondering if anyone thinks it's a good idea to add some bgworker
mainloop helper API, where one call is made at the start of the bgw, before
BackgroundWorkerInitializeConnection(), to:

* set application_name
* set process title
* set up default signal handlers and unblock signals
* creates and assigns a BGWMemoryContext named after the bgw
* set state to idle in pgstat
* sets up MyProcPort (or teach InitPostgres to do so for a bgw)
* takes a function pointer for a before_shmem_exit cleanup handler
  to encourage ext authors to use this approach

then a mainloop helper that:

* Checks that CurrentMemoryContext == BGWMemoryContext
* Runs CHECK_FOR_INTERRUPTS()
* Checks for postmaster death and exits
* Checks for and performs config file reload



PROPOSED ERROR HANDLING HELPER


It'd also be beneficial to have a helper for bgw mainloops with error
recovery, generalizing and extracting the pattern I see in all these
routines:

src/backend/postmaster/autovacuum.c=AutoVacLauncherMain
src/backend/postmaster/autovacuum.c=AutoVacWorkerMain
src/backend/postmaster/bgworker.c=StartBackgroundWorker
src/backend/postmaster/bgwriter.c=BackgroundWriterMain
src/backend/postmaster/checkpointer.c=CheckpointerMain
src/backend/postmaster/walwriter.c=WalWriterMain
src/backend/tcop/postgres.c=PostgresMain

which all do their own error reset loops. There's way too much copy/paste
going on there IMO, even before we consider bgworkers, and it's nearly
impossible for anyone who isn't quite an experienced Pg developer to write
a bgw that can do anything with errors except promote ERROR to FATAL and
die.

This would be modeled on PosgresMain(). The worker would only need to add
its own logic to the sigsetjmp() error jump path, not duplicate all the
core cleanup work.

PROPOSED GENERALISED WORKER MANAGEMENT


Finally I'm wondering if there's any interest in generalizing the logical
rep worker management for other bgworkers. I've done a ton of work with
worker management and it's something I'm sure I could take on but I don't
want to write it without knowing there's some chance of acceptance.

The general idea is to provide a way for bgworkers to start up managers for
pools / sets of workers. They launch them and have a function they can call
in their mainloop that watches their child worker states, invoking
callbacks when they fail to launch, launch successfully, exit cleanly 

RE: SIGQUIT handling, redux

2020-09-09 Thread tsunakawa.ta...@fujitsu.com
From: Tom Lane 
> This is straying a bit from the stated topic of this thread, but ...
> I did some further looking around to see whether there were any
> unsafe signal handlers besides SIGQUIT ones.  The situation is not
> too awful, but I did find several issues not already mentioned
> in this thread:

Wow, your eyes catch this many issues.  (I was just wondering about one or two 
of them.)

I'm not sure if this is related, but I had been wondering if the following can 
be what it is now.


(1)
When logical replication is used, pg_ctl stop with the default fast mode emits 
the message about termination of logical replication launcher.  Although it's 
not FATAL or ERROR, but I was a bit startled when I saw this message for the 
first time.  Why should this message be emitted while nothing about other 
postmaster children are reported?  The logical rep launcher ignores SIGINT 
(SIG_IGN).

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  background worker "logical replication launcher" (PID 10056) exited with 
exit code 1
LOG:  shutting down
LOG:  database system is shut down


(2)
When the physical standby stops, a FATAL message is output.  It may be annoying 
to the DBA that monitors messages with high severity.

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating walreceiver process due to administrator command
LOG:  shutting down
LOG:  database system is shut down


(3)
When WaitLatch(EXIT_ON_POSTMASTER_DEATH) detects postmaster death, it calls 
proc_exit(1).  Can we call _exit(1) here likewise?


Regards
Takayuki Tsunakawa






Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-09 Thread Thomas Munro
On Tue, Sep 8, 2020 at 11:17 PM Jakub Wartak  wrote:
> I agree with both, I just thought it might be interesting finding as this 
> idiv might be (?) present in other common paths like ReadBuffer*() / 
> PinBuffer() (some recent discussions, maybe on NUMA boxes), not just WAL 
> recovery as it seems relatively easy to improve.

I wrote a draft commit message for Jakub's proposed change (attached),
and did a little bit of testing, but I haven't seen a case where it
wins yet; I need to work on something else now but thought I'd share
this much anyway.  One observation is that the rule in init_htab had
better agree with the expansion rule in hash_search_with_hash_value,
otherwise you can size your hash table perfectly for n elements and
then it still has to expand when you insert the nth element, which is
why I changed >= to >.  Does this make sense?  Oh man, I don't like
the mash-up of int, long, uint32, Size dynahash uses in various places
and that are brought into relief by that comparison...
From e5d890bc1178861dc1a5b1f430289a5ee2b4a2fa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 10 Sep 2020 12:27:25 +1200
Subject: [PATCH] Remove custom fill factor support from dynahash.c.

Since ancient times we have had support for a fill factor (maximum load
factor) to be set for a dynahash hash table, but:

1.  It had to be an integer value >= 1, whereas for in memory hash
tables interesting load factor targets are probably somewhere near the
0.75-1.0 range.

2.  It was implemented in a way that performed an expensive division
operation that regularly showed up in profiles.

3.  We are not aware of anyone ever having used a non-default value.

Therefore, remove support, making fill factor 1 as the implicit value.

Author: Jakub Wartak 
Reviewed-by: Alvaro Herrera 
Reviewed-by: Tomas Vondra 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 src/backend/utils/hash/dynahash.c | 15 ---
 src/include/utils/hsearch.h   |  2 --
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index f4fbccdd7e..4aed0114fb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -191,7 +191,6 @@ struct HASHHDR
 	Size		keysize;		/* hash key length in bytes */
 	Size		entrysize;		/* total user element size in bytes */
 	long		num_partitions; /* # partitions (must be power of 2), or 0 */
-	long		ffactor;		/* target fill factor */
 	long		max_dsize;		/* 'dsize' limit if directory is fixed size */
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
@@ -497,8 +496,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		/* ssize had better be a power of 2 */
 		Assert(hctl->ssize == (1L << hctl->sshift));
 	}
-	if (flags & HASH_FFACTOR)
-		hctl->ffactor = info->ffactor;
 
 	/*
 	 * SHM hash tables have fixed directory size passed by the caller.
@@ -603,8 +600,6 @@ hdefault(HTAB *hashp)
 
 	hctl->num_partitions = 0;	/* not partitioned */
 
-	hctl->ffactor = DEF_FFACTOR;
-
 	/* table has no fixed maximum size */
 	hctl->max_dsize = NO_MAX_DSIZE;
 
@@ -670,11 +665,10 @@ init_htab(HTAB *hashp, long nelem)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
 	/*
-	 * Divide number of elements by the fill factor to determine a desired
-	 * number of buckets.  Allocate space for the next greater power of two
-	 * number of buckets
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
 	 */
-	nbuckets = next_pow2_int((nelem - 1) / hctl->ffactor + 1);
+	nbuckets = next_pow2_int(nelem);
 
 	/*
 	 * In a partitioned table, nbuckets must be at least equal to
@@ -733,7 +727,6 @@ init_htab(HTAB *hashp, long nelem)
 			"DIRECTORY SIZE  ", hctl->dsize,
 			"SEGMENT SIZE", hctl->ssize,
 			"SEGMENT SHIFT   ", hctl->sshift,
-			"FILL FACTOR ", hctl->ffactor,
 			"MAX BUCKET  ", hctl->max_bucket,
 			"HIGH MASK   ", hctl->high_mask,
 			"LOW  MASK   ", hctl->low_mask,
@@ -975,7 +968,7 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * order of these tests is to try to check cheaper conditions first.
 		 */
 		if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
-			hctl->freeList[0].nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+			hctl->freeList[0].nentries > (long) (hctl->max_bucket + 1) &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
 	}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index f1deb9beab..bebf89b3c4 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -68,7 +68,6 @@ typedef struct HASHCTL
 	long		ssize;			/* segment size */
 	long		dsize;			/* (initial) directory size */
 	long		max_dsize;		/* limit to dsize if dir size is limited */
-	long		ffactor;		/* fill factor */
 	Size		

RE: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread osumi.takami...@fujitsu.com
Hello, Peter-San


> > That's a great idea. I've applied this idea to the latest patch v10.
> 
> 
> 
> COMMENT create_trigger.sgml (typo/wording)
> 
> "vise versa" -> "vice versa"
Sorry and thank you for all your pointing out.

> BEFORE
> You cannot replace triggers with a different type of trigger, that means it is
> impossible to replace regular trigger with constraint trigger and vise versa.
> 
> AFTER (suggestion)
> You cannot replace triggers with a different type of trigger. That means it is
> impossible to replace a regular trigger with a constraint trigger, and vice 
> versa.
Thank you. Your suggestion must be better.

I attached the v11 patch.

Regards,
Takamichi Osumi


CREATE_OR_REPLACE_TRIGGER_v11.patch
Description: CREATE_OR_REPLACE_TRIGGER_v11.patch


Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
I wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.

This is straying a bit from the stated topic of this thread, but ...
I did some further looking around to see whether there were any
unsafe signal handlers besides SIGQUIT ones.  The situation is not
too awful, but I did find several issues not already mentioned
in this thread:

StartupProcShutdownHandler (SIGTERM)

This conditionally calls proc_exit(1).  The conditions boil down
to are-we-interrupting-a-system(3)-call, so who knows how safe
that is?  I wouldn't care to bet that system() doesn't use malloc,
for instance.  Still, the odds are very good that if a signal did
arrive, it'd be interrupting system()'s waitpid() or equivalent
kernel call, which is likely safe enough.

bgworker_die (SIGTERM)

Calls ereport(FATAL).  This is surely not any safer than, say,
quickdie().  No, it's worse, because at least that won't try
to go out via proc_exit().

FloatExceptionHandler (SIGFPE)

Calls ereport(ERROR).  This might be okay, though, since the
trap should be synchronous with the offending calculation.
Besides, if you're risking divide by zero or the like in
critical code, You're Doing It Wrong.

RecoveryConflictInterrupt (called from SIGUSR1)

Calls a whole boatload of state tests that were never designed
to be interrupt-safe, such as transaction-state-related inquiries
in xact.c.  The lack of any explicit awareness in this code that
it's in a signal handler doesn't discourage people from inserting
even more dubious stuff.  I think this needs to be burned to the
ground and rewritten.

StandbyDeadLockHandler (from SIGALRM)
StandbyTimeoutHandler (ditto)

Calls CancelDBBackends, which just for starters tries to acquire
an LWLock.  I think the only reason we've gotten away with this
for this long is the high probability that by the time either
timeout fires, we're going to be blocked on a semaphore.

I don't have any ideas about how to fix any of these things,
but I thought getting the problems on the record would be good.

regards, tom lane




Re: please update ps display for recovery checkpoint

2020-09-09 Thread Justin Pryzby
On Mon, Aug 31, 2020 at 03:52:44PM +0900, Michael Paquier wrote:
> On Thu, Aug 20, 2020 at 05:09:05PM +0900, Michael Paquier wrote:
> > That could be helpful.  Wouldn't it be better to use "end-of-recovery
> > checkpoint" instead?  That's the common wording in the code comments.
> > 
> > I don't see the point of patch 0002.  In the same paragraph, we
> > already know that this applies to any checkpoints.
> 
> Thinking more about this..  Could it be better to just add some calls
> to set_ps_display() directly in CreateCheckPoint()?  This way, both
> the checkpointer as well as the startup process at the end of recovery
> would benefit from the change.

What would you want the checkpointer's ps to say ?

Normally it just says:
postgres  3468  3151  0 Aug27 ?00:20:57 postgres: checkpointer  
  

Or do you mean do the same thing as now, but one layer lower, like:

@@ -8728,6 +8725,9 @@ CreateCheckPoint(int flags)
+   if (flags & CHECKPOINT_END_OF_RECOVERY)
+   set_ps_display("recovery checkpoint");

-- 
Justin




RE: Global snapshots

2020-09-09 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey san,

From: Andrey V. Lepikhov > > From: 
tsunakawa.ta...@fujitsu.com 
> >> While Clock-SI seems to be considered the best promising for global
> >>> > Could you take a look at this patent?  I'm afraid this is the Clock-SI 
> >>> > for MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?
> >> 
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I wanted to ask about this after I've published the revised scale-out design 
wiki, but I'm taking too long, so could you share your study results?  I think 
we need to make it clear about the patent before discussing the code.  After we 
hear your opinion, we also have to check to see if Clock-SI is patented or 
avoid it by modifying part of the algorithm.  Just in case we cannot use it, we 
have to proceed with thinking about alternatives.


Regards
Takayuki Tsunakawa




Re: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread Peter Smith
On Wed, Sep 9, 2020 at 11:28 PM osumi.takami...@fujitsu.com
 wrote:
> That's a great idea. I've applied this idea to the latest patch v10.



COMMENT create_trigger.sgml (typo/wording)

"vise versa" -> "vice versa"

BEFORE
You cannot replace triggers with a different type of trigger, that
means it is impossible to replace regular trigger with constraint
trigger and vise versa.

AFTER (suggestion)
You cannot replace triggers with a different type of trigger. That
means it is impossible to replace a regular trigger with a constraint
trigger, and vice versa.



Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-09 Thread tsunakawa.ta...@fujitsu.com
Alexey-san, Sawada-san,
cc: Fujii-san,


From: Fujii Masao 
> But if we
> implement 2PC as the improvement on FDW independently from PostgreSQL
> sharding, I think that it's necessary to support other FDW. And this is our
> direction, isn't it?

I understand the same way as Fujii san.  2PC FDW is itself useful, so I think 
we should pursue the tidy FDW interface and good performance withinn the FDW 
framework.  "tidy" means that many other FDWs should be able to implement it.  
I guess XA/JTA is the only material we can use to consider whether the FDW 
interface is good.


> Sawada-san's patch supports that case by implememnting some conponents
> for that also in PostgreSQL core. For example, with the patch, all the remote
> transactions that participate at the transaction are managed by PostgreSQL
> core instead of postgres_fdw layer.
> 
> Therefore, at least regarding the difference 2), I think that Sawada-san's
> approach is better. Thought?

I think so.  Sawada-san's patch needs to address the design issues I posed 
before digging into the code for thorough review, though.

BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm 
concerned about the performance for practical use.  Do you two have differences 
in these points, for instance?  The first two items are often cited to evaluate 
the algorithm's performance, as you know.

* The number of round trips to remote nodes.
* The number of disk I/Os on each node and all nodes in total (WAL, two-phase 
file, pg_subtrans file, CLOG?).
* Are prepare and commit executed in parallel on remote nodes? (serious DBMSs 
do so)
* Is there any serialization point in the processing? (Sawada-san's has one)

I'm sorry to repeat myself, but I don't think we can compromise the 2PC 
performance.  Of course, we recommend users to design a schema that co-locates 
data that each transaction accesses to avoid 2PC, but it's not always possible 
(e.g., when secondary indexes are used.)

Plus, as the following quote from TPC-C specification shows, TPC-C requires 15% 
of (Payment?) transactions to do 2PC.  (I knew this on Microsoft, CockroachDB, 
or Citus Data's site.)


--
Independent of the mode of selection, the customer resident 
warehouse is the home warehouse 85% of the time and is a randomly selected 
remote warehouse 15% of the time. 
This can be implemented by generating two random numbers x and y within [1 .. 
100]; 

. If x <= 85 a customer is selected from the selected district number (C_D_ID = 
D_ID) and the home warehouse 
number (C_W_ID = W_ID). The customer is paying through his/her own warehouse. 

. If x > 85 a customer is selected from a random district number (C_D_ID is 
randomly selected within [1 .. 10]), 
and a random remote warehouse number (C_W_ID is randomly selected within the 
range of active 
warehouses (see Clause 4.2.2), and C_W_ID ≠ W_ID). The customer is paying 
through a warehouse and a 
district other than his/her own. 
--


Regards
Takayuki Tsunakawa




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
Here's a draft patch that I think would be reasonable to back-patch.
(Before v13, we'd need a bespoke SIGQUIT handler to substitute for
SignalHandlerForCrashExit, but that's easy enough.)

Aside from comment updates, this

* uses SignalHandlerForCrashExit for SIGQUIT

* renames startup_die per your request

* moves BackendInitialize's interrupt-re-disabling code up a bit to
reduce the scope where these interrupts are active.  This doesn't
make things a whole lot safer, but it can't hurt.

I'll take a closer look at the idea of using _exit(1) tomorrow,
but I'd be pretty hesitant to back-patch that.

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..e65086f7b4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -112,6 +112,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
+#include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
@@ -405,7 +406,7 @@ static void SIGHUP_handler(SIGNAL_ARGS);
 static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
-static void startup_die(SIGNAL_ARGS);
+static void startup_packet_die(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
 static void StartupPacketTimeoutHandler(void);
 static void CleanupBackend(int pid, int exitstatus);
@@ -4340,22 +4341,30 @@ BackendInitialize(Port *port)
 	whereToSendOutput = DestRemote; /* now safe to ereport to client */
 
 	/*
-	 * We arrange for a simple exit(1) if we receive SIGTERM or SIGQUIT or
-	 * timeout while trying to collect the startup packet.  Otherwise the
-	 * postmaster cannot shutdown the database FAST or IMMED cleanly if a
-	 * buggy client fails to send the packet promptly.  XXX it follows that
-	 * the remainder of this function must tolerate losing control at any
-	 * instant.  Likewise, any pg_on_exit_callback registered before or during
-	 * this function must be prepared to execute at any instant between here
-	 * and the end of this function.  Furthermore, affected callbacks execute
-	 * partially or not at all when a second exit-inducing signal arrives
-	 * after proc_exit_prepare() decrements on_proc_exit_index.  (Thanks to
-	 * that mechanic, callbacks need not anticipate more than one call.)  This
-	 * is fragile; it ought to instead follow the norm of handling interrupts
-	 * at selected, safe opportunities.
-	 */
-	pqsignal(SIGTERM, startup_die);
-	pqsignal(SIGQUIT, startup_die);
+	 * We arrange to do proc_exit(1) if we receive SIGTERM or timeout while
+	 * trying to collect the startup packet; while SIGQUIT results in
+	 * _exit(2).  Otherwise the postmaster cannot shutdown the database FAST
+	 * or IMMED cleanly if a buggy client fails to send the packet promptly.
+	 *
+	 * XXX this is pretty dangerous; signal handlers should not call anything
+	 * as complex as proc_exit() directly.  We minimize the hazard by not
+	 * keeping these handlers active for longer than we must.  However, it
+	 * seems necessary to be able to escape out of DNS lookups as well as the
+	 * startup packet reception proper, so we can't narrow the scope further
+	 * than is done here.
+	 *
+	 * XXX it follows that the remainder of this function must tolerate losing
+	 * control at any instant.  Likewise, any pg_on_exit_callback registered
+	 * before or during this function must be prepared to execute at any
+	 * instant between here and the end of this function.  Furthermore,
+	 * affected callbacks execute partially or not at all when a second
+	 * exit-inducing signal arrives after proc_exit_prepare() decrements
+	 * on_proc_exit_index.  (Thanks to that mechanic, callbacks need not
+	 * anticipate more than one call.)  This is fragile; it ought to instead
+	 * follow the norm of handling interrupts at selected, safe opportunities.
+	 */
+	pqsignal(SIGTERM, startup_packet_die);
+	pqsignal(SIGQUIT, SignalHandlerForCrashExit);
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	PG_SETMASK();
 
@@ -4411,8 +4420,8 @@ BackendInitialize(Port *port)
 		port->remote_hostname = strdup(remote_host);
 
 	/*
-	 * Ready to begin client interaction.  We will give up and exit(1) after a
-	 * time delay, so that a broken client can't hog a connection
+	 * Ready to begin client interaction.  We will give up and proc_exit(1)
+	 * after a time delay, so that a broken client can't hog a connection
 	 * indefinitely.  PreAuthDelay and any DNS interactions above don't count
 	 * against the time limit.
 	 *
@@ -4434,6 +4443,12 @@ BackendInitialize(Port *port)
 	 */
 	status = ProcessStartupPacket(port, false, false);
 
+	/*
+	 * Disable the timeout, and prevent SIGTERM again.
+	 */
+	disable_timeout(STARTUP_PACKET_TIMEOUT, false);
+	PG_SETMASK();
+
 	/*
 	 * Stop here if it was bad or a cancel packet.  

Re: v13: show extended stats target in \d

2020-09-09 Thread Justin Pryzby
On Wed, Sep 09, 2020 at 07:22:30PM -0300, Alvaro Herrera wrote:
> On 2020-Sep-09, Justin Pryzby wrote:
> 
> > As for the discussion about a separator, I think maybe a comma is enough.  I
> > doubt anyone is going to think that you can get a valid command by prefixing
> > this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was 
> > valid
> > command without the stats target - after all, that's not true of indexes.
> 
> By the way, I got into this because of your comment in
> https://postgr.es/m/20200901011429.gz5...@telsasoft.com on whether we
> needed CREATE STATISTICS to have a clause for this.

And I agree that it does not - the question was raised during development of
the feature, and nobody sees a need.  I asked since I didn't know and I didn't
want it to be missed for lack of asking.

-- 
Justin




Re: Optimising compactify_tuples()

2020-09-09 Thread Thomas Munro
On Thu, Sep 10, 2020 at 2:34 AM David Rowley  wrote:
> I think you were adequately caffeinated.  You're right that this is
> fairly simple to do, but it looks even more simple than looping twice
> of the array.  I think it's just a matter of looping over the
> itemidbase backwards and putting the higher itemid tuples at the end
> of the page. I've done it this way in the attached patch.

Yeah, I was wondering how to make as much of the algorithm work in a
memory-forwards direction as possible (even the item pointer access),
but it was just a hunch.  Once you have the adjacent-tuple merging
thing so you're down to just a couple of big memcpy calls, it's
probably moot anyway.

> I also added a presorted path which falls back on doing memmoves
> without the temp buffer when the itemidbase array indicates that
> higher lineitems all have higher offsets.  I'm doing the presort check
> in the calling function since that loops over the lineitems already.
> We can just memmove the tuples in reverse order without overwriting
> any yet to be moved tuples when these are in order.

Great.

I wonder if we could also identify a range at the high end that is
already correctly sorted and maximally compacted so it doesn't even
need to be copied out.

+* Do the tuple compactification.  Collapse memmove calls for adjacent
+* tuples.

s/memmove/memcpy/

> With the attached v3, performance is better. The test now runs
> recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x
> faster.

On my machine I'm seeing 57s, down from 86s on unpatched master, for
the simple pgbench workload from
https://github.com/macdice/redo-bench/.  That's not quite what you're
reporting but it still blows the doors off the faster sorting patch,
which does it in 74s.

> We should probably consider what else can be done to try to write
> pages with tuples for earlier lineitems earlier in the page.  VACUUM
> FULLs and friends will switch back to the opposite order when
> rewriting the heap.

Yeah, and also bulk inserts/COPY.  Ultimately if we flipped our page
format on its head that'd come for free, but that'd be a bigger
project with more ramifications.

So one question is whether we want to do the order-reversing as part
of this patch, or wait for a more joined-up project to make lots of
code paths collude on making scan order match memory order
(corellation = 1).  Most or all of the gain from your patch would
presumably still apply if did the exact opposite and forced offset
order to match reverse-item ID order (correlation = -1), which also
happens to be the initial state when you insert tuples today; you'd
still tend towards a state that allows nice big memmov/memcpy calls
during page compaction.  IIUC currently we start with correlation -1
and then tend towards correlation = 0 after many random updates
because we can't change the order, so it gets scrambled over time.
I'm not sure what I think about that.

PS You might as well post future patches with .patch endings so that
the cfbot tests them; it seems pretty clear now that a patch to
optimise sorting (as useful as it may be for future work) can't beat a
patch to skip it completely.  I took the liberty of switching the
author and review names in the commitfest entry to reflect this.




Re: v13: show extended stats target in \d

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Justin Pryzby wrote:

> As for the discussion about a separator, I think maybe a comma is enough.  I
> doubt anyone is going to think that you can get a valid command by prefixing
> this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was 
> valid
> command without the stats target - after all, that's not true of indexes.

By the way, I got into this because of your comment in
https://postgr.es/m/20200901011429.gz5...@telsasoft.com on whether we
needed CREATE STATISTICS to have a clause for this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: v13: show extended stats target in \d

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Justin Pryzby wrote:

> As for the discussion about a separator, I think maybe a comma is enough.  I
> doubt anyone is going to think that you can get a valid command by prefixing
> this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was 
> valid
> command without the stats target - after all, that's not true of indexes.
> 
> +"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM 
> ab1, STATISTICS 0

I vote to use a semicolon instead of comma, and otherwise +1.

> This revision only shows the stats target in verbose mode (slash dee plus).

I don't think this interferes enough with other stuff to relegate it to
the "plus" version, since it's not shown if default.

Tomas -- this needs to go in pg13, right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
I wrote:
>> Not only DNS, but all the various auth libraries would have to be
>> contended with.  Lots of work there compared to the likely rewards.

> Wait a minute.  The entire authentication cycle happens inside
> InitPostgres, using the backend's normal signal handlers.  So
> maybe we are overthinking the problem.  What if we simply postpone
> ProcessStartupPacket into that same place, and run it under the same
> rules as we do for authentication?

Or, continuing to look for other answers ...

During BackendInitialize we have not yet touched shared memory.
This is not incidental but must be so, per its header comment.
Therefore it seems like we could have these signal handlers (for
SIGTERM or timeout) do _exit(1), thereby resolving the signal
unsafety while not provoking a database restart.  We don't
care whether inside-the-process state is sane, and we shouldn't
have changed anything yet in shared memory.

This is obviously not 100.00% safe, since maybe something could
violate these assumptions, but it seems a lot safer than using
proc_exit() from a signal handler.

One way to help catch any such assumption-violations is to add
an assertion that no on_shmem_exit handlers have been set up yet.
If there are none, there should be no expectation of having to
clean up shared state.

regards, tom lane




Re: v13: show extended stats target in \d

2020-09-09 Thread Justin Pryzby
On Tue, Sep 01, 2020 at 12:35:19PM +, Georgios Kokolatos wrote:
> I will humbly disagree with the current review. I shall refrain from changing 
> the status though because I do not have very strong feeling about it.

Sorry but this was in my spam, and didn't see until now.

> 
> However the patch contains:
> 
> - "  'm' = 
> any(stxkind) AS mcv_enabled\n"
> + "  'm' = 
> any(stxkind) AS mcv_enabled,\n"
> + "  %s"
>   "FROM 
> pg_catalog.pg_statistic_ext stat "
>   "WHERE stxrelid = 
> '%s'\n"
>   "ORDER BY 1;",
> + (pset.sversion >= 
> 13 ? "stxstattarget\n" : "-1\n"),
>   oid);
> 
> This seems to be breaking a bit the pattern in describeOneTableDetails().
> First, it is customary to write the whole query for the version in its own 
> block. I do find this pattern rather helpful and clean. So in my humble 
> opinion, the ternary expression should be replaced with a distinct if block 
> that would implement stxstattarget. Second, setting the value to -1 as an 
> indication of absence breaks the pattern a bit further. More on that bellow.

Hm, I did like this using the "hastriggers" code as a template.  But you're
right that everywhere else does it differently.

> +   if (strcmp(PQgetvalue(result, i, 8), 
> "-1") != 0)
> +   appendPQExpBuffer(, " 
> STATISTICS %s",
> + 
> PQgetvalue(result, i, 8));
> +
> 
> In the same function, one will find a bit of a different pattern for printing 
> the statistics, e.g.
>  if (strcmp(PQgetvalue(result, i, 7), "t") == 0) 
> I will again hold the opinion that if the query gets crafted a bit 
> differently, one can inspect if the table has `stxstattarget` and then, go 
> ahead and print the value.
> 
> Something in the lines of:
> if (strcmp(PQgetvalue(result, i, 8), "t") == 0)
> appendPQExpBuffer(, " STATISTICS %s", 
> PQgetvalue(result, i, 9));

I think what you've written wouldn't give the desired behavior, since it would
show the stats target even when it's set to the default.  I don't see the point
of having additional, separate, version-specific boolean columns for 1) column
exists; 2) column is not default, in addition to 3) column value.  But I added
comment about what Peter and I agree is desirable, anyway.

> Finally, the patch might be more complete if a note is added in the 
> documentation.
> Have you considered adding something in doc/src/sgml/ref/psql-ref.sgml? If 
> no, will you consider it? If yes, why did you discard it?

I don't think the details of psql output are currently documented.  This shows
nothing about column statistics target, nor about MV stats at all.
https://www.postgresql.org/docs/13/app-psql.html

As for the discussion about a separator, I think maybe a comma is enough.  I
doubt anyone is going to think that you can get a valid command by prefixing
this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was valid
command without the stats target - after all, that's not true of indexes.

+"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1, 
STATISTICS 0

This revision only shows the stats target in verbose mode (slash dee plus).

-- 
Justin
>From e5b351cc9b0518c9162a4b988b17ed920f09d952 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 30 Aug 2020 23:38:17 -0500
Subject: [PATCH v2] Show stats target of extended statistics

The stats target can be set since commit d06215d03, but wasn't shown by psql.
 ALTER STATISISTICS .. SET STATISTICS n.
Traditional column stats targets are shown in \d+ table, so do the same for
extended/MV stats.
---
 src/bin/psql/describe.c | 14 --
 src/test/regress/expected/stats_ext.out | 18 ++
 src/test/regress/sql/stats_ext.sql  |  2 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 0861d74a6f..3c391fe686 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2683,8 +2683,13 @@ describeOneTableDetails(const char *schemaname,
 			  "a.attnum = s.attnum AND NOT attisdropped)) AS columns,\n"
 			  "  'd' = any(stxkind) AS ndist_enabled,\n"
 			  "  'f' = any(stxkind) AS deps_enabled,\n"
-			  "  'm' = any(stxkind) AS mcv_enabled\n"
-			  "FROM pg_catalog.pg_statistic_ext stat "
+			  "  'm' = any(stxkind) AS mcv_enabled");
+
+			if (pset.sversion >= 13)
+			

Re: SIGQUIT handling, redux

2020-09-09 Thread Andres Freund
Hi,

On 2020-09-09 16:30:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> >> We could call it startup_packet_die or something?
> 
> > Yea, I think that'd be good.
> 
> I'll make it so.

Thanks!


> >> We see backends going through this code on a very regular basis in the
> >> buildfarm, but complete hangs are rare as can be.  I think you
> >> overestimate the severity of the problem.
> 
> > I don't think the BF exercises the problmetic paths to a significant
> > degree. It's mostly local socket connections, and where not it's
> > localhost. There's no slow DNS, no more complicated authentication
> > methods, no packet loss. How often do we ever actually end up even
> > getting close to any of the paths but immediate shutdowns?
> 
> Since we're talking about quickdie(), immediate shutdown/crash restart
> is exactly the case of concern, and the buildfarm exercises it all the
> time.

Yea, but only in simple cases. Largely no SSL / kerberos. Largely
untranslated. Mostly the immediate shutdowns aren't when inside plpython
or such.


> > And in the
> > SIGQUIT path, how often do we end up in the SIGKILL path, masking
> > potential deadlocks?
> 
> True, we can't really tell that.  I wonder if we should make the
> postmaster emit a log message when it times out and goes to SIGKILL.
> After a few months we could scrape the buildfarm logs and get a
> pretty good handle on it.

I think that'd be a good idea.

Greetings,

Andres Freund




Re: [HACKERS] [PATCH] Generic type subscripting

2020-09-09 Thread Justin Pryzby
On Wed, Aug 05, 2020 at 04:04:22PM +0200, Dmitry Dolgov wrote:
> > On Sun, Aug 02, 2020 at 12:50:12PM +0200, Pavel Stehule wrote:
> > >
> > > > Maybe this could be salvaged by flushing 0005 in its current form and
> > > > having the jsonb subscript executor do something like "if the current
> > > > value-to-be-subscripted is a JSON array, then try to convert the textual
> > > > subscript value to an integer".  Not sure about what the error handling
> > > > rules ought to be like, though.
> > >
> > > I'm fine with the idea of separating 0005 patch and potentially prusuing
> > > it as an independent item. Just need to rebase 0006, since Pavel
> > > mentioned that it's a reasonable change he would like to see in the
> > > final result.
> >
> > +1
> 
> Here is what I had in mind. Worth noting that, as well as the original

This seems to already hit a merge conflict (8febfd185).
Would you re-rebase ?

-- 
Justin




Re: [Patch] ALTER SYSTEM READ ONLY

2020-09-09 Thread Andres Freund
Hi,

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...


On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> index 42050ab7195..0ac826d3c2f 100644
> --- a/src/backend/nodes/readfuncs.c
> +++ b/src/backend/nodes/readfuncs.c
> @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
>   READ_DONE();
>  }
>  
> +/*
> + * _readAlterSystemWALProhibitState
> + */
> +static AlterSystemWALProhibitState *
> +_readAlterSystemWALProhibitState(void)
> +{
> + READ_LOCALS(AlterSystemWALProhibitState);
> +
> + READ_BOOL_FIELD(WALProhibited);
> +
> + READ_DONE();
> +}
> +

Why do we need readfuncs support for this?

> +
> +/*
> + * AlterSystemSetWALProhibitState
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +static void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> + /* some code */
> + elog(INFO, "AlterSystemSetWALProhibitState() called");
> +}

As long as it's not implemented it seems better to return an ERROR.

> @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
>   VariableSetStmt *setstmt;   /* SET subcommand */
>  } AlterSystemStmt;
>  
> +/* --
> + *   Alter System Read Statement
> + * --
> + */
> +typedef struct AlterSystemWALProhibitState
> +{
> + NodeTag type;
> + boolWALProhibited;
> +} AlterSystemWALProhibitState;
> +

All the nearby fields use under_score_style names.



> From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> From: Amul Sul 
> Date: Fri, 19 Jun 2020 06:29:36 -0400
> Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
> 
> Implementation:
> 
>  1. When a user tried to change server state to WAL-Prohibited using
> ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
> raises request to checkpointer by marking current state to inprogress in
> shared memory.  Checkpointer, noticing that the current state is has

"is has"

> WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
> then acknowledges back to the backend who requested the state change once
> the transition has been completed.  Final state will be updated in control
> file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?


>  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> it is already in a transaction and the transaction already assigned XID,
> then the backend will be killed by throwing FATAL(XXX: need more 
> discussion
> on this)


>  3. Otherwise, if that backend running transaction which yet to get XID
> assigned we don't need to do anything special

Somewhat garbled sentence...


>  4. A new transaction (from existing or new backend) starts as a read-only
> transaction.

Maybe "(in an existing or in a new backend)"?


>  5. Autovacuum launcher as well as checkpointer will don't do anything in
> WAL-Prohibited server state until someone wakes us up.  E.g. a backend
> might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"


>  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> and xlog rotation. Starting up again will perform crash recovery(XXX:
> need some discussion on this as well) but the end of recovery checkpoint
> will be skipped and it will be performed when the system changed to
> WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.


>  8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;


>  9. Add system_is_read_only GUC show the system state -- will true when system
> is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay.  ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.


> --- /dev/null
> +++ b/src/backend/access/transam/walprohibit.c
> @@ -0,0 +1,321 @@
> +/*-
> + *
> + * walprohibit.c
> + *   PostgreSQL write-ahead log prohibit states
> + *

Re: PG 13 release notes, first draft

2020-09-09 Thread Jonathan S. Katz
Hi,

On 5/4/20 11:16 PM, Bruce Momjian wrote:
> I have committed the first draft of the PG 13 release notes.  You can
> see them here:
> 
>   https://momjian.us/pgsql_docs/release-13.html
> 
> It still needs markup, word wrap, and indenting.  The community doc
> build should happen in a few hours.

Thank you again for compiling and maintaining the release notes through
another major release cycle, I know it's no small undertaking!

Attached is a proposal for the "major enhancements" section. I borrowed
from the press release[1] but tried to stay true to the release notes
format and listed out the enhancements that way.

Open to suggestion, formatting changes, etc.

Thanks!

Jonathan

[1]
https://www.postgresql.org/message-id/3bd579f8-438a-ed1a-ee20-738292099aae%40postgresql.org

diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 68f9a0a9f1..95106921d7 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -6,29 +6,49 @@
 
   
Release date:
-   2020-XX-XX, CURRENT AS OF 2020-08-09
+   2020-09-24, CURRENT AS OF 2020-09-09
   
 
   
Overview
 
 
- Major enhancements in PostgreSQL 13 include:
+ PostgreSQL 13 contains many new features and
+ enhancements, including:
 
 
-   
-

-
 
-TBD
+ 
+  Space savings and performance gains from B-tree index deduplication
+ 
+
+
+ 
+  Improved response times for queries that use aggregates or partitions
+ 
+
+
+ 
+  Better query planning when using enhanced statistics
+ 
+
+
+ 
+  Parallelized vacuuming of B-tree indexes
+ 
+
+
+ 
+  Incremental sorting
+ 
 
-

 
-
- The above items are explained in more detail in the sections below.
-
+   
+and more. The above items and other new features of PostgreSQL 13 are
+explained in more detail in the sections below.
+   
 
   
 


signature.asc
Description: OpenPGP digital signature


Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
I wrote:
> Not only DNS, but all the various auth libraries would have to be
> contended with.  Lots of work there compared to the likely rewards.

Wait a minute.  The entire authentication cycle happens inside
InitPostgres, using the backend's normal signal handlers.  So
maybe we are overthinking the problem.  What if we simply postpone
ProcessStartupPacket into that same place, and run it under the same
rules as we do for authentication?  We would waste more cycles than
we do now for the case where the client closes the connection without
sending a startup packet, but not enormously so, I think --- and
optimizing that case doesn't seem like a high-priority goal anyway.
And cases like DNS lookup taking forever don't seem like any more of
an issue than auth lookup taking forever.

regards, tom lane




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
Andres Freund  writes:
> On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
>> We could call it startup_packet_die or something?

> Yea, I think that'd be good.

I'll make it so.

>> We see backends going through this code on a very regular basis in the
>> buildfarm, but complete hangs are rare as can be.  I think you
>> overestimate the severity of the problem.

> I don't think the BF exercises the problmetic paths to a significant
> degree. It's mostly local socket connections, and where not it's
> localhost. There's no slow DNS, no more complicated authentication
> methods, no packet loss. How often do we ever actually end up even
> getting close to any of the paths but immediate shutdowns?

Since we're talking about quickdie(), immediate shutdown/crash restart
is exactly the case of concern, and the buildfarm exercises it all the
time.

> And in the
> SIGQUIT path, how often do we end up in the SIGKILL path, masking
> potential deadlocks?

True, we can't really tell that.  I wonder if we should make the
postmaster emit a log message when it times out and goes to SIGKILL.
After a few months we could scrape the buildfarm logs and get a
pretty good handle on it.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, Tomas Vondra wrote:


There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.


Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.



Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.


The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.


Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:


Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.




Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.



Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.



Yeah, I agree. I plan to review those once again in a couple days and
then push them.



On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.



I'm not a huge fan of that. I think it's unlikely we'll ever make this
new set of oplasses just as fast as the plain minmax, and moreover it
does have some additional requirements (e.g. the distance proc, which
may not make sense for some data types).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGQUIT handling, redux

2020-09-09 Thread Andres Freund
Hi,

On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I wish startup_die() weren't named startup_ - every single time I see
> > the name I think it's about the startup process...
> 
> We could call it startup_packet_die or something?

Yea, I think that'd be good.


> > I think StartupPacketTimeoutHandler is another case?
> 
> Yeah.  Although it's a lot less risky, since if the timeout is reached
> we're almost certainly waiting for client input.

An adversary could control that to a significant degree - but ordinarily
I agree...


> >> In passing, it's worth noting that startup_die() isn't really much safer
> >> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> >> those is that code that applies BlockSig will at least manage to block the
> >> former.
> 
> > Which is pretty unconvincing...
> 
> Agreed, it'd be nice if this were less shaky.  On the other hand,
> we've seen darn few complaints traceable to this AFAIR.  I'm not
> really sure it's worth putting a lot of effort into.

Not sure how many would plausibly reach us though.  A common reaction
will likely just to be to force-restart the server, not to fully
investigate the issue. Particularly because it'll often be once
something already has gone wrong...



> >> I don't want to give up trying to send a message to the client.
> 
> > That still doesn't make much sense to me. The potential for hanging
> > (e.g. inside malloc) is so much worse than not sending a message...
> 
> We see backends going through this code on a very regular basis in the
> buildfarm, but complete hangs are rare as can be.  I think you
> overestimate the severity of the problem.

I don't think the BF exercises the problmetic paths to a significant
degree. It's mostly local socket connections, and where not it's
localhost. There's no slow DNS, no more complicated authentication
methods, no packet loss. How often do we ever actually end up even
getting close to any of the paths but immediate shutdowns? And in the
SIGQUIT path, how often do we end up in the SIGKILL path, masking
potential deadlocks?


> > I only had one coffee so far (and it looks like the sun has died
> > outside), so maybe I'm just slow: But, uh, we don't currently send a
> > message startup_die(), right?
> > So that part is about quickdie()?
> 
> Right.  Note that startup_die() is pre-authentication, so I'm doubtful
> that we should tell the would-be client anything about the state of
> the server at that point, even ignoring these risk factors.  (I'm a
> bit inclined to remove the comment suggesting that'd be desirable.)

Yea, I think just putting in an editorialized version of your paragraph
would make sense.

Greetings,

Andres Freund




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
Andres Freund  writes:
> I wish startup_die() weren't named startup_ - every single time I see
> the name I think it's about the startup process...

We could call it startup_packet_die or something?

> I think StartupPacketTimeoutHandler is another case?

Yeah.  Although it's a lot less risky, since if the timeout is reached
we're almost certainly waiting for client input.

>> In passing, it's worth noting that startup_die() isn't really much safer
>> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
>> those is that code that applies BlockSig will at least manage to block the
>> former.

> Which is pretty unconvincing...

Agreed, it'd be nice if this were less shaky.  On the other hand,
we've seen darn few complaints traceable to this AFAIR.  I'm not
really sure it's worth putting a lot of effort into.

> The long term correct way to handle this would obviously be to
> restructure everything that happens covered by startup_die() in a
> non-blocking manner and just rely on CFR(). But that's a tall order to
> get done anytime soon, particularly things like DNS are IIRC pretty hard
> without relying on custom libraries.

Not only DNS, but all the various auth libraries would have to be
contended with.  Lots of work there compared to the likely rewards.

>> I don't want to give up trying to send a message to the client.

> That still doesn't make much sense to me. The potential for hanging
> (e.g. inside malloc) is so much worse than not sending a message...

We see backends going through this code on a very regular basis in the
buildfarm, but complete hangs are rare as can be.  I think you
overestimate the severity of the problem.

> I only had one coffee so far (and it looks like the sun has died
> outside), so maybe I'm just slow: But, uh, we don't currently send a
> message startup_die(), right?
> So that part is about quickdie()?

Right.  Note that startup_die() is pre-authentication, so I'm doubtful
that we should tell the would-be client anything about the state of
the server at that point, even ignoring these risk factors.  (I'm a
bit inclined to remove the comment suggesting that'd be desirable.)

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Tomas Vondra wrote:

> There are some minor optimizations possible - for example I noticed we
> call minmax_multi_get_strategy_procinfo often because it happens in a
> loop, and we could easily do it just once. But that saves only about 10%
> or so, it's not a ground-breaking optimization.

Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.

> The main reason for the slowness is that we pass the values one by one
> to brin_minmax_multi_add_value - and on each call we need to deserialize
> (and then sometimes also serialize) the summary, which may be quite
> expensive. The regular minmax does not have this issue, it just swaps
> the Datum value and that's it.

Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:

> Another option would be to teach add_value to keep the deserialized
> summary somewhere, and then force serialization at the end of the BRIN
> page range. The end result would be roughly the same, I think.


Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.


On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 03:40:41PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, John Naylor wrote:


create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)



create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)


It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?



There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.

The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.

I see two possible optimizations - firstly, adding some sort of batch
variant of the add_value function, which would get a bunch of values
instead of just a single one, amortizing the serialization costs.

Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGQUIT handling, redux

2020-09-09 Thread Andres Freund
Hi,

On 2020-09-08 17:39:24 -0400, Tom Lane wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

+1


> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, yea, that seems like an important change.


I wish startup_die() weren't named startup_ - every single time I see
the name I think it's about the startup process...


I think StartupPacketTimeoutHandler is another case?



> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.

Which is pretty unconvincing...

I wonder if we could at least *try* to rely on CFR() for a while. It'd
not be pretty, but probably doable, to set up a timer inside the signal
handler and try to exit using normal mechanisms for a while (some
overlap with authentication timeout).

That'd leave the question of what we do once that timer expires. There's
quite a few possible ways that could reproducibly happen, e.g. if DNS
lookups are required during auth.

The long term correct way to handle this would obviously be to
restructure everything that happens covered by startup_die() in a
non-blocking manner and just rely on CFR(). But that's a tall order to
get done anytime soon, particularly things like DNS are IIRC pretty hard
without relying on custom libraries.

Shorter term I don't really see an alternative to escalating to SIGQUIT
which is obviously terrible.


> So it's slightly tempting to drop startup_die() altogether in favor of
> using SignalHandlerForCrashExit for the SIGTERM-during-auth case too.
> However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Indeed.


> I don't want to give up trying to send a message to the client.

That still doesn't make much sense to me. The potential for hanging
(e.g. inside malloc) is so much worse than not sending a message... And
we already infer things about the server dying in libpq when the
connection just closes (which I am admittedly also not happy about). But
I also think we can reduce the risk a bit, see below.

I only had one coffee so far (and it looks like the sun has died
outside), so maybe I'm just slow: But, uh, we don't currently send a
message startup_die(), right?

So that part is about quickdie()?


> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();

I still think that we should at least mitigate the risk to hang inside
quickdie() by at least disabling translations (since the rest happens
inside the pre-allocated error memory context, which should lower the
risk).  And perhaps do similarly for the memory required for encryption,
if enabled, and where possible.


Greetings,

Andres Freund




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2020-09-09 Thread Peter Eisentraut

On 2020-09-09 15:38, Tom Lane wrote:

and a few
more calls of int8_numeric that could be converted.  I think the
attached updated version is committable, and I'd recommend going
ahead with that regardless of the rest of this.  I hadn't realized
how many random calls of int8_numeric and int4_numeric we'd grown,
but there are a lot, so this is nice cleanup.



Yes, please go ahead with it.


It's your patch, I figured you'd want to commit it.


ok done

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Inconsistent Japanese name order in v13 contributors list

2020-09-09 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Sep-09, Fujii Masao wrote:
>> Using given-name-first order is our consensus?

> That's indeed our historical practice.  See previous thread where we've
> discussed this at length,
> https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2

> The Economist piece Peter G cited is also relevant.

Right.  I think the decree the Economist cites might be sufficient
reason to reopen the discussion, though I surely don't want it to
turn into another long thread.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, John Naylor wrote:

> create index on t using brin (a);
> CREATE INDEX
> Time: 1631.452 ms (00:01.631)

> create index on t using brin (a int8_minmax_multi_ops);
> CREATE INDEX
> Time: 6521.026 ms (00:06.521)

It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?

> The second thing is, with parallel seq scan, the query is faster than
> a BRIN bitmap scan, with this pathological data distribution, but the
> planner won't choose it unless forced to:
> 
> set enable_bitmapscan = 'off';
> explain analyze select * from t
>   where a between 1923300::int and 1923600::int;

This is probably explained by the fact that you likely have the whole
table in shared buffers, or at least in OS cache.  I'm not sure if the
costing should necessarily account for this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Tom Lane
Fujii Masao  writes:
> On 2020/09/10 2:16, Tom Lane wrote:
>> Hm, in a quick search I only see 2eb3bc588 which was not back-patched
>> ... which commits are you thinking of?

> I thought your commit b55b4dad99 included the improvement on comment 
> detection and was back-patched

Oh, right, I did include the check for '#' in what was back-patched.
I debated whether to do that or not, and was misremembering my decision.

So yeah, it seems there's no need to document 2eb3bc588 in the v13
notes.  I'll try to remember to give you part credit for b55b4dad9
when I do the November notes.

regards, tom lane




Re: Inconsistent Japanese name order in v13 contributors list

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Fujii Masao wrote:

> On 2020/09/09 14:15, Etsuro Fujita wrote:
> > Hi,
> > 
> > Attached is a patch to standardize Japanese names as given-name-first
> > in the v13 contributors list as before.
> 
> Using given-name-first order is our consensus? I was thinking we have not
> reached that yet and our "vague" consensus was to use the name that each
> contributor prefers, for example the name that used in the email signature, 
> etc.

That's indeed our historical practice.  See previous thread where we've
discussed this at length,
https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2

The Economist piece Peter G cited is also relevant.

The commit Peter E cited seems more anecdotical than precedence-setting,
since there was no actual discussion, and whatever little there was was
confined to pgsql-committers.


An easy way to avoid any confusion is to uppercase the family name in
the cases where it goes first.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Fujii Masao




On 2020/09/10 2:16, Tom Lane wrote:

Fujii Masao  writes:

On 2020/09/10 1:48, Tom Lane wrote:

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.



"Improve comment detection for .pgpass files" is also the material for
minor version release because that change was also back-patched?
If so, I agree to drop the entry.


Hm, in a quick search I only see 2eb3bc588 which was not back-patched
... which commits are you thinking of?


I thought your commit b55b4dad99 included the improvement on comment detection 
and was back-patched

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: shared-memory based stats collector

2020-09-09 Thread Tom Lane
Stephen Frost  writes:
> Just FYI, Tom's started a thread which includes this over here-
> https://postgr.es/m/1850884.1599601...@sss.pgh.pa.us

Per that discussion, I'm about to go and commit the 0001 patch from
this thread, which will cause the cfbot to not be able to apply the
patchset anymore till you repost it without 0001.  However, before
reposting, you might want to fix the compile errors the cfbot is
showing currently.

On the Windows side:

src/backend/postmaster/postmaster.c(6410): error C2065: 'pgStatSock' : 
undeclared identifier [C:\projects\postgresql\postgres.vcxproj]

On the Linux side:

1711pgstat.c: In function ‘pgstat_vacuum_stat’:
1712pgstat.c:1411:7: error: ‘key’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
1713   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1714   ^
1715pgstat.c:1373:8: note: ‘key’ was declared here
1716   Oid *key;
1717^
1718pgstat.c:1411:7: error: ‘oidtab’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
1719   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1720   ^
1721pgstat.c:1372:9: note: ‘oidtab’ was declared here
1722   HTAB*oidtab;
1723 ^
1724pgstat.c: In function ‘pgstat_reset_single_counter’:
1725pgstat.c:1625:6: error: ‘stattype’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
1726  env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
1727  ^
1728pgstat.c:1601:14: note: ‘stattype’ was declared here
1729  PgStatTypes stattype;
1730  ^
1731cc1: all warnings being treated as errors

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-09 Thread Fujii Masao




On 2020/09/08 12:03, Amit Kapila wrote:

On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao  wrote:


On 2020/09/08 10:34, Amit Kapila wrote:

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  wrote:


IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually.



Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.


Thanks!


1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.


I was thinking that [1] was discussing global snapshot feature for
"atomic visibility" rather than the solution like 2PC for "atomic commit".
But if another approach for "atomic commit" was also proposed at [1],
that's good. I will check that.



Okay, that makes sense.


I read Alexey's 2PC patch 
(0001-Add-postgres_fdw.use_twophase-GUC-to-use-2PC.patch)
proposed at [1]. As Alexey told at that thread, there are two big differences
between his patch and Sawada-san's; 1) whether there is the resolver process
for foreign transactions, 2) 2PC logic is implemented only inside postgres_fdw
or both FDW and PostgreSQL core.

I think that 2) is the first decision point. Alexey's 2PC patch is very simple
and all the 2PC logic is implemented only inside postgres_fdw. But this
means that 2PC is not usable if multiple types of FDW (e.g., postgres_fdw
and mysql_fdw) participate at the transaction. This may be ok if we implement
2PC feature only for PostgreSQL sharding using postgres_fdw. But if we
implement 2PC as the improvement on FDW independently from PostgreSQL
sharding, I think that it's necessary to support other FDW. And this is our
direction, isn't it?

Sawada-san's patch supports that case by implememnting some conponents
for that also in PostgreSQL core. For example, with the patch, all the remote
transactions that participate at the transaction are managed by PostgreSQL
core instead of postgres_fdw layer.

Therefore, at least regarding the difference 2), I think that Sawada-san's
approach is better. Thought?

[1]
https://postgr.es/m/3ef7877bfed0582019eab3d462a43...@postgrespro.ru

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 12:04:28PM -0400, John Naylor wrote:

On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:


OK, here is a rebased version. Most of the breakage was due to changes
to the BRIN sgml docs.


Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
 from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

   QUERY PLAN
--
Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 712
  Heap Blocks: lossy=56819
  ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.328 ms
Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

  QUERY PLAN
-
Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 22240
  Heap Blocks: lossy=128
  ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.250 ms
Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.



I think this is much more an estimation issue than a costing one. Notice
that in the "regular" BRIN minmax index we have this:

   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
   width=0) (actual time=3.054..3.055 rows=568320 loops=1)

while for the multi-minmax we have this:

   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
   width=0) (actual time=0.691..0.691 rows=1280 loops=1)

So yes, the multi-minmax index is costed a bit higher, mostly because
the index is a bit larger. (There's also a tweak to the correlation, but
that does not make much difference because it's just 0.99 vs. 1.0.)

The main difference is that for minmax the bitmap index scan actually
matches ~586k rows (a bit confusing, considering the heap scan has to
process almost 10M rows during recheck). But the multi-minmax only
matches ~1300 rows, with a recheck of 22k.

I'm not sure how to consider this during costing, as we only see these
numbers at execution time. One way would be to also consider "size" of
the ranges (i.e. max-min) vs. range of the whole column. But that's not
something we already have.

I'm not sure how troublesome this issue really is - I don't think people
are very likely to have both minmax and multi-minmax indexes on the same
column.


The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
 where a between 1923300::int and 1923600::int;
 QUERY PLAN
---
Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)

Re: More aggressive vacuuming of temporary tables

2020-09-09 Thread Andres Freund
Hi,

On 2020-09-09 10:14:04 -0400, Stephen Frost wrote:
> > I've been toying with a patch that introduces more smarts about when a
> > row is removable, by looking more closely whether a specific row
> > versions are visible (e.g. in the common case of one old snapshot and
> > lots of newer rows). But that's orders of magnitude more complicated. So
> > going for something as simple as this seems like a good idea.
> 
> I've wondered about this for a long time- very cool that you've found
> time to actually work on a patch.  A couple of different ideas were
> discussed previously about how to do that kind of a check- mind talking
> about what method you're using, or perhaps just sharing that patch? :)

It's very very early, and it doesn't really work. I basically tried to
just plug a bit more intelligence into the dead tuple detection (which
now can easily store more and incrementally built state with the recent
changes for snapshot scalability).  Detection that tuples newer than the
horizon are dead isn't that complicated - what's hard is not breaking
things due to ctid chains lacking intermediate versions.  To avoid that
I had to restrict it to inserted (not updated) tuples that were
subsequently deleted. And my heuristic only supported only one old
snapshot.

Building a bsearchable list of ranges of valid (xmin-xmax] ranges isn't
that hard. Some care needs to be taken to make the list non-overlapping,
but that's easy enough by just merging entries.

Obviously lookup in such a more complicated structure isn't free. Nor is
building it. So we'd need some heuristics about when to do so. It'd
probably be OK to occasionally look at the age of the oldest in-progress
statement, to infer the time for old snapshots based on that. And then
we could have a GUC that says when it's worth doing more complicated
lookups.

I don't have a handle on how to deal with the ctid chaining for
intermediate row versions.

Greetings,

Andres Freund




Re: Global snapshots

2020-09-09 Thread Fujii Masao




On 2020/09/09 2:00, Alexey Kondratov wrote:

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds 
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues 
above and tries to add proper comments everywhere. I think, that 0003 should be 
rebased on the top of it, or it could be a first patch in the set, since it may 
be used independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are two 
major differences:


Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.




1. There is a built-in foreign xacts resolver in the [1], which should be much 
more convenient from the end-user perspective. It involves huge in-core changes 
and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all foreign 
prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a 
situation when the coordinator node is actually a HA cluster group (primary + 
sync + async replica) and it failed just after PREPARE stage of after local 
COMMIT. In that case all foreign xacts will be left in the prepared state. 
After failover process complete synchronous replica will become a new primary. 
Would it have all required info to properly resolve orphan prepared xacts?


IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.



I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change to
+rollback, therefore at this time some participants might be prepared whereas
+some are not prepared. The former foreign transactions need to be resolved
+using pg_resolve_foreign_xact() manually and the latter ends transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.



Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()).



According to the Sawada-san's v25 0002 the logic is pretty much the same there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the 
CommitTransaction(). Thus, I don't see many difference between these approach 
and CallXactCallbacks() usage regarding this point.


IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the remote 
servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Tom Lane
Fujii Masao  writes:
> On 2020/09/10 1:48, Tom Lane wrote:
>> We could adjust the release-note entry for your patch to say
>> "Improve comment detection for .pgpass files", or we could decide
>> it's not worth documenting that part and just drop the entry.

> "Improve comment detection for .pgpass files" is also the material for
> minor version release because that change was also back-patched?
> If so, I agree to drop the entry.

Hm, in a quick search I only see 2eb3bc588 which was not back-patched
... which commits are you thinking of?

regards, tom lane




Re: Minor cleanup of partbounds.c

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Etsuro Fujita wrote:

> Here is a patch for minor cleanup of the partbounds.c changes made by
> commit c8434d64c: 1) removes a useless assignment (in normal builds)

LGTM.

> and 2) improves comments a little.

No objection to changing "bounds" to "range bounds".

I think the point other is to replace the only appearance of "dummy
relation" to better match the extensive use of "dummy partition" in this
file.  The concept of a "dummy relation" is well established in the
planner.  I didn't know if "dummy partition" is itself a concept
(apparently in the newfangled partition-wise join stuff), or just
glorified wording to say "a dummy relation that happens to be a
partition".  Looking at is_dummy_partition, apparently a dummy partition
is either a dummy relation or a partition that doesn't have a
RelOptInfo.  So my conclusion is that this wording is okay to change
too.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Fujii Masao




On 2020/09/10 1:48, Tom Lane wrote:

Fujii Masao  writes:

The patch was back-patched to v13, but v13 release note still has the following 
item.



  Tighten libpq's overlength-line handling and comment detection for 
.pgpass files (Fujii Masao)



This should be changed to the following or something?



  Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)


Hm.  Actually, since that went further back than v13, I don't think
it should appear in the v13 notes at all; it will be material for
the next quarterly update release notes.

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.


"Improve comment detection for .pgpass files" is also the material for
minor version release because that change was also back-patched?
If so, I agree to drop the entry.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin



> 9 сент. 2020 г., в 20:39, Heikki Linnakangas  написал(а):
> 
> On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
>> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:
>>> Come to think of it, the point z-order comparator could benefit a lot
>>> from key abbreviation, too. You could do the point -> zorder conversion
>>> in the abbreviation routine.
>> That's how it works in PostGIS, only that we moved to more
>> effecient Hilbert curve:
>> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171
> 
> Thanks, that's interesting.
> 
> I implemented the abbreviated keys for the point opclass, too, and noticed 
> that the patch as it was never used it. I reworked the patch so that 
> tuplesort_begin_index_gist() is responsible for looking up the sortsupport 
> function, like tuplesort_begin_index_btree() does, and uses abbreviation when 
> possible.
Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> 
a,b

> do we have regression test coverage for this?
Yes, sorting build for points is tested in point.sql, but with small dataset. 
index_including_gist.sql seems to be working with boxes, but triggers point 
paths too.

> , also on a SIZEOF_DATUM==4 system since the abbreviation works differently 
> with that, and push if nothing new comes up. And clarify the documentation 
> and/or comments that the sortsupport function sees "compressed" values.
> 
> I wonder if we could use sorting to also speed up building tsvector indexes? 
> The values stored there are bit signatures, what would be a good sort order 
> for those?
We need an order so that nearby values have a lot of bits in common.
What is the length of this signature?
For each 4 bytes we can compute number of 1s in it's binary representation. 
Then z-order these dwords as values 0-32.

This will be very inefficient grouping, but it will tend to keep empty and 
dense 4-byte regions apart.

Thanks for working on this!


Best regards, Andrey Borodin.




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Tom Lane
Fujii Masao  writes:
> The patch was back-patched to v13, but v13 release note still has the 
> following item.

>  Tighten libpq's overlength-line handling and comment detection for 
> .pgpass files (Fujii Masao)

> This should be changed to the following or something?

>  Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)

Hm.  Actually, since that went further back than v13, I don't think
it should appear in the v13 notes at all; it will be material for
the next quarterly update release notes.

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.

regards, tom lane




Re: Inconsistent Japanese name order in v13 contributors list

2020-09-09 Thread Fujii Masao




On 2020/09/09 21:15, Peter Eisentraut wrote:

On 2020-09-09 07:40, Fujii Masao wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Using given-name-first order is our consensus? I was thinking we have not
reached that yet and our "vague" consensus was to use the name that each
contributor prefers, for example the name that used in the email signature, etc.

BTW, if possible I'd like to use family-name-first, i.e., "Fujii Masao".


See commit 53c89aed7b38ab412fddc1d6118822ce5d962acd for when this was changed.

At least it's the current practice.  It can be debated whether it's a good 
practice.


Thanks for letting me know that! Ok, using family-name-first rule only
for me would make the maintain of contributors list harder, so I'm ok
to follow the given-name-first rule for the list.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Remove line length restriction in passwordFromFile()

2020-09-09 Thread Fujii Masao




On 2020/09/02 0:14, Tom Lane wrote:

Fujii Masao  writes:

The patch looks good to me, except the following minor thing.
+   if (fgets(buf.data + buf.len, buf.maxlen - buf.len - 1, fp) == 
NULL)
IIUC fgets() reads the data with the specified size - 1, so ISTM that -1 of
"buf.maxlen - buf.len - 1" is not necessary.


Good point, I was being unduly conservative.  Thanks for reviewing
the patch!


Thanks for committing the patch!

The patch was back-patched to v13, but v13 release note still has the following 
item.

Tighten libpq's overlength-line handling and comment detection for .pgpass 
files (Fujii Masao)

This should be changed to the following or something?

Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Minor fixes for upcoming version 13

2020-09-09 Thread Tom Lane
John Naylor  writes:
> On Wed, Sep 9, 2020 at 12:16 PM Alexander Lakhin  wrote:
>> 09.09.2020 18:29, Tom Lane wrote:
>>> But it's not user-visible is it?  That should surely be a can't-happen
>>> case.

>> I've encountered this while translating NLS messages in postgres.po and
>> ecpg.po. So I think it should at least puzzle translators.

> Maybe instead Assert(0) with a comment or something like that?

Maybe what we need is yyerror_internal() for messages that don't require
translation.  Or just convert it to a plain elog(ERROR) ... that would
lose the possibility of providing an error cursor, but that seems like
something we don't need here either.

regards, tom lane




Re: Minor fixes for upcoming version 13

2020-09-09 Thread John Naylor
On Wed, Sep 9, 2020 at 12:16 PM Alexander Lakhin  wrote:
>
> Hello Tom,
>
> 09.09.2020 18:29, Tom Lane wrote:
> > But it's not user-visible is it?  That should surely be a can't-happen
> > case.
> I've encountered this while translating NLS messages in postgres.po and
> ecpg.po. So I think it should at least puzzle translators.

Maybe instead Assert(0) with a comment or something like that?

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Minor fixes for upcoming version 13

2020-09-09 Thread Alexander Lakhin
Hello Tom,

09.09.2020 18:29, Tom Lane wrote:
> Alexander Lakhin  writes:
>> Please consider fixing minor defects in the source code and documentation.
> I agree with all of these, except the markup fixes in bgworker.sgml
> still seem not right; it should be more like
>
>   RegisterBackgroundWorker(BackgroundWorker 
> *worker)
Yes, but I've tried to minimize the change and align with the adjacent
text. For example:
Once running, the process can connect to a database by calling
   BackgroundWorkerInitializeConnection(char
*dbname...
>> 6. "unhandled previous state in xqs" -> "unhandled previous state at end
>> quote"?
>> "xqs" is rather cryptic for a user-visible message
> But it's not user-visible is it?  That should surely be a can't-happen
> case.
I've encountered this while translating NLS messages in postgres.po and
ecpg.po. So I think it should at least puzzle translators.

Best regards,
Alexander




Re: VACUUM (INTERRUPTIBLE)?

2020-09-09 Thread Magnus Hagander
On Wed, Sep 9, 2020 at 12:58 AM Andres Freund  wrote:

> Hi,
>
> On 2020-09-08 22:30:40 +0200, Magnus Hagander wrote:
> > One thing I've been wanting many times but never properly got around to
> > investigating how much work it would be to make happen, was to be able to
> > trigger an autovacuum manually (yeah, strange choice of terms). That is,
> > instead of the above, you'd have something like "VACUUM BACKGROUND" which
> > would trigger an autovacuum worker to do the work, and then release your
> > session. The points then being both (1) the ability to interrupt it, and
> > (2) that it'd run in the backgorund and thus the foreground session could
> > disconnect.
> >
> > I think both would probably solve your problem, and being able to
> trigger a
> > background one would add some extra value? But we'd have to figure out
> and
> > be clear about what to do if all workers are busy for example - queue or
> > error?
> >
> > Worth considering, or am I missing something?
>
> It seems like it could be useful in general. Not that much for my case
> however. It'd be much easier to test whether vacuum was successfully
> cancelled if we can see the cancellation, which we can't in the
> autovacuum case. And there'd likely be some fairly ugly logic around
>

That does bring up the other thing that I even put together some hacky
patch for at one point (long since lost or badly-rebased-into-nothingness)
which is to have the stats collector track on a per-relation basis the
number of autovacuum interruptions that have happened on a specific table :)

But yes, with that it would still not be *as easy* to use, definitely.


> needing to wait until the "autovacuum request" is processed etc,
> including the case that all workers are currently busy.


> So my INTERRUPTIBLE idea seems to be a better fit for the tests, and
> independently quite useful. E.g. wanting to know whether VACUUM errored
> out is useful for scripts that want their VACUUMs to be interruptible,
> and that doesn't work well with the "backgrounding" idea of yours.
>
> Having said that, your idea does seem like it could be helpful. The
> difficulty seems to depend a bit on the exact desired
> semantics. E.g. would the "queue" command block until vacuum started or
> not? The latter would obviously be much easier...


Yeah, that's where I stalled in my own thoughts about it I think. OTOH, why
wait for it to start, if you're not waiting for it to finish... But also,
if there is a max size of the queue, what do you do if you hit that one?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Fujii Masao




On 2020/09/09 22:57, Magnus Hagander wrote:

On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote:

On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
 >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander mailto:mag...@hagander.net>> wrote:
 >>
 >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila mailto:amit.kapil...@gmail.com>> wrote:
 >>>
 >>
 >> Though in fact the one inconsistent place in the code now is that if it 
is corrupt in the db entry part of the file it returns true and the global timestamp, 
which I would argue is perhaps incorrect and it should return false.
 >>
 >
 >Yeah, this is exactly the case I was pointing out where we return true
 >before the patch, basically the code below:
 >case 'D':
 >if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
 >  fpin) != offsetof(PgStat_StatDBEntry, tables))
 >{
 >ereport(pgStatRunningInCollector ? LOG : WARNING,
 >(errmsg("corrupted statistics file \"%s\"",
 >statfile)));
 >goto done;
 >}
 >
 >done:
 >FreeFile(fpin);
 >return true;
 >
 >Now, if we decide to return 'false' here, then surely there is no
 >argument and we should return false in other cases as well. Basically,
 >I think we should be consistent in handling the corrupt file case.
 >

FWIW I do agree with this - we should return false here, to make it
return false like in the other data corruption cases. AFAICS that's the
only inconsistency here.


+1, I think that's the place to fix, rather than reversing all the other places.


+1 as I suggested upthread!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread John Naylor
On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:
>
> OK, here is a rebased version. Most of the breakage was due to changes
> to the BRIN sgml docs.

Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
  from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 712
   Heap Blocks: lossy=56819
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.328 ms
 Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

   QUERY PLAN
-
 Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 22240
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.250 ms
 Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.

The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
  where a between 1923300::int and 1923600::int;
  QUERY PLAN
---
 Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)
 Filter: ((a >= 1923300) AND (a <= 1923600))
 Rows Removed by Filter: 237
 Planning Time: 0.089 ms
 Execution Time: 380.434 ms
(8 rows)

And just to compare size:

BRIN 32kB
BRIN multi  136kB
Btree   188MB

[1] 
https://www.postgresql.org/message-id/459eef3e-48c7-0f5a-8356-992442a78bb6%402ndquadrant.com

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Amit Kapila wrote:

> I have included Alvaro as he is a committer for 187492b6, so he might
> remember something and let us know if this is a mistake or there is
> some reason for doing so (return true even when the db entry we are
> trying to read is corrupt).

Thanks -- I have to excuse myself here, as I don't have too many
memories about this.  It seems y'all have derived more insight that I
could possibly offer.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki
>From 3a4d9c14631ae54b983d75433e0286f3dfedf432 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 9 Sep 2020 18:33:18 +0300
Subject: [PATCH v17 1/1] Add sort support for point gist_point_sortsupport

Implement GiST build using sort support

We use special sorting function provided by opclass to approximate
GiST tree with B-tree-like structure. This approach allows to
radically reduce build time in some cases.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml  |  66 
 src/backend/access/gist/gistbuild.c | 499 
 src/backend/access/gist/gistproc.c  | 224 +++
 src/backend/access/gist/gistutil.c  |  53 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/sortsupport.c|  34 ++
 src/backend/utils/sort/tuplesort.c  |  57 +++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/catversion.h|   1 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/sortsupport.h |   1 +
 src/include/utils/tuplesort.h   |   4 +
 16 files changed, 912 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..b049094c811 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to speed up
+   building a GiST index.
  
 
  
@@ -1065,6 +1067,70 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator routine preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field,
+   the full API is defined in
+   src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum
+my_sortsupport(PG_FUNCTION_ARGS)
+{
+  SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+  ssup->comparator = my_fastcmp;
+  PG_RETURN_VOID();
+}
+
+  
+ 
+
   
 
   
diff 

Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Justin Pryzby
On Wed, Sep 09, 2020 at 09:22:00PM +0900, Michael Paquier wrote:
> On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:
> > Initially I added List *params, and Michael suggested to retire
> > ReindexStmt->concurrent.  I provided a patch to do so, initially by leaving 
> > int
> > options and then, after this, removing it to "complete the thought", and get
> > rid of the remnants of the "old way" of doing it.  This is also how vacuum 
> > and
> > explain are done.
> > https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com
> 
> Defining a set of DefElem when parsing and then using the int
> "options" with bitmasks where necessary at the beginning of the
> execution looks like a good balance to me.  This way, you can extend
> the grammar to use things like (verbose = true), etc.

It doesn't need to be extended - defGetBoolean already handles that.  I don't
see what good can come from storing the information in two places in the same
structure.

|postgres=# CLUSTER (VERBOSE on) pg_attribute USING 
pg_attribute_relid_attnum_index ;
|INFO:  clustering "pg_catalog.pg_attribute" using index scan on 
"pg_attribute_relid_attnum_index"
|INFO:  "pg_attribute": found 0 removable, 2968 nonremovable row versions in 55 
pages
|DETAIL:  0 dead row versions cannot be removed yet.
|CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
|CLUSTER

-- 
Justin




Re: Minor fixes for upcoming version 13

2020-09-09 Thread Tom Lane
Alexander Lakhin  writes:
> Please consider fixing minor defects in the source code and documentation.

I agree with all of these, except the markup fixes in bgworker.sgml
still seem not right; it should be more like

  RegisterBackgroundWorker(BackgroundWorker 
*worker)

> 6. "unhandled previous state in xqs" -> "unhandled previous state at end
> quote"?
> "xqs" is rather cryptic for a user-visible message

But it's not user-visible is it?  That should surely be a can't-happen
case.

Will push in a little bit.

regards, tom lane




Re: unsupportable composite type partition keys

2020-09-09 Thread Julien Rouhaud
On Wed, Sep 9, 2020 at 4:17 PM Jobin Augustine  wrote:
>
> Is there a way out if someone accidentally executes the same test case 
> against PG12?
>
> testdb=# create table partitioned (a int, b int)
> testdb-#   partition by list ((row(a, b)::partitioned));
> CREATE TABLE
> testdb=# DROP TABLE partitioned;
> ERROR:  cache lookup failed for type 18269

AFAICT this is only a side effect of that particular use case if you
try to drop it without having a relcache entry.  Do any access before
dropping it and it should be fine, for instance:

rjuju=# create table partitioned (a int, b int)
rjuju-# partition by list ((row(a, b)::partitioned));
CREATE TABLE
rjuju=# DROP TABLE partitioned;
ERROR:  cache lookup failed for type 144845
rjuju=# \d partitioned
  Partitioned table "public.partitioned"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST ((ROW(a, b)::partitioned))
Number of partitions: 0

rjuju=# DROP TABLE partitioned;
DROP TABLE




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
I wrote:
> Stephen Frost  writes:
>> As I mentioned over there, I agree that we should do this and we should
>> further have the statistics collector also do so, which currently sets
>> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
>> fine to write out the stats file (which we're going to remove during
>> recovery anyway...) and then call exit(0).

> I noticed that that was different from everything else, but it's not
> actually signal-unsafe, so it seems like a different topic from what
> I'm on about at the moment.  I don't mind if you or somebody else
> wants to change it, but I don't see it as a back-patchable bug fix.

Note also that the postmaster actually uses SIGQUIT to command normal
shutdown of the stats collector (cf reaper(), around line 3125 in HEAD).
So this needs a change in signaling conventions, not just internal
tweaks in the collector.  Not a big deal, but it reinforces my feeling
that it should be a HEAD-only change.

regards, tom lane




Re: SIGQUIT handling, redux

2020-09-09 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
>> anywhere, so I did a quick survey of backend signal handlers to see if
>> that is true.  I immediately found pgarch_exit() which surely is not.  It
>> turns out that Horiguchi-san already noticed that and proposed to fix it,
>> within the seemingly entirely unrelated patch series for a shared-memory
>> based stats collector (see patch 0001 in [2]).  I think we should go ahead
>> and commit that, and maybe even back-patch it.  There seems no good reason
>> for the archiver to treat SIGQUIT as nonfatal when other postmaster
>> children do not.

> As I mentioned over there, I agree that we should do this and we should
> further have the statistics collector also do so, which currently sets
> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
> fine to write out the stats file (which we're going to remove during
> recovery anyway...) and then call exit(0).

I noticed that that was different from everything else, but it's not
actually signal-unsafe, so it seems like a different topic from what
I'm on about at the moment.  I don't mind if you or somebody else
wants to change it, but I don't see it as a back-patchable bug fix.

> I also think we should
> back-patch these changes, as the commit mentioned in Horiguchi-san's
> patch for pgarch_exit() was.

Agreed; I'll go make it happen.

regards, tom lane




Re: shared-memory based stats collector

2020-09-09 Thread Stephen Frost
Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > Shouldn't this:
> > 
> > a) be back-patched, as the other change was
> > b) also include a change to have the stats collector (which I realize is
> >removed later on in this patch set, but we're talking about fixing
> >existing things..) for the same reason, and because there isn't much
> >point in trying to write out the stats after we get a SIGQUIT, since
> >we're just going to blow them away again since we're going to go
> >through crash recovery..?
> > 
> > Might be good to have a separate thread to address these changes.
> 
> +1

Just FYI, Tom's started a thread which includes this over here-

https://postgr.es/m/1850884.1599601...@sss.pgh.pa.us

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: SIGQUIT handling, redux

2020-09-09 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> This is to pull together a couple of recent threads that are seemingly
> unrelated to $SUBJECT.
> 
> Over in the long thread at [1] we've agreed to try to make the backend
> code actually do what assorted comments claim it does, namely allow
> SIGQUIT to be accepted at any point after a given process has established
> its signal handlers.  (Right now, it mostly is not accepted during error
> recovery, but there's no clear reason to preserve that exception.)
> 
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

As I mentioned over there, I agree that we should do this and we should
further have the statistics collector also do so, which currently sets
up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
fine to write out the stats file (which we're going to remove during
recovery anyway...) and then call exit(0).  I also think we should
back-patch these changes, as the commit mentioned in Horiguchi-san's
patch for pgarch_exit() was.

> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, agreed.

> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.  So it's slightly tempting to drop startup_die() altogether in
> favor of using SignalHandlerForCrashExit for the SIGTERM-during-auth case
> too.  However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Agreed, that's definitely no good.

> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();
> I don't want to give up trying to send a message to the client.
> Note, however, that quickdie() does end with _exit(2) so that at
> least it's not trying to execute arbitrary process-cleanup code.
> 
> In short then, I want to ensure that postmaster children's SIGQUIT
> handlers always end with _exit(2) rather than some other exit method.
> We have two exceptions now and they need to get fixed.

I agree we should fix these.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: doc review for v13

2020-09-09 Thread Justin Pryzby
I've added a few more.

-- 
Justin
>From 137321a0d476f66b5e5f21c2f627c407330e50b1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 19:43:22 -0500
Subject: [PATCH v8 01/14] doc: btree deduplication

commit 0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27
Author: Peter Geoghegan 
---
 doc/src/sgml/btree.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index d03ee4d6fa..69c1ee0e97 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -642,7 +642,7 @@ options(relopts local_relopts *) returns
   
   
Deduplication works by periodically merging groups of duplicate
-   tuples together, forming a single posting list tuple for each
+   tuples together, forming a single posting list tuple for each
group.  The column key value(s) only appear once in this
representation.  This is followed by a sorted array of
TIDs that point to rows in the table.  This
-- 
2.17.0

>From fc883d317a895140771ce34564847bb9ca98b7e3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 29 Mar 2020 21:22:43 -0500
Subject: [PATCH v8 02/14] doc: pg_stat_progress_basebackup

commit e65497df8f85ab9b9084c928ff69f384ea729b24
Author: Fujii Masao 
---
 doc/src/sgml/monitoring.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 304c49f07b..ea8780327f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6069,8 +6069,8 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   waiting for checkpoint to finish
   
The WAL sender process is currently performing
-   pg_start_backup to set up for
-   taking a base backup, and waiting for backup start
+   pg_start_backup to prepare to
+   take a base backup, and waiting for the start-of-backup
checkpoint to finish.
   
  
-- 
2.17.0

>From 2d7c3ca43db77c910e8cda4b53fd6c6b09421b40 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 7 Apr 2020 19:56:56 -0500
Subject: [PATCH v8 03/14] doc: Allow users to limit storage reserved by
 replication slots

commit c6550776394e25c1620bc8258427c8f1d448080d
Author: Alvaro Herrera 
---
 doc/src/sgml/config.sgml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c550..190157df0a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3900,9 +3900,9 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 slots are allowed to retain in the pg_wal
 directory at checkpoint time.
 If max_slot_wal_keep_size is -1 (the default),
-replication slots retain unlimited amount of WAL files.  If
-restart_lsn of a replication slot gets behind more than that megabytes
-from the current LSN, the standby using the slot may no longer be able
+replication slots may retain an unlimited amount of WAL files.  Otherwise, if
+restart_lsn of a replication slot falls behind the current LSN by more
+than the given size, the standby using the slot may no longer be able
 to continue replication due to removal of required WAL files. You
 can see the WAL availability of replication slots
 in pg_replication_slots.
-- 
2.17.0

>From e06f8c6f048daa9829dc3bf0450caf5c099d9e4f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 6 Apr 2020 17:16:07 -0500
Subject: [PATCH v8 04/14] doc: s/evade/avoid/

---
 src/backend/access/gin/README | 2 +-
 src/backend/utils/adt/jsonpath_exec.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index 125a82219b..41d4e1e8a0 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -413,7 +413,7 @@ leftmost leaf of the tree.
 Deletion algorithm keeps exclusive locks on left siblings of pages comprising
 currently investigated path.  Thus, if current page is to be removed, all
 required pages to remove both downlink and rightlink are already locked.  That
-evades potential right to left page locking order, which could deadlock with
+avoids potential right to left page locking order, which could deadlock with
 concurrent stepping right.
 
 A search concurrent to page deletion might already have read a pointer to the
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index f146767bfc..2d2eb7d7a3 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -35,7 +35,7 @@
  * executeItemOptUnwrapTarget() function have 'unwrap' argument, which indicates
  * whether unwrapping of array is needed.  When unwrap == true, each of array
  * members is passed to executeItemOptUnwrapTarget() again but with unwrap == false
- * in order to evade subsequent array unwrapping.
+ * in order to avoid 

Re: [UNVERIFIED SENDER] Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Jameson, Hunter 'James'
Also, the behavior (=line of code) added by the bug fix is the same as existing 
code in the same function, _bt_first(), at lines 898, 1096, 1132, 1367. And the 
calls to _bt_parallel_readpage(), line 903, and _bt_steppage(), line 1416, will 
also ultimately call _bt_parallel_done(). So the bug seems to be a pretty 
simple oversight: in 6 out of 7 cases in _bt_first(), we call 
_bt_parallel_done() before returning "false"; but in the 7th case (fixed in 
this bug fix), we do not. The fix is to make case #7 the same as the other 6.

James

On 9/9/20, 7:11 AM, "Jameson, Hunter 'James'"  wrote:

Hi, I spent some time trying to create a repro (other than testing it on 
the production instance where we encountered the bug), but was unable to create 
one within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not 
only do you need, first, to satisfy conditions (1), (2), and (3), without the 
query optimizer optimizing them away! -- but you also need, second, a query 
that runs long enough for one or more of the parallel workers' state machines 
to get confused. (This wasn't a problem on the production instance where we 
encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just 
appends a new block to the relation, so the bug doesn't even result in an error 
condition on an RW instance. (The production instance was RO...) So the bug, 
although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy 
the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.




Re: Optimising compactify_tuples()

2020-09-09 Thread David Rowley
On Wed, 9 Sep 2020 at 05:38, Thomas Munro  wrote:
>
> On Wed, Sep 9, 2020 at 3:47 AM David Rowley  wrote:
> > On Tue, 8 Sep 2020 at 12:08, Thomas Munro  wrote:
> > > One thought is that if we're going to copy everything out and back in
> > > again, we might want to consider doing it in a
> > > memory-prefetcher-friendly order.  Would it be a good idea to
> > > rearrange the tuples to match line pointer order, so that the copying
> > > work and also later sequential scans are in a forward direction?
> >
> > That's an interesting idea but wouldn't that require both the copy to
> > the separate buffer *and* a qsort? That's the worst of both
> > implementations. We'd need some other data structure too in order to
> > get the index of the sorted array by reverse lineitem point, which
> > might require an additional array and an additional sort.
>
> Well I may not have had enough coffee yet but I thought you'd just
> have to spin though the item IDs twice.  Once to compute sum(lp_len)
> so you can compute the new pd_upper, and the second time to copy the
> tuples from their random locations on the temporary page to new
> sequential locations, so that afterwards item ID order matches offset
> order.

I think you were adequately caffeinated.  You're right that this is
fairly simple to do, but it looks even more simple than looping twice
of the array.  I think it's just a matter of looping over the
itemidbase backwards and putting the higher itemid tuples at the end
of the page. I've done it this way in the attached patch.

I also added a presorted path which falls back on doing memmoves
without the temp buffer when the itemidbase array indicates that
higher lineitems all have higher offsets.  I'm doing the presort check
in the calling function since that loops over the lineitems already.
We can just memmove the tuples in reverse order without overwriting
any yet to be moved tuples when these are in order.

Also, I added code to collapse the memcpy and memmoves for adjacent
tuples so that we perform the minimal number of calls to those
functions. Once we've previously compacted a page it seems that the
code is able to reduce the number of calls significantly.  I added
some logging and reviewed at after a run of the benchmark and saw that
for about 192 tuples we're mostly just doing 3-4 memcpys in the
non-presorted path and just 2 memmoves, for the presorted code path.
I also found that in my test the presorted path was only taken 12.39%
of the time. Trying with 120 million UPDATEs instead of 12 million in
the test ended up reducing this to just 10.89%.  It seems that it'll
just be 1 or 2 tuples spoiling it since new tuples will still be added
earlier in the page after we free up space to add more.

I also experimented seeing what would happen if I also tried to
collapse the memcpys for copying to the temp buffer. The performance
got a little worse from doing that. So I left that code #ifdef'd out

With the attached v3, performance is better. The test now runs
recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x
faster.

We should probably consider what else can be done to try to write
pages with tuples for earlier lineitems earlier in the page.  VACUUM
FULLs and friends will switch back to the opposite order when
rewriting the heap.

Also fixed my missing libc debug symbols:

  24.90%  postgres  postgres[.] PageRepairFragmentation
  15.26%  postgres  libc-2.31.so[.] __memmove_avx_unaligned_erms
   9.61%  postgres  postgres[.] hash_search_with_hash_value
   8.03%  postgres  postgres[.] compactify_tuples
   6.25%  postgres  postgres[.] XLogReadBufferExtended
   3.74%  postgres  postgres[.] PinBuffer
   2.25%  postgres  postgres[.] hash_bytes
   1.79%  postgres  postgres[.] heap_xlog_update
   1.47%  postgres  postgres[.] LWLockRelease
   1.44%  postgres  postgres[.] XLogReadRecord
   1.33%  postgres  postgres[.] PageGetHeapFreeSpace
   1.16%  postgres  postgres[.] DecodeXLogRecord
   1.13%  postgres  postgres[.] pg_comp_crc32c_sse42
   1.12%  postgres  postgres[.] LWLockAttemptLock
   1.09%  postgres  postgres[.] StartupXLOG
   0.90%  postgres  postgres[.] ReadBuffer_common
   0.84%  postgres  postgres[.] SlruSelectLRUPage
   0.74%  postgres  libc-2.31.so[.] __memcmp_avx2_movbe
   0.68%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
   0.66%  postgres  postgres[.] PageAddItemExtended
   0.66%  postgres  postgres[.] PageIndexTupleOverwrite
   0.62%  postgres  postgres[.] smgropen
   0.60%  postgres  postgres[.] ReadPageInternal
   0.57%  postgres  postgres[.] GetPrivateRefCountEntry
   0.52%  postgres  postgres[.] heap_redo
   0.51%  postgres  postgres[.] AdvanceNextFullTransactionIdPastXid

David
diff 

Re: Reduce the time required for a database recovery from archive.

2020-09-09 Thread Stephen Frost
Greetings,

* Dmitry Shulga (d.shu...@postgrespro.ru) wrote:
> Overall archive file processing is done one by one, and this might
> create a performance bottleneck if archived WAL files are delivered slowly,
> because the database server has to wait for arrival of the next
> WAL segment before applying its records.
> 
> To address this issue it is proposed to receive archived WAL files in parallel
> so that when the next WAL segment file is required for processing of redo log
> records it would be already available.

Yes, pgbackrest already does exactly this (if configured)- uses parallel
processes to fetch the WAL and have it be available ahead of time.

> Implementation of this approach assumes running several background processes 
> (bgworkers)
> each of which runs a shell command specified by the parameter restore_command
> to deliver an archived WAL file. Number of running parallel processes is 
> limited
> by the new parameter max_restore_command_workers. If this parameter has value > 0
> then WAL files delivery is performed using the original algorithm, that is in
> one-by-one manner. If this parameter has value greater than 0 then the 
> database
> server starts several bgworker processes up to the limit specified by
> the parameter max_restore_command_workers and passes to every process
> WAL file name to deliver. Active processes start prefetching of specified
> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> bgworker process finishes receiving a file it marks itself as a free process
> and waits for a new request to receive a next WAL file. The main process
> performing database recovery still handles WAL files in one-by-one manner,
> but instead of waiting for a next required WAL file's availability it checks 
> for
> that file in the prefetched directory. If a new file is present there,
> the main process starts its processing.

I'm a bit confused about this description- surely it makes sense for the
parallel workers to continue to loop and fetch up to some specified
max..?  Then to monitor and to fetch more when the amount pre-fetched so
far drops before that level?  The description above makes it sound like
X WAL will be fetched ahead of time, and then the recovery process will
go through those until it runs out and then it'll have to wait for the
next X WAL to be fetched, which means it's still going to end up being
delayed even with these parallel processes, which isn't good.

Does this also properly handle timeline switches..?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Online checksums patch - once again

2020-09-09 Thread Daniel Gustafsson
> On 7 Sep 2020, at 09:17, Michael Paquier  wrote:

> Daniel, could you look at that?

I believe this boils down to a timing issue, I've included a fix in the v21
patch attached to a previous mail upthread.

cheers ./daniel




Re: unsupportable composite type partition keys

2020-09-09 Thread Jobin Augustine
Is there a way out if someone accidentally executes the same test case
against PG12?

testdb=# create table partitioned (a int, b int)
testdb-#   partition by list ((row(a, b)::partitioned));
CREATE TABLE
testdb=# DROP TABLE partitioned;
ERROR:  cache lookup failed for type 18269


>
> Ah, I wasn't sure that additional tests on a table would be worthwhile
> enough.
> Thanks for tweaking and pushing!
>
>
>


Re: More aggressive vacuuming of temporary tables

2020-09-09 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-08-28 11:46:49 -0400, Tom Lane wrote:
> > It strikes me that when we are vacuuming a temporary table (which
> > necessarily will be one of our own session), we don't really need
> > to care what the global xmin horizon is.  If we're not inside a
> > user transaction block, then there are no tuples in the table that
> > could be in-doubt anymore.  Neither are there any snapshots in our
> > session that could see any dead tuples.  Nor do we give a fig what
> > other sessions might think of those tuples.  So we could just set
> > the xmin cutoff as aggressively as possible, which is to say
> > equal to the nextXid counter.  While vacuuming a temp table is
> > perhaps not something people do very often, I think when they do
> > do it they would like us to clean out all the dead tuples not just
> > some.
> 
> That seems like a good idea.

Agreed.

> I've been toying with a patch that introduces more smarts about when a
> row is removable, by looking more closely whether a specific row
> versions are visible (e.g. in the common case of one old snapshot and
> lots of newer rows). But that's orders of magnitude more complicated. So
> going for something as simple as this seems like a good idea.

I've wondered about this for a long time- very cool that you've found
time to actually work on a patch.  A couple of different ideas were
discussed previously about how to do that kind of a check- mind talking
about what method you're using, or perhaps just sharing that patch? :)

The potential of such an improvement is huge.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Jameson, Hunter 'James'
Hi, I spent some time trying to create a repro (other than testing it on the 
production instance where we encountered the bug), but was unable to create one 
within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not only 
do you need, first, to satisfy conditions (1), (2), and (3), without the query 
optimizer optimizing them away! -- but you also need, second, a query that runs 
long enough for one or more of the parallel workers' state machines to get 
confused. (This wasn't a problem on the production instance where we 
encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just appends 
a new block to the relation, so the bug doesn't even result in an error 
condition on an RW instance. (The production instance was RO...) So the bug, 
although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.



Minor fixes for upcoming version 13

2020-09-09 Thread Alexander Lakhin
Hello hackers,

Please consider fixing minor defects in the source code and documentation.
1. a missing dot in guc.c
"SELECT name, short_desc FROM pg_settings WHERE short_desc NOT LIKE
'%.'" finds only this one instance.
2. inrange -> in_range
the former spelling is unique
3. sigature -> signature
4. "start timeline %u not found history of timeline %u" -> "start
timeline %u not found in history of timeline %u"
it seems that a preposition is missing
5. BackgroundWorker *worker -> BackgroundWorker
*worker
incorrect Docbook tag
6. "unhandled previous state in xqs" -> "unhandled previous state at end
quote"?
"xqs" is rather cryptic for a user-visible message

I'm not sure about 6, so the attached patch covers only 1-5.

Best regards,
Alexander
diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 6e1cf121de0..1876dfbd5a5 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -34,11 +34,11 @@
   PostgreSQL is started by including the module name in
   shared_preload_libraries.  A module wishing to run a background
   worker can register it by calling
-  RegisterBackgroundWorker(BackgroundWorker *worker)
+  RegisterBackgroundWorker(BackgroundWorker *worker)
   from its _PG_init().  Background workers can also be started
   after the system is up and running by calling the function
-  RegisterDynamicBackgroundWorker(BackgroundWorker
-  *worker, BackgroundWorkerHandle **handle).  Unlike
+  RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
+  BackgroundWorkerHandle **handle).  Unlike
   RegisterBackgroundWorker, which can only be called from within
   the postmaster, RegisterDynamicBackgroundWorker must be
   called from a regular backend or another background worker.
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index d03ee4d6fa0..4a1ac40297c 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -263,7 +263,7 @@

   
   
-   inrange
+   in_range

 
  in_range support functions
diff --git a/doc/src/sgml/intarray.sgml b/doc/src/sgml/intarray.sgml
index 9d2eb52eeb4..956c01d7493 100644
--- a/doc/src/sgml/intarray.sgml
+++ b/doc/src/sgml/intarray.sgml
@@ -452,7 +452,7 @@
 -- a message can be in one or more sections
 CREATE TABLE message (mid INT PRIMARY KEY, sections INT[], ...);
 
--- create specialized index with sigature length of 32 bytes
+-- create specialized index with signature length of 32 bytes
 CREATE INDEX message_rdtree_idx ON message USING GIST (sections gist__int_ops(siglen=32));
 
 -- select messages in section 1 OR 2 - OVERLAP operator
diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index b6260049271..a43c793e289 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -272,7 +272,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
 	 */
 	if (!found_start_timeline)
 		ereport(ERROR,
-errmsg("start timeline %u not found history of timeline %u",
+errmsg("start timeline %u not found in history of timeline %u",
 	   starttli, endtli));
 
 	/* Terminate the list of WAL ranges. */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 20d4784b335..44c3e5051df 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3133,7 +3133,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 	{
 		{"autovacuum_vacuum_insert_threshold", PGC_SIGHUP, AUTOVACUUM,
-			gettext_noop("Minimum number of tuple inserts prior to vacuum, or -1 to disable insert vacuums"),
+			gettext_noop("Minimum number of tuple inserts prior to vacuum, or -1 to disable insert vacuums."),
 			NULL
 		},
 		_vac_ins_thresh,


Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Magnus Hagander
On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra 
wrote:

> On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
> >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander 
> wrote:
> >>
> >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila 
> wrote:
> >>>
> >>
> >> Though in fact the one inconsistent place in the code now is that if it
> is corrupt in the db entry part of the file it returns true and the global
> timestamp, which I would argue is perhaps incorrect and it should return
> false.
> >>
> >
> >Yeah, this is exactly the case I was pointing out where we return true
> >before the patch, basically the code below:
> >case 'D':
> >if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
> >  fpin) != offsetof(PgStat_StatDBEntry, tables))
> >{
> >ereport(pgStatRunningInCollector ? LOG : WARNING,
> >(errmsg("corrupted statistics file \"%s\"",
> >statfile)));
> >goto done;
> >}
> >
> >done:
> >FreeFile(fpin);
> >return true;
> >
> >Now, if we decide to return 'false' here, then surely there is no
> >argument and we should return false in other cases as well. Basically,
> >I think we should be consistent in handling the corrupt file case.
> >
>
> FWIW I do agree with this - we should return false here, to make it
> return false like in the other data corruption cases. AFAICS that's the
> only inconsistency here.
>

+1, I think that's the place to fix, rather than reversing all the other
places.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:

On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  wrote:


On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  wrote:




Though in fact the one inconsistent place in the code now is that if it is 
corrupt in the db entry part of the file it returns true and the global 
timestamp, which I would argue is perhaps incorrect and it should return false.



Yeah, this is exactly the case I was pointing out where we return true
before the patch, basically the code below:
case 'D':
if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
 fpin) != offsetof(PgStat_StatDBEntry, tables))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
statfile)));
goto done;
}

done:
FreeFile(fpin);
return true;

Now, if we decide to return 'false' here, then surely there is no
argument and we should return false in other cases as well. Basically,
I think we should be consistent in handling the corrupt file case.



FWIW I do agree with this - we should return false here, to make it
return false like in the other data corruption cases. AFAICS that's the
only inconsistency here.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online checksums verification in the backend

2020-09-09 Thread Julien Rouhaud
On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> 
> I'll do some becnhmarking and see if I can get some figures, but it'll 
> probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder.  v12 attached should fix that.
>From 84d0a4c5ec5d6b6f832fad02d421c204f1bee98b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v12] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/postmaster/pgstat.c   |   3 +
 src/backend/storage/page/checksum.c   | 322 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 218 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/pgstat.h  |   3 +-
 src/include/utils/checksumfuncs.h |  31 ++
 src/include/utils/guc_tables.h|   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/check_relation/.gitignore|   2 +
 src/test/modules/check_relation/Makefile  |  14 +
 src/test/modules/check_relation/README|  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++
 19 files changed, 1096 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
  
 
 
+
+ Cost-based Checksum Verification Delay
+
+ 
+  During the execution of 
+  function, the system maintains an internal counter that keeps track of
+  the estimated cost of the various I/O operations that are performed.
+  When the accumulated cost reaches a limit (specified by
+  checksum_cost_limit), the process performing the
+  operation will sleep for a short period of time, as specified by
+  checksum_cost_delay. Then it will reset the counter
+  and continue execution.
+ 
+
+ 
+  This feature is disabled by default. To enable it, set the
+  checksum_cost_delay variable to a nonzero
+  value.
+ 
+
+ 
+  
+   checksum_cost_delay (floating 
point)
+   
+checksum_cost_delay configuration 
parameter
+   
+   
+   
+
+ The amount of time that the process will sleep
+ when the cost limit has been exceeded.
+ If this value is specified without units, it is taken as milliseconds.
+ The default value is zero, which disables the cost-based checksum
+ verification delay feature.  Positive values enable cost-based
+ checksum verification.
+
+   
+  
+
+  
+   checksum_cost_page (integer)
+   
+checksum_cost_page configuration 
parameter
+   
+   
+   
+
+ The estimated cost for verifying a buffer, whether it's found in the
+ shared buffer cache or not. It represents the cost to lock the buffer
+ pool, lookup the shared hash table, read the content of the page from
+ disk and compute its checksum.  The default value is 10.
+
+   
+  
+
+  
+   checksum_cost_limit (integer)
+   
+checksum_cost_limit configuration 
parameter
+   
+   
+   
+
+ The accumulated cost that will cause the verification process to 
sleep.
+ The default value is 200.
+
+   
+  
+ 
+
+ 
+  
+   There are certain operations that hold critical locks and should
+   therefore complete as quickly as possible.  Cost-based checksum
+   verification delays do not occur during such operations.  Therefore 

Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2020-09-09 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-09-07 01:46, Tom Lane wrote:
>> I reviewed the 0002 patch, finding one bug (in int8_sum)

> Ouch, no test coverage.  Should we perhaps remove this function, since 
> it's obsolete and unused?

I don't feel a need to.

>> and a few
>> more calls of int8_numeric that could be converted.  I think the
>> attached updated version is committable, and I'd recommend going
>> ahead with that regardless of the rest of this.  I hadn't realized
>> how many random calls of int8_numeric and int4_numeric we'd grown,
>> but there are a lot, so this is nice cleanup.

> Yes, please go ahead with it.

It's your patch, I figured you'd want to commit it.

regards, tom lane




Minor cleanup of partbounds.c

2020-09-09 Thread Etsuro Fujita
Here is a patch for minor cleanup of the partbounds.c changes made by
commit c8434d64c: 1) removes a useless assignment (in normal builds)
and 2) improves comments a little.

Best regards,
Etsuro Fujita


partbounds-cleanup.patch
Description: Binary data


RE: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread osumi.takami...@fujitsu.com
Hi


> Those are fixed OK now, but I found 2 new review points.
> 
> 
> 
> COMMENT trigger.c (typo)
> 
> + ereport(ERROR,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("trigger \"%s\" for relation \"%s\" is a constraint trigger",
> + stmt->trigname, RelationGetRelationName(rel)),
> + errhint("use CREATE OR REPLACE CONSTRAINT TRIGGER to replace a
> costraint trigger")));
> 
> 
> Typo in the errhint text.
> "costraint" -> "constraint"
Fixed. Thank you.

> 
> 
> COMMENT create_trigger.sgmg (add more help?)
> 
> I noticed that the CREATE OR REPLACE FUNCTION help [1] describes the OR
> REPLACE syntax ("Description" section) and also mentions some of the
> restrictions when using REPLACE ("Notes" section).
> [1] - https://www.postgresql.org/docs/current/sql-createfunction.html
> 
> ~~
> OTOH this trigger patch does not add anything much at all in the trigger help.
> 
> Shouldn't the "Description" at least say something like:
> "CREATE OR REPLACE will either create a new trigger, or replace an existing
> definition."
> 
> Shouldn't the "Notes" include information about restrictions when using OR
> REPLACE e.g. cannot replace triggers with triggers of a different kind e.g.
> cannot replace triggers with pending events
> 
> What do you think?
That's a great idea. I've applied this idea to the latest patch v10.

Regards,
Takamichi Osumi


CREATE_OR_REPLACE_TRIGGER_v10.patch
Description: CREATE_OR_REPLACE_TRIGGER_v10.patch


Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Amit Kapila
On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel BTree 
> scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
> instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

-- 
With Regards,
Amit Kapila.




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 15:22, Michael Paquier wrote:

On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:

Initially I added List *params, and Michael suggested to retire
ReindexStmt->concurrent.  I provided a patch to do so, initially by 
leaving int
options and then, after this, removing it to "complete the thought", 
and get
rid of the remnants of the "old way" of doing it.  This is also how 
vacuum and

explain are done.
https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com


Defining a set of DefElem when parsing and then using the int
"options" with bitmasks where necessary at the beginning of the
execution looks like a good balance to me.  This way, you can extend
the grammar to use things like (verbose = true), etc.

By the way, skimming through the patch set, I was wondering if we
could do the refactoring of patch 0005 as a first step



Yes, I did it with intention to put as a first patch, but wanted to get 
some feedback. It's easier to refactor the last patch without rebasing 
others.




until I
noticed this part:
+common_option_name:
NonReservedWord { $$ = $1; }
| analyze_keyword { $$ = "analyze"; }
This is not a good idea as you make ANALYZE an option available for
all the commands involved in the refactoring.  A portion of that could
be considered though, like the use of common_option_arg.



From the grammar perspective ANY option is available for any command 
that uses parenthesized option list. All the checks and validations are 
performed at the corresponding command code.
This analyze_keyword is actually doing only an ANALYZE word 
normalization if it's used as an option. Why it could be harmful?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Online checksums verification in the backend

2020-09-09 Thread Julien Rouhaud
On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier  wrote:
>
> Another thing that was itching me is the introduction of 3 GUCs with
> one new category for the sake of two SQL functions.  For VACUUM we
> have many things relying on the GUC delays, with autovacuum and manual
> vacuum.  Perhaps it would make sense to have these if we have some day
> a background worker doing checksum verifications, still that could
> perfectly be in contrib/, and that would be kind of hard to tune as
> well.  The patch enabling checksums on-the-fly could also be a reason
> good enough.  Another thing we could consider is to pass down those
> parameters as function arguments, at the cost of not being able to
> reload them.

I'm not terribly happy with adding that for now, but it's quite clear
that there'll eventually be a lot of new stuff added that will benefit
from either the category or the GUC.  FTR once we reach an agreement
on how to do this check (I'm wondering if it'll stay an SQL function
or become a plain backend command, in which case GUCs would be
mandatory), I'll also be happy to work on a background worker to help
people running the check regularly.  So in my opinion it's better to
add them now so we won't have to change the sql function definition
later when other facilities will rely on them.




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Amit Langote
On Wed, Sep 9, 2020 at 6:42 PM Alexey Kondratov
 wrote:
> On 2020-09-09 11:45, Andrey V. Lepikhov wrote:
> > This does not seem very convenient and will lead to errors in the
> > future. So, I agree with Amit.
>
> And InitResultRelInfo() may set ri_usesMultiInsert to false by default,
> since it's used only by COPY now. Then you won't need this in several
> places:
>
> +   resultRelInfo->ri_usesMultiInsert = false;
>
> While the logic of turning multi-insert on with all the validations
> required could be factored out of InitResultRelInfo() to a separate
> routine.

Interesting idea.  Maybe better to have a separate routine like Alexey says.


--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: Online checksums verification in the backend

2020-09-09 Thread Michael Paquier
On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote:
> I assumed that the odds of having to check the buffer twice were so low, and
> avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
> enough optimisation, but maybe that's only wishful thinking.

Perhaps it is worth it, so it would be good to make sure of it and see
if that's actually worth the extra complication.  This also depends if
the page is in the OS cache if the page is not in shared buffers,
meaning that smgrread() is needed to fetch the page to check.  I would
be more curious to see if there is an actual difference if the page is
the OS cache.

> I'll do some becnhmarking and see if I can get some figures, but it'll 
> probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

Thanks.

Another thing that was itching me is the introduction of 3 GUCs with
one new category for the sake of two SQL functions.  For VACUUM we
have many things relying on the GUC delays, with autovacuum and manual
vacuum.  Perhaps it would make sense to have these if we have some day
a background worker doing checksum verifications, still that could
perfectly be in contrib/, and that would be kind of hard to tune as
well.  The patch enabling checksums on-the-fly could also be a reason
good enough.  Another thing we could consider is to pass down those
parameters as function arguments, at the cost of not being able to
reload them.
--
Michael


signature.asc
Description: PGP signature


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Michael Paquier
On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:
> Initially I added List *params, and Michael suggested to retire
> ReindexStmt->concurrent.  I provided a patch to do so, initially by leaving 
> int
> options and then, after this, removing it to "complete the thought", and get
> rid of the remnants of the "old way" of doing it.  This is also how vacuum and
> explain are done.
> https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com

Defining a set of DefElem when parsing and then using the int
"options" with bitmasks where necessary at the beginning of the
execution looks like a good balance to me.  This way, you can extend
the grammar to use things like (verbose = true), etc.

By the way, skimming through the patch set, I was wondering if we
could do the refactoring of patch 0005 as a first step, until I
noticed this part:
+common_option_name:
NonReservedWord { $$ = $1; }
| analyze_keyword { $$ = "analyze"; }
This is not a good idea as you make ANALYZE an option available for
all the commands involved in the refactoring.  A portion of that could
be considered though, like the use of common_option_arg.
--
Michael


signature.asc
Description: PGP signature


Re: Yet another fast GiST build

2020-09-09 Thread Komяpa
On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

> On 09/09/2020 13:28, Andrey M. Borodin wrote:
> > Thanks Darafei!
> >
> >> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
> >>  написал(а):
> >>
> >>> How does the 'sortsupport' routine interact with
> >>> 'compress'/'decompress'? Which representation is passed to the
> >>> comparator routine: the original value from the table, the
> >>> compressed representation, or the decompressed representation? Do
> >>> the comparetup_index_btree() and readtup_index() routines agree
> >>> with that?
> >>
> 
>
> Come to think of it, the point z-order comparator could benefit a lot
> from key abbreviation, too. You could do the point -> zorder conversion
> in the abbreviation routine.
>

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Re: Inconsistent Japanese name order in v13 contributors list

2020-09-09 Thread Peter Eisentraut

On 2020-09-09 07:15, Etsuro Fujita wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Given existing practice, this patch looks okay.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Inconsistent Japanese name order in v13 contributors list

2020-09-09 Thread Peter Eisentraut

On 2020-09-09 07:40, Fujii Masao wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Using given-name-first order is our consensus? I was thinking we have not
reached that yet and our "vague" consensus was to use the name that each
contributor prefers, for example the name that used in the email signature, etc.

BTW, if possible I'd like to use family-name-first, i.e., "Fujii Masao".


See commit 53c89aed7b38ab412fddc1d6118822ce5d962acd for when this was 
changed.


At least it's the current practice.  It can be debated whether it's a 
good practice.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas

On 09/09/2020 13:28, Andrey M. Borodin wrote:

Thanks Darafei!


9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
 написал(а):


How does the 'sortsupport' routine interact with
'compress'/'decompress'? Which representation is passed to the
comparator routine: the original value from the table, the
compressed representation, or the decompressed representation? Do
the comparetup_index_btree() and readtup_index() routines agree
with that?


Currently we pass compressed values, which seems not very good. But
there was a request from PostGIS maintainers to pass values before
decompression. Darafei, please, correct me if I'm wrong. Also can
you please provide link on PostGIS B-tree sorting functions?

We were expecting to reuse btree opclass for this thing. This way
btree_gist extension will become a lot thinner. :)


I think if we aim at reusing B-tree sort support functions we have to
pass uncompressed values. They can be a lot bigger and slower in case
of PostGIS. We will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in
btree_gist, rather than trying to reuse existing sort supports.


Yeah, I don't think reusing existing sortsupport functions directly is 
important. The comparison function should be short anyway for 
performance reasons, so it won't be a lot of code to copy-paste. And if 
there are some common subroutines, you can put them in a separate 
internal functions for reuse.


Using the 'compressed' format seems reasonable to me. It's natural to 
the gistbuild.c code, and the comparison routine can 'decompress' itself 
if it wishes. If the decompressions is somewhat expensive, it's 
unfortunate if you need to do it repeatedly in the comparator, but 
tuplesort.c would need pretty big changes to keep around a separate 
in-memory representation compare. However, you could use the sort 
"abbreviation" functionality to mitigate that.


Come to think of it, the point z-order comparator could benefit a lot 
from key abbreviation, too. You could do the point -> zorder conversion 
in the abbreviation routine.


- Heikki




Re: Global snapshots

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 08:35, Masahiko Sawada wrote:

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
 wrote:


On 2020-09-08 14:48, Fujii Masao wrote:
>
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
>

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we 
change

to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.


Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.



OK, thank you for the explanation!



Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC.


I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?



It's not actually related to critical section errors escalation. Any 
error in the backend after the local commit and 
ProcArrayEndTransaction() will try to abort the current transaction and 
do RecordTransactionAbort(), but it's too late to do so and PANIC will 
be risen:


/*
	 * Check that we haven't aborted halfway through 
RecordTransactionCommit.

 */
if (TransactionIdDidCommit(xid))
elog(PANIC, "cannot abort transaction %u, it was already 
committed",
 xid);

At least that's how I understand it.



And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction 
on server %s",
+  
frstate->server->servername)));


I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid 
of

possible ERRORs as far as possible in my proposed patch.



In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.



Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and 
staring a discussion :) Activity from this thread moved to [1] and you 
solution with built-in resolver is discussed [2]. I'll try to take a 
look on v25 closely and write to [2] instead.



[1] 
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca


[2] 
https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: file_fdw vs relative paths

2020-09-09 Thread Magnus Hagander
On Wed, Sep 9, 2020 at 3:39 AM Ian Barwick 
wrote:

> Hi
>
> On 2020/09/07 2:31, Magnus Hagander wrote:
> > On Mon, Aug 31, 2020 at 5:03 PM Bruce Momjian  br...@momjian.us>> wrote:
> >
> > On Mon, Aug 31, 2020 at 01:16:05PM +0200, Magnus Hagander wrote:
> >  > Bruce, I've applied and backpatched your docs patch for this.
> >  >
> >  > Gah, and of course right after doing that, I remembered I wanted
> to get a
> >  > second change in :) To solve the "who's this Josh" question, I
> suggest we also
> >  > change the example to point to the data/log directory which is
> likely to exist
> >  > in a lot more of the cases. I keep getting people who ask "who is
> josh" based
> >  > on the /home/josh path. Not that it's that important, but...
> >
> > Thanks, and agreed.
> >
> >
> > Thanks, applied. I backpacked to 13 but didn't bother with the rest as
> it's not technically *wrong* before..
>
> It's missing the leading single quote from the filename parameter:
>
>  diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
>  (...)
>  -OPTIONS ( filename '/home/josh/data/log/pglog.csv', format 'csv' );
>  +OPTIONS ( filename log/pglog.csv', format 'csv' );
>  (...)
>

GAH.

Thanks!


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin
Thanks Darafei!

> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski  
> написал(а):
> 
> > How does the 'sortsupport' routine interact with 'compress'/'decompress'? 
> > Which representation is passed to the comparator routine: the original 
> > value from the table, the compressed representation, or the decompressed 
> > representation? Do the comparetup_index_btree() and readtup_index() 
> > routines agree with that?
> 
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before 
> decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link on 
> PostGIS B-tree sorting functions?
> 
> We were expecting to reuse btree opclass for this thing. This way btree_gist 
> extension will become a lot thinner. :)

I think if we aim at reusing B-tree sort support functions we have to pass 
uncompressed values. They can be a lot bigger and slower in case of PostGIS. We 
will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in btree_gist, 
rather than trying to reuse existing sort supports.

Best regards, Andrey Borodin.



Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Amit Kapila
On Wed, Sep 9, 2020 at 3:17 PM Magnus Hagander  wrote:
>
> On Wed, Sep 9, 2020 at 9:11 AM Amit Kapila  wrote:
>>
>> On Wed, Sep 9, 2020 at 10:54 AM Fujii Masao  
>> wrote:
>> >
>> > On 2020/09/09 12:04, Amit Kapila wrote:
>> > >
>> > > No, before patch as well, if we can't read the DB entry say because
>> > > the file is corrupt, we return true and use timestamp of global stats
>> > > file and this is what is established by the original commit 187492b6.
>> > > So, if we consider that as correct then the functionality is something
>> > > like once we have read the timestamp of the global statfile, we use
>> > > that if we can't read the actual db entry for whatever reason. It
>> > > seems if we return false, the caller will call this function again in
>> > > the loop. Now, I see the point that if we can't read some part of the
>> > > file we should return false instead but not sure if that is helpful
>> > > from the perspective of the caller of
>> > > pgstat_read_db_statsfile_timestamp.
>> >
>> > When false is returned, the caller backend_read_statsfile() seems to
>> > request the stats collector to write a fresh stats data into the file,
>> > and then pgstat_read_db_statsfile_timestamp() can try to read the fresh
>> > file that may not be corrupted. So returning false in that case seems ok
>> > to me...
>> >
>>
>> Hmm, the request to get fresh data is sent irrespective of true or
>> false. We send it to get the latest data if it is not present and it
>
>
> No we don't. Just before we request it, there is:
> /* Normal acceptance case: file is not older than cutoff time */
> if (ok && file_ts >= min_ts)
> break;
>
> So it only requests a new file in the case that it returned false.
>

What if the second part of the above 'if' statement is false, then
basically even when pgstat_read_db_statsfile_timestamp() has returned
true, we can send a request. IIUC, here the basic idea is if the stats
in the file is updated before cut_off time, then we do send the
request and wait for updated stats.

-- 
With Regards,
Amit Kapila.




Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-09 Thread Amit Kapila
On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  wrote:
>
> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  wrote:
>>
>
> Though in fact the one inconsistent place in the code now is that if it is 
> corrupt in the db entry part of the file it returns true and the global 
> timestamp, which I would argue is perhaps incorrect and it should return 
> false.
>

Yeah, this is exactly the case I was pointing out where we return true
before the patch, basically the code below:
case 'D':
if (fread(, 1, offsetof(PgStat_StatDBEntry, tables),
  fpin) != offsetof(PgStat_StatDBEntry, tables))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
statfile)));
goto done;
}

done:
FreeFile(fpin);
return true;

Now, if we decide to return 'false' here, then surely there is no
argument and we should return false in other cases as well. Basically,
I think we should be consistent in handling the corrupt file case.

-- 
With Regards,
Amit Kapila.




  1   2   >