date:20201103

On Wed, Nov 4, 2020 at 2:57 PM Tomas Vondra
 wrote:
> On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:
> >On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
> > wrote:
> >> After a while (~1h on my machine) the pg_multixact gets over 10GB, which
> >> triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
> >> My guess is that this discards some of the files, but checkpointer is
> >> not aware of that, or something like that. Not sure.
> >
> >Urgh.  Thanks.  Looks like perhaps the problem is that I have
> >RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
> >that unlinks files, but not another.  Looking.
>
> Maybe. I didn't have time to investigate this more deeply, and it takes
> quite a bit of time to reproduce. I can try again with extra logging or
> test some proposed fixes, if you give me a patch.

I think this should be fixed by doing all unlinking through a common
code path.  Does this pass your test?
From 03f51c9e800a9fdff830c639344988f2b71ae746 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 4 Nov 2020 17:14:04 +1300
Subject: [PATCH] Fix unlinking of SLRU segments.

Commit dee663f7 intended to drop any queued up fsync requests before
unlinking segment files, but missed a code path.  Fix, by centralizing
the forget-and-unlink code into a single function.

Reported-by: Tomas Vondra 
Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
---
 src/backend/access/transam/slru.c | 44 +--
 1 file changed, 18 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 16a7898697..0aa33cc325 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -145,7 +145,7 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 	  int segpage, void *data);
-static void SlruInternalDeleteSegment(SlruCtl ctl, char *filename);
+static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
 
 /*
  * Initialization of shared memory
@@ -1298,19 +1298,28 @@ restart:;
 }
 
 /*
- * Delete an individual SLRU segment, identified by the filename.
+ * Delete an individual SLRU segment.
  *
  * NB: This does not touch the SLRU buffers themselves, callers have to ensure
  * they either can't yet contain anything, or have already been cleaned out.
  */
 static void
-SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
+SlruInternalDeleteSegment(SlruCtl ctl, int segno)
 {
 	char		path[MAXPGPATH];
 
-	snprintf(path, MAXPGPATH, "%s/%s", ctl->Dir, filename);
-	ereport(DEBUG2,
-			(errmsg("removing file \"%s\"", path)));
+	/* Forget any fsync requests queued for this segment. */
+	if (ctl->sync_handler != SYNC_HANDLER_NONE)
+	{
+		FileTag		tag;
+
+		INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+		RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+	}
+
+	/* Unlink the file. */
+	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
+	ereport(DEBUG2, (errmsg("removing file \"%s\"", path)));
 	unlink(path);
 }
 
@@ -1322,7 +1331,6 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
 {
 	SlruShared	shared = ctl->shared;
 	int			slotno;
-	char		path[MAXPGPATH];
 	bool		did_write;
 
 	/* Clean out any possibly existing references to the segment. */
@@ -1364,23 +1372,7 @@ restart:
 	if (did_write)
 		goto restart;
 
-	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
-	ereport(DEBUG2,
-			(errmsg("removing file \"%s\"", path)));
-
-	/*
-	 * Tell the checkpointer to forget any sync requests, before we unlink the
-	 * file.
-	 */
-	if (ctl->sync_handler != SYNC_HANDLER_NONE)
-	{
-		FileTag		tag;
-
-		INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
-		RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
-	}
-
-	unlink(path);
+	SlruInternalDeleteSegment(ctl, segno);
 
 	LWLockRelease(shared->ControlLock);
 }
@@ -1413,7 +1405,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 	int			cutoffPage = *(int *) data;
 
 	if (ctl->PagePrecedes(segpage, cutoffPage))
-		SlruInternalDeleteSegment(ctl, filename);
+		SlruInternalDeleteSegment(ctl, segpage / SLRU_PAGES_PER_SEGMENT);
 
 	return false;/* keep going */
 }
@@ -1425,7 +1417,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 bool
 SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage, void *data)
 {
-	SlruInternalDeleteSegment(ctl, filename);
+	SlruInternalDeleteSegment(ctl, segpage / SLRU_PAGES_PER_SEGMENT);
 
 	return false;/* keep going */
 }
-- 
2.20.1

Re: "unix_socket_directories" should be GUC_LIST_INPUT?

On Tue, Oct 27, 2020 at 12:23:22PM -0400, Tom Lane wrote:
> Robert Haas  writes:
>> If we change this, is it going to be a compatibility break for the
>> contents of postgresql.conf files?
> 
> I think not, at least not for the sorts of values you'd ordinarily
> find in that variable, say '/tmp, /var/run/postgresql'.  Possibly
> the behavior would change for pathnames containing spaces or the
> like, but it is probably kinda broken for such cases today anyway.
> 
> In any case, Michael had promised to test this aspect before committing.

Paths with spaces or commas would be fine as long as we associate
GUC_LIST_QUOTE with GUC_LIST_INPUT so as commas within quoted entries
are handled consistently.  postmaster.c uses SplitDirectoriesString()
with a comma do decide how to split things.  This discards leading and
trailing whitespaces, requires a double-quote to have a matching
double-quote where trailing/leading whitespaces are allowed, and
nothing to escape quotes.  Those patterns fail:
"/tmp/repo1,"/tmp/repo2
"/tmp/repo1,/tmp/repo2
These are split as a single entry:
"/tmp/repo1,/tmp/repo2"
"/tmp/ repo1,/tmp/ repo2"
"/tmp/ repo1 , /tmp/ repo2 "
These are split as two entries:
"/tmp/repo1,",/tmp/repo2
/tmp /repo1 , /tmp/ repo2
"/tmp/""sock1", "/tmp/, sock2" (here first path has one double quote)

If we use GUC_LIST_INPUT | GUC_LIST_QUOTE, paths are handled the same
way as the original, but we would run into problems if not using
GUC_LIST_QUOTE as mentioned upthread.

Anyway, we have a compatibility problem once we use ALTER SYSTEM.
Just take the following command: 
alter system set unix_socket_directories = '/tmp/sock1, /tmp/sock2';

On HEAD, this would be treated and parsed as two items.  However, with
the patch, this becomes one item as this is considered as one single
element of the list of paths, as that's written to
postgresql.auto.conf as '"/tmp/sock1, /tmp/sock2"'.

This last argument would be IMO a reason enough to not do the switch.
Even if I have never seen cases where ALTER SYSTEM was used with
unix_socket_directories, we cannot say either that nobody relies on
the existing behavior (perhaps some failover solutions care?).  So at
least we should add a comment as proposed in
https://postgr.es/m/122596.1603427...@sss.pgh.pa.us.

Thoughts?
--
Michael

signature.asc
Description: PGP signature

RE: extension patch of CREATE OR REPLACE TRIGGER

2020-11-03 Thread osumi.takami...@fujitsu.com

Hi


Peter-San, thanks for your support.
On Monday, November 2, 2020 2:39 PM Peter Smith wrote:
> ===
> 
> (1) COMMENT
> File: NA
> Maybe next time consider using format-patch to make the patch. Then you
> can include a comment to give the background/motivation for this change.
OK. How about v15 ?

> ===
> 
> (2) COMMENT
> File: doc/src/sgml/ref/create_trigger.sgml
> @@ -446,6 +454,13 @@ UPDATE OF
> column_name1
> [, column_name2 Currently it says:
> When replacing an existing trigger with CREATE OR REPLACE TRIGGER,
> there are restrictions. You cannot replace constraint triggers. That means it 
> is
> impossible to replace a regular trigger with a constraint trigger and to 
> replace
> a constraint trigger with another constraint trigger.
> 
> --
> 
> Is that correct wording? I don't think so. Saying "to replace a regular 
> trigger
> with a constraint trigger" is NOT the same as "replace a constraint trigger".
I corrected my wording in create_trigger.sgml, which should cause less confusion
than v14. The reason why I changed the documents is described below.

> Maybe I am mistaken but I think the help and the code are no longer in sync
> anymore. e.g. In previous versions of this patch you used to verify
> replacement trigger kinds (regular/constraint) match. AFAIK you are not
> doing that in the current code (but you should be). So although you say
> "impossible to replace a regular trigger with a constraint trigger" I don't 
> see
> any code to check/enforce that ( ?? )
> IMO when you simplified this v14 patch you may have removed some extra
> trigger kind conditions that should not have been removed.
> 
> Also, the test code should have detected this problem, but I think the tests
> have also been broken in v14. See later COMMENT (9).
Don't worry and those are not broken.

I changed some codes in gram.y to throw a syntax error when
OR REPLACE clause is used with CREATE CONSTRAINT TRIGGER.

In the previous discussion with Tsunakawa-San in this thread,
I judged that OR REPLACE clause is
not necessary for *CONSTRAINT* TRIGGER to achieve the purpose of this patch.
It is to support the database migration from Oracle to Postgres
by supporting the same syntax for trigger replacement. Here,
because the constraint trigger is unique to the latter,
I prohibited the usage of CREATE CONSTRAINT TRIGGER and OR REPLACE
clauses at the same time at the grammatical level.
Did you agree with this way of modification ?

To prohibit the combination OR REPLACE and CONSTRAINT clauses may seem
a little bit radical but I refer to an example of the combination to use
CREATE CONSTRAINT TRIGGER and AFTER clause.
When the timing of trigger is not AFTER for CONSTRAINT TRIGGER,
an syntax error is caused. I learnt and followed the way for
my modification from it.

> ===
> 
> (3) COMMENT
> File: src/backend/commands/trigger.c
> @@ -185,6 +185,10 @@ CreateTrigger(CreateTrigStmt *stmt, const char
> *queryString,
> + bool internal_trigger = false;
> 
> --
> 
> There is potential for confusion of "isInternal" versus "internal_trigger". 
> The
> meaning is not apparent from the names, but IIUC isInternal seems to be
> when creating an internal trigger, whereas internal_trigger seems to be when
> you found an existing trigger that was previously created as isInternal.
> 
> Maybe something like "existing_isInternal" would be a better name instead of
> "internal_trigger".
Definitely sounds better. I fixed the previous confusing name.

> 
> ===
> 
> (4) COMMENT
> File: src/backend/commands/trigger.c
> @@ -185,6 +185,10 @@ CreateTrigger(CreateTrigStmt *stmt, const char
> *queryString,
> + bool is_update = false;
> 
> Consider if "was_replaced" might be a better name than "is_update".
> 
> ===
Also, this must be good. Done.

> 
> (5) COMMENT
> File: src/backend/commands/trigger.c
> @@ -669,6 +673,81 @@ CreateTrigger(CreateTrigStmt *stmt, const char
> *queryString,
> + if (!stmt->replace)
> + ereport(ERROR,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("trigger \"%s\" for relation \"%s\" already exists",
> + stmt->trigname, RelationGetRelationName(rel;
> + else
> + {
> + /*
> + * An internal trigger cannot be replaced by another user defined
> + * trigger. This should exclude the case that internal trigger is
> + * child trigger of partition table and needs to be rewritten when
> + * the parent trigger is replaced by user.
> + */
> + if (internal_trigger && isInternal == false && in_partition == false)
> + ereport(ERROR, (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("trigger \"%s\" for relation \"%s\" is an internal trigger",
> + stmt->trigname, RelationGetRelationName(rel;
> +
> + /*
> + * CREATE OR REPLACE TRIGGER command can't replace constraint
> + * trigger.
> + */
> + if (OidIsValid(existing_constraint_oid))
> + ereport(ERROR,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("trigger \"%s\" for relation \"%s\" is a constraint trigger",
> + stmt->trigname, RelationGetRelationName(rel;
> + }
> 
> It is not really necessary

Re: Parallel INSERT (INTO ... SELECT ...)

2020-11-03 Thread Amit Kapila

On Wed, Nov 4, 2020 at 6:11 AM Greg Nancarrow  wrote:
>
> On Tue, Nov 3, 2020 at 5:25 PM vignesh C  wrote:
> >
> > -> commandType is not used, we can remove it.
> > + * Prepare for entering parallel mode by assigning a FullTransactionId, to 
> > be
> > + * included in the transaction state that is serialized in the parallel 
> > DSM.
> > + */
> > +void PrepareParallelModeForModify(CmdType commandType)
> > +{
> > +   Assert(!IsInParallelMode());
> > +
> > +   (void)GetCurrentTransactionId();
> > +}
>
> Thanks, at least for INSERT, it's not needed, so I'll remove it.
>

Or you might want to consider moving the check related to
IsModifySupportedInParallelMode() inside
PrepareParallelModeForModify(). That way the code might look a bit
cleaner.


-- 
With Regards,
Amit Kapila.

Re: list of extended statistics on psql

2020-11-03 Thread Tatsuro Yamada


Hi,


I addressed it, so I keep the size of extended stats with the unit.

Changes:

   - Use pg_size_pretty to show the size of extended stats by \dX+



I rebased the patch on the head and also added tab-completion.
Any feedback is welcome.


Preparing for tests:
===
create table t1 (a int, b int);
create statistics stts_1 (dependencies) on a, b from t1;
create statistics stts_2 (dependencies, ndistinct) on a, b from t1;
create statistics stts_3 (dependencies, ndistinct, mcv) on a, b from t1;

create table t2 (a int, b int, c int);
create statistics stts_4 on b, c from t2;

create table hoge (col1 int, col2 int, col3 int);
create statistics stts_hoge on col1, col2, col3 from hoge;

create schema foo;
create schema yama;
create statistics foo.stts_foo on col1, col2 from hoge;
create statistics yama.stts_yama (ndistinct, mcv) on col1, col3 from hoge;

insert into t1 select i,i from generate_series(1,100) i;
analyze t1;

Result of \dX:
==
postgres=# \dX
  List of extended statistics
 Schema |   Name| Definition | N_distinct | Dependencies |  
 Mcv
+---+++--+-
 foo| stts_foo  | col1, col2 FROM hoge   | defined| defined  | 
defined
 public | stts_1| a, b FROM t1   || built|
 public | stts_2| a, b FROM t1   | built  | built|
 public | stts_3| a, b FROM t1   | built  | built| 
built
 public | stts_4| b, c FROM t2   | defined| defined  | 
defined
 public | stts_hoge | col1, col2, col3 FROM hoge | defined| defined  | 
defined
 yama   | stts_yama | col1, col3 FROM hoge   | defined|  | 
defined
(7 rows)

Result of \dX+:
===
postgres=# \dX+
   List of extended statistics
 Schema |   Name| Definition | N_distinct | Dependencies |  
 Mcv   |  N_size  |  D_size  |   M_size
+---+++--+-+--+--+
 foo| stts_foo  | col1, col2 FROM hoge   | defined| defined  | 
defined | 0 bytes  | 0 bytes  | 0 bytes
 public | stts_1| a, b FROM t1   || built|  
   |  | 40 bytes |
 public | stts_2| a, b FROM t1   | built  | built|  
   | 13 bytes | 40 bytes |
 public | stts_3| a, b FROM t1   | built  | built| 
built   | 13 bytes | 40 bytes | 6126 bytes
 public | stts_4| b, c FROM t2   | defined| defined  | 
defined | 0 bytes  | 0 bytes  | 0 bytes
 public | stts_hoge | col1, col2, col3 FROM hoge | defined| defined  | 
defined | 0 bytes  | 0 bytes  | 0 bytes
 yama   | stts_yama | col1, col3 FROM hoge   | defined|  | 
defined | 0 bytes  |  | 0 bytes
(7 rows)

Results of Tab-completion:
===
postgres=# \dX 
foo. pg_toast.stts_2   stts_hoge
information_schema.  public.  stts_3   yama.
pg_catalog.  stts_1   stts_4

postgres=# \dX+ 
foo. pg_toast.stts_2   stts_hoge
information_schema.  public.  stts_3   yama.
pg_catalog.  stts_1   stts_4


Regards,
Tatsuro Yamada
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..fd860776af 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1918,6 +1918,20 @@ testdb=>
 
 
   
+  
+  
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the 
pattern
+are listed.
+If + is appended to the command name, each extended 
statistics
+is listed with its size.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a83d5dfc..c6f1653cb7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -930,6 +930,9 @@ exec_command_d(PsqlScanState scan_state, bool 
active_branch, const char *cmd)
else
success = listExtensions(pattern);
break;
+   case 'X':   /* Extended Statistics 
*/
+   success = listExtendedStats(pattern, 
show_verbose);
+   break;
case 'y':   /* Event Triggers */
success = listEventTriggers(pattern, 
show_verbose);
break;
diff --git a/src/bin

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-03 Thread k.jami...@fujitsu.com

Hi,

I've updated the patch 0004 (Truncate optimization) with the previous comments 
of
Tsunakawa-san already addressed in the patch. (Thank you very much for the 
review.) 
The change here compared to the previous version is that in 
DropRelFileNodesAllBuffers()
we don't check for the accurate flag anymore when deciding whether to optimize 
or not.
For relations with blocks that do not exceed the threshold for full scan, we 
call
DropRelFileNodeBuffers where the flag will be checked anyway. Otherwise, we 
proceed
to the traditional buffer scan. Thoughts?

I've done recovery performance for TRUNCATE.
Test case: 1 parent table with 100 child partitions. TRUNCATE each child 
partition (1 transaction per table).
Currently, it takes a while to recover when we have large shared_buffers 
setting, but with the patch applied
the recovery is almost constant (0.206 s below).

| s_b   | master | patched | 
|---||-| 
| 128MB | 0.105  | 0.105   | 
| 1GB   | 0.205  | 0.205   | 
| 20GB  | 2.008  | 0.206   | 
| 100GB | 9.315  | 0.206   |

Method of Testing (assuming streaming replication is configured):
1. Create 1 parent table and 100 child partitions
2. Insert data to each table. 
3. Pause WAL replay on standby. ( SELECT pg_wal_replay_pause(); )
4. TRUNCATE each child partitions on primary (1 transaction per table). Stop 
the primary
5. Resume the WAL replay and promote standby. ( SELECT pg_wal_replay_resume(); 
pg_ctl promote)
I have confirmed that the relations became empty on standby.

Your thoughts, feedback are very much appreciated.

Regards,
Kirk Jamison


v29-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v29-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v29-0004-TRUNCATE-optimization.patch
Description: v29-0004-TRUNCATE-optimization.patch

Re: ModifyTable overheads in generic plans

2020-11-03 Thread Amit Langote

On Tue, Nov 3, 2020 at 9:05 PM Heikki Linnakangas  wrote:
> On 03/11/2020 10:27, Amit Langote wrote:
> > Please check the attached if that looks better.
>
> Great, thanks! Yeah, I like that much better.
>
> This makes me a bit unhappy:
>
> >
> >   /* Also let FDWs init themselves for foreign-table result 
> > rels */
> >   if (resultRelInfo->ri_FdwRoutine != NULL)
> >   {
> >   if (resultRelInfo->ri_usesFdwDirectModify)
> >   {
> >   ForeignScanState *fscan = (ForeignScanState 
> > *) mtstate->mt_plans[i];
> >
> >   /*
> >* For the FDW's convenience, set the 
> > ForeignScanState node's
> >* ResultRelInfo to let the FDW know which 
> > result relation it
> >* is going to work with.
> >*/
> >   Assert(IsA(fscan, ForeignScanState));
> >   fscan->resultRelInfo = resultRelInfo;
> >   
> > resultRelInfo->ri_FdwRoutine->BeginDirectModify(fscan, eflags);
> >   }
> >   else if 
> > (resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
> >   {
> >   List   *fdw_private = (List *) 
> > list_nth(node->fdwPrivLists, i);
> >
> >   
> > resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,
> > 
> >resultRelInfo,
> > 
> >fdw_private,
> > 
> >i,
> > 
> >eflags);
> >   }
> >   }
>
> If you remember, I was unhappy with a similar assertion in the earlier
> patches [1]. I'm not sure what to do instead though. A few options:
>
> A) We could change FDW API so that BeginDirectModify takes the same
> arguments as BeginForeignModify(). That avoids the assumption that it's
> a ForeignScan node, because BeginForeignModify() doesn't take
> ForeignScanState as argument. That would be consistent, which is nice.
> But I think we'd somehow still need to pass the ResultRelInfo to the
> corresponding ForeignScan, and I'm not sure how.

Maybe ForeignScan doesn't need to contain any result relation info
then?  ForeignScan.operation != CMD_SELECT is enough to tell it to
call IterateDirectModify() as today.

> B) Look up the ResultRelInfo, and call BeginDirectModify(), on the first
> call to ForeignNext().
>
> C) Accept the Assertion. And add an elog() check in the planner for that
> with a proper error message.
>
> I'm leaning towards B), but maybe there's some better solution I didn't
> think of?   Perhaps changing the API would make sense in any case, it is a
> bit weird as it is. Backwards-incompatible API changes are not nice, but
> I don't think there are many FDWs out there that implement the
> DirectModify functions. And those functions are pretty tightly coupled
> with the executor and ModifyTable node details anyway, so I don't feel
> like we can, or need to, guarantee that they stay unchanged across major
> versions.

B is not too bad, but I tend to prefer doing A too.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Collation versioning

On Wed, Nov 4, 2020 at 2:56 PM David Rowley  wrote:
> initdb works fine. I ran vcregress upgradecheck and it passes.
>
> With my default locale of English.New Zealand.1252 I get zero rows from:
>
> select * from pg_depend where coalesce(refobjversion,'') <> '';
>
> if I initdb with --lc-collate=en-NZ, it works and I see:
>
> postgres=# select * from pg_depend where coalesce(refobjversion,'') <> '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> deptype |  refobjversion
> -+---+--++--+-+-+-
> 2606 | 12512 |0 |   3456 |  100 |   0 | n
>  | 1538.14,1538.14
> (1 row)

Thanks for all the help and testing!  Pushed.  If we don't come up
with something better I'll need to figure out how to explain this in
the manual.  (Will no one rid us of these meddlesome old format locale
names?  It seems like pg_locale.c could drop a lot of rather
unpleasant code if initdb, CREATE COLLATION, and CREATE DATABASE
didn't allow them into the catalogue in the first place...)

Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843


On Wed, Nov 04, 2020 at 02:49:24PM +1300, Thomas Munro wrote:

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
 wrote:

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.


Urgh.  Thanks.  Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another.  Looking.


Maybe. I didn't have time to investigate this more deeply, and it takes
quite a bit of time to reproduce. I can try again with extra logging or
test some proposed fixes, if you give me a patch.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Collation versioning

2020-11-03 Thread David Rowley

On Wed, 4 Nov 2020 at 14:21, Thomas Munro  wrote:
>
> On Wed, Nov 4, 2020 at 10:56 AM Thomas Munro  wrote:
> > On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> > > Thomas Munro  writes:
> > > > We want the same algorithm that Windows uses internally to resolve the
> > > > old style name to a collation; in other words we probably want
> > > > something more like the code path that they took away in VS2015 :-(.
> > >
> > > Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> > > Maybe we could push David's code or something similar, and then
> > > contemplate better ways at leisure?
> >
> > Ok, yeah, I'll do that in the next few hours.
>
> I can't bring myself to commit that, it's not really in the spirit of
> this data integrity feature, and it's not our business to second guess
> the relationship between different locale naming schemes through fuzzy
> logic.  Instead, I propose to just neuter the feature if Windows
> decides it can't understand a locale names that it gave us.  It should
> still work fine with something like initdb --lc-collate=en-US.  Here's
> an untested patch.  Thoughts?

I gave this a quick test.

initdb works fine. I ran vcregress upgradecheck and it passes.

With my default locale of English.New Zealand.1252 I get zero rows from:

select * from pg_depend where coalesce(refobjversion,'') <> '';

if I initdb with --lc-collate=en-NZ, it works and I see:

postgres=# select * from pg_depend where coalesce(refobjversion,'') <> '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
deptype |  refobjversion
-+---+--++--+-+-+-
2606 | 12512 |0 |   3456 |  100 |   0 | n
 | 1538.14,1538.14
(1 row)

David

Re: PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843

On Wed, Nov 4, 2020 at 2:32 PM Tomas Vondra
 wrote:
> After a while (~1h on my machine) the pg_multixact gets over 10GB, which
> triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
> My guess is that this discards some of the files, but checkpointer is
> not aware of that, or something like that. Not sure.

Urgh.  Thanks.  Looks like perhaps the problem is that I have
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true) in one codepath
that unlinks files, but not another.  Looking.

Re: Explicit NULL dereference (src/backend/utils/adt/ruleutils.c)

2020-11-03 Thread Kyotaro Horiguchi

At Mon, 02 Nov 2020 14:49:50 -0500, Tom Lane  wrote in 
> Ranier Vilela  writes:
> > Em seg., 2 de nov. de 2020 às 01:36, Kyotaro Horiguchi <
> > horikyota@gmail.com> escreveu:
> >> Doesn't seem that ev_qual and ev_action can be NULL.  The same
> >> function in the current converts action list to string using
> >> nodeToSTring so NIL is converted into '<>', which is not NULL.
> 
> > I think that Assert is not the right solution here.
> 
> I think there's some confusion here: whether the ev_actions column can
> contain a SQL NULL is a very different thing from whether the result of
> stringToNode() on it can be a NIL list.  The latter should also not
> happen, but it's not enforced by low-level code in the same way that
> the NOT NULL property is.  So in my judgment, it's okay for us to just
> Assert that we got a not-null datum, but it's probably worth expending
> an actual test-and-elog for the NIL-list case.
> 
> Of course, someone could put a string into that column that doesn't
> read out as a List at all, or it does but the elements aren't Query
> nodes, etc etc.  There's a very finite limit to how much code I'm
> willing to expend on such scenarios.  But a test for NIL list seems
> reasonable, since this function previously had what looked like sane
> handling for that case.
> 
> Anyway, I adjusted Kyotaro-san's patch a bit (including fixing the
> near identical code in make_viewdef) and pushed it.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Collation versioning

Thomas Munro  writes:
> I can't bring myself to commit that, it's not really in the spirit of
> this data integrity feature, and it's not our business to second guess
> the relationship between different locale naming schemes through fuzzy
> logic.  Instead, I propose to just neuter the feature if Windows
> decides it can't understand a locale names that it gave us.  It should
> still work fine with something like initdb --lc-collate=en-US.  Here's
> an untested patch.  Thoughts?

Works for me, at least as a short-term solution.

regards, tom lane

Re: Online checksums verification in the backend

On Tue, Nov 03, 2020 at 06:46:12PM +0900, Michael Paquier wrote:
> Yep, that's clear.  I'll deal with that tomorrow.  That's more than a
> simple rework.

This part is now done as of e152506a.
--
Michael


signature.asc
Description: PGP signature

PANIC: could not fsync file "pg_multixact/..." since commit dee663f7843


Hi,

While running some multixact-oriented stress tests, I noticed that
commit dee663f7843:

Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.

...

seems to trigger this issue:

[17820] LOG:  checkpoint starting: wal
[17820] PANIC:  could not fsync file "pg_multixact/offsets/06E0": No such 
file or directory
[17818] LOG:  checkpointer process (PID 17820) was terminated by signal 6: 
Aborted
[17818] LOG:  terminating any other active server processes

which is then followed by this during recovery:

[18599] LOG:  redo starts at 1F/FF098138
[18599] LOG:  file "pg_multixact/offsets/0635" doesn't exist, reading as 
zeroes
[18599] CONTEXT:  WAL redo at 1F/FF09A218 for MultiXact/CREATE_ID: 
104201060 offset 1687158668 nmembers 3: 2128819 (keysh) 2128823 (keysh) 2128827 
(keysh)
[18599] LOG:  file "pg_multixact/members/7DE3" doesn't exist, reading as 
zeroes
[18599] CONTEXT:  WAL redo at 1F/FF09A218 for MultiXact/CREATE_ID: 
104201060 offset 1687158668 nmembers 3: 2128819 (keysh) 2128823 (keysh) 2128827 
(keysh)
[18599] LOG:  redo done at 2A/D4D8BFB0 system usage: CPU: user: 265.57 s, 
system: 12.43 s, elapsed: 278.06 s
[18599] LOG:  checkpoint starting: end-of-recovery immediate
[18599] PANIC:  could not fsync file "pg_multixact/offsets/06E0": No such 
file or directory
[17818] LOG:  startup process (PID 18599) was terminated by signal 6: 
Aborted
[17818] LOG:  aborting startup due to startup process failure
[17818] LOG:  database system is shut down

at which point the cluster is kaput, of course.

It's clearly the fault of dee663f7843 - 4 failures out of 4 attempts on
that commit, and after switching to ca7f8e2b86 it goes away.

Reproducing it is pretty simple, but it takes a bit of time. Essentially
do this:

create table t (a int primary key);
insert into t select i from generate_series(1,1000) s(i);

and then run

SELECT * FROM t FOR KEY SHARE;

from pgbench with many concurrent clients. I do this:

pgbench -n -c 32 -j 8 -f select.sql -T 86400 test

After a while (~1h on my machine) the pg_multixact gets over 10GB, which
triggers a more aggressive cleanup (per MultiXactMemberFreezeThreshold).
My guess is that this discards some of the files, but checkpointer is
not aware of that, or something like that. Not sure.

Attached are backtraces from the two crashes - regular and during
recovery. Not sure how interesting / helpful that is, it probably does
not say much about how we got there.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49  return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x7faee64fe53a in __GI_abort () at abort.c:79
#2  0x564dbf3ac117 in errfinish (filename=, 
lineno=, funcname=0x564dbf4ebb00 <__func__.16955> 
"ProcessSyncRequests") at elog.c:592
#3  0x564dbf2a2191 in ProcessSyncRequests () at sync.c:416
#4  0x564dbf05dee4 in CheckPointGuts (checkPointRedo=274827688840, 
flags=flags@entry=192) at xlog.c:9196
#5  0x564dbf0642e6 in CreateCheckPoint (flags=192) at xlog.c:8974
#6  0x564dbf222642 in CheckpointerMain () at checkpointer.c:451
#7  0x564dbf0740e7 in AuxiliaryProcessMain (argc=argc@entry=2, 
argv=argv@entry=0x7ffd07376d40) at bootstrap.c:451
#8  0x564dbf22dd8b in StartChildProcess (type=CheckpointerProcess) at 
postmaster.c:5512
#9  0x564dbf22f042 in reaper (postgres_signal_arg=) at 
postmaster.c:3044
#10 
#11 0x7faee65d1d46 in __GI___select (nfds=nfds@entry=8, 
readfds=readfds@entry=0x7ffd073774e0, writefds=writefds@entry=0x0, 
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd07377440) at 
../sysdeps/unix/sysv/linux/select.c:41
#12 0x564dbf22fa0c in ServerLoop () at postmaster.c:1706
#13 0x564dbf2314f3 in PostmasterMain (argc=3, argv=) at 
postmaster.c:1415
#14 0x564dbefd78ae in main (argc=3, argv=0x564dbfe19530) at main.c:209


Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49  return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x7faee64fe53a in __GI_abort () at abort.c:79
#2  0x564dbf3ac117 in errfinish (filename=, 
lineno=, funcname=0x564dbf4ebb00 <__func

Re: Collation versioning

On Wed, Nov 4, 2020 at 10:56 AM Thomas Munro  wrote:
> On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > We want the same algorithm that Windows uses internally to resolve the
> > > old style name to a collation; in other words we probably want
> > > something more like the code path that they took away in VS2015 :-(.
> >
> > Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> > Maybe we could push David's code or something similar, and then
> > contemplate better ways at leisure?
>
> Ok, yeah, I'll do that in the next few hours.

I can't bring myself to commit that, it's not really in the spirit of
this data integrity feature, and it's not our business to second guess
the relationship between different locale naming schemes through fuzzy
logic.  Instead, I propose to just neuter the feature if Windows
decides it can't understand a locale names that it gave us.  It should
still work fine with something like initdb --lc-collate=en-US.  Here's
an untested patch.  Thoughts?
From 0e99688adb3c1b18c01fbfd8e2488e5769309fc7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 4 Nov 2020 13:49:05 +1300
Subject: [PATCH] Tolerate version lookup failure for old-style Windows locale
 names.

Accept that we can't get versions for such locale names for now, leaving
it up to the user to provide the modern language tag format to benefit
from the collation versioning feature.  It's not clear that we can do
the conversion from the old style to the new style reliably enough for
this purpose.

Unfortunately, this includes the values that are typically installed as
database default by initdb, so it'd be nice to find a better solution
than this.

Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
---
 src/backend/utils/adt/pg_locale.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 3b0324ce18..d5a0169420 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1702,10 +1702,22 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 			LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
+		{
+			/*
+			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
+			 * locale name like "English_United States.1252".  Until those
+			 * values can be prevented from entering the system, or 100%
+			 * reliably converted to the more useful tag format, tolerate the
+			 * resulting error and report that we have no version data.
+			 */
+			if (GetLastError() == ERROR_INVALID_PARAMETER)
+return NULL;
+
 			ereport(ERROR,
 	(errmsg("could not get collation version for locale \"%s\": error code %lu",
 			collcollate,
 			GetLastError(;
+		}
 		collversion = psprintf("%d.%d,%d.%d",
 			   (version.dwNLSVersion >> 8) & 0x,
 			   version.dwNLSVersion & 0xFF,
-- 
2.20.1

Re: -O switch

Magnus Hagander  writes:
> [ remove_option_o_2.patch ]

This seems committable to me now, although ...

> On Mon, Nov 2, 2020 at 6:58 PM Tom Lane  wrote:
>> Magnus Hagander  writes:
>>> Initially I kept the dynamic argv/argc in even though it's now
>>> hardcoded, in case we wanted to add something back. But given the way
>>> it looks now, perhaps we should just get rid of BackendRun()
>>> completely and directly call PostgresMain()? Or keep BackendRun() with
>>> just setting the TopMemoryContext, but removing the dynamic parts?

>> I'd be inclined to keep it as-is for now.  It's not adding any significant
>> amount of cycles compared to the process fork, so we might as well
>> preserve flexibility.

... looking at this again, BackendRun certainly looks ridiculously
over-engineered for what it still does.  If we keep it like this, we
should at least add a comment along the lines of "We once had the
ability to pass additional arguments to PostgresMain, and may someday
want to do that again".  But I wouldn't object to getting rid of the
dynamic construction of the arg array, and the debug output too.

regards, tom lane

Re: Dereference before NULL check (src/backend/storage/ipc/latch.c)

2020-11-03 Thread Kyotaro Horiguchi

At Tue, 3 Nov 2020 20:44:23 +1300, Thomas Munro  wrote 
in 
> On Tue, Nov 3, 2020 at 12:50 AM Kyotaro Horiguchi
>  wrote:
> > With the fix patch, it changes to:
> >
> > [16632] LOG:  FALSE LATCH: 
> 
> Nice repo.  But is it OK to not reset the Win32 event in this case?
> Does it still work correctly if you wait on the latch after that
> happened, and perhaps after the PG latch is reset?

I'm not sure what is the point of the question, but the previous patch
doesn't omit resetting the Win32-event in that case.  In the same way
with other implements of the same function, it resets the trigger that
woke up the process since the trigger is no longer needed even if we
are not waiting on it.

If we call WaitLatch(OrSocket) that waits on the latch, it immediately
returns because the latch is set. If we called ResetLatch before the
next call to WaitLatch(), it correctly waits on a trigger to be
pulled.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Fix brin_form_tuple to properly detoast data


Hi,

As pointed out in [1], BRIN is not properly handling toasted data, which
may easily lead to index tuples referencing TOAST-ed values. Which is
clearly wrong - it's trivial to trigger failues after a DELETE.

Attached is a patch that aims to fix this - AFAIK the brin_form_tuple
was simply missing the TOAST_INDEX_HACK stuff from index_form_tuple,
which ensures the data is detoasted and (possibly) re-compressed. The
code is mostly the same, with some BRIN-specific tweaks (looking at
oi_typecache instead of the index descriptor, etc.).

I also attach a simple SQL script that I used to trigger the issue. This
needs to be turned into a regression test, I'll work on that tomorrow.


A separate question is what to do about existing indexes - ISTM the only
thing we can do is to tell the users to reindex all BRIN indexes on
varlena values. Something like this:

select * from pg_class
 where relam = (select oid from pg_am where amname = 'brin')
   and oid in (select attrelid from pg_attribute where attlen = -1
and attstorage in ('e', 'x'));


regards


[1] 
https://www.postgresql.org/message-id/20201001184133.oq5uq75sb45pu3aw%40development

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/access/brin/brin_tuple.c 
b/src/backend/access/brin/brin_tuple.c
index 46e6b23c87..96eab9e87c 100644
--- a/src/backend/access/brin/brin_tuple.c
+++ b/src/backend/access/brin/brin_tuple.c
@@ -38,6 +38,17 @@
 #include "utils/datum.h"
 #include "utils/memutils.h"
 
+#include "access/detoast.h"
+#include "access/heaptoast.h"
+#include "access/toast_internals.h"
+
+/*
+ * This enables de-toasting of index entries.  Needed until VACUUM is
+ * smart enough to rebuild indexes from scratch.
+ */
+#define TOAST_INDEX_HACK
+
+
 static inline void brin_deconstruct_tuple(BrinDesc *brdesc,

  char *tp, bits8 *nullbits, bool nulls,

  Datum *values, bool *allnulls, bool *hasnulls);
@@ -99,6 +110,12 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, 
BrinMemTuple *tuple,
Sizelen,
hoff,
data_len;
+   int i;
+
+#ifdef TOAST_INDEX_HACK
+   Datum  *untoasted_values;
+   int nuntoasted = 0;
+#endif
 
Assert(brdesc->bd_totalstored > 0);
 
@@ -107,6 +124,10 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, 
BrinMemTuple *tuple,
phony_nullbitmap = (bits8 *)
palloc(sizeof(bits8) * BITMAPLEN(brdesc->bd_totalstored));
 
+#ifdef TOAST_INDEX_HACK
+   untoasted_values = (Datum *) palloc(sizeof(Datum) * 
brdesc->bd_totalstored);
+#endif
+
/*
 * Set up the values/nulls arrays for heap_fill_tuple
 */
@@ -141,7 +162,75 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, 
BrinMemTuple *tuple,
for (datumno = 0;
 datumno < brdesc->bd_info[keyno]->oi_nstored;
 datumno++)
-   values[idxattno++] = 
tuple->bt_columns[keyno].bv_values[datumno];
+   {
+   Datum value = 
tuple->bt_columns[keyno].bv_values[datumno];
+
+#ifdef TOAST_INDEX_HACK
+
+   /* We must look at the stored type, not at the index 
descriptor. */
+   TypeCacheEntry  *atttype = 
brdesc->bd_info[keyno]->oi_typcache[datumno];
+
+   /* Do we need to free the value at the end? */
+   bool free_value = false;
+
+   /* For non-varlena types we don't need to do anything 
special */
+   if (atttype->typlen != -1)
+   {
+   values[idxattno++] = value;
+   continue;
+   }
+
+   /*
+* Do nothing if value is not of varlena type. We don't 
need to
+* care about NULL values here, thanks to bv_allnulls 
above.
+*
+* If value is stored EXTERNAL, must fetch it so we are 
not
+* depending on outside storage.
+*
+* XXX Is this actually true? Could it be that the 
summary is
+* NULL even for range with non-NULL data? E.g. 
degenerate bloom
+* filter may be thrown away, etc.
+*/
+   if (VARATT_IS_EXTERNAL(DatumGetPointer(value)))
+   {
+   value = 
PointerGetDatum(detoast_external_attr((struct varlena *)
+

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

On Tue, Nov 03, 2020 at 01:46:38PM +0100, Magnus Hagander wrote:
> On Tue, Nov 3, 2020 at 1:00 PM Daniel Gustafsson  wrote:
>> I kind of like the idea of continuing to abstract this functionality, not
>> pulling in OpenSSL headers in fork_process.c is a neat bonus.  I'd say it's
>> worth implementing to see what it would imply, and am happy to do unless
>> someone beats me to it.
> 
> Yeah, if it's likely to be usable in the other implementations, then I
> think we should definitely explore exactly what that kind of an
> abstraction would imply. Anything isolating the dependency on OpenSSL
> would likely have to be done at that time anyway in that case, so
> better have it ready.

With the NSS argument, agreed.  Documenting when this initialization
routine is used is important.  And I think that we should force to
look at this code when adding a new SSL implementation to make sure
that we never see CVE-2013-1900 again, say:
void
pg_strong_random_init(void)
{
#ifdef USE_SSL
#ifdef USE_OPENSSL
RAND_poll();
#elif USE_NSS
/* do the NSS initialization */
#else
Hey, you are missing something here.
#endif
#endif
}
--
Michael


signature.asc
Description: PGP signature

Re: Parallel INSERT (INTO ... SELECT ...)

2020-11-03 Thread Greg Nancarrow

Hi Vignesh,

Thanks for reviewing the patches.

On Tue, Nov 3, 2020 at 5:25 PM vignesh C  wrote:
>
> -> commandType is not used, we can remove it.
> + * Prepare for entering parallel mode by assigning a FullTransactionId, to be
> + * included in the transaction state that is serialized in the parallel DSM.
> + */
> +void PrepareParallelModeForModify(CmdType commandType)
> +{
> +   Assert(!IsInParallelMode());
> +
> +   (void)GetCurrentTransactionId();
> +}

Thanks, at least for INSERT, it's not needed, so I'll remove it.

>
> -> As we support insertion of data from the workers, this comments "but as of 
> now, only the leader backend writes into a completely new table.  In the 
> future, we can extend it to allow workers to write into the table" must be 
> updated accordingly:
> +* modify any data using a CTE, or if this is a cursor operation, or 
> if
> +* GUCs are set to values that don't permit parallelism, or if
> +* parallel-unsafe functions are present in the query tree.
>  *
> -* (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE
> +* (Note that we do allow CREATE TABLE AS, INSERT, SELECT INTO, and 
> CREATE
>  * MATERIALIZED VIEW to use parallel plans, but as of now, only the 
> leader
>  * backend writes into a completely new table.  In the future, we can
>  * extend it to allow workers to write into the table.  However, to 
> allow
>
> -> Also should we specify insert as "insert into select"
>

I'll update it, appropriate to each patch.

> -> We could include a small writeup of the design may be in the commit 
> message. It will be useful for review.
>

Will do so for the next patch version.

> -> I felt the below two assignment statements can be in the else condition:
> glob->maxParallelHazard = max_parallel_hazard(parse);
> glob->parallelModeOK = (glob->maxParallelHazard != 
> PROPARALLEL_UNSAFE);
> +
> +   /*
> +* Additional parallel-mode safety checks are required in 
> order to
> +* allow an underlying parallel query to be used for a
> +* table-modification command that is supported in 
> parallel-mode.
> +*/
> +   if (glob->parallelModeOK &&
> +   IsModifySupportedInParallelMode(parse->commandType))
> +   {
> +   glob->maxParallelHazard = 
> MaxParallelHazardForModify(parse, &glob->maxParallelHazard);
> +   glob->parallelModeOK = (glob->maxParallelHazard != 
> PROPARALLEL_UNSAFE);
> +   }
>
> something like:
> /*
> * Additional parallel-mode safety checks are required in order to
> * allow an underlying parallel query to be used for a
> * table-modification command that is supported in parallel-mode.
> */
> if (glob->parallelModeOK &&
> IsModifySupportedInParallelMode(parse->commandType))
> glob->maxParallelHazard = MaxParallelHazardForModify(parse, 
> &glob->maxParallelHazard);
> else
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse);
> glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
>

That won't work. As the comment is trying to point out, additional
parallel-safety checks (i.e. in addition to those done by
max_parallel_hazard()) are required to determine if INSERT can be
safely run in parallel-mode with an underlying parallel query.
Also, the max_parallel_hazard found from first calling
max_parallel_hazard() then needs to be fed into
MaxParallelHazardForModify(), in case it finds a worse parallel
hazard.
For example, max_parallel_hazard() may find something parallel
RESTRICTED, but then the additional parallel-safety checks done by
MaxParallelHazardForModify() find something parallel UNSAFE.

> -> Comments need slight adjustment, maybe you could run pgindent for the 
> modified code.
> +   /*
> +* Supported table-modification commands may require 
> additional steps
> +* prior to entering parallel mode, such as assigning a 
> FullTransactionId.
> +*/
>

OK, will run pgindent.

> -> In the below, max_parallel_hazard_test will return true for 
> PROPARALLEL_RESTRICTED also, Is break intentional in that case? As in case of 
> RI_TRIGGER_FK for PROPARALLEL_RESTRICTED we continue.
> +   if (max_parallel_hazard_test(func_parallel(trigger->tgfoid), 
> context))
> +   break;
> +
> +   /*
> +* If the trigger type is RI_TRIGGER_FK, this indicates a FK 
> exists in
> +* the relation, and this would result in creation of new 
> CommandIds
> +* on insert/update/delete and this isn't supported in a 
> parallel
> +* worker (but is safe in the parallel leader).
> +*/
> +   trigtype = RI_FKey_trigger_type(trigger->tgfoid);
> +

Re: Delay of standby shutdown

2020-11-03 Thread Soumyadeep Chakraborty

Hello Fujii,

On Thu, Sep 17, 2020 at 6:49 AM Fujii Masao  wrote:

As far as I understand after debugging, the problem is as follows:

1. After the primary is stopped, and the standby startup process is
waiting inside:

(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
wait_time,
WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);

it receives SIGTERM on account of stopping the standby and it leads to
the WaitLatch call returning with WL_LATCH_SET.

2. WaitForWALToBecomeAvailable() then will return true after calling
XLogFileReadAnyTLI() and eventually, XLogReadRecord() will return NULL
since there is no new WAL to read, which means ReadRecord() will loop
back and perform another XLogReadRecord().

3. The additional XLogReadRecord() will lead to a 5s wait inside
WaitForWALToBecomeAvailable() - a different WaitLatch() call this time:

/*
 * Wait for more WAL to arrive. Time out after 5 seconds
 * to react to a trigger file promptly and to check if the
 * WAL receiver is still active.
 */
(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_STREAM);

4. And then eventually, the code will handle interrupts here inside
WaitForWALToBecomeAvailable() after the 5s wait:

/*
 * This possibly-long loop needs to handle interrupts of startup
 * process.
 */
HandleStartupProcInterrupts();

> To avoid this issue, I think that ReadRecord() should call
> HandleStartupProcInterrupts() whenever looping back to retry.

Alternatively, perhaps we can place it inside
WaitForWALToBecomeAvailable() (which already has a similar call),
since it is more semantically aligned to checking for interrupts, rather
than ReadRecord()? Like this:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 52a67b117015..b05cf6c7c219 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12249,6 +12249,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr
RecPtr, bool randAccess,

WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
 ResetLatch(&XLogCtl->recoveryWakeupLatch);
 now = GetCurrentTimestamp();
+/*
+ * Check for interrupts
+ */
+HandleStartupProcInterrupts();
 }
 last_fail_time = now;
 currentSource = XLOG_FROM_ARCHIVE;

It also has the advantage of being in a slightly less "hot" code path
as compared to it being where you suggested (the location you suggested
is executed infinitely when a standby is not connected to a primary and
there is no more WAL to read)

Regards,

Soumyadeep and Georgios

Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

I wrote:
> * I notice that this will sometimes transform non-SQL-spec syntax
> into SQL-spec, for example ...
> I'm not sure that that satisfies the POLA.  This particular case is
> especially not great, because this is really textregexsubstr() which
> is *not* SQL compatible, so the display is more than a bit misleading.

Actually, the problem there is that I made ruleutils.c willing to
reverse-list textregexsubstr() in SQL syntax, which it really shouldn't
since there is no such function per SQL.  So deleting that "case" value
is enough to fix most of the problem.  Still:

> ... In fact, I'd sort of argue
> that we should not force the function to be sought in pg_catalog in such
> a case either.  The comments in substr_list claim that we're trying to
> allow extension functions named substring(), but using SystemFuncName is
> 100% hostile to that.

... this seems like a reasonable argument.  However, in the attached
I only did that for SUBSTRING and OVERLAY.  I had thought of doing
it for POSITION and TRIM, but both of those are weird enough that
allowing a "normal function call" seems error-prone.  For example,
the fact that TRIM(expr_list) works out as a call to btrim() is a mess,
but I don't think we can change it.  (But of course you can still call a
user-defined trim() function if you double-quote the function name.)

I did get rid of the empty variant for position_list, which AFAICS
has no value except adding confusion: there are no zero-argument
functions named "position" in pg_catalog.

I feel like this is committable at this point --- any objections?

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 530aac68a7..3031c52991 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2682,11 +2682,12 @@ _copyFuncCall(const FuncCall *from)
 	COPY_NODE_FIELD(args);
 	COPY_NODE_FIELD(agg_order);
 	COPY_NODE_FIELD(agg_filter);
+	COPY_NODE_FIELD(over);
 	COPY_SCALAR_FIELD(agg_within_group);
 	COPY_SCALAR_FIELD(agg_star);
 	COPY_SCALAR_FIELD(agg_distinct);
 	COPY_SCALAR_FIELD(func_variadic);
-	COPY_NODE_FIELD(over);
+	COPY_SCALAR_FIELD(funcformat);
 	COPY_LOCATION_FIELD(location);
 
 	return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 0cf90ef33c..9aa853748d 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2369,11 +2369,12 @@ _equalFuncCall(const FuncCall *a, const FuncCall *b)
 	COMPARE_NODE_FIELD(args);
 	COMPARE_NODE_FIELD(agg_order);
 	COMPARE_NODE_FIELD(agg_filter);
+	COMPARE_NODE_FIELD(over);
 	COMPARE_SCALAR_FIELD(agg_within_group);
 	COMPARE_SCALAR_FIELD(agg_star);
 	COMPARE_SCALAR_FIELD(agg_distinct);
 	COMPARE_SCALAR_FIELD(func_variadic);
-	COMPARE_NODE_FIELD(over);
+	COMPARE_SCALAR_FIELD(funcformat);
 	COMPARE_LOCATION_FIELD(location);
 
 	return true;
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..ee033ae779 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -582,7 +582,7 @@ makeDefElemExtended(char *nameSpace, char *name, Node *arg,
  * supply.  Any non-default parameters have to be inserted by the caller.
  */
 FuncCall *
-makeFuncCall(List *name, List *args, int location)
+makeFuncCall(List *name, List *args, CoercionForm funcformat, int location)
 {
 	FuncCall   *n = makeNode(FuncCall);
 
@@ -590,11 +590,12 @@ makeFuncCall(List *name, List *args, int location)
 	n->args = args;
 	n->agg_order = NIL;
 	n->agg_filter = NULL;
+	n->over = NULL;
 	n->agg_within_group = false;
 	n->agg_star = false;
 	n->agg_distinct = false;
 	n->func_variadic = false;
-	n->over = NULL;
+	n->funcformat = funcformat;
 	n->location = location;
 	return n;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7e324c12e2..4504b1503b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2765,11 +2765,12 @@ _outFuncCall(StringInfo str, const FuncCall *node)
 	WRITE_NODE_FIELD(args);
 	WRITE_NODE_FIELD(agg_order);
 	WRITE_NODE_FIELD(agg_filter);
+	WRITE_NODE_FIELD(over);
 	WRITE_BOOL_FIELD(agg_within_group);
 	WRITE_BOOL_FIELD(agg_star);
 	WRITE_BOOL_FIELD(agg_distinct);
 	WRITE_BOOL_FIELD(func_variadic);
-	WRITE_NODE_FIELD(over);
+	WRITE_ENUM_FIELD(funcformat, CoercionForm);
 	WRITE_LOCATION_FIELD(location);
 }
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 357ab93fb6..df263e4fdd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -490,7 +490,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	rowsfrom_item rowsfrom_list opt_col_def_list
 %type  opt_ordinality
 %type 	ExclusionConstraintList ExclusionConstraintElem
-%type 	func_arg_list
+%type 	func_arg_list func_arg_list_opt
 %type 	func_arg_expr
 %type 	row explicit_row implicit_row type_list array_expr_list
 %type 	case_expr case_arg when_clause case_default
@@ -129

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-11-03 Thread Alvaro Herrera

Here's an updated version of this patch.

Apart from rebasing to current master, I made the following changes:

* On the first transaction (the one that marks the partition as
detached), the partition is locked with ShareLock rather than
ShareUpdateExclusiveLock.  This means that DML is not allowed anymore.
This seems a reasonable restriction to me; surely by the time you're
detaching the partition you're not inserting data into it anymore.

* In ExecInitPartitionDispatchInfo, the previous version always
excluded detached partitions.  Now it does include them in isolation
level repeatable read and higher.  Considering the point above, this
sounds a bit contradictory: you shouldn't be inserting new tuples in
partitions being detached.  But if you do, it makes more sense: in RR
two queries that insert tuples in the same partition would not fail
mid-transaction.  (In a read-committed transaction, the second query
does fail, but to me that does not sound surprising.)

* ALTER TABLE .. DETACH PARTITION FINALIZE now waits for concurrent old
snapshots, as previously discussed. This should ensure that the user
doesn't just cancel the wait during the second transaction by Ctrl-C and
run FINALIZE immediately afterwards, which I claimed would bring
inconsistency.

* Avoid creating the CHECK constraint if an identical one already
exists.

(Note I do not remove the constraint on ATTACH.  That seems pointless.)

Still to do: test this using the new hook added by 6f0b632f083b.
>From 8135f82493892ad755176c8c4c8251355dae2927 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 3 Aug 2020 19:21:09 -0400
Subject: [PATCH v4] ALTER TABLE ... DETACH CONCURRENTLY

---
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/alter_table.sgml |  25 +-
 src/backend/catalog/heap.c|  13 +-
 src/backend/catalog/index.c   |   4 +-
 src/backend/catalog/partition.c   |  28 +-
 src/backend/catalog/pg_inherits.c |  51 +-
 src/backend/commands/indexcmds.c  |   7 +-
 src/backend/commands/tablecmds.c  | 587 ++
 src/backend/commands/trigger.c|   5 +-
 src/backend/executor/execPartition.c  |  29 +-
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/optimizer/util/plancat.c  |  11 +-
 src/backend/parser/gram.y |  23 +-
 src/backend/partitioning/partbounds.c |   6 +-
 src/backend/partitioning/partdesc.c   |  21 +-
 src/backend/tcop/utility.c|  19 +
 src/backend/utils/cache/partcache.c   |  15 +-
 src/bin/psql/describe.c   |  40 +-
 src/include/catalog/pg_inherits.h |   7 +-
 src/include/nodes/parsenodes.h|   2 +
 src/include/parser/kwlist.h   |   1 +
 src/include/partitioning/partdesc.h   |   5 +-
 src/include/utils/snapmgr.h   |   1 +
 .../detach-partition-concurrently-1.out   | 301 +
 .../detach-partition-concurrently-2.out   |  66 ++
 src/test/isolation/isolation_schedule |   2 +
 .../detach-partition-concurrently-1.spec  |  90 +++
 .../detach-partition-concurrently-2.spec  |  41 ++
 29 files changed, 1234 insertions(+), 178 deletions(-)
 create mode 100644 src/test/isolation/expected/detach-partition-concurrently-1.out
 create mode 100644 src/test/isolation/expected/detach-partition-concurrently-2.out
 create mode 100644 src/test/isolation/specs/detach-partition-concurrently-1.spec
 create mode 100644 src/test/isolation/specs/detach-partition-concurrently-2.spec

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5fb9dca425..b628ffaa4b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -4482,6 +4482,16 @@ SCRAM-SHA-256$:&l
when using declarative partitioning.
   
  
+
+ 
+  
+   inhdetached bool
+  
+  
+   Set to true for a partition that is in the process of being detached;
+   false otherwise.
+  
+ 
 

   
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index c25ef5abd6..1ff6214fa7 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -36,7 +36,9 @@ ALTER TABLE ALL IN TABLESPACE name
 ALTER TABLE [ IF EXISTS ] name
 ATTACH PARTITION partition_name { FOR VALUES partition_bound_spec | DEFAULT }
 ALTER TABLE [ IF EXISTS ] name
-DETACH PARTITION partition_name
+DETACH PARTITION partition_name [ CONCURRENTLY ]
+ALTER TABLE [ IF EXISTS ] name
+DETACH PARTITION partition_name FINALIZE
 
 where action is one of:
 
@@ -938,7 +940,8 @@ WITH ( MODULUS numeric_literal, REM

 

-DETACH PARTITION partition_name
+DETACH PARTITION partition_name [ { CONCURRENTLY | FINALIZE } ]
+
 
  
   This form detaches

Re: Collation versioning

2020-11-03 Thread Christoph Berg

Re: Thomas Munro
> for all I know, "en" variants might change
> independently (I doubt it in practice, but in theory it's wrong).

Long before the glibc 2.28 incident, the same collation change
had already happened twice, namely between RedHat 5 -> 6 -> 7, for
de_DE.UTF-8 only. de_AT.UTF-8 and all other locales were unaffected.

At the time I didn't connect the dots to check if Debian was affected
as well, but of course later testing revealed it was since it was a
change in glibc.

(German blogpost: 
https://www.credativ.de/blog/postgresql/postgresql-und-inkompatible-deutsche-spracheigenschaften-in-centos-rhel/)

Christoph

Re: Collation versioning

On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > We want the same algorithm that Windows uses internally to resolve the
> > old style name to a collation; in other words we probably want
> > something more like the code path that they took away in VS2015 :-(.
>
> Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> Maybe we could push David's code or something similar, and then
> contemplate better ways at leisure?

Ok, yeah, I'll do that in the next few hours.

Re: Collation versioning

Thomas Munro  writes:
> On Tue, Nov 3, 2020 at 4:38 PM Thomas Munro  wrote:
>> On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
>>> FWIW, the attached does fix the issue for me.  It basically just calls
>>> the function that converts the windows-type "English_New Zealand.1252"
>>> locale name string into, e.g. "en_NZ".

> We want the same algorithm that Windows uses internally to resolve the
> old style name to a collation; in other words we probably want
> something more like the code path that they took away in VS2015 :-(.

Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
Maybe we could push David's code or something similar, and then
contemplate better ways at leisure?

regards, tom lane

Re: Collation versioning

On Tue, Nov 3, 2020 at 4:38 PM Thomas Munro  wrote:
> On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
> > On Tue, 3 Nov 2020 at 12:29, David Rowley  wrote:
> > > Running low on ideas for now, so thought I'd post this in case it
> > > someone thinks of something else.
> >
> > FWIW, the attached does fix the issue for me.  It basically just calls
> > the function that converts the windows-type "English_New Zealand.1252"
> > locale name string into, e.g. "en_NZ". Then, since GetNLSVersionEx()
> > wants yet another variant with a - rather than an _, I've just added a
> > couple of lines to swap the _ for a -.  There's a bit of extra work
> > there since IsoLocaleName() just did the opposite, so perhaps doing it
> > that way was lazy of me.  I'd have invented some other function if I
> > could have thought of a meaningful name for it, then just have the ISO
> > version of it swap - for _.
>
> Thanks!  Hmm, it looks like Windows calls the hyphenated ISO
> language-country form a "tag".  It makes me slightly nervous to ask
> for the version of a transformed name with the encoding stripped, but
> it does seem entirely plausible that it gives the answer we seek.  I
> suppose if we were starting from a clean slate we might want to
> perform this transformation up front so that we have it in datcollate
> and then not have to think about the older form ever again.  If we
> decided to do that going forward, the last trace of that problem would
> live in pg_upgrade.  If we ever extend pg_import_system_collations()
> to cover Windows, we should make sure it captures the tag form.

So we have:

1.  Windows locale names, like "English_United States.1252".  Windows
still returns these from setlocale(), so they finish up in datcollate,
and yet some relevant APIs don't accept them, at least on some
machines.

2.  BCP 47/RFC 5646 language tags, like "en-US".  Windows uses these
in relevant new APIs, including the case in point.

3.  Unix-style (XPG?  ISO/IEC 15897?) locale names, like "en_US"
("language[_territory[.codeset]][@modifier]").  These are used for
message catalogues.

We have a VS2015+ way of converting from form 1 to form 2 (and thence
3 by s/-/_/), and an older way.  Unfortunately, the new way looks a
little too fuzzy: if i'm reading it right, search_locale_enum() might
stop on either "en" or "en-AU", given "English_Australia", depending
on the search order, no?  This may be fine for the purpose of looking
up error messages with gettext() (where there is only one English
language message catalogue, we haven't got around to translating our
errors into 'strayan yet), but it doesn't seem like a good way to look
up the collation version; for all I know, "en" variants might change
independently (I doubt it in practice, but in theory it's wrong).  We
want the same algorithm that Windows uses internally to resolve the
old style name to a collation; in other words we probably want
something more like the code path that they took away in VS2015 :-(.

Re: Use of "long" in incremental sort code


On Tue, Nov 03, 2020 at 03:53:53AM +0100, Tomas Vondra wrote:

Hi,

I took another look at this, and 99% of the patch (the fixes to sort
debug messages) seems fine to me.  Attached is the part I plan to get
committed, including commit message etc.



I've pushed this part. Thanks for the patch, Haiying Tang.



The one change I decided to remove is this change in tuplesort_free:

-   longspaceUsed;
+   int64   spaceUsed;

The reason why I think this variable should be 'long' is that we're
using it for this:

   spaceUsed = LogicalTapeSetBlocks(state->tapeset);

and LogicalTapeSetBlocks is defined like this:

   extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);

FWIW the "long" is not introduced by incremental sort - it used to be in
tuplesort_end, the incremental sort patch just moved it to a different
function. It's a bit confusing that tuplesort_updatemax has this:

   int64   spaceUsed;

But I'd argue this is actually wrong, and should be "long" instead. (And
this actually comes from the incremental sort patch, by me.)


FWIW while looking at what the other places calling LogicalTapeSetBlocks
do, and I noticed this:

   uint64  disk_used = LogicalTapeSetBlocks(...);

in the disk-based hashagg patch. So that's a third data type ...



IMHO this should simply switch the current int64 variable to long, as it
was before. Not sure about about the hashagg uint64 variable.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: enable_incremental_sort changes query behavior


On Tue, Nov 03, 2020 at 03:37:43AM +0100, Tomas Vondra wrote:

On Fri, Oct 30, 2020 at 07:37:33PM -0400, James Coleman wrote:


...

I was looking at this some more, and I'm still convinced that this is
correct, but I don't think a comment about it being an optimization
(though I suspect that that its major benefit), since it came from,
and must parallel, find_ec_member_for_tle, and there is no such
explanation there. I don't want to backfill reasoning onto that
decision, but I do think it's important to parallel it since it's
ultimately what is going to cause errors if we're out of sync with it.

As to why find_em_expr_for_rel is different, I think the answer is
that it's used by the FDW code, and it seems to me to make sense that
in that case we'd only send sort conditions to the foreign server if
the em actually contains rels.

The new series attached includes the primary fix for this issue, the
additional comments that could either go in at the same time or not,
and my patch with semi-related questions (which isn't intended to be
committed, but to keep track of the questions).



Thanks. Attached are two patches that I plan to get committed

0001 is what you sent, with slightly reworded commit message. This is
the actual fix.

I'm still thinking about Robert's comment that perhaps we should be
considering only expressions already in the relation's target, which
would imply checking for volatile expressions is not enough. But I've
been unable to convince myself it's correct or incorrect. In any case
0001 is a clear improvement - in the worst case we'll have to make it
stricter in the future.


0002 partially reverts ba3e76cc571 by moving find_em_expr_for_rel back
to postgres_fdw.c.  We've moved it to equivclass.c so that it could be
called from both the FDW and allpaths.c, but now that the functions
diverged it's only called from the FDW again.  So maybe we should move
it back, but I guess that's not a good thing in a minor release.



I've pushed the 0001 part, i.e. the main fix. Not sure about the other
parts (comments and moving the code back to postgres_fdw) yet.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-03 Thread David Rowley

On Tue, 3 Nov 2020 at 20:08, Peter Eisentraut
 wrote:
>
> On 2020-09-29 11:26, David Rowley wrote:
> > I've marked this patch back as waiting for review. It would be good if
> > someone could run some tests on some intel hardware and see if they
> > can see any speedup.
>
> What is the way forward here?  What exactly would you like to have tested?

It would be good to see some small scale bench -S tests with and
without -M prepared.

Also, small scale TPC-H tests would be good.I really only did
testing on new AMD hardware, so some testing on intel hardware would
be good.

David

Re: libpq compression

2020-11-03 Thread Daniil Zakhlystov

Hi,Looks like I found an issue: there can be a situation when libpq frontend hangs after zpq_read(). This may happen when the socket is not read ready and we can't perform another read because we wait on pqWait() but we actually have some buffered rx data.I think we should add a check if there is any buffered rx data before calling pqWait() to avoid such hangs.--Daniil Zakhlystov

Re: Dumping/restoring fails on inherited generated column

2020-11-03 Thread Peter Eisentraut


On 2020-08-27 13:30, Masahiko Sawada wrote:

This issue is not fixed yet. I've attached the updated version patch
and registered it to commit fest so as not to forget. Please review
it.


A few fixes have been committed in this thread, basically to prevent 
situations that shouldn't have been allowed.


What's left is the originally reported issue that some parts of the 
regression test database are dumped incorrectly.  The two proposed 
patches in their most recent versions are


https://www.postgresql.org/message-id/attachment/107447/v3-0001-pg_dump-Fix-dumping-of-inherited-generated-column.patch 
(message 
https://www.postgresql.org/message-id/b1c831dd-d520-5e7f-0304-0eeed39c9996%402ndquadrant.com)


and

https://www.postgresql.org/message-id/attachment/113487/fix_gcolumn_dump_v2.patch 
(message 
https://www.postgresql.org/message-id/CA%2Bfd4k6pLzrZDQsdsxcS06AwGRf1DgwOw84sFq9oXNw%2B83nB1g%40mail.gmail.com)


Both of these result in the same change to the dump output.  Both of 
them have essentially the same idea.  The first one adds the 
conditionals during the information gathering phase of pg_dump, the 
second one adds the conditionals during the output phase.


Any further thoughts?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-03 Thread Dmitry Dolgov

> On Thu, Aug 20, 2020 at 03:11:19PM +0900, Michael Paquier wrote:
> On Wed, Aug 19, 2020 at 02:16:46PM -0400, Alvaro Herrera wrote:
> > I did not set the flag in REINDEX CONCURRENTLY, but as I understand it
> > can be done too, since in essence it's the same thing as a CIC from a
> > snapshot management point of view.
>
> Yes, I see no problems for REINDEX CONCURRENTLY as well as long as
> there are no predicates and expressions involved.  The transactions
> that should be patched are all started in ReindexRelationConcurrently.
> The transaction of index_concurrently_swap() cannot set up that
> though.  Only thing to be careful is to make sure that safe_flag is
> correct depending on the list of indexes worked on.

Hi,

After looking through the thread and reading the patch it seems good,
and there are only few minor questions:

* Doing the same for REINDEX CONCURRENTLY, which does make sense. In
  fact it's already mentioned in the commentaries as done, which a bit
  confusing.

* Naming, to be more precise what suggested Michael:

> Could we consider renaming vacuumFlags?  With more flags associated to
> a PGPROC entry that are not related to vacuum, the current naming
> makes things confusing.  Something like statusFlags could fit better
> in the picture?

  which sounds reasonable, and similar one about flag name
  PROC_IN_SAFE_CIC - if it covers both CREATE INDEX/REINDEX CONCURRENTLY
  maybe just PROC_IN_SAFE_IC?

Any plans about those questions? I can imagine that are the only missing
parts.

Re: cutting down the TODO list thread

2020-11-03 Thread John Naylor

I wrote:

>
> Ok, that's two votes for a separate page, and one for a new section on the
> same page, so it looks like it's a new page. That being the case, I would
> think it logical to move "features we don't want" there. As for the name,
> we should probably encompass both "won't fix" bugs and features not wanted.
> Maybe "past development ideas" or "not worth doing", but I'm open to better
> ideas. Once that's agreed upon, I'll make a new page and migrate the items
> over, minus the two that were mentioned upthread.
>

Hearing no preference, I've created

https://wiki.postgresql.org/wiki/Not_Worth_Doing

...with links between the two. I've moved over the items I suggested
upthread, minus the two where I heard feedback otherwise (prevent replay of
query cancel packets and improve WAL replay of CREATE TABLESPACE)

I have patches for documenting some behavior we won't fix in [1][2].

I was thinking of not having the next updates during commitfest, but it
could also be argued this is a type of review, and the things here will be
returned with feedback or rejected, in a way. Ultimately, it comes down to
"when time permits".

[1]
https://www.postgresql.org/message-id/flat/CAFBsxsGsBZsG%3DcLM0Op5HFb2Ks6SzJrOc_eRO_jcKSNuqFRKnQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAFBsxsEmg=kqrekxrlygy0ujcfyck4vgxzkalrwh_olfj8o...@mail.gmail.com

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-03 Thread Stephen Frost

Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
> 
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current 
> situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
> 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might 
> is different layer and might have different storage settings), VG on top of 
> dm-0, LV with stripe-size 8kB, ext4.
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
> output of PgSQL12: 
> INFO:  "x": scanned 150 of 22395442 pages, containing 112410444 live rows 
> and 0 dead rows; 150 rows in sample, 1678321053 estimated total rows
> 
> Hot VFS cache:
> Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is 
> spent heapam_scan_analyze_next_block() -> .. -> pread() which causes 
> ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat 
> with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, 
> hot caches, total elapsed ~3m
> Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
> Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
> Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
> 
> No VFS cache:
> Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
> read, ondemand_readahead visible in perf, total elapsed 3m30s
> Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m29s
> Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m28s
> 
> No VFS cache, readahead off:
> Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors 
> (*512=5.5kB), total elapsed 5m59s
> Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, double-confirmed no readaheads [ 
> pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
> elapsed 5m56s
> Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, total elapsed 5m55s

[ ... ]

> >The VACUUM case is going to be complicated by what's in the visibility
> >map. (..) 
> 
> After observing the ANALYZE readahead behavior benefit I've abandoned
> the case of testing much more advanced VACUUM processing, clearly Linux 
> read-ahead is beneficial in even simple cases.

This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.

Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.

Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none').  If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.

Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.

> >> My only idea would be that a lot of those blocks could be read 
> >> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on 
> >> block-range before, so maybe the the optimization is possible, but not 
> >> until we'll have AIO ;)
> >
> > (..)AIO is a whole other animal that's been discussed off and on
> >around here but it's a much larger and more invasive change than just
> >calling posix_fadvise().
> 
> Yes, I'm aware and I'm keeping my fingers crossed that maybe some day 

I don't think we should throw out the idea of using PrefetchBuffer()
here.  "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.

> The ANALYZE just seem fit to be natural candidate to use it. The only easy 
> chance 
> of acceleration of stats gathering - at least to me and enduser point of view 
> - 
> is to have more parallel autoanalyze workers running to drive more I/O 
> concurrency 
> (by e.g. partitioning the data),  both in readahe

Re: patch: reduce overhead of execution of CALL statement in no atomic mode from PL/pgSQL

2020-11-03 Thread Pavel Stehule

Hi

I played with the profiler a little bit to get some data - see attached
file.  Almost all overhead is related to impossibility to use simple
expression evaluation (that has very good performance now).

Probably there is no simple way to reduce this overhead.

Regards

Pavel


call-graph-profile.gz
Description: application/gzip

Re: PATCH: Batch/pipelining support for libpq

2020-11-03 Thread David G. Johnston

On Mon, Nov 2, 2020 at 8:58 AM Alvaro Herrera 
wrote:

> On 2020-Nov-02, Alvaro Herrera wrote:
>
> > In v23 I've gone over docs; discovered that PQgetResults docs were
> > missing the new values.  Added those.  No significant other changes yet.
>
>
Just reading the documentation of this patch, haven't been following the
longer thread:

Given the caveats around blocking mode connections why not just require
non-blocking mode, in a similar fashion to how synchronous functions are
disallowed?

"Batched operations will be executed by the server in the order the client
sends them. The server will send the results in the order the statements
executed."

Maybe:

"The server executes statements, and returns results, in the order the
client sends them."

Using two sentences and relying on the user to mentally link the two "in
the order" descriptions together seems to add unnecessary cognitive load.

+ The client interleaves result
+ processing with sending batch queries, or for small batches may
+ process all results after sending the whole batch.

Suggest: "The client may choose to interleave result processing with
sending batch queries, or wait until the complete batch has been sent."

I would expect to process the results of a batch only after sending the
entire batch to the server.  That I don't have to is informative but
knowing when I should avoid doing so, and why, is informative as well.  To
the extreme while you can use batch mode and interleave if you just poll
getResult after every command you will make the whole batch thing
pointless.  Directing the reader from here to the section "Interleaving
Result Processing and Query Dispatch" seems worth considering.  The
dynamics of small sizes and sockets remains a bit unclear as to what will
break (if anything, or is it just process memory on the server) if
interleaving it not performed and sizes are large.

I would suggest placing commentary about "all transactions subsequent to a
failed transaction in a batch are ignored while previous completed
transactions are retained" in the "When to Use Batching".  Something like
"Batching is less useful, and more complex, when a single batch contains
multiple transactions (see Error Handling)."

My imagined use case would be to open a batch, start a transaction, send
all of its components, end the transaction, end the batch, check for batch
failure and if it doesn't fail have the option to easily continue without
processing individual pgResults (or if it does fail, have the option to
extract the first error pgResult and continue, ignoring the rest, knowing
that the transaction as a whole was reverted and the batch unapplied).
I've never interfaced with libpq directly.  Though given how the existing C
API works what is implemented here seems consistent.

The "queueing up queries into a pipeline to be executed as a batch on the
server" can be read as a client-side behavior where nothing is sent to the
server until the batch has been completed.  Reading further it becomes
clear that all it basically is is a sever-side toggle that instructs the
server to continue processing incoming commands even while prior commands
have their results waiting to be ingested by the client.

Batch seems like the user-visible term to describe this feature.  Pipeline
seems like an implementation detail that doesn't need to be mentioned in
the documentation - especially given that pipeline doesn't get a mentioned
beyond the first two paragraphs of the chapter and never without being
linked directly to "batch".  I would probably leave the indexterm and have
a paragraph describing that batching is implemented using a query pipeline
so that people with the implementation detail on their mind can find this
chapter, but the prose for the user should just stick to batching.

Sorry, that all is a bit unfocused, but the documentation for the user of
the API could be cleaned up a bit and some more words spent on
what trade-offs are being made when using batching versus normal
command-response processing.  That said, while I don't see all of this
purely a matter of style I'm also not seeing anything demonstrably wrong
with the documentation at the moment.  Hopefully my perspective helps
though, and depending on what happens next I may try and make my thoughts
more concrete with an actual patch.

David J.

Why does to_json take "anyelement" rather than "any"?

2020-11-03 Thread Nikhil Benesch


to_json is declared as taking "anyelement" as input, which means
you can't pass it something of unknown type:

postgres=# SELECT to_json('foo');
ERROR:  could not determine polymorphic type because input has type unknown

But this works fine with the very similar json_build_array function:

postgres=# SELECT json_build_array('foo');
 json_build_array
--
 ["foo"]
(1 row)

The difference is that json_build_array takes type "any" as input, while
to_json takes "anyelement" as input.

Is there some reason to_json couldn't be switched to take "any" as input?
Hacking this together seems to mostly just work:

postgres=# CREATE FUNCTION my_to_json ("any") RETURNS json LANGUAGE 
'internal' AS 'to_json';
postgres=# SELECT my_to_json('foo');
 my_to_json

 "foo"
(1 row)

Is there something I'm missing?

Nikhil

Re: Online checksums verification in the backend

2020-11-03 Thread Robert Haas

On Mon, Nov 2, 2020 at 2:35 PM Andres Freund  wrote:
> Wouldn't this be better served by having a ReadBufferExtended() flag,
> preventing erroring out and zeroing the buffer? I'm not sure that
> handling both the case where the buffer contents need to be valid and
> the one where it doesn't will make for a good API.

I'm not sure. The goal I had in mind was giving a caller a way to get
a copy of a buffer even if it's one we wouldn't normally admit into
shared_buffers. I think it's probably a bad idea to allow for a back
door where things that fail PageIsVerified() can nevertheless escape
into the buffer, but that doesn't mean a checker or recovery tool
shouldn't be allowed to see them.

> > If the buffer is in shared buffers, we could take a share-lock
> > on the buffer and copy the contents of the page as it exists on disk,
> > and then still check it.
>
> Don't think we need a share lock. That still allows the buffer to be
> written out (and thus a torn read). What we need is to set
> BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
> that. And then unset that again, without unsetting
> BM_DIRTY/BM_JUST_DIRTIED.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] remove pg_archivecleanup and pg_standby

2020-11-03 Thread Robert Haas

On Thu, Oct 29, 2020 at 3:40 PM Michael Banck  wrote:
> I guess not many will complain about pg_standby going away, but I am
> under the impression that pg_archivecleanup is still used a lot in PITR
> backup environments as a handy tool to expire WAL related to expired
> base backups. I certainly saw hand-assembled shell code fail with "too
> many files" and things when it tried to act on large amount of WAL.

Yeah, I see pg_archivecleanup used in customer environments all the
time. Like just this morning, for example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Batch/pipelining support for libpq

2020-11-03 Thread Matthieu Garrigues

I implemented a C++ async HTTP server using this new batch mode and it
provides everything I needed to transparently batch sql requests.
It gives a performance boost  between x2 and x3 on this benchmark:
https://www.techempower.com/benchmarks/#section=test&runid=3097dbae-5228-454c-ba2e-2055d3982790&hw=ph&test=query&a=2&f=zik0zj-zik0zj-zik0zj-zik0zj-zieepr-zik0zj-zik0zj-zik0zj-zik0zj-zik0zj-zik0zj

I'll ask other users interested in this to review the API.

Matthieu Garrigues

On Tue, Nov 3, 2020 at 4:56 PM Dave Cramer  wrote:
>
>
>
> On Tue, 3 Nov 2020 at 08:42, Alvaro Herrera  wrote:
>>
>> Hi Dave,
>>
>> On 2020-Nov-03, Dave Cramer wrote:
>>
>> > On Mon, 2 Nov 2020 at 10:57, Alvaro Herrera  
>> > wrote:
>> >
>> > > On 2020-Nov-02, Alvaro Herrera wrote:
>> > >
>> > > > In v23 I've gone over docs; discovered that PQgetResults docs were
>> > > > missing the new values.  Added those.  No significant other changes 
>> > > > yet.
>> >
>> > Thanks for looking at this.
>> >
>> > What else does it need to get it in shape to apply?
>>
>> I want to go over the code in depth to grok the design more fully.
>>
>> It would definitely help if you (and others) could think about the API
>> being added: Does it fulfill the promises being made?  Does it offer the
>> guarantees that real-world apps want to have?  I'm not much of an
>> application writer myself -- particularly high-traffic apps that would
>> want to use this.  As a driver author I would welcome your insight in
>> these questions.
>>
>
> I'm sort of in the same boat as you. While I'm closer to the client. I don't 
> personally write that much client code.
>
> I'd really like to hear from the users here.
>
>
> Dave Cramer
> www.postgres.rocks

Re: public schema default ACL

2020-11-03 Thread Robert Haas

On Mon, Nov 2, 2020 at 1:41 PM Stephen Frost  wrote:
> > What potentially could move the needle is separate search paths for
> > relation lookup and function/operator lookup.  We have sort of stuck
> > our toe in that pond already by discriminating against pg_temp for
> > function/operator lookup, but we could make that more formalized and
> > controllable if there were distinct settings.  I'm not sure offhand
> > how much of a compatibility problem that produces.
>
> While I agree with the general idea of giving users more granularity
> when it comes to what objects are allowed to be created by users, and
> where, and how objects are looked up, I really don't think this would
> end up being a sufficiently complete answer to a world-writable public
> schema.  You don't have to be able to create functions or operators in
> the public schema to make things dangerous for some other user poking
> around at the tables or views that you are allowed to create there.

I agree. Everything that can execute code is a risk, which also
includes things like triggers and RLS policies. Noah's certainly right
about the compatibility hazard, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq compression

2020-11-03 Thread Konstantin Knizhnik




On 03.11.2020 18:08, Daniil Zakhlystov wrote:

Hi,

Looks like I found an issue: there can be a situation when libpq 
frontend hangs after zpq_read(). This may happen when the socket is 
not read ready and we can't perform another read because we wait on 
pqWait() but we actually have some buffered rx data.


I think we should add a check if there is any buffered rx data before 
calling pqWait() to avoid such hangs.


--
Daniil Zakhlystov


Hi, thank you very much for detecting the problem and especially 
providing fix for it.

I committed you pull request.
Patch based on this PR is attached to this mail .
The latest version of libpq sources can be also found in git repository: 
g...@github.com:postgrespro/libpq_compression.git


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index ace4ed5..deba608 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LD
 LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 XML2_LIBS
@@ -867,6 +868,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 '
@@ -8571,6 +8573,85 @@ fi
 
 
 
+#
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
 
 #
 # Zlib
diff --git a/configure.ac b/configure.ac
index 5b91c83..93a5285 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1000,6 +1000,13 @@ PGAC_ARG_BOOL(with, zlib, yes,
 AC_SUBST(with_zlib)
 
 #
+# Zstd
+#
+PGAC_ARG_BOOL(with, zstd, no,
+  [use zstd])
+AC_SUBST(with_zstd)
+
+#
 # Assignments
 #
 
@@ -1186,6 +1193,14 @@ failure.  It is possible the compiler isn't looking in the proper directory.
 Use --without-zlib to disable zlib support.])])
 fi
 
+if test "$with_zstd" = yes; then
+  AC_CHECK_LIB(zstd, ZSTD_decompressStream, [],
+   [AC_MSG_ERROR([zstd library not found
+If you have zstd already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-zstd to disable zstd support.])])
+fi
+
 if test "$enable_spinlocks" = yes; then
   AC_DEFINE(HAVE_SPINLOCKS, 1, [Define to 1 if you have spinlocks.])
 else
@@ -1400,6 +1415,13 @@ failure.  It is possible the compiler isn't looking in the proper directory.
 Use --without-zlib to disable zlib support.])])
 fi
 
+if test "$with_zstd" = yes; then
+  AC_CHECK_HEADER(zstd.h, [], [AC_MSG_ERROR([zstd header not found
+If you have zstd already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-zstd to disable zstd support.])])
+fi
+
 if test "$with_gssapi" = yes ; then
   AC_CHECK_HEADERS(gssapi/gssapi.h, [],
 	[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is required for GSSAPI])])])
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 9ce32fb..140724d 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1225,6 +1225,22 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  compression
+  
+  
+Request compression of libpq traffic. Client sends to the server list of compression algorithms, supported by client library.
+If server supports one of this algorithms, then it acknowledges use of this algorithm and then all libpq messages send bo

Re: PATCH: Batch/pipelining support for libpq

2020-11-03 Thread Dave Cramer

On Tue, 3 Nov 2020 at 08:42, Alvaro Herrera  wrote:

> Hi Dave,
>
> On 2020-Nov-03, Dave Cramer wrote:
>
> > On Mon, 2 Nov 2020 at 10:57, Alvaro Herrera 
> wrote:
> >
> > > On 2020-Nov-02, Alvaro Herrera wrote:
> > >
> > > > In v23 I've gone over docs; discovered that PQgetResults docs were
> > > > missing the new values.  Added those.  No significant other changes
> yet.
> >
> > Thanks for looking at this.
> >
> > What else does it need to get it in shape to apply?
>
> I want to go over the code in depth to grok the design more fully.
>
> It would definitely help if you (and others) could think about the API
> being added: Does it fulfill the promises being made?  Does it offer the
> guarantees that real-world apps want to have?  I'm not much of an
> application writer myself -- particularly high-traffic apps that would
> want to use this.  As a driver author I would welcome your insight in
> these questions.
>
>
I'm sort of in the same boat as you. While I'm closer to the client. I
don't personally write that much client code.

I'd really like to hear from the users here.


Dave Cramer
www.postgres.rocks

Re: [PATCH] remove pg_archivecleanup and pg_standby

2020-11-03 Thread Heikki Linnakangas


On 02/11/2020 20:26, Justin Pryzby wrote:

On Thu, Oct 29, 2020 at 08:40:31PM +0100, Michael Banck wrote:

Am Mittwoch, den 28.10.2020, 21:44 -0500 schrieb Justin Pryzby:

Forking this thread:
https://www.postgresql.org/message-id/fd93f1c5-7818-a02c-01e5-1075ac0d4...@iki.fi



I think these are old-fashioned since 9.6 (?), so remove them for v14.


Why 9.6?


My work doesn't currently bring me in contact with replication, so I've had to
dig through release notes.  I think streaming replication was new in 9.0, and
increasingly mature throughout 9.x.  Maybe someone else will say a different
release was when streaming replication became the norm and wal shipping old.


Removing pg_standby has been proposed a couple of times in the past. See 
https://www.postgresql.org/message-id/20170913064824.rqflkadxwpboa...@alap3.anarazel.de 
for the latest attempt.


Masao-san, back in 2014 you mentioned "fast failover" as a feature that 
was missing from the built-in standby mode 
(https://www.postgresql.org/message-id/CAHGQGwEE_8vvpQk0ex6Qa_aXt-OSJ7OdZjX4uM_FtqKfxq5SbQ%40mail.gmail.com). 
I think that's been implemented since, with the recovery_target 
settings. Would you agree?


I'm pretty sure we can remove pg_standby by now. But if there's 
something crucial missing from the built-in facilities, we need to talk 
about implementing them.


- Heikki

Re: language cleanups in code and docs

On Wed, Oct 21, 2020 at 11:23 PM Thomas Munro  wrote:
>
> On Wed, Jun 17, 2020 at 10:32 PM Magnus Hagander  wrote:
> > In looking at this I realize we also have exactly one thing referred to as 
> > "blacklist" in our codebase, which is the "enum blacklist" (and then a 
> > small internal variable in pgindent). AFAICT, it's not actually exposed to 
> > userspace anywhere, so we could probably make the attached change to 
> > blocklist at no "cost" (the only thing changed is the name of the hash 
> > table, and we definitely change things like that in normal releases with no 
> > specific thought on backwards compat).
>
> +1
>
> Hmm, can we find a more descriptive name for this mechanism?  What
> about calling it the "uncommitted enum table"?  See attached.

Thanks for picking this one up again!

Agreed, that's a much better choice.

The term itself is a bit of a mouthful, but it does describe what it
is in a much more clear way than what the old term did anyway.

Maybe consider just calling it "uncomitted enums", which would as a
bonus have it not end up talking about uncommittedenumtablespace which
gets hits on searches for tablespace.. Though I'm not sure that's
important.

I'm +1 to the change with or without that adjustment.

As for the code, I note that:
+   /* Set up the enum table if not already done in this transaction */

forgets to say it's *uncomitted* enum table -- which is the important
part of I believe.

And
+ * Test if the given enum value is in the table of blocked enums.

should probably talk about uncommitted rather than blocked?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Dumping/restoring fails on inherited generated column

2020-11-03 Thread Peter Eisentraut


On 2020-09-25 15:07, Peter Eisentraut wrote:

We could probably fix this by having ALTER TABLE ONLY / DROP EXPRESSION
update the attlocal column of direct children to true, to make the
catalog state look like something that can be restored.  However, that's
a fair amount of complicated code, so for now I propose to just prohibit
this command, meaning you can't use ONLY in this command if there are
children.  This is new in PG13, so this change would have very limited
impact in practice.


With the minor releases coming up, I have committed this patch and will 
work on getting the remaining pg_dump issues fixed as well.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql \df choose functions by their arguments

2020-11-03 Thread Greg Sabino Mullane

On Sun, Nov 1, 2020 at 12:05 PM David Christensen  wrote:

>
> I can’t speak for the specific patch, but tab completion of proc args for
> \df, \ef and friends has long been a desired feature of mine, particularly
> when you are dealing with functions with huge numbers of arguments and the
> same name which I have (sadly) come across many times in the wild.
>

If someone can get this working against this current patch, that would be
great, but I suspect it will require some macro-jiggering in tab-complete.c
and possibly more, so yeah, could be something to add to the todo list.

Cheers,
Greg

Re: psql \df choose functions by their arguments

2020-11-03 Thread Greg Sabino Mullane

Thanks for looking this over!


> some Abbreviations of common types are not added to the
> type_abbreviations[] Such as:
>
> Int8=> bigint
>

I wasn't aiming to provide a canonical list, as I personally have never
seen anyone use int8 instead of bigint when (for example) creating a
function, but I'm not strongly opposed to expanding the list.

Single array seems difficult to handle it, may be we can use double array
> or use a struct.
>

I think the single works out okay, as this is a simple write-once variable
that is not likely to get updated often.


> And I think It's better to update '/?' info about '\df[+]' in function
> slashUsage(unsigned short int pager).
>

Suggestions welcome, but it's already pretty tight in there, so I couldn't
think of anything:

fprintf(output, _("  \\dew[+] [PATTERN]  list foreign-data
wrappers\n"));
fprintf(output, _("  \\df[anptw][S+] [PATRN] list [only
agg/normal/procedures/trigger/window] functions\n"));
fprintf(output, _("  \\dF[+]  [PATTERN]  list text search
configurations\n"));

The \df option is already our longest one, even with the silly attempt to
shorten PATTERN :)

Cheers,
Greg

Re: PATCH: Batch/pipelining support for libpq

2020-11-03 Thread Alvaro Herrera

Hi Dave,

On 2020-Nov-03, Dave Cramer wrote:

> On Mon, 2 Nov 2020 at 10:57, Alvaro Herrera  wrote:
> 
> > On 2020-Nov-02, Alvaro Herrera wrote:
> >
> > > In v23 I've gone over docs; discovered that PQgetResults docs were
> > > missing the new values.  Added those.  No significant other changes yet.
>
> Thanks for looking at this.
> 
> What else does it need to get it in shape to apply?

I want to go over the code in depth to grok the design more fully.

It would definitely help if you (and others) could think about the API
being added: Does it fulfill the promises being made?  Does it offer the
guarantees that real-world apps want to have?  I'm not much of an
application writer myself -- particularly high-traffic apps that would
want to use this.  As a driver author I would welcome your insight in
these questions.

Re: PATCH: Batch/pipelining support for libpq

2020-11-03 Thread Dave Cramer

Alvaro,


On Mon, 2 Nov 2020 at 10:57, Alvaro Herrera  wrote:

> On 2020-Nov-02, Alvaro Herrera wrote:
>
> > In v23 I've gone over docs; discovered that PQgetResults docs were
> > missing the new values.  Added those.  No significant other changes yet.
>
>
>
Thanks for looking at this.

What else does it need to get it in shape to apply?


Dave Cramer
www.postgres.rocks

Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-11-03 Thread Dave Cramer

David,

On Thu, 24 Sep 2020 at 00:22, Michael Paquier  wrote:

> On Tue, Sep 08, 2020 at 12:18:23PM -0700, Andres Freund wrote:
> > A test verifying that a non-transactional message in later aborted
> > transaction is handled correctly would be good.
>
> On top of that, the patch needs a rebase as it visibly fails to apply,
> per the CF bot.
> --
> Michael
>

Where are you with this? Are you able to work on it ?
Dave Cramer

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

On Tue, Nov 3, 2020 at 1:00 PM Daniel Gustafsson  wrote:
>
> > On 3 Nov 2020, at 11:35, Michael Paquier  wrote:
> >
> > On Tue, Nov 03, 2020 at 10:15:48AM +0100, Magnus Hagander wrote:
> >> On Wed, Aug 26, 2020 at 2:19 PM Daniel Gustafsson  wrote:
> >>> That's certainly true.  The intention though is to make the code easier to
> >>> follow (more explicit/discoverable) for anyone trying to implement 
> >>> support for
> >>
> >> Is it really a reasonable usecase to use RAND_bytes() outside of both
> >> pg_stroing_random() *and' outside of the openssl-specific files (like
> >> be-secure-openssl.c)? Because it would only be those cases that would
> >> have this case, right?
> >
> > It does not sound that strange to me to assume if some out-of-core
> > code makes use of that to fetch a random set of bytes.  Now I don't
> > know of any code doing that.  Who knows.
>
> Even if there are, I doubt such codepaths will be stumped by using
> USE_OPENSSL_RANDOM for pg_strong_random code as opposed to USE_OPENSSL.
>
> >> If anything, perhaps the call to RAND_poll() in fork_process.c should
> >> actually be a call to a strong_random_initialize() or something which
> >> would have an implementation in pg_strong_random.c, thereby isolating
> >> the openssl specific code in there? (And with a void implementation
> >> without openssl)
> >
> > I don't think that we have any need to go to such extent just for this
> > case, as RAND_poll() after forking a process is irrelevant in 1.1.1.
> > We are still many years away from removing its support though.
>
> Agreed, I doubt we'll be able to retire our <1.1.1 suppport any time soon.
>
> > No idea if other SSL implementations would require such a thing.
> > Daniel, what about NSS?
>
> PK11_GenerateRandom in NSS requires an NSSContext to be set up after fork,
> which could be performed in such an _initialize function.  The PRNG in NSPR 
> has
> a similar requirement (which may be the one the NSS patch end up using, not
> sure yet).
>
> I kind of like the idea of continuing to abstract this functionality, not
> pulling in OpenSSL headers in fork_process.c is a neat bonus.  I'd say it's
> worth implementing to see what it would imply, and am happy to do unless
> someone beats me to it.

Yeah, if it's likely to be usable in the other implementations, then I
think we should definitely explore exactly what that kind of an
abstraction would imply. Anything isolating the dependency on OpenSSL
would likely have to be done at that time anyway in that case, so
better have it ready.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Re: parallel distinct union and aggregate support patch

2020-11-03 Thread Dilip Kumar

On Thu, Oct 29, 2020 at 12:53 PM bu...@sohu.com  wrote:
>
> > 1) It's better to always include the whole patch series - including the
> > parts that have not changed. Otherwise people have to scavenge the
> > thread and search for all the pieces, which may be a source of issues.
> > Also, it confuses the patch tester [1] which tries to apply patches from
> > a single message, so it will fail for this one.
>  Pathes 3 and 4 do not rely on 1 and 2 in code.
>  But, it will fail when you apply the apatches 3 and 4 directly, because
>  they are written after 1 and 2.
>  I can generate a new single patch if you need.
>
> > 2) I suggest you try to describe the goal of these patches, using some
> > example queries, explain output etc. Right now the reviewers have to
> > reverse engineer the patches and deduce what the intention was, which
> > may be causing unnecessary confusion etc. If this was my patch, I'd try
> > to create a couple examples (CREATE TABLE + SELECT + EXPLAIN) showing
> > how the patch changes the query plan, showing speedup etc.
>  I written some example queries in to regress, include "unique" "union"
>  "group by" and "group by grouping sets".
>  here is my tests, they are not in regress
> ```sql
> begin;
> create table gtest(id integer, txt text);
> insert into gtest select t1.id,'txt'||t1.id from (select 
> generate_series(1,1000*1000) id) t1,(select generate_series(1,10) id) t2;
> analyze gtest;
> commit;
> set jit = off;
> \timing on
> ```
> normal aggregate times
> ```
> set enable_batch_hashagg = off;
> explain (costs off,analyze,verbose)
> select sum(id),txt from gtest group by txt;
>  QUERY PLAN
> -
>  Finalize GroupAggregate (actual time=6469.279..8947.024 rows=100 loops=1)
>Output: sum(id), txt
>Group Key: gtest.txt
>->  Gather Merge (actual time=6469.245..8165.930 rows=158 loops=1)
>  Output: txt, (PARTIAL sum(id))
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Sort (actual time=6356.471..7133.832 rows=53 loops=3)
>Output: txt, (PARTIAL sum(id))
>Sort Key: gtest.txt
>Sort Method: external merge  Disk: 11608kB
>Worker 0:  actual time=6447.665..7349.431 rows=317512 loops=1
>  Sort Method: external merge  Disk: 10576kB
>Worker 1:  actual time=6302.882..7061.157 rows=01 loops=1
>  Sort Method: external merge  Disk: 2kB
>->  Partial HashAggregate (actual time=2591.487..4430.437 
> rows=53 loops=3)
>  Output: txt, PARTIAL sum(id)
>  Group Key: gtest.txt
>  Batches: 17  Memory Usage: 4241kB  Disk Usage: 113152kB
>  Worker 0:  actual time=2584.345..4486.407 rows=317512 
> loops=1
>Batches: 17  Memory Usage: 4241kB  Disk Usage: 101392kB
>  Worker 1:  actual time=2584.369..4393.244 rows=01 
> loops=1
>Batches: 17  Memory Usage: 4241kB  Disk Usage: 112832kB
>  ->  Parallel Seq Scan on public.gtest (actual 
> time=0.691..603.990 rows=333 loops=3)
>Output: id, txt
>Worker 0:  actual time=0.104..607.146 rows=3174970 
> loops=1
>Worker 1:  actual time=0.100..603.951 rows=3332785 
> loops=1
>  Planning Time: 0.226 ms
>  Execution Time: 9021.058 ms
> (29 rows)
>
> Time: 9022.251 ms (00:09.022)
>
> set enable_batch_hashagg = on;
> explain (costs off,analyze,verbose)
> select sum(id),txt from gtest group by txt;
>QUERY PLAN
> -
>  Gather (actual time=3116.666..5740.826 rows=100 loops=1)
>Output: (sum(id)), txt
>Workers Planned: 2
>Workers Launched: 2
>->  Parallel BatchHashAggregate (actual time=3103.181..5464.948 
> rows=33 loops=3)
>  Output: sum(id), txt
>  Group Key: gtest.txt
>  Worker 0:  actual time=3094.550..5486.992 rows=326082 loops=1
>  Worker 1:  actual time=3099.562..5480.111 rows=324729 loops=1
>  ->  Parallel Seq Scan on public.gtest (actual time=0.791..656.601 
> rows=333 loops=3)
>Output: id, txt
>Worker 0:  actual time=0.080..646.053 rows=3057680 loops=1
>Worker 1:  actual time=0.070..662.754 rows=3034370 loops=1
>  Planning Time: 0.243 ms
>  Execution Time: 5788.981 ms
> (15 rows)
>
> Time: 5790.143 ms (00:05.790)
> ```
>
> grouping sets times
> ```
> set enable_batch_hashagg = off;
> explain (costs off,analyze,verbose)
> select sum(id),txt from gtest group by grouping sets(id,txt,());
>

Re: Parallel copy

2020-11-03 Thread Heikki Linnakangas


On 03/11/2020 10:59, Amit Kapila wrote:

On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:

However, the point of parallel copy is to maximize bandwidth.


Okay, but this first-phase (finding the line boundaries) can anyway
be not done in parallel and we have seen in some of the initial 
benchmarking that this initial phase is a small part of work 
especially when the table has indexes, constraints, etc. So, I think 
it won't matter much if this splitting is done in a single process

or multiple processes.
Right, it won't matter performance-wise. That's not my point. The 
difference is in the complexity. If you don't store the line boundaries 
in shared memory, you get away with much simpler shared memory structures.



I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review.


Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this 
nice to review and work with.


2. Refactor CopyReadLine(), so that finding the line-endings and the 
rest of the line-parsing are separated into separate functions.


3. Implement parallel copy.


I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.


Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1] 
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi

Re: ModifyTable overheads in generic plans

2020-11-03 Thread Heikki Linnakangas


On 03/11/2020 10:27, Amit Langote wrote:

Please check the attached if that looks better.


Great, thanks! Yeah, I like that much better.

This makes me a bit unhappy:



/* Also let FDWs init themselves for foreign-table result rels 
*/
if (resultRelInfo->ri_FdwRoutine != NULL)
{
if (resultRelInfo->ri_usesFdwDirectModify)
{
ForeignScanState *fscan = (ForeignScanState *) 
mtstate->mt_plans[i];

/*
 * For the FDW's convenience, set the 
ForeignScanState node's
 * ResultRelInfo to let the FDW know which 
result relation it
 * is going to work with.
 */
Assert(IsA(fscan, ForeignScanState));
fscan->resultRelInfo = resultRelInfo;

resultRelInfo->ri_FdwRoutine->BeginDirectModify(fscan, eflags);
}
else if 
(resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
{
List   *fdw_private = (List *) 
list_nth(node->fdwPrivLists, i);


resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,

 resultRelInfo,

 fdw_private,

 i,

 eflags);
}
}


If you remember, I was unhappy with a similar assertion in the earlier 
patches [1]. I'm not sure what to do instead though. A few options:


A) We could change FDW API so that BeginDirectModify takes the same 
arguments as BeginForeignModify(). That avoids the assumption that it's 
a ForeignScan node, because BeginForeignModify() doesn't take 
ForeignScanState as argument. That would be consistent, which is nice. 
But I think we'd somehow still need to pass the ResultRelInfo to the 
corresponding ForeignScan, and I'm not sure how.


B) Look up the ResultRelInfo, and call BeginDirectModify(), on the first 
call to ForeignNext().


C) Accept the Assertion. And add an elog() check in the planner for that 
with a proper error message.


I'm leaning towards B), but maybe there's some better solution I didn't 
think of? Perhaps changing the API would make sense in any case, it is a 
bit weird as it is. Backwards-incompatible API changes are not nice, but 
I don't think there are many FDWs out there that implement the 
DirectModify functions. And those functions are pretty tightly coupled 
with the executor and ModifyTable node details anyway, so I don't feel 
like we can, or need to, guarantee that they stay unchanged across major 
versions.


[1] 
https://www.postgresql.org/message-id/19c23dd9-89ce-75a3-9105-5fc05a46f94a%40iki.fi


- Heikki

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-03 Thread Daniel Gustafsson

> On 3 Nov 2020, at 11:35, Michael Paquier  wrote:
> 
> On Tue, Nov 03, 2020 at 10:15:48AM +0100, Magnus Hagander wrote:
>> On Wed, Aug 26, 2020 at 2:19 PM Daniel Gustafsson  wrote:
>>> That's certainly true.  The intention though is to make the code easier to
>>> follow (more explicit/discoverable) for anyone trying to implement support 
>>> for
>> 
>> Is it really a reasonable usecase to use RAND_bytes() outside of both
>> pg_stroing_random() *and' outside of the openssl-specific files (like
>> be-secure-openssl.c)? Because it would only be those cases that would
>> have this case, right?
> 
> It does not sound that strange to me to assume if some out-of-core
> code makes use of that to fetch a random set of bytes.  Now I don't
> know of any code doing that.  Who knows.

Even if there are, I doubt such codepaths will be stumped by using
USE_OPENSSL_RANDOM for pg_strong_random code as opposed to USE_OPENSSL.

>> If anything, perhaps the call to RAND_poll() in fork_process.c should
>> actually be a call to a strong_random_initialize() or something which
>> would have an implementation in pg_strong_random.c, thereby isolating
>> the openssl specific code in there? (And with a void implementation
>> without openssl)
> 
> I don't think that we have any need to go to such extent just for this
> case, as RAND_poll() after forking a process is irrelevant in 1.1.1.
> We are still many years away from removing its support though.

Agreed, I doubt we'll be able to retire our <1.1.1 suppport any time soon.

> No idea if other SSL implementations would require such a thing.
> Daniel, what about NSS?

PK11_GenerateRandom in NSS requires an NSSContext to be set up after fork,
which could be performed in such an _initialize function.  The PRNG in NSPR has
a similar requirement (which may be the one the NSS patch end up using, not
sure yet).

I kind of like the idea of continuing to abstract this functionality, not
pulling in OpenSSL headers in fork_process.c is a neat bonus.  I'd say it's
worth implementing to see what it would imply, and am happy to do unless
someone beats me to it.

cheers ./daniel

Multi Inserts in CREATE TABLE AS - revived patch

2020-11-03 Thread Bharath Rupireddy

Hi,

I would like to propose an updated patch on multi/bulk inserts in CTAS [1]
that tries to address the review comments that came up in [1]. One of the
main review comments was to calculate/estimate the tuple size to decide on
when to flush. I tried to solve this point with a new function
GetTupleSize()(see the patch for implementation).

I did some testing with custom configuration[2].

Use case 1- 100mn tuples, 2 integer columns, exec time in sec:
HEAD: *131.507* when the select part is not parallel, 128.832 when the
select part is parallel
Patch: 98.925 when the select part is not parallel, *52.901* when the
select part is parallel

Use case 2- 10mn tuples, 4 integer and 6 text columns, exec time in sec:
HEAD: *76.801* when the select part is not parallel, 66.074 when the select
part is parallel
Patch: 74.083 when the select part is not parallel, *57.739* when the
select part is parallel

Thoughts?

If the approach followed in the patch looks okay, I can work on a separate
patch for multi inserts in refresh materialized view cases.

I thank Simon Riggs for the offlist discussion.

PS: I chose to start a new thread as the previous thread [1] was closed in
the CF. I hope that's not a problem.

[1] -
https://www.postgresql.org/message-id/CAEET0ZHRWxbRUgwzUK_tOFDWx7VE2-P%3DxMBT6-N%2BgAa9WQ%3DxxA%40mail.gmail.com
[2] - The postgresql.conf used:
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Multi-Inserts-in-Create-Table-As.patch
Description: Binary data

Re: contrib/sslinfo cleanup and OpenSSL errorhandling

On Tue, Nov 3, 2020 at 10:22 AM Daniel Gustafsson  wrote:
>
> > On 3 Nov 2020, at 10:05, Magnus Hagander  wrote:
>
> > Applied, with the small adjustment of the comma in the docs.
>
> Thanks!
>
> > I wonder if we should perhaps backpatch 0002? The changes to sslinfo
> > that were ported go all the way back to 9.6, so it should be a safe
> > one I think?
>
> It should be safe given that the code has been in production for a long time.
> That being said, sslinfo doesn't have tests (yet) and probably isn't used that
> much; perhaps it's best to let this mature in HEAD for a few buildfarm cycles
> first?

Yeah, that's a good point. I didn't realize we had no tests for that
one, so then it's a bit less safe in that regard. I agree with leaving
it for at least one complete buildfarm run before backporting
anything.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Bogus documentation for bogus geometric operators

2020-11-03 Thread Pavel Borisov

 Emre, could you please again rebase your patch on top of
2f70fdb0644c32c4154236c2b5c241bec92eac5e
?
 It is not applied anymore.

>

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: hash_array_extended() needs to pass down collation

On Mon, Nov 02, 2020 at 10:01:53AM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
>> I noticed that hash_array_extended() does not pass down the collation to 
>> the element's collation function, unlike hash_array().  As a 
>> consequence, hash partitioning using text arrays as partition key fails.
> 
>> The attached patch fixes this.  I propose to backpatch this.
> 
> LGTM

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

On Tue, Nov 03, 2020 at 10:15:48AM +0100, Magnus Hagander wrote:
> On Wed, Aug 26, 2020 at 2:19 PM Daniel Gustafsson  wrote:
>> That's certainly true.  The intention though is to make the code easier to
>> follow (more explicit/discoverable) for anyone trying to implement support 
>> for
> 
> Is it really a reasonable usecase to use RAND_bytes() outside of both
> pg_stroing_random() *and' outside of the openssl-specific files (like
> be-secure-openssl.c)? Because it would only be those cases that would
> have this case, right?

It does not sound that strange to me to assume if some out-of-core
code makes use of that to fetch a random set of bytes.  Now I don't
know of any code doing that.  Who knows.

> If anything, perhaps the call to RAND_poll() in fork_process.c should
> actually be a call to a strong_random_initialize() or something which
> would have an implementation in pg_strong_random.c, thereby isolating
> the openssl specific code in there? (And with a void implementation
> without openssl)

I don't think that we have any need to go to such extent just for this
case, as RAND_poll() after forking a process is irrelevant in 1.1.1.
We are still many years away from removing its support though.

No idea if other SSL implementations would require such a thing.
Daniel, what about NSS?
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums verification in the backend

On Mon, Nov 02, 2020 at 11:34:57AM -0800, Andres Freund wrote:
> On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
>> On Thu, Oct 29, 2020 at 2:17 PM Andres Freund  wrote:
>>> I think this needs to be quickly reworked or reverted.
> 
> I think it's fairly clear by now that revert is appropriate for now.

Yep, that's clear.  I'll deal with that tomorrow.  That's more than a
simple rework.

>> I don't like this patch, either. In addition to what Andres mentioned,
>> CheckBuffer() is a completely-special case mechanism which can't be
>> reused by anything else. In particular, the amcheck for heap stuff
>> which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
>> would really like a way to examine a buffer without risking an error
>> if PageIsVerified() should happen to fail, but this patch is of
>> absolutely no use in getting there, because CheckBuffer() doesn't give
>> the caller any way to access the contents of the buffer. It can only
>> do the checks that it knows how to do, and that's it. That doesn't
>> seem like a good design.
> 
> Wouldn't this be better served by having a ReadBufferExtended() flag,
> preventing erroring out and zeroing the buffer? I'm not sure that
> handling both the case where the buffer contents need to be valid and
> the one where it doesn't will make for a good API.

If you grep for ReadBuffer_common() is some of the emails I sent..  I
was rather interested in something like that.

>> I don't like the fact that CheckBuffer() silently skips dirty buffers,
>> either. The comment should really say that it checks the state of a
>> buffer without loading it into shared buffers, except sometimes it
>> doesn't actually check it.
> 
> Yea, I don't see a good reason for that either. There's reasons for
> dirty buffers that aren't WAL logged - so if the on-disk page is broken,
> a standby taken outside pg_basebackup would possibly still end up with a
> corrupt on-disk page. Similar with a crash restart.

Er, if you don't skip dirty buffers, wouldn't you actually report some
pages as broken if attempting to run those in a standby who may have
some torn pages from a previous base backup?  You could still run into
problems post-promotion, until the first checkpoint post-recovery
happens, no?

>> If the buffer is in shared buffers, we could take a share-lock
>> on the buffer and copy the contents of the page as it exists on disk,
>> and then still check it.
> 
> Don't think we need a share lock. That still allows the buffer to be
> written out (and thus a torn read). What we need is to set
> BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
> that. And then unset that again, without unsetting
> BM_DIRTY/BM_JUST_DIRTIED.

If that can work, we could make use of some of that for base backups
for a single retry of a page that initially failed.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-03 Thread Peter Eisentraut


On 2020-10-27 04:25, Justin Pryzby wrote:

Forking this thread:
https://www.postgresql.org/message-id/fd93f1c5-7818-a02c-01e5-1075ac0d4...@iki.fi

They have been deprecated for a Long Time, so finally remove them for v14.
Four fewer exclamation marks makes the documentation less exciting, which is a
good thing.


I have committed the parts that remove the built-in geometry operators 
and the related regression test changes.


The changes to the contrib modules appear to be incomplete in some ways. 
 In cube, hstore, and seg, there are no changes to the extension 
scripts to remove the operators.  All you're doing is changing the C 
code to no longer recognize the strategy, but that doesn't explain what 
will happen if the operator is still used.  In intarray, by contrast, 
you're editing an existing extension script, but that should be done by 
an upgrade script instead.

Re: -O switch

On Mon, Nov 2, 2020 at 6:58 PM Tom Lane  wrote:
>
> Magnus Hagander  writes:
> > PFA a patch to do this.
>
> One thing you missed is that the getopt() calls in both postmaster.c
> and postgres.c have 'o:' entries that should be removed.  Also IIRC
> there is a "case 'o'" in postgres.c to go along with that.

Ha. Of course. Oops.

PFA updated.


> > Initially I kept the dynamic argv/argc in even though it's now
> > hardcoded, in case we wanted to add something back. But given the way
> > it looks now, perhaps we should just get rid of BackendRun()
> > completely and directly call PostgresMain()? Or keep BackendRun() with
> > just setting the TopMemoryContext, but removing the dynamic parts?
>
> I'd be inclined to keep it as-is for now.  It's not adding any significant
> amount of cycles compared to the process fork, so we might as well
> preserve flexibility.
>
> Is it really possible to not include miscadmin.h in postmaster.c?
> I find that a bit surprising.

I did too, but having removed it postmaster.c still compiles fine
without warnings for me. It did also pass the cfbot build step, but it
might be that it'll eventually break down on some more different
buildfarm animal.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index fda678e345..4aaa7abe1a 100644
--- a/doc/src/sgml/ref/postgres-ref.sgml
+++ b/doc/src/sgml/ref/postgres-ref.sgml
@@ -280,32 +280,6 @@ PostgreSQL documentation
   
  
 
- 
-  -o extra-options
-  
-   
-The command-line-style arguments specified in extra-options are passed to
-all server processes started by this
-postgres process.
-   
-
-   
-Spaces within extra-options are
-considered to separate arguments, unless escaped with a backslash
-(\); write \\ to represent a literal
-backslash.  Multiple arguments can also be specified via multiple
-uses of -o.
-   
-
-   
-The use of this option is obsolete; all command-line options
-for server processes can be specified directly on the
-postgres command line.
-   
-  
- 
-
  
   -p port
   
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index a4dd233c7f..b6e5128832 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -323,7 +323,6 @@ help(const char *progname)
 	printf(_("  -l enable SSL connections\n"));
 #endif
 	printf(_("  -N MAX-CONNECT maximum number of allowed connections\n"));
-	printf(_("  -o OPTIONS pass \"OPTIONS\" to each server process (obsolete)\n"));
 	printf(_("  -p PORTport number to listen on\n"));
 	printf(_("  -s show statistics after each query\n"));
 	printf(_("  -S WORK-MEMset amount of memory for sorts (in kB)\n"));
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..5abccf9e07 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -105,7 +105,6 @@
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
-#include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
@@ -219,12 +218,6 @@ int			ReservedBackends;
 /* The socket(s) we're listening to. */
 #define MAXLISTEN	64
 static pgsocket ListenSocket[MAXLISTEN];
-
-/*
- * Set by the -o option
- */
-static char ExtraOptions[MAXPGPATH];
-
 /*
  * These globals control the behavior of the postmaster in case some
  * backend dumps core.  Normally, it kills all peers of the dead backend
@@ -537,7 +530,6 @@ typedef struct
 #endif
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
-	char		ExtraOptions[MAXPGPATH];
 } BackendParameters;
 
 static void read_backend_variables(char *id, Port *port);
@@ -694,7 +686,7 @@ PostmasterMain(int argc, char *argv[])
 	 * tcop/postgres.c (the option sets should not conflict) and with the
 	 * common help() function in main/main.c.
 	 */
-	while ((opt = getopt(argc, argv, "B:bc:C:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:sTt:W:-:")) != -1)
+	while ((opt = getopt(argc, argv, "B:bc:C:D:d:EeFf:h:ijk:lN:nOPp:r:S:sTt:W:-:")) != -1)
 	{
 		switch (opt)
 		{
@@ -773,13 +765,6 @@ PostmasterMain(int argc, char *argv[])
 SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 break;
 
-			case 'o':
-/* Other options to pass to the backend on the command line */
-snprintf(ExtraOptions + strlen(ExtraOptions),
-		 sizeof(ExtraOptions) - strlen(ExtraOptions),
-		 " %s", optarg);
-break;
-
 			case 'P':
 SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
 break;
@@ -4496,26 +4481,14 @@ BackendRun(Port *port)
 
 	/*
 	 * Now, build the argv vector that will be given to PostgresMain.
-	 *
-	 * The maximum possible number of commandline argument

Re: Online checksums verification in the backend

On Tue, Nov 03, 2020 at 09:31:20AM +0100, Magnus Hagander wrote:
> On Mon, Nov 2, 2020 at 8:35 PM Andres Freund  wrote:
>> On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
>>> It feels really confusing to me that the user-exposed function here is
>>> called pg_relation_check_pages(). How is the user supposed to
>>> understand the difference between what this function does and what the
>>> new verify_heapam() in amcheck does? The answer is that the latter
>>> does far more extensive checks, but this isn't obvious from the SGML
>>> documentation, which says only that the blocks are "verified," as if
>>> an end-user can reasonably be expected to know what that means. It
>>> seems likely to lead users to the belief that if this function passes,
>>> they are in good shape, which is extremely far from being true. Just
>>> look at what PageIsVerified() checks compared to what verify_heapam()
>>> checks.

The cases of verify_heapam() are much wider as they target only one
AM, while this stuff should remain more general.  There seems to be
some overlap in terms of the basic checks done by bufmgr.c, and
the fact that you may not want to be that much intrusive with the
existing buffer pool as well when running the AM checks.  It also
seems to me that the use cases are quite different for both, the
original goal of this thread is to detect physical corruptions for all
AMs, while verify_heapam() looks after logical corruptions in the way
heap is handled.

>> Yea I had similar thoughts, it should just be called
>> pg_checksum_verify_relation() or something.
> 
> +1.

I mentioned that upthread, is there really a dependency with checksums
here?  There are two cases where we can still apply some checks on a
page, without any need of checksums:
- The state of the page header.
- Zeroed page if pd_upper is 0.  Those pages are valid, and don't have
a checksum computed.
So it seems to me that when it comes to relation pages, then the
check of a page should answer to the question: is this page loadable
in shared buffers, or not?

>>> In fact, I would argue that this functionality ought to live in
>>> amcheck rather than core, though there could usefully be enabling
>>> functions in core.
>>
>> I'm not really convinced by this though. It's not really AM
>> specific - works for all types of relations with storage; don't really
>> object either...
> 
> Yeah, I'm not sure about that one either. Also what would happen
> if/when we get checksums on things that aren't even relations? (though
> maybe that goes for other parts of amcheck at some point as well?)

I also thought about amcheck when looking at this thread, but it did
not seem the right place as this applies to any AM able that could
load stuff into the shared buffers.
--
Michael


signature.asc
Description: PGP signature

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-03 Thread Jakub Wartak

Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after 
>> spending time looking at the smgr/md/fd API changes required to find 
>> shortcut, but I'm getting actually a little bit worse timings at least on 
>> "laptop DB tests". One thing that I've noticed is that needs to be only for 
>> automatic-analyze, but not for automatic-vacuum where apparently there is 
>> some boost due to readahead.

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation 
is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is 
different layer and might have different storage settings), VG on top of dm-0, 
LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
output of PgSQL12: 
INFO:  "x": scanned 150 of 22395442 pages, containing 112410444 live rows 
and 0 dead rows; 150 rows in sample, 1678321053 estimated total rows

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent 
heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as 
reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average 
request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, 
total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), 
total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, double-confirmed no readaheads [ 
pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are 
being 
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O 
layer waiting. 

54.67% 1.33%  postgres  postgres[.] FileRead
---FileRead
--53.33%--__pread_nocancel
   --50.67%--entry_SYSCALL_64_after_hwframe
 do_syscall_64
 sys_pread64
 |--49.33%--vfs_read
 |   --48.00%--__vfs_read
 | 
|--45.33%--generic_file_read_iter
 | |  
|--42.67%--ondemand_readahead
 | |  | 
 __do_page_cache_readahead
 | |  | 
 |--25.33%--ext4_mpage_readpages
 | |  | 
 |  |--10.67%--submit_bio
 | |  | 
 |  |  generic_make_request
 | |  | 
 |  |  |--8.00%--blk_mq_make_request
 | |  | 
 |  |  |  |--4.00%--blk_mq_get_request
 | |  | 
 |  |  |  |  |--1.33%--blk_mq_get_tag
 | |  | 
 |  |  |  |

Re: Bogus documentation for bogus geometric operators

2020-11-03 Thread Pavel Borisov

> The subject is about the documentation, but the post reveals
> inconsistencies of the operators.  Tom Lane fixed the documentation on
> the back branches.  The new patch is to fix the operators on the
> master only.
>

Nice catch, thanks!
I agree that different operators should not have the same name and I'm
planning to review the patch soon. What are your ideas on the possibility
to backpatch it also? It seems a little bit weird that the operator can
change its name between versions of PG.
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Prefer TG_TABLE_NAME over TG_RELNAME in tests

On Thu, Sep 24, 2020 at 5:17 AM Michael Paquier  wrote:
>
> On Wed, Sep 23, 2020 at 12:07:14PM +0200, Daniel Gustafsson wrote:
> > TG_RELNAME was marked deprecated in commit 3a9ae3d2068 some 14 years ago, 
> > but
> > we still use it in the triggers test suite (test added in 59b4cef1eb74a a 
> > year
> > before deprecation).  Seems about time to move over to TG_TABLE_NAME 
> > ourselves,
> > as TG_RELNAME is still covered by the test added in the deprecation commit.
>
> No objections from here to remove that from the core tests.  It is
> worth noting that Debian Code Search hints that this is used in some
> extensions:
> https://codesearch.debian.net/search?q=TG_RELNAME&literal=1
>
> These are pgformatter, bucardo, sql-ledger, ledgersmb and davical.

That's interesting, but I think irrelevant to this patch in itself of
course. But it might be worth reaching out to some of those projects
and notifying them they're using the deprecated ones..

Thus, pushed. Thanks!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: contrib/sslinfo cleanup and OpenSSL errorhandling

2020-11-03 Thread Daniel Gustafsson

> On 3 Nov 2020, at 10:05, Magnus Hagander  wrote:

> Applied, with the small adjustment of the comma in the docs.

Thanks!

> I wonder if we should perhaps backpatch 0002? The changes to sslinfo
> that were ported go all the way back to 9.6, so it should be a safe
> one I think?

It should be safe given that the code has been in production for a long time.
That being said, sslinfo doesn't have tests (yet) and probably isn't used that
much; perhaps it's best to let this mature in HEAD for a few buildfarm cycles
first?

cheers ./daniel

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

On Wed, Aug 26, 2020 at 2:19 PM Daniel Gustafsson  wrote:
>
> > On 26 Aug 2020, at 09:56, Michael Paquier  wrote:
> > On Tue, Aug 25, 2020 at 03:52:14PM +0200, Daniel Gustafsson wrote:
>
> >> The attached moves all invocations under the correct guards.  RAND_poll() 
> >> in
> >> fork_process.c needs to happen for both OpenSSL and OpenSSL random, thus 
> >> the
> >> check for both.
> >
> > Yeah, it could be possible that somebody still calls RAND_bytes() or
> > similar without going through pg_strong_random(), so we still need to
> > use USE_OPENSSL after forking.  Per this argument, I am not sure I see
> > the point of the change in fork_process.c as it seems to me that
> > USE_OPENSSL_RANDOM should only be tied to pg_strong_random.c, and
> > you'd still get a compilation failure if trying to use
> > USE_OPENSSL_RANDOM without --with-openssl.
>
> That's certainly true.  The intention though is to make the code easier to
> follow (more explicit/discoverable) for anyone trying to implement support for

Is it really a reasonable usecase to use RAND_bytes() outside of both
pg_stroing_random() *and' outside of the openssl-specific files (like
be-secure-openssl.c)? Because it would only be those cases that would
have this case, right?

If anything, perhaps the call to RAND_poll() in fork_process.c should
actually be a call to a strong_random_initialize() or something which
would have an implementation in pg_strong_random.c, thereby isolating
the openssl specific code in there? (And with a void implementation
without openssl)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: contrib/sslinfo cleanup and OpenSSL errorhandling

On Mon, Nov 2, 2020 at 3:19 PM Magnus Hagander  wrote:
>
> On Fri, Oct 30, 2020 at 11:20 PM Daniel Gustafsson  wrote:
> >
> > > On 30 Oct 2020, at 00:40, Andres Freund  wrote:
> >
> > > There's quite a few copies of this code that look exactly the same,
> > > except for the be_tls_get_* call. Do you see a way to have fewer copies
> > > of the same code?
> >
> > There's really only two of the same, and two sets of those.  I tried some
> > variations but didn't really achieve anything that would strike the right
> > balance on the codegolf-to-readability scale.  Maybe others have a richer
> > imagination than me.
>
> Yeah, since it's only 2 of each, moving it to a macro wouldn't really
> save a lot -- and it would make things less readable overall I think.
>
> So I'd say the current version is OK.
>
> One thing I noted was in the docs part of the patch there is a missing
> comma -- but that one is missing previously as well. I'll go apply
> that fix to the back branches while waiting to see if somebody comes
> up with a more creative way to avoid the repeated code :)

Applied, with the small adjustment of the comma in the docs.

I wonder if we should perhaps backpatch 0002? The changes to sslinfo
that were ported go all the way back to 9.6, so it should be a safe
one I think?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Parallel copy

2020-11-03 Thread Amit Kapila

On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas  wrote:
>
> On 02/11/2020 08:14, Amit Kapila wrote:
> > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas  wrote:
> >>
> >> In this design, you don't need to keep line boundaries in shared memory,
> >> because each worker process is responsible for finding the line
> >> boundaries of its own block.
> >>
> >> There's a point of serialization here, in that the next block cannot be
> >> processed, until the worker working on the previous block has finished
> >> scanning the EOLs, and set the starting position on the next block,
> >> putting it in READY state. That's not very different from your patch,
> >> where you had a similar point of serialization because the leader
> >> scanned the EOLs,
> >
> > But in the design (single producer multiple consumer) used by the
> > patch the worker doesn't need to wait till the complete block is
> > processed, it can start processing the lines already found. This will
> > also allow workers to start much earlier to process the data as it
> > doesn't need to wait for all the offsets corresponding to 64K block
> > ready. However, in the design where each worker is processing the 64K
> > block, it can lead to much longer waits. I think this will impact the
> > Copy STDIN case more where in most cases (200-300 bytes tuples) we
> > receive line-by-line from client and find the line-endings by leader.
> > If the leader doesn't find the line-endings the workers need to wait
> > till the leader fill the entire 64K chunk, OTOH, with current approach
> > the worker can start as soon as leader is able to populate some
> > minimum number of line-endings
>
> You can use a smaller block size.
>

Sure, but the same problem can happen if the last line in that block
is too long and we need to peek into the next block. And then there
could be cases where a single line could be greater than 64K.

> However, the point of parallel copy is
> to maximize bandwidth.
>

Okay, but this first-phase (finding the line boundaries) can anyway be
not done in parallel and we have seen in some of the initial
benchmarking that this initial phase is a small part of work
especially when the table has indexes, constraints, etc. So, I think
it won't matter much if this splitting is done in a single process or
multiple processes.

> If the workers ever have to sit idle, it means
> that the bottleneck is in receiving data from the client, i.e. the
> backend is fast enough, and you can't make the overall COPY finish any
> faster no matter how you do it.
>
> > The other point is that the leader backend won't be used completely as
> > it is only doing a very small part (primarily reading the file) of the
> > overall work.
>
> An idle process doesn't cost anything. If you have free CPU resources,
> use more workers.
>
> > We have discussed both these approaches (a) single producer multiple
> > consumer, and (b) all workers doing the processing as you are saying
> > in the beginning and concluded that (a) is better, see some of the
> > relevant emails [1][2][3].
> >
> > [1] - 
> > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de
> > [2] - 
> > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de
> > [3] - 
> > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de
>
> Sorry I'm late to the party. I don't think the design I proposed was
> discussed in that threads.
>

I think something close to that is discussed as you have noticed in
your next email but IIRC, because many people (Andres, Ants, myself
and author) favoured the current approach (single reader and multiple
consumers) we decided to go with that. I feel this patch is very much
in the POC stage due to which the code doesn't look good and as we
move forward we need to see what is the better way to improve it,
maybe one of the ways is to split it as you are suggesting so that it
can be easier to review. I think the other important thing which this
patch has not addressed properly is the parallel-safety checks as
pointed by me earlier. There are two things to solve there (a) the
lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
APIs, etc.) have checks which doesn't allow any writes, we need to see
which of those we can open now (or do some additional work to prevent
from those checks) after some of the work done for parallel-writes in
PG-13[1][2], and (b) in which all cases we can parallel-writes
(parallel copy) is allowed, for example need to identify whether table
or one of its partitions has any constraint/expression which is
parallel-unsafe.

[1] 85f6b49 Allow relation extension lock to conflict among parallel
group members
[2] 3ba59cc Allow page lock to conflict among parallel group members

>
> I want to throw out one more idea. It's an interim step, not the final
> solution we want, but a useful step in getting there:
>
> Have the leader process scan the input for

Re: Split copy.c

2020-11-03 Thread Erikjan Rijkers


On 2020-11-03 08:38, Heikki Linnakangas wrote:


[v3-0001-Split-copy.c-into-copyto.c-and-copyfrom.c.patch]
[v3-0002-Split-copyfrom.c-further-into-copyfrom.c-and-copy.patch]


The patches apply ok, but I get these errors:

In file included from ../../../src/include/postgres.h:46,
 from copyto.c:15:
copyto.c: In function ‘BeginCopyTo’:
copyto.c:477:11: error: ‘is_from’ undeclared (first use in this 
function); did you mean ‘is_program’?

   Assert(!is_from);
   ^~~
../../../src/include/c.h:790:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^
copyto.c:477:11: note: each undeclared identifier is reported only once 
for each function it appears in

   Assert(!is_from);
   ^~~
../../../src/include/c.h:790:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^

Re: Online checksums verification in the backend