Re: adding partitioned tables to publications

2019-11-19 Thread Peter Eisentraut

On 2019-11-12 02:11, Amit Langote wrote:

I don't understand why you go through great lengths to ensure that the
relkinds match between publisher and subscriber.  We already ensure that
only regular tables are published and only regular tables are allowed as
subscription target.  In the future, we may want to allow further
combinations.  What situation are you trying to address here?

I'd really want to see the requirement for relkinds to have to match
go away, but as you can see, this patch doesn't modify enough of
pgoutput.c and worker.c to make that possible.  Both the code for the
initital syncing and that for the subsequent real-time replication
assume that both source and target are regular tables.  So even if
partitioned tables can now be in a publication, they're never sent in
the protocol messages, only their leaf partitions are.  Initial
syncing code can be easily modified to support any combination of
source and target relations, but changes needed for real-time
replication seem non-trivial.  Do you think we should do that before
we can say partitioned tables support logical replication?


My question was more simply why you have this check:

+   /*
+* Cannot replicate from a regular to a partitioned table or vice
+* versa.
+*/
+   if (local_relkind != pt->relkind)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot use relation \"%s.%s\" as logical 
replication target",

+   rv->schemaname, rv->relname),

It doesn't seem necessary.  What happens if you remove it?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: adding partitioned tables to publications

2019-11-19 Thread Peter Eisentraut

On 2019-11-18 09:53, Amit Langote wrote:

I have spent some time hacking on this.  With the attached updated
patch, adding a partitioned table to publication results in publishing
the inserts, updates, deletes of the table's leaf partitions as
inserts, updates, deletes of the table itself (it all happens inside
pgoutput).  So, the replication target table doesn't necessarily have
to be a partitioned table and even if it is partitioned its partitions
don't have to match one-to-one.

One restriction remains though: partitioned tables on a subscriber
can't accept updates and deletes, because we'd need to map those to
updates and deletes of their partitions, including handling a tuple
possibly moving from one partition to another during an update.


Right.  Without that second part, the first part isn't really that 
useful yet, is it?


I'm not sure what your intent with this patch is now.  I thought the 
previous behavior -- add a partitioned table to a publication and its 
leaf tables appear in the replication output -- was pretty welcome.  Do 
we not want that anymore?


There should probably be an option to pick the behavior, like we do in 
pg_dump.


What happens when you add a leaf table directly to a publication?  Is it 
replicated under its own identity or under its ancestor partitioned 
table?  (What if both the leaf table and a partitioned table are 
publication members?)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-11-19 Thread Michael Paquier
On Fri, Nov 15, 2019 at 10:34:55AM +0900, Michael Paquier wrote:
> It seems to me that if the plan is to have one option structure for
> each index AM, which has actually the advantage to reduce the bloat of
> each relcache entry currently relying on StdRdOptions, then we could
> have those extra assertion checks in the same patch, because the new
> macros are introduced.

I have looked at this patch, and did not like much having the
calculations of the page free space around, so I have moved that into
each AM's dedicated header.

> There is rd_rel->relam.  You can for example refer to pgstatindex.c
> which has AM-related checks to make sure that the correct index AM is
> being used.

We can do something similar for GIN and BRIN on top of the rest, so
updated the patch with that.  Nikolay, I would be fine to commit this
patch as-is.  Thanks for your patience on this stuff.
--
Michael
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index fb351b36e0..794088867a 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -37,11 +37,15 @@ typedef struct BrinStatsData
 
 #define BRIN_DEFAULT_PAGES_PER_RANGE	128
 #define BrinGetPagesPerRange(relation) \
-	((relation)->rd_options ? \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == BRIN_AM_OID), \
+	 (relation)->rd_options ? \
 	 ((BrinOptions *) (relation)->rd_options)->pagesPerRange : \
 	  BRIN_DEFAULT_PAGES_PER_RANGE)
 #define BrinGetAutoSummarize(relation) \
-	((relation)->rd_options ? \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == BRIN_AM_OID), \
+	 (relation)->rd_options ? \
 	 ((BrinOptions *) (relation)->rd_options)->autosummarize : \
 	  false)
 
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 78fcd826f1..4e45552b3f 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -14,6 +14,7 @@
 #include "access/gin.h"
 #include "access/ginblock.h"
 #include "access/itup.h"
+#include "catalog/pg_am_d.h"
 #include "fmgr.h"
 #include "storage/bufmgr.h"
 #include "lib/rbtree.h"
@@ -30,10 +31,14 @@ typedef struct GinOptions
 
 #define GIN_DEFAULT_USE_FASTUPDATE	true
 #define GinGetUseFastUpdate(relation) \
-	((relation)->rd_options ? \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == GIN_AM_OID), \
+	 (relation)->rd_options ? \
 	 ((GinOptions *) (relation)->rd_options)->useFastUpdate : GIN_DEFAULT_USE_FASTUPDATE)
 #define GinGetPendingListCleanupSize(relation) \
-	((relation)->rd_options && \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == GIN_AM_OID), \
+	 (relation)->rd_options && \
 	 ((GinOptions *) (relation)->rd_options)->pendingListCleanupSize != -1 ? \
 	 ((GinOptions *) (relation)->rd_options)->pendingListCleanupSize : \
 	 gin_pending_list_limit)
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 24af778fdf..7c530a99a0 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -20,6 +20,7 @@
 #include "access/amapi.h"
 #include "access/itup.h"
 #include "access/sdir.h"
+#include "catalog/pg_am_d.h"
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
@@ -263,6 +264,21 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+typedef struct HashOptions
+{
+	int32		varlena_header_;  /* varlena header (do not touch directly!) */
+	int			fillfactor;		  /* page fill factor in percent (0..100) */
+} HashOptions;
+
+#define HashGetFillFactor(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == HASH_AM_OID), \
+	 (relation)->rd_options ? \
+	 ((HashOptions *) (relation)->rd_options)->fillfactor :	\
+	 HASH_DEFAULT_FILLFACTOR)
+#define HashGetTargetPageUsage(relation) \
+	(BLCKSZ * HashGetFillFactor(relation) / 100)
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..ca24dfdc4d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "access/sdir.h"
 #include "access/xlogreader.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_index.h"
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
@@ -680,6 +681,23 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+typedef struct BTOptions
+{
+	int32	varlena_header_;	/* varlena header (do not touch directly!) */
+	int		fillfactor;			/* page fill factor in percent (0..100) */
+	/* fraction of newly inserted tuples prior to trigger index cleanup */
+	float8		vacuum_cleanup_index_scale_factor;
+}	BTOptions;
+
+#define BTGetFillFactor(relation) \
+	

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Kapila
On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar  wrote:
>
> On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
> >
> >
> > Have you tried by injecting some error?  After getting the error
> > mentioned above in email, when I retried the same query, I got the
> > below message.
> >
> > postgres=# SELECT 1 from
> > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > ERROR:  could not remove file
> > "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
> > removal of pg_replslot/regression_slot/xid*: Permission denied
> >
> > And, then I tried to drop the replication slot and I got below error.
> > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> > ERROR:  could not rename file "pg_replslot/regression_slot" to
> > "pg_replslot/regression_slot.tmp": Permission denied
> >
> > It might be something related to Windows
>
> Oh ok, I missed the fact that on Windows we can't delete the files
> that are already open, unlike Linux/Unix.
>

See comment in pgunlink() "We need to loop because even though
PostgreSQL uses flags that allow unlink while the file is open, other
applications might have the file
open without those flags.".  Can you once see if there is any flag
that you have missed to pass to allow this?  If there is nothing we
can do about it, then we might need to use some different API or maybe
define a new API that can handle this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_waldump and PREPARE

2019-11-19 Thread btkimurayuzk
I did not see any problems in this version of the patch. The 
information

displayed by pg_waldump for the PREPARE record is sufficient for use.


Thanks Andrey and Michael for the review! I committed the patch.

Regards,



Hi,
There is a mistake in the comment in the definition of 
xl_xact_relfilenodes.

This is a small patch to correct it.

Regards,

Yu Kimuradiff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 42b76cb4dd..9d2899dea1 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -239,7 +239,7 @@ typedef struct xl_xact_subxacts
 
 typedef struct xl_xact_relfilenodes
 {
-	int			nrels;			/* number of subtransaction XIDs */
+	int			nrels;			/* number of relations */
 	RelFileNode xnodes[FLEXIBLE_ARRAY_MEMBER];
 } xl_xact_relfilenodes;
 #define MinSizeOfXactRelfilenodes offsetof(xl_xact_relfilenodes, xnodes)


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-19 Thread Kyotaro Horiguchi
I'm in the benchmarking week..

Thanks for reviewing!.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch  wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.

I'll look into that soon.

By the way, before finalize this, I'd like to share the result of a
brief benchmarking.

First, I measured the direct effect of WAL skipping.
I measured the time required to do the following sequence for the
COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on
non-server spec HDD.

  BEGIN;
  TRUNCATE t;
  INSERT INTO t (SELECT a FROM generate_series(1, n) a);
  COMMIT;

REPLICA means the time with wal_level = replica
SYNCmeans the time with wal_level = minimal and force file sync.
WAL means the time with wal_level = minimal and force commit-WAL.
pages is the number of pages of the table.
(REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2)

pages REPLICASYNC  WAL
1:   144 ms   683 ms   217 ms
3:   303 ms   995 ms   385 ms
5:   271 ms  1007 ms   217 ms
   10:   157 ms  1043 ms   224 ms
   17:   189 ms  1007 ms   193 ms
   31:   202 ms  1091 ms   230 ms
   56:   265 ms  1175 ms   226 ms
  100:   510 ms  1307 ms   270 ms
  177:   790 ms  1523 ms   524 ms
  316:  1827 ms  1643 ms   719 ms
  562:  1904 ms  2109 ms  1148 ms
 1000:  3060 ms  2979 ms  2113 ms
 1778:  6077 ms  3945 ms  3618 ms
 3162: 13038 ms  7078 ms  6734 ms

There was a crossing point around 3000 pages. (bench1() finds that by
bisecting, run.sh 3).


With multiple sessions, the crossing point  but does not go so
small.

10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1].
pagesSYNC WAL
  316:  8436 ms  4694 ms
  562: 12067 ms  9627 ms (x2.1) # WAL wins
 1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep.
 1778: 32495 ms 63863 ms (x1.4)

100 processes (run.pl 4 100)
pagesSYNC WAL
   10: 13275 ms  1868 ms 
   17: 15919 ms  4438 ms (x2.3)
   31: 17063 ms  6431 ms (x1.5)
   56: 23193 ms 14276 ms (x2.2)  # WAL wins
  100: 35220 ms 67843 ms (x4.8)  # SYNC wins. WAL's slope becomes steep.

With 10 pgbench sessions.
pages   SYNC WAL 
1:   915 ms   301 ms
3:  1634 ms   508 ms
5:  1634 ms   293ms
   10:  1671 ms  1043 ms
   17:  1600 ms   333 ms
   31:  1864 ms   314 ms
   56:  1562 ms   448 ms
  100:  1538 ms   394 ms
  177:  1697 ms  1047 ms
  316:  3074 ms  1788 ms
  562:  3306 ms  1245 ms
 1000:  3440 ms  2182 ms
 1778:  5064 ms  6464 ms  # WAL's slope becomes steep
 3162:  8675 ms  8165 ms


I don't think the result of 100 processes is meaningful, so excluding
the result a candidate for wal_skip_threshold can be 1000.

Thoughts? The attached is the benchmark script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;

my @time = ();

sub bench {
my ($header, $nprocs, $ntups, $threshold) = @_;
my @result = ();
my @rds = ();

for (my $ip = 0 ; $ip < $nprocs ; $ip++)
{
pipe(my $rd, my $wr);
$rds[$ip] = $rd;

my $pid = fork();

die "fork failed: $!\n" if ($pid < 0);
if ($pid == 0)
{
close($rd);

my $pid = open2(my $psqlrd, my $psqlwr, "psql 
postgres");
print $psqlwr "SET wal_skip_threshold to $threshold;\n";
print $psqlwr "DROP TABLE IF EXISTS t$ip;";
print $psqlwr "CREATE TABLE t$ip (a int);\n";

my @st = gettimeofday();
for (my $i = 0 ; $i < 10 ; $i++)
{
print $psqlwr "BEGIN;";
print $psqlwr "TRUNCATE t$ip;";
print $psqlwr "INSERT INTO t$ip (SELECT a FROM 
generate_series(1, $ntups) a);";
print $psqlwr "COMMIT;";
}
close($psqlwr);
waitpid($pid, 0);

print $wr $ip, " ", 1000 * tv_interval(\@st, 
[gettimeofday]), "\n";
exit;
}
close($wr);
}

my $rpid;
while (($rpid = wait()) == 0) {}

my $sum = 0;
for (my $ip = 0 ; $ip < $nprocs ; $ip++)
{
my $ret = readline($rds[$ip]);
die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

$sum += $2;
}

printf "$header: procs $nprocs: time 

Re: range_agg

2019-11-19 Thread Paul A Jungwirth
On Tue, Nov 19, 2019 at 1:17 AM Pavel Stehule  wrote:
> Hi
> I tested last patches. I found some issues

Thank you for the review!

> 1. you should not to try patch catversion.

I've seen discussion on pgsql-hackers going both ways, but I'll leave
it out of future patches. :-)

> 2. there is warning
>
> parse_coerce.c: In function ‘enforce_generic_type_consistency’:
> parse_coerce.c:1975:11: warning: ‘range_typelem’ may be used uninitialized in 
> this function [-Wmaybe-uninitialized]
>  1975 |   else if (range_typelem != elem_typeid)

Fixed locally, will include in my next patch.

> 3. there are problems with pg_upgrade. Regress tests fails
> . . .
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 1653; 1247 17044 TYPE arrayrange pavel
> pg_restore: error: could not execute query: ERROR:  pg_type array OID value 
> not set when in binary upgrade mode

I see what's going on here. (Sorry if this verbose explanation is
obvious; it's as much for me as for anyone.) With pg_upgrade the
values of pg_type.oid and pg_type.typarray must be the same before &
after. For built-in types there's no problem, because those are fixed
by pg_type.dat. But for user-defined types we have to take extra steps
to make sure they don't change. CREATE TYPE always uses two oids: one
for the type and one for the type's array type. But now when you
create a range type we use *four*: the range type, the array of that
range, the multirange type, and the array of that multirange.
Currently when you run pg_dump in "binary mode" (i.e. as part of
pg_upgrade) it includes calls to special functions to set the next oid
to use for pg_type.oid and pg_type.typarray. Then CREATE TYPE also has
special "binary mode" code to check those variables and use those oids
(e.g. AssignTypeArrayOid). After using them once it sets them back to
InvalidOid so it doesn't keep using them. So I guess I need to add
code to pg_dump so that it also outputs calls to two new special
functions that similarly set the oid to use for the next multirange
and multirange[]. For v12->v13 it will chose high-enough oids like we
do already for arrays of domains. (For other upgrades it will use the
existing value.) And then I can change the CREATE TYPE code to check
those pre-set values when obtaining the next oid. Does that sound like
the right approach here?

> 4. there is a problem with doc
>
> extend.sgml:281: parser error : Opening and ending tag mismatch: para line 
> 270 and type
>  type of the ranges in an anymultirange.

Hmm, yikes, I'll fix that!

> I am not sure how much is correct to use  
> in doc. It is used for ranges, and multiranges, but no in other places

I could use some advice here. Many operators seem best presented in
groups of four, where only their parameter types change, for example:

int8range < int8range
int8range < int8multirange
int8multirange < int8range
int8multirange < int8multirange

All I really want is to show those separated by line breaks. I
couldn't find any other examples of that happening inside a table cell
though (i.e. inside ). What is the best way
to do that?

Thanks,
Paul




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-19 Thread Dilip Kumar
On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila  wrote:
>
> On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar  wrote:
> >
> > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila  wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar  wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Few other comments on this patch:
> > > > > 1.
> > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > +
> > > > > + /*
> > > > > + * Execute the invalidation message locally.
> > > > > + *
> > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > + * about the message itself?
> > > > > + */
> > > > > + LocalExecuteInvalidationMessage(>data.inval.msg);
> > > > > + break;
> > > > >
> > > > > Here, why are we executing messages individually?  Can't we just
> > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > receive any invalidation message?
> >
> > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > commit.  Because this is required to add any committed transaction to
> > the snapshot if it has done any catalog changes.
> >
>
> Hmm, this is also used to build cid hash map (see
> ReorderBufferBuildTupleCidHash) which we need to use while streaming
> changes for the in-progress transactions.  So, I think that it would
> be required earlier (before commit) as well.
>
Oh right,  I guess I missed that part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: backup manifests

2019-11-19 Thread Suraj Kharage
Hi,

Since now we are generating the backup manifest file with each backup, it
provides us an option to validate the given backup.
Let's say, we have taken a backup and after a few days, we want to check
whether that backup is validated or corruption-free without restarting the
server.

Please find attached POC patch for same which will be based on the latest
backup manifest patch from Rushabh. With this functionality, we add new
option to pg_basebackup, something like --verify-backup.
So, the syntax would be:
./bin/pg_basebackup --verify-backup -D 

Basically, we read the backup_manifest file line by line from the given
directory path and build the hash table, then scan the directory and
compare each file with the hash entry.

Thoughts/suggestions?

On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia 
wrote:

>
>
> My colleague Suraj did testing and noticed the performance impact
> with the checksums.   On further testing, he found that specifically with
> sha its more of performance impact.
>
> Please find below statistics:
>
> no of tables without checksum SHA256
> checksum % performnce
> overhead
> with
> SHA-256 md5 checksum % performnce
> overhead with md5 CRC checksum % performnce
> overhead with
> CRC
> 10 (100 MB
> in each table) real 0m10.957s
> user 0m0.367s
> sys 0m2.275s real 0m16.816s
> user 0m0.210s
> sys 0m2.067s 53% real 0m11.895s
> user 0m0.174s
> sys 0m1.725s 8% real 0m11.136s
> user 0m0.365s
> sys 0m2.298s 2%
> 20 (100 MB
> in each table) real 0m20.610s
> user 0m0.484s
> sys 0m3.198s real 0m31.745s
> user 0m0.569s
> sys 0m4.089s
> 54% real 0m22.717s
> user 0m0.638s
> sys 0m4.026s 10% real 0m21.075s
> user 0m0.538s
> sys 0m3.417s 2%
> 50 (100 MB
> in each table) real 0m49.143s
> user 0m1.646s
> sys 0m8.499s real 1m13.683s
> user 0m1.305s
> sys 0m10.541s 50% real 0m51.856s
> user 0m0.932s
> sys 0m7.702s 6% real 0m49.689s
> user 0m1.028s
> sys 0m6.921s 1%
> 100 (100 MB
> in each table) real 1m34.308s
> user 0m2.265s
> sys 0m14.717s real 2m22.403s
> user 0m2.613s
> sys 0m20.776s 51% real 1m41.524s
> user 0m2.158s
> sys 0m15.949s
> 8% real 1m35.045s
> user 0m2.061s
> sys 0m16.308s 1%
> 100 (1 GB
> in each table) real 17m18.336s
> user 0m20.222s
> sys 3m12.960s real 24m45.942s
> user 0m26.911s
> sys 3m33.501s 43% real 17m41.670s
> user 0m26.506s
> sys 3m18.402s 2% real 17m22.296s
> user 0m26.811s
> sys 3m56.653s
>
> sometimes, this test
> completes within the
> same time as without
> checksum. approx. 0.5%
>
>
> Considering the above results, I modified the earlier Robert's patch and
> added
> "manifest_with_checksums" option to pg_basebackup.  With a new patch.
> by default, checksums will be disabled and will be only enabled when
> "manifest_with_checksums" option is provided.  Also re-based all patch set.
>
>
>
> Regards,
>
> --
> Rushabh Lathia
> www.EnterpriseDB.com
>
> On Tue, Oct 1, 2019 at 5:43 PM Robert Haas  wrote:
>
>> On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
>>  wrote:
>> > Entry for directory is not added in manifest. So it might be difficult
>> > at client to get to know about the directories. Will it be good to add
>> > an entry for each directory too? May be like:
>> > Dir 
>>
>> Well, what kind of corruption would this allow us to detect that we
>> can't detect as things stand? I think the only case is an empty
>> directory. If it's not empty, we'd have some entries for the files in
>> that directory, and those files won't be able to exist unless the
>> directory does. But, how would we end up backing up an empty
>> directory, anyway?
>>
>> I don't really *mind* adding directories into the manifest, but I'm
>> not sure how much it helps.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>>
>
> --
> Rushabh Lathia
>


-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.


backup_validator_POC.patch
Description: Binary data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 5:44 PM Amit Kapila  wrote:
> On Wed, Nov 20, 2019 at 12:28 AM Thomas Munro  wrote:
> > +   if (GetLastError() == ERROR_HANDLE_EOF)
> > +   return 0;

> Yes, this works for me.

Thanks, pushed.




Re: [HACKERS] Block level parallel vacuum

2019-11-19 Thread Masahiko Sawada
On Mon, 18 Nov 2019 at 15:38, Masahiko Sawada
 wrote:
>
> On Mon, 18 Nov 2019 at 15:34, Amit Kapila  wrote:
> >
> > On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, 13 Nov 2019 at 14:31, Amit Kapila  wrote:
> > > >
> > > >
> > > > Based on these needs, we came up with a way to allow users to specify
> > > > this information for IndexAm's. Basically, Indexam will expose a
> > > > variable amparallelvacuumoptions which can have below options
> > > >
> > > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> > > > vacuumcleanup) can't be performed in parallel
> > >
> > > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
> > > want to support parallel vacuum don't have to set anything.
> > >
> >
> > make sense.
> >
> > > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> > > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set this
> > > > flag)
> > > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > > > done in parallel if bulkdelete is not performed (Indexes nbtree, brin,
> > > > gin, gist,
> > > > spgist, bloom will set this flag)
> > > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done in
> > > > parallel even if bulkdelete is already performed (Indexes gin, brin,
> > > > and bloom will set this flag)
> > >
> > > I think gin and bloom don't need to set both but should set only
> > > VACUUM_OPTION_PARALLEL_CLEANUP.
> > >
> > > And I'm going to disallow index AMs to set both
> > > VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
> > > by assertions, is that okay?
> > >
> >
> > Sounds reasonable to me.
> >
> > Are you planning to include the changes related to I/O throttling
> > based on the discussion in the nearby thread [1]?  I think you can do
> > that if you agree with the conclusion in the last email[1], otherwise,
> > we can explore it separately.
>
> Yes I agreed. I'm going to include that changes in the next version
> patches. And I think we will be able to do more discussion based on
> the patch.
>

I've attached the latest version patch set. The patch set includes all
discussed points regarding index AM options as well as shared cost
balance. Also I added some test cases used all types of index AM.

During developments I had one concern about the number of parallel
workers to launch. In current design each index AMs can choose the
participation of parallel bulk-deletion and parallel cleanup. That
also means the number of parallel worker to launch might be different
for each time of parallel bulk-deletion and parallel cleanup. In
current patch the leader will always launch the number of indexes that
support either one but it would not be efficient in some cases. For
example, if we have 3 indexes supporting only parallel bulk-deletion
and 2 indexes supporting only parallel index cleanup, we would launch
5 workers for each execution but some workers will do nothing at all.
To deal with this problem, I wonder if we can improve the parallel
query so that the leader process creates a parallel context with the
maximum number of indexes and can launch a part of workers instead of
all of them.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v33-0001-Add-index-AM-field-and-callback-for-parallel-ind.patch
Description: Binary data


v33-0003-Add-paralell-P-option-to-vacuumdb-command.patch
Description: Binary data


v33-0002-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


Re: backup manifests

2019-11-19 Thread Rushabh Lathia
On Tue, Nov 19, 2019 at 7:19 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 11/19/19 5:00 AM, Rushabh Lathia wrote:
> >
> >
> > My colleague Suraj did testing and noticed the performance impact
> > with the checksums.   On further testing, he found that specifically with
> > sha its more of performance impact.
> >
> >
>
> I admit I haven't been following along closely, but why do we need a
> cryptographic checksum here instead of, say, a CRC? Do we think that
> somehow the checksum might be forged? Use of cryptographic hashes as
> general purpose checksums has become far too common IMNSHO.
>

Yeah, maybe.  I was thinking to give the user an option to choose checksums
algorithms (SHA256. CRC, MD5, etc),  so that they are open to choose what
suites for their environment.

If we decide to do that than we need  to store the checksums algorithm
information in the manifest file.

Thoughts?



>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstanhttps://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Rushabh Lathia


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Kapila
On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar  wrote:
>
> On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
> >
> > On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar  
> > wrote:
> > >
> > > For the API's that use VFDs (like PathNameOpenFile), the files opened
> > > are always recorded in the VfdCache array. So it is not required to do
> > > the cleanup at (sub)transaction end, because the kernel fds get closed
> > > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
> > > limit. So if a transaction aborts, the fds might remain open, but
> > > those will get cleaned up whenever we require more fds, through
> > > ReleaseLruFiles(). Whereas, for files opened through
> > > OpenTransientFile(), VfdCache is not involved, so this needs
> > > transaction end cleanup.
> > >
> >
> > Have you tried by injecting some error?  After getting the error
> > mentioned above in email, when I retried the same query, I got the
> > below message.
> >
> > postgres=# SELECT 1 from
> > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > ERROR:  could not remove file
> > "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
> > removal of pg_replslot/regression_slot/xid*: Permission denied
> >
> > And, then I tried to drop the replication slot and I got below error.
> > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> > ERROR:  could not rename file "pg_replslot/regression_slot" to
> > "pg_replslot/regression_slot.tmp": Permission denied
> >
> > It might be something related to Windows
>
> Oh ok, I missed the fact that on Windows we can't delete the files
> that are already open, unlike Linux/Unix.
> I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use
> OpenTemporaryFile().
>

I think setting FD_CLOSE_AT_EOXACT won't work unless you also set
have_xact_temporary_files because it checks that flag in
CleanupTempFiles.  Also, OpenTemporaryFile() doesn't take the input
file path, so how will you use it?

> I wonder though if this same issue might come up
> for the other use-case of PathNameOpenFile() :
> logical_rewrite_log_mapping().
>

It is possible, but I haven't tested that path.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 4:54 PM Amit Khandekar  wrote:
> - BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return
> value if it is -1
> if (file->nbytes < 0) file->nbytes = 0;

Ok, that's a different problem we need to fix then.  But it does
explain how we didn't know.  And sure enough there is "unrecognized
win32 error code: 38" LOG-spam on the build farm, at places where
tuplestores are expected:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=whelk=2019-11-20%2002%3A41%3A41=check




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Kapila
On Wed, Nov 20, 2019 at 12:28 AM Thomas Munro  wrote:
>
> On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha
>  wrote:
> > On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro  
> > wrote:
> >> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar  
> >> wrote:
> >> > On Windows, it is documented that ReadFile() (which is called by
> >> > pg_pread) will return false on EOF but only when the file is open for
> >> > asynchronous reads/writes. But here we are just dealing with usual
> >> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on
> >> > Windows. Not yet able to figure out how FileRead() managed to return
> >> > this error on Windows. But from your symptoms, it does look like
> >> > pg_pread()=>ReadFile() returned false (despite doing asynchronous
> >> > reads), and so _dosmaperr() gets called, and then it does not find the
> >> > eof error in doserrors[], so the "unrecognized win32 error code"
> >> > message is printed. May have to dig up more on this.
> >>
> >> Hmm.  See also this report:
> >>
> >> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com
> >>
> >
> > The files from pgwin32_open() are open for synchronous access, while 
> > pg_pread() uses the asynchronous functionality to offset the read. Under 
> > these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as 
> > explained in:
>
> Oh, thanks.
>
> > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863
>
> !?!
>
> Amit, since it looks like you are Windows-enabled and have a repro,
> would you mind confirming that this fixes the problem?
>
> --- a/src/port/pread.c
> +++ b/src/port/pread.c
> @@ -41,6 +41,9 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
> overlapped.Offset = offset;
> if (!ReadFile(handle, buf, size, , ))
> {
> +   if (GetLastError() == ERROR_HANDLE_EOF)
> +   return 0;
> +
> _dosmaperr(GetLastError());
> return -1;
> }

Yes, this works for me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




RE: Should we add xid_current() or a int8->xid cast?

2019-11-19 Thread imai.yoshik...@fujitsu.com
Hi Thomas,

Please let me ask something about wraparound problem.

+static FullTransactionId
+convert_xid(TransactionId xid, FullTransactionId next_fxid)
 {
-   uint64  epoch;
+   TransactionId next_xid = XidFromFullTransactionId(next_fxid);
+   uint32 epoch = EpochFromFullTransactionId(next_fxid);
 
...
 
-   /* xid can be on either side when near wrap-around */
-   epoch = (uint64) state->epoch;
-   if (xid > state->last_xid &&
-   TransactionIdPrecedes(xid, state->last_xid))
+   if (xid > next_xid)
epoch--;
-   else if (xid < state->last_xid &&
-TransactionIdFollows(xid, state->last_xid))
-   epoch++;
 
-   return (epoch << 32) | xid;
+   return FullTransactionIdFromEpochAndXid(epoch, xid);


ISTM codes for wraparound are deleted. Is that correct?
I couldn't have read all related threads about using FullTransactionId but
does using FullTransactionId avoid wraparound problem? 

If we consider below conditions, we can say it's difficult to see wraparound
with current disk like SSD (2GB/s) or memory DDR4 (34GB/s), but if we can use 
more high-spec hardware like HBM3 (2048GB/s), we can see wraparound. Or do
I say silly things?

* 10 year system ( < 2^4 )
* 1 year = 31536000 ( = 60 * 60 * 24 * 365) secs  ( < 2^25 )
* 2^35 ( = 2^64 / 2^4 / 2^25) transactions we can use in each seconds
* we can write at (2^5 * 2^30 * n) bytes/sec = (32 * n) GB/sec if we use 'n'
  bytes for each transactions.

Is there any agreement we can throw the wraparound problem away if we adopt
FullTransactionId?


Thanks
--
Yoshikazu Imai



Re: Allow CREATE OR REPLACE VIEW to rename the columns

2019-11-19 Thread btkimurayuzk

Barring any objection, I'm thinking to commit this patch.

Regards,


Build and All Test has passed .
Looks good to me .

Regards,




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Khandekar
On Wed, 20 Nov 2019 at 01:05, Thomas Munro  wrote:
>
> On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro  wrote:
> > On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha
> > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863
> >
> > !?!

Thanks Juan and Thomas for pointing to these links where already this
was discussed.

>
> One thing I don't understand (besides, apparently, the documentation):
> how did this problem escape detection by check-world for such a long
> time?  Surely we expect to hit the end of various temporary files in
> various tests.  Is it intermittent, or dependent on Windows version,
> or something like that?

Possibly there aren't any callers who try to pread() at end-of-file
using FileRead/pg_pread :

- mdread() seems to read from an offset which it seems to know that it
is inside the end-of file, including the whole BLCKSZ.
- BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return
value if it is -1
if (file->nbytes < 0) file->nbytes = 0;
- XLogPageRead() also seems to know that the offset is a valid offset.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: initdb SegFault

2019-11-19 Thread Kyotaro Horiguchi
At Tue, 19 Nov 2019 12:06:50 -0500, Tom Lane  wrote in 
> Andres Freund  writes:
> > Agreed wrt this specific failure scenario. It does however seem not
> > great that callsites for PQExpBuffer ought to check every call for
> > allocation failures, in the general case.
> 
> It is possible to check just once at the end, using the PQExpBufferBroken
> API, and I believe that libpq for instance is fairly careful about that.

FWIW, I looked though the callers of PQExpBuffer.

pqGetErrorNotice3 seems ingoring OOM on message buffer when !isError,
then sets NULL to res->errMsg. getParameterStatus doesn't check that
before use, too.

Most of the callers of PQExpBufferDataBroken use libpq_gettext("out of
memory"). And some of them do strdup(libpq_gettext()).

Not restricting to libpq functions, 

dblink_connstr_check complains as "password is required" when
PQconninfoParse hits OOM.

libpqrcv_check_conninfo() will show '(null)' or maybe get SEGV on some
platforms when PQconninfoParse() hits OOM, since it uses err without
null checking. pg_basebackup, pg_dumpall and pg_isready is doing the
same thing.


> I agree that programs that just need to print something and exit could
> perhaps ask pqexpbuffer.c to handle that for them.  (But initdb still
> doesn't fall in that category, because of its very nontrivial atexit
> handler :-(.)
> 
> > I wonder if, for frontend paths, a simplified error handling path would
> > be worthwhile for OOM paths. Doing only a write() or such to print an
> > error message.
> 
> Perhaps.  You wouldn't get any translation --- but then, gettext is
> probably going to fail anyway under such conditions.

I think we should refrain from translating in the cases.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-19 Thread Justin Pryzby
On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

Note, the ALTER was (I think) building index in a parallel process:

 2019-11-15 22:16:08.752-05 |  5976 | terminating connection because of crash 
of another server process
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was 
terminated by signal 6: Aborted
 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server 
processes
 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file 
"base/16491/1731839470.2": No such file or directory
 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms  statement: ALTER 
TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolulscell TYPE integer 
USING pmradiothpvolulscell::integer
 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576
 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488
 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096

FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
 child  | eric_enodeb_cell_201811  | table | telsasoft |
 | 2595 MB |
 child  | eric_enodeb_cell_201811_idx  | index | telsasoft | 
eric_enodeb_cell_201811 | 120 kB  |
 child  | eric_enodeb_cell_201811_site_idx | index | telsasoft | 
eric_enodeb_cell_201811 | 16 MB   |

I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER.  I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.

Checkpoint logs for good measure:
 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers 
(3.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=30.022 s, sync=0.472 
s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; 
distance=297471 kB, estimate=297471 kB
 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers 
(100.0%); 0 WAL file(s) added, 0 removed, 43 recycled; write=11.040 s, 
sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
; distance=698932 kB, estimate=698932 kB
 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery 
immediate
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was 
terminated by signal 6: Aborted
 2019-11-15 22:15:33.03-05  | 27388 | checkpoint starting: time
 2019-11-15 22:15:03.62-05  | 27388 | checkpoint complete: wrote 5436 buffers 
(5.5%); 0 WAL file(s) added, 0 removed, 45 recycled; write=28.938 s, sync=0.355 
s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
istance=740237 kB, estimate=740237 kB
 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time

I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int 
unique); INSERT INTO t SELECT generate_series(1,99)"; echo "begin;SELECT 
pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& 
sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; 
PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER 
i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;

Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails 
like:
|pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 
16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).




RE: Planning counters in pg_stat_statements (using pgss_store)

2019-11-19 Thread imai.yoshik...@fujitsu.com
On Tue, Nov 19, 2019 at 2:27 PM, Julien Rouhaud wrote:
> On Fri, Nov 15, 2019 at 2:00 AM imai.yoshik...@fujitsu.com 
>  wrote:
> >
> > Actually I also don't have strong opinion but I thought someone would 
> > complain about renaming of those columns and
> also some tools like monitoring which use those columns will not work. If we 
> use {total, min, max, mean, stddev}_time,
> someone might mistakenly understand {total, min, max, mean, stddev}_time mean 
> {total, min, max, mean, stddev} of planning
> and execution.
> > If I need to choose {total, min, max, mean, stddev}_time or {total, min, 
> > max, mean, stddev}_exec_time, I choose former
> one because choosing best name is not worth destructing the existing scripts 
> or tools.
> 
> We could definitely keep (plan|exec)_time for the SRF, and have the {total, 
> min, max, mean, stddev}_time created by
> the view to be a sum of planning + execution for each counter

I might misunderstand but if we define {total, min, max, mean, stddev}_time is
just a sum of planning + execution for each counter like
"select total_plan_time + total_exec_time as total_time from 
pg_stat_statements",
I wonder we can calculate stddev_time correctly. If we prepare variables in
the codes to calculate those values, yes, we can correctly calculate those
values even for the total_stddev.

> and it doesn't sound like a bad idea if having even
> more columns in the view is not an issue.

I also wondered having many columns in the view is ok, but if it's ok, I agree
all of those columns are in the view. Only problem I can come up with is the
view will look bad with many columns, but it already looks bad because query
column values tend to be long and each row can't fit in the one row in the
console.

--
Yoshikazu Imai


Re: Role membership and DROP

2019-11-19 Thread Laurenz Albe
On Tue, 2019-11-19 at 13:21 -0500, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Fri, 2019-11-15 at 13:41 -0500, Tom Lane wrote:
> > > Laurenz Albe  writes:
> > > > On Wed, 2019-11-13 at 17:17 -0500, Tom Lane wrote:
> > > > > It might be worth clarifying this point in section 5.7,
> > > > > https://www.postgresql.org/docs/devel/ddl-priv.html
> > I like your second sentence, but I think that "the right ... is inherent
> > in being the ... owner" is unnecessarily complicated.
> > Removing the "always" and "only" makes the apparent contradiction between
> > the sentences less jarring to me.
> 
> I think it's important to emphasize that this is implicit in object
> ownership.
> 
> Looking at the page again, I notice that there's a para a little further
> down that overlaps quite a bit with what we're discussing here, but it's
> about implicit grant options rather than the right to DROP.  In the
> attached, I reworded that too, and moved it because it's not fully
> intelligible until we've explained grant options.  Thoughts?

I am fine with that.

Yours,
Laurenz Albe





Re: Proposal- GUC for max dead rows before autovacuum

2019-11-19 Thread Laurenz Albe
On Tue, 2019-11-19 at 15:35 -0700, Michael Lewis wrote:
> To mitigate the need for per-table tuning of autovacuum configuration, I'd 
> like to propose a new GUC for autovacuum_vacuum_threshold_max.
> 
> Currently, it seems that I can either set autovacuum_vacuum_scale_factor much 
> smaller than default on tables with millions of rows,
> or set a value globally that means small tables are auto vacuumed rarely.
> 
> The default value for this new setting value could be -1 or 0 to disable the 
> feature, or something like 100,000 perhaps
> so that tables with more than 500, tuples are candidates for an 
> autovacuum before they would with current default values.

I think this is unnecessary.
Usually you have problems only with a few tables, and it is no problem
to set autovacuum parameters on these individually.

Yours,
Laurenz Albe





Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-19 Thread Justin Pryzby
On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> Perhaps we should not panic if we failed to open (not fsync) the file,
> but it's not the root problem here which is that somehow we thought we
> should be fsyncing a file that had apparently been removed already
> (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

FYI, I *do* have scripts which CLUSTER and(or) VAC FULL (and REINDEX), but they
were disabled due to crash in 12.0 (which was resolved in 12.1) and it wouldn't
have run anyway, since they run after the backup script (which failed) in a
shell with set -e.

I think TRUNCATE does that too..but I don't think that any processes which do
TRUNCATE are running on that server.

Thanks,
Justin




Proposal- GUC for max dead rows before autovacuum

2019-11-19 Thread Michael Lewis
To mitigate the need for per-table tuning of autovacuum configuration, I'd
like to propose a new GUC for autovacuum_vacuum_threshold_max.

Currently, it seems that I can either set autovacuum_vacuum_scale_factor
much smaller than default on tables with millions of rows, or set a value
globally that means small tables are auto vacuumed rarely.

The default value for this new setting value could be -1 or 0 to disable
the feature, or something like 100,000 perhaps so that tables with more
than 500, tuples are candidates for an autovacuum before they would
with current default values.


Re: Invisible PROMPT2

2019-11-19 Thread David Fetter
On Tue, Nov 19, 2019 at 04:02:48PM +1300, Thomas Munro wrote:
> On Tue, Nov 19, 2019 at 12:09 PM Tom Lane  wrote:
> > You should follow the logic in pg_wcswidth: compute PQmblen() first,
> > and bail out if it's more than the remaining string length, otherwise
> > it's ok to apply PQdsplen().
> 
> Got it.  I was worried that it wasn't safe to call even PQmblen(),
> because I didn't know a fact about all encodings: as described in the
> comment of pg_gb18030_mblen(), all implementations read only the first
> byte to determine the length, except for GB18030 which reads the
> second byte too, and that's OK because there's always a null
> terminator.
> 
> > It might be a good idea to explicitly initialize last_prompt1_width to
> > zero, for clarity.
> >
> > Should the user docs explicitly say "of the same width as the most recent
> > output of PROMPT1", as you have in the comments?  That seems a more
> > precise specification, and it will eliminate some questions people will
> > otherwise ask.
> >
> > LGTM otherwise.
> 
> Done, and pushed.  I also skipped negative results from PQdsplen like
> pg_wcswidth() does (that oversight explained why a non-readline build
> showed the correct alignment for PROMPT1 '%[%033[1m%]%M
> %n@%/%R%[%033[0m%]%# ' by strange concindence).
> 
> Thanks all for the feedback.  I think the new bikeshed colour looks good.

Please find attached some polka dots for the bike shed :)

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From fdbb6272338ed71f8cecf2a755ef53b037acc251 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Tue, 19 Nov 2019 11:39:47 -0800
Subject: [PATCH v1] Make visible_length() available to the rest of psql
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.23.0"

This is a multi-part message in MIME format.
--2.23.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 90f6380170..3245f3ebae 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -2443,3 +2443,50 @@ recognized_connection_string(const char *connstr)
 {
 	return uri_prefix_length(connstr) != 0 || strchr(connstr, '=') != NULL;
 }
+
+/*
+ * Return visible width of the string
+ */
+int
+visible_length(char *string)
+{
+	int		visible_length = 0;
+	char   *p = string;
+	char   *end = p + strlen(p);
+	bool	visible = true;
+
+	while(*string)
+	{
+#if defined(USE_READLINE) && defined(RL_PROMPT_START_IGNORE)
+		if (*string == RL_PROMPT_START_IGNORE)
+		{
+			visible = false;
+			++p;
+		}
+		else if (*p == RL_PROMPT_END_IGNORE)
+		{
+			visible = true;
+			++p;
+		}
+		else
+#endif
+		{
+			int		chlen,
+	chwidth;
+
+			chlen = PQmblen(p, pset.encoding);
+			if (p + chlen > end)
+break;
+
+			if (visible)
+			{
+chwidth = PQdsplen(p, pset.encoding);
+if (chwidth > 0)
+	visible_length += chwidth;
+			}
+
+			p += chlen;
+		}
+	}
+	return visible_length;
+}
diff --git a/src/bin/psql/common.h b/src/bin/psql/common.h
index 282a520116..56e28915ce 100644
--- a/src/bin/psql/common.h
+++ b/src/bin/psql/common.h
@@ -44,4 +44,5 @@ extern void expand_tilde(char **filename);
 
 extern bool recognized_connection_string(const char *connstr);
 
+extern int visible_length(char *string);
 #endif			/* COMMON_H */
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 41c6f21ecf..9c0e2c7567 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -347,46 +347,7 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 
 	/* Compute the visible width of PROMPT1, for PROMPT2's %w */
 	if (prompt_string == pset.prompt1)
-	{
-		char	   *p = destination;
-		char	   *end = p + strlen(p);
-		bool		visible = true;
-
-		last_prompt1_width = 0;
-		while (*p)
-		{
-#if defined(USE_READLINE) && defined(RL_PROMPT_START_IGNORE)
-			if (*p == RL_PROMPT_START_IGNORE)
-			{
-visible = false;
-++p;
-			}
-			else if (*p == RL_PROMPT_END_IGNORE)
-			{
-visible = true;
-++p;
-			}
-			else
-#endif
-			{
-int			chlen,
-			chwidth;
-
-chlen = PQmblen(p, pset.encoding);
-if (p + chlen > end)
-	break;		/* Invalid string */
-
-if (visible)
-{
-	chwidth = PQdsplen(p, pset.encoding);
-	if (chwidth > 0)
-		last_prompt1_width += chwidth;
-}
-
-p += chlen;
-			}
-		}
-	}
+		last_prompt1_width = visible_length(destination);
 
 	return destination;
 }

--2.23.0--




Re: backup manifests

2019-11-19 Thread David Steele
On 11/19/19 5:00 AM, Rushabh Lathia wrote:
> 
> My colleague Suraj did testing and noticed the performance impact
> with the checksums.   On further testing, he found that specifically with
> sha its more of performance impact.  

We have found that SHA1 adds about 3% overhead when the backup is also
compressed (gzip -6), which is what most people want to do.  This
percentage goes down even more if the backup is being transferred over a
network or to an object store such as S3.

We judged that the lower collision rate of SHA1 justified the additional
expense.

That said, making SHA256 optional seems reasonable.  We decided not to
make our SHA1 checksums optional to reduce the test matrix and because
parallelism largely addressed performance concerns.

Regards,
-- 
-David
da...@pgmasters.net




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-11-19 Thread Alexander Korotkov
On Sun, Nov 17, 2019 at 9:18 PM Alexander Korotkov
 wrote:
> So, I've put this explanation into README patch.  I just change
> designation to better match with Lehman & Yao paper and did some minor
> enchantments.
>
> I'm going to push this patchset if no objections.

So, pushed with minor changes during backpatching.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 12:58 AM Justin Pryzby  wrote:
> < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file 
> "base/16491/1731839470.2": No such file or directory
> < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was 
> terminated by signal 6: Aborted
>
> /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
> Centos 7.7 qemu/KVM
> Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 
> x86_64 x86_64 x86_64 GNU/Linux
> There's no added tablespaces.
>
> Copying Thomas since I wonder if this is related:
> 3eb77eba Refactor the fsync queue for wider use.

It could be, since it changed some details about the way that queue
worked, and another relevant change is:

9ccdd7f6 PANIC on fsync() failure.

Perhaps we should not panic if we failed to open (not fsync) the file,
but it's not the root problem here which is that somehow we thought we
should be fsyncing a file that had apparently been removed already
(due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).  Usually, if
a file is in the fsync queue and then is later removed, we handle that
by searching the queue for cancellation messages (since one should
always be sent before the file is unlinked), and I think your core
file with "failures = 1" tells us that it didn't find a cancellation
message.  So this seems to indicate a problem, somewhere, in that
protocol.  That could either be a defect in 3eb77eba or it could have
been a pre-existing problem that became a bigger problem due to
9ccdd7f6.

Looking into it.

> I can't find any relation with filenode nor OID matching 1731839470, nor a 
> file
> named like that (which is maybe no surprise, since it's exactly the issue
> checkpointer had last week).
>
> A backup job would've started at 22:00 and probably would've run until 22:27,
> except that its backend was interrupted "because of crash of another server
> process".  That uses pg_dump --snapshot.
>
> This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
> indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
> |SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

By the way, it's relfilenode, not oid, that is used in these names
(though they start out the same).  In a rewrite, the relfilenode
changes but the oid stays the same.

> Ah, I found a maybe relevant log:
> |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms  statement: ALTER TABLE 
> child.eric_enodeb_cell_201811 ALTER pmradiothpvolul
>
> So we altered that table (and 100+ others) with a type-promoting alter,
> starting at 2019-11-15 21:20:51,942.  That involves DETACHing all but the most
> recent partitions, altering the parent, and then iterating over historic
> children to ALTER and reATTACHing them.  (We do this to avoid locking the 
> table
> for long periods, and to avoid worst-case disk usage).

Hmm.




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro  wrote:
> On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha
> > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863
>
> !?!

One thing I don't understand (besides, apparently, the documentation):
how did this problem escape detection by check-world for such a long
time?  Surely we expect to hit the end of various temporary files in
various tests.  Is it intermittent, or dependent on Windows version,
or something like that?




planner support functions: handle GROUP BY estimates ?

2019-11-19 Thread Justin Pryzby
Tom implemented "Planner support functions":
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
https://www.postgresql.org/docs/12/xfunc-optimization.html

I wondered whether there was any consideration to extend that to allow
providing improved estimates of "group by".  That currently requires manually
by creating an expression index, if the function is IMMUTABLE (which is not
true for eg.  date_trunc of timestamptz).

ts=# explain analyze SELECT date_trunc('day', start_time) FROM 
child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..98.45 rows=889 width=8) (actual time=1.476..1.482 
rows=19 loops=1)

ts=# explain analyze SELECT date_trunc('year', start_time) FROM 
child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..98.45 rows=889 width=8) (actual time=1.499..1.500 
rows=1 loops=1)

ts=# CREATE INDEX ON child.alu_amms_201911 (date_trunc('year',start_time));
ts=# ANALYZE child.alu_amms_201911;
ts=# explain analyze SELECT date_trunc('year', start_time) FROM 
child.alu_amms_201911 GROUP BY 1;
 HashAggregate  (cost=87.34..87.35 rows=1 width=8) (actual time=1.414..1.414 
rows=1 loops=1)




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha
 wrote:
> On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro  wrote:
>> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar  
>> wrote:
>> > On Windows, it is documented that ReadFile() (which is called by
>> > pg_pread) will return false on EOF but only when the file is open for
>> > asynchronous reads/writes. But here we are just dealing with usual
>> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on
>> > Windows. Not yet able to figure out how FileRead() managed to return
>> > this error on Windows. But from your symptoms, it does look like
>> > pg_pread()=>ReadFile() returned false (despite doing asynchronous
>> > reads), and so _dosmaperr() gets called, and then it does not find the
>> > eof error in doserrors[], so the "unrecognized win32 error code"
>> > message is printed. May have to dig up more on this.
>>
>> Hmm.  See also this report:
>>
>> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com
>>
>
> The files from pgwin32_open() are open for synchronous access, while 
> pg_pread() uses the asynchronous functionality to offset the read. Under 
> these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as 
> explained in:

Oh, thanks.

> https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863

!?!

Amit, since it looks like you are Windows-enabled and have a repro,
would you mind confirming that this fixes the problem?

--- a/src/port/pread.c
+++ b/src/port/pread.c
@@ -41,6 +41,9 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
overlapped.Offset = offset;
if (!ReadFile(handle, buf, size, , ))
{
+   if (GetLastError() == ERROR_HANDLE_EOF)
+   return 0;
+
_dosmaperr(GetLastError());
return -1;
}




Re: Role membership and DROP

2019-11-19 Thread Tom Lane
Laurenz Albe  writes:
> On Fri, 2019-11-15 at 13:41 -0500, Tom Lane wrote:
>> Laurenz Albe  writes:
>>> On Wed, 2019-11-13 at 17:17 -0500, Tom Lane wrote:
 It might be worth clarifying this point in section 5.7,
 https://www.postgresql.org/docs/devel/ddl-priv.html

> I like your second sentence, but I think that "the right ... is inherent
> in being the ... owner" is unnecessarily complicated.
> Removing the "always" and "only" makes the apparent contradiction between
> the sentences less jarring to me.

I think it's important to emphasize that this is implicit in object
ownership.

Looking at the page again, I notice that there's a para a little further
down that overlaps quite a bit with what we're discussing here, but it's
about implicit grant options rather than the right to DROP.  In the
attached, I reworded that too, and moved it because it's not fully
intelligible until we've explained grant options.  Thoughts?

regards, tom lane

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9d6ec2c..0be0774 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1578,8 +1578,10 @@ ALTER TABLE products RENAME TO items;
   
 
   
-   The right to modify or destroy an object is always the privilege of
-   the owner only.
+   The right to modify or destroy an object is inherent in being the
+   object's owner, and cannot be granted or revoked in itself.
+   (However, like all privileges, that right can be inherited by
+   members of the owning role; see .)
   
 
   
@@ -1614,17 +1616,11 @@ GRANT UPDATE ON accounts TO joe;
   
 
   
-   To revoke a privilege, use the fittingly named
+   To revoke a previously-granted privilege, use the fittingly named
 command:
 
 REVOKE ALL ON accounts FROM PUBLIC;
 
-   The special privileges of the object owner (i.e., the right to do
-   DROP, GRANT, REVOKE, etc.)
-   are always implicit in being the owner,
-   and cannot be granted or revoked.  But the object owner can choose
-   to revoke their own ordinary privileges, for example to make a
-   table read-only for themselves as well as others.
   
 
   
@@ -1639,6 +1635,13 @@ REVOKE ALL ON accounts FROM PUBLIC;
   
 
   
+   An object's owner can choose to revoke their own ordinary privileges,
+   for example to make a table read-only for themselves as well as others.
+   But owners are always treated as holding all grant options, so they
+   can always re-grant their own privileges.
+  
+
+  
The available privileges are:
 



Re: initdb SegFault

2019-11-19 Thread Tom Lane
Andres Freund  writes:
> Agreed wrt this specific failure scenario. It does however seem not
> great that callsites for PQExpBuffer ought to check every call for
> allocation failures, in the general case.

It is possible to check just once at the end, using the PQExpBufferBroken
API, and I believe that libpq for instance is fairly careful about that.

I agree that programs that just need to print something and exit could
perhaps ask pqexpbuffer.c to handle that for them.  (But initdb still
doesn't fall in that category, because of its very nontrivial atexit
handler :-(.)

> I wonder if, for frontend paths, a simplified error handling path would
> be worthwhile for OOM paths. Doing only a write() or such to print an
> error message.

Perhaps.  You wouldn't get any translation --- but then, gettext is
probably going to fail anyway under such conditions.

regards, tom lane




Re: initdb SegFault

2019-11-19 Thread Andres Freund
Hi,

On 2019-11-19 10:16:02 -0500, Tom Lane wrote:
> vignesh C  writes:
> > createPQExpBuffer allocates memory and returns the pointer, there is a
> > possibility that createPQExpBuffer can return NULL pointer in case of
> > malloc failiure, but initdb's main function does not check this
> > condition. During malloc failure when pointer is accessed it results
> > in segmentation fault. Made changes to check and exit if
> > createPQExpBuffer return's NULL pointer. Patch for the same is
> > attached.
> 
> I can't get excited about this, for several reasons.
> 
> 1) The probability of it happening in the field is not
> distinguishable from zero, surely.  I imagine you forced this
> failure by making a debugging malloc fail occasionally.

Agreed wrt this specific failure scenario. It does however seem not
great that callsites for PQExpBuffer ought to check every call for
allocation failures, in the general case.

I do think it'd be reasonable to move the cases where "graceful" dealing
with OOM isn't necessary ought to really use an interface that
internally errors out on memory allocation failures. Kinda thinking we
ought to slowly move such paths towards stringinfo...


> 2) If we really are out of memory at this point, we'd have just as good
> odds that some allocation request inside pg_log_error() would fail.
> There's no practical way to ensure that that code path remains free
> of malloc attempts.  (Not to mention cleanup_directories_atexit().)

I wonder if, for frontend paths, a simplified error handling path would
be worthwhile for OOM paths. Doing only a write() or such to print an
error message.

Greetings,

Andres Freund




Re: initdb SegFault

2019-11-19 Thread Tom Lane
vignesh C  writes:
> createPQExpBuffer allocates memory and returns the pointer, there is a
> possibility that createPQExpBuffer can return NULL pointer in case of
> malloc failiure, but initdb's main function does not check this
> condition. During malloc failure when pointer is accessed it results
> in segmentation fault. Made changes to check and exit if
> createPQExpBuffer return's NULL pointer. Patch for the same is
> attached.

I can't get excited about this, for several reasons.

1) The probability of it happening in the field is not
distinguishable from zero, surely.  I imagine you forced this
failure by making a debugging malloc fail occasionally.

2) If we really are out of memory at this point, we'd have just as good
odds that some allocation request inside pg_log_error() would fail.
There's no practical way to ensure that that code path remains free
of malloc attempts.  (Not to mention cleanup_directories_atexit().)

3) In the end, an initdb failure is an initdb failure.  This change
doesn't improve robustness by any useful metric, it just adds an
untestable code path.  If we could recover somehow, it'd be more
interesting to spend time on.

BTW, looking at the small minority of places that bother to test
for createPQExpBuffer failure, the correct test for that seems
to be PQExpBufferBroken().

regards, tom lane




Re: PITR on DROP DATABASE, deleting of the database directory despite the recovery_target_time set before.

2019-11-19 Thread Nicolas Lutic


On 11/19/19 1:40 AM, Craig Ringer wrote:
> On Mon, 18 Nov 2019 at 18:48, Nicolas Lutic  > wrote:
> 
> Dear Hackers,
> 
> After a drop database
> 
> 
> with FORCE?
No, we tested with PostgreSQL v 11 and we don't have this option.
>  
> 
> , he tried to recover the data on the last inserted transaction by
> using the recovery_target_time.
> The issue is the database is present in the system catalog but the
> directory was still deleted.
> Here the technical information of the database
> version 11
> default  postgresql.conf except for this options
>     wal_level = replica
>     archive_mode = on
>     archive_command = 'cp %p /tmp/wal_archive/%f '
>     log_statement = 'all'
>     log_min_messages = debug5
> 
>   
> The following method was used 
> 
>   * create cluster
> 
>   * create database
> 
>   * create 1 table 
> 
>   * create 1 index on 1 column
> 
>   * insert 1 rows
> 
>   * backup with pg_base_backup
> 
>   * insert 2 rows
> 
> autocommit? 

Yes, I forgot to mention it.

> 
>   * drop database
> 
> force?
>  
> 
>   *     Change recovery behaviour in that case to prevent all xact
> operation to perform until COMMIT timestamp is checked against
> recovery_time bound (but it seems to be difficult as
> state 
> https://www.postgresql.org/message-id/flat/20141125160629.GC21475%40msg.df7cb.dewhich
> also identifies the problem and tries to give some solutions. 
> Maybe another way, as a trivial guess (all apologises) is to
> buffer immediate xacts until we have the commit for each and
> apply the whole buffer xact once the timestamp known (and
> checked agains recovery_target_time value);
> 
>   *     The other way to improve this is to update PostgreSQL
> documentation  by specifying that recovery_target_time cannot be
> used in this case.There should be multiple places where it can
> be stated. The best one (if only one) seems to be in 
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=doc/src/sgml/config.sgml;h=
> 
> 
> 
> 
> If this only happens when a DB is dropped under load with force, I lean
> toward just documenting it as a corner case.

This can happen in the case of a non-transactional instruction, DROP
DATABASE (with or without FORCE) is one case but there may be other cases ?

The documentation modification have to mention this case and list the
other most likely operations.

An idea, without insight knowledge of the code, in case of
recovery_target_time (only), would be to move forward each record for an
xact.

Each record that is «timestamped» can be applied but once we encounter a
non timestamped record we could buffer the following records for any
xaxts until a timestamped commit/rollback for the transaction where that
non transactionnal op appearsin. Once the commit/rollback records are
found, there's two options :
1) the commit/rollback timestamp is inside the "replay" bound, then the
whole buffer can be applied
2) the commit/rollback timestamp is beyond the upper time bound for
"replay", then  the whole buffer for that transaction could be canceled.
This can only be done on DROP DATABASE "DELETE" operation ?
Maybe, this will lead to skewed pages and this is a wrong way to do such
a thing.

Another assumption is that "DROP DATABASE" sequence can be changed for
this operation to perform correctly.

We are aware that this part is tricky and will have little effects on
normal operations, as best practices are to use xid_target or lsn_target.


> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  2ndQuadrant - PostgreSQL Solutions for the Enterprise

Best regards

-- 
LOXODATA https://www.loxodata.com/
Consulting - Training - Support
Nicolas Lutic
Consultant trainer





initdb SegFault

2019-11-19 Thread vignesh C
Hi,

While checking initdb code, I found one segmentation fault, stack
trace for the same is:
Core was generated by `./initdb -D data6'.
Program terminated with signal 11, Segmentation fault.
#0  0x0040ea22 in main (argc=3, argv=0x7ffc82237308) at initdb.c:3340
3340printf(_("\nSuccess. You can now start the database server
using:\n\n"

Analysis for the same is given below:
createPQExpBuffer allocates memory and returns the pointer, there is a
possibility that createPQExpBuffer can return NULL pointer in case of
malloc failiure, but initdb's main function does not check this
condition. During malloc failure when pointer is accessed it results
in segmentation fault. Made changes to check and exit if
createPQExpBuffer return's NULL pointer. Patch for the same is
attached.

Let me know your thoughts for the same. Similar issue exists in few
other places, if changes are ok, I can check and fix the issue in
other places also.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From c1787a74deeb8b0162684219136819a36a771e3e Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Tue, 19 Nov 2019 19:48:38 +0530
Subject: [PATCH] initdb crash fix when createPQExpBuffer returns NULL pointer.

createPQExpBuffer allocates memory and returns the pointer, there is a
possibility that createPQExpBuffer can return NULL pointer in case of malloc
failiure, but initdb's main function does not check this condition. Made
changes to check and exit if createPQExpBuffer return's NULL pointer.
---
 src/bin/initdb/initdb.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 88a261d..a428a91 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -3318,6 +3318,11 @@ main(int argc, char *argv[])
 	 * Build up a shell command to tell the user how to start the server
 	 */
 	start_db_cmd = createPQExpBuffer();
+	if (!start_db_cmd)
+	{
+		pg_log_error("out of memory");
+		exit(1);
+	}
 
 	/* Get directory specification used to start initdb ... */
 	strlcpy(pg_ctl_path, argv[0], sizeof(pg_ctl_path));
-- 
1.8.3.1



Re: Planning counters in pg_stat_statements (using pgss_store)

2019-11-19 Thread Julien Rouhaud
On Fri, Nov 15, 2019 at 2:00 AM imai.yoshik...@fujitsu.com
 wrote:
>
> Actually I also don't have strong opinion but I thought someone would 
> complain about renaming of those columns and also some tools like monitoring 
> which use those columns will not work. If we use {total, min, max, mean, 
> stddev}_time, someone might mistakenly understand {total, min, max, mean, 
> stddev}_time mean {total, min, max, mean, stddev} of planning and execution.
> If I need to choose {total, min, max, mean, stddev}_time or {total, min, max, 
> mean, stddev}_exec_time, I choose former one because choosing best name is 
> not worth destructing the existing scripts or tools.

We could definitely keep (plan|exec)_time for the SRF, and have the
{total, min, max, mean, stddev}_time created by the view to be a sum
of planning + execution for each counter, and it doesn't sound like a
bad idea if having even more columns in the view is not an issue.




Re: physical slot xmin dependency on logical slot?

2019-11-19 Thread Jeremy Finzel
>
> I expect that you created the replica in a manner that preserved the
> logical replication slot on it. You also had hot_standby_feedback enabled.
>

As per both you and Andres' replies, we wanted the backup to have the
logical slots on it, because we wanted to allow decoding from the slots on
our backup.  However, what we should have done is drop the slot of the
backup on the master.


> PostgreSQL standbys send the global xmin and (in Pg10+) catalog_xmin to
> the upstream when hot_standby_feedback is enabled. If there's a slot
> holding the catalog_xmin on the replica down, that'll be passed on via
> hot_standby_feedback to the upstream. On Pg 9.6 or older, or if the replica
> isn't using a physical replication slot, the catalog_xmin is treated as a
> regular xmin since there's nowhere in PGPROC or PGXACT to track the
> separate catalog_xmin. If the standby uses a physical slot, then on pg10+
> the catalog_xmin sent by the replica is stored as the catalog_xmin on the
> physical slot instead.
>
> Either way, if you have hot_standby_feedback enabled on a standby, that
> feedback includes the requirements of any replication slots on the standby.
>

Thank you for the thorough explanation.  As I noted in my reply to Andres,
we routinely and intentionally create snapshots with replication slots
intact (but we normally drop the slot on the master immediately), so our
own use case is rare and it's not surprising that we don't find a thorough
explanation of this scenario in the docs.

Thanks,
Jeremy


Re: backup manifests

2019-11-19 Thread Andrew Dunstan


On 11/19/19 5:00 AM, Rushabh Lathia wrote:
>
>
> My colleague Suraj did testing and noticed the performance impact
> with the checksums.   On further testing, he found that specifically with
> sha its more of performance impact.  
>
>

I admit I haven't been following along closely, but why do we need a
cryptographic checksum here instead of, say, a CRC? Do we think that
somehow the checksum might be forged? Use of cryptographic hashes as
general purpose checksums has become far too common IMNSHO.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Hypothetical indexes using BRIN broken since pg10

2019-11-19 Thread Michael Paquier
On Tue, Nov 19, 2019 at 08:37:04AM +0100, Julien Rouhaud wrote:
> None from me.  I'm obviously biased, but I hope that it can get
> backpatched.  BRIN is probably seldom used, but we shouldn't make it
> harder to use it, even if it's that's only for hypothetical usage, and
> even if it'll still be quite inexact.

Re-reading the thread.  Any design change should IMO just happen on
master so as we don't take any risks with potential ABI breakages.
Even if there is not much field demand for it, that's not worth the
risk.  Thinking harder, I don't actually quite see why it would be an
issue to provide default stats for an hypothetical BRIN index based
using the best estimations we can do down to 10 with the infra in
place.  Taking the case of hypopg, one finishes with an annoying
"could not open relation with OID %u", which is not that nice from the
user perspective.  Let's wait a bit and see if others have more
arguments to offer.
--
Michael


signature.asc
Description: PGP signature


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Juan José Santamaría Flecha
On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro 
wrote:

> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar 
> wrote:
> > On Tue, 19 Nov 2019 at 14:07, Amit Kapila 
> wrote:
> > > No, I got this before applying the patch.  However, after applying the
> > > patch, I got below error in the same test:
> > >
> > > postgres=# SELECT 1 from
> > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > > ERROR:  could not read from reorderbuffer spill file: Invalid argument
> > >
> > > It seems to me that FileRead API used in the patch can return value <
> > > 0 on  EOF.  See the API usage in BufFileLoadBuffer.  I got this error
> > > on a windows machine and in the server log the message was "LOG:
> > > unrecognized win32 error code: 38" which indicates "Reached the end of
> > > the file."
> >
> > On Windows, it is documented that ReadFile() (which is called by
> > pg_pread) will return false on EOF but only when the file is open for
> > asynchronous reads/writes. But here we are just dealing with usual
> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on
> > Windows. Not yet able to figure out how FileRead() managed to return
> > this error on Windows. But from your symptoms, it does look like
> > pg_pread()=>ReadFile() returned false (despite doing asynchronous
> > reads), and so _dosmaperr() gets called, and then it does not find the
> > eof error in doserrors[], so the "unrecognized win32 error code"
> > message is printed. May have to dig up more on this.
>
> Hmm.  See also this report:
>
>
> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com
>
>
The files from pgwin32_open() are open for synchronous access, while
pg_pread() uses the asynchronous functionality to offset the read. Under
these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as
explained in:

https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863


Regards,

Juan José Santamaría Flecha


checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-19 Thread Justin Pryzby
I (finally) noticed this morning on a server running PG12.1:

< 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file 
"base/16491/1731839470.2": No such file or directory
< 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was 
terminated by signal 6: Aborted

/dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
Centos 7.7 qemu/KVM
Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
There's no added tablespaces.

Copying Thomas since I wonder if this is related:
3eb77eba Refactor the fsync queue for wider use.

I can't find any relation with filenode nor OID matching 1731839470, nor a file
named like that (which is maybe no surprise, since it's exactly the issue
checkpointer had last week).

A backup job would've started at 22:00 and probably would've run until 22:27,
except that its backend was interrupted "because of crash of another server
process".  That uses pg_dump --snapshot.

This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
|SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

Ah, I found a maybe relevant log:
|2019-11-15 22:15:59.592-05 | duration: 220283.831 ms  statement: ALTER TABLE 
child.eric_enodeb_cell_201811 ALTER pmradiothpvolul

So we altered that table (and 100+ others) with a type-promoting alter,
starting at 2019-11-15 21:20:51,942.  That involves DETACHing all but the most
recent partitions, altering the parent, and then iterating over historic
children to ALTER and reATTACHing them.  (We do this to avoid locking the table
for long periods, and to avoid worst-case disk usage).

FYI, that server ran PG12.0 since Oct 7 with no issue.
I installed pg12.1 at:
$ ps -O lstart 27384
  PID  STARTED S TTY  TIME COMMAND
27384 Fri Nov 15 08:13:08 2019 S ?00:05:54 /usr/pgsql-12/bin/postmaster 
-D /var/lib/pgsql/12/data/

Core was generated by `postgres: checkpointer '.
Program terminated with signal 6, Aborted.

(gdb) bt
#0  0x7efc9d8b3337 in raise () from /lib64/libc.so.6
#1  0x7efc9d8b4a28 in abort () from /lib64/libc.so.6
#2  0x0087752a in errfinish (dummy=) at elog.c:552
#3  0x0075c8ec in ProcessSyncRequests () at sync.c:398
#4  0x00734dd9 in CheckPointBuffers (flags=flags@entry=256) at 
bufmgr.c:2588
#5  0x005095e1 in CheckPointGuts (checkPointRedo=26082542473320, 
flags=flags@entry=256) at xlog.c:9006
#6  0x0050ff86 in CreateCheckPoint (flags=flags@entry=256) at 
xlog.c:8795
#7  0x006e4092 in CheckpointerMain () at checkpointer.c:481
#8  0x0051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, 
argv=argv@entry=0x7ffda8a24340) at bootstrap.c:461
#9  0x006ee680 in StartChildProcess (type=CheckpointerProcess) at 
postmaster.c:5392
#10 0x006ef9ca in reaper (postgres_signal_arg=) at 
postmaster.c:2973
#11 
#12 0x7efc9d972933 in __select_nocancel () from /lib64/libc.so.6
#13 0x004833d4 in ServerLoop () at postmaster.c:1668
#14 0x006f106f in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1601280) at postmaster.c:1377
#15 0x00484cd3 in main (argc=3, argv=0x1601280) at main.c:228

bt f:
#3  0x0075c8ec in ProcessSyncRequests () at sync.c:398
path = 
"base/16491/1731839470.2\000m\000e\001\000\000\000\000hЭи\027\000\000\032\364q\000\000\000\000\000\251\202\214\000\000\000\000\000\004\000\000\000\000\000\000\000\251\202\214\000\000\000\000\000@S`\001\000\000\000\000\000>\242\250\375\177\000\000\002\000\000\000\000\000\000\000\340\211b\001\000\000\000\000\002\346\241\000\000\000\000\000\200>\242\250\375\177\000\000\225ኝ\374~\000\000C\000_US.UTF-8\000\374~\000\000\000\000\000\000\000\000\000\000\306ފ\235\374~\000\000LC_MESSAGES/postgres-12.mo\000\000\000\000\000\000\001\000\000\000\000\000\000\000.ފ\235\374~"...
failures = 1
sync_in_progress = true
hstat = {hashp = 0x1629e00, curBucket = 122, curEntry = 0x0}
entry = 0x1658590
absorb_counter = 
processed = 43





Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-19 Thread Amit Kapila
On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila  wrote:
>
> On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila  wrote:
> >
> > Some notes before commit:
> > --
> > 1.
> > Commit message need to be changed for the first patch
> > -
> > A.
> > > The memory limit is defined by a new logical_decoding_work_mem GUC, so 
> > > for example we can do this
> >
> > SET logical_decoding_work_mem = '128kB'
> >
> > > to trigger very aggressive streaming. The minimum value is 64kB.
> >
> > I think this patch doesn't contain streaming, so we either need to
> > reword it or remove it.
> >
> > B.
> > > The logical_decoding_work_mem may be set either in postgresql.conf, in 
> > > which case it serves as the default for all publishers on that instance, 
> > > or when creating the
> > > subscription, using a work_mem paramemter in the WITH clause (specifies 
> > > number of kilobytes).
> >
> > We need to reword this as we have decided to remove the setting from
> > the subscription side as of now.
> >
> > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
> >
>
> I have made these modifications and additionally ran pgindent.
>
> > 4. I think we can combine both patches and commit as one patch, but it
> > is okay to commit them separately as well.
> >
>
> I am not sure if this is a good idea, so still kept them as separate.
>

I have committed the first patch.  I will commit the second one
related to stats of spilled xacts on Thursday.  The second patch needs
catalog version bump as well because we are modifying the catalog
contents in that patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-19 Thread Amit Kapila
On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar  wrote:
>
> On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila  wrote:
> >
> > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > Few other comments on this patch:
> > > > 1.
> > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > +
> > > > + /*
> > > > + * Execute the invalidation message locally.
> > > > + *
> > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > + * the other fields added to ReorderBufferChange, or just
> > > > + * about the message itself?
> > > > + */
> > > > + LocalExecuteInvalidationMessage(>data.inval.msg);
> > > > + break;
> > > >
> > > > Here, why are we executing messages individually?  Can't we just
> > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > receive any invalidation message?
>
> I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> commit.  Because this is required to add any committed transaction to
> the snapshot if it has done any catalog changes.
>

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions.  So, I think that it would
be required earlier (before commit) as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Thomas Munro
On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar  wrote:
> On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
> > No, I got this before applying the patch.  However, after applying the
> > patch, I got below error in the same test:
> >
> > postgres=# SELECT 1 from
> > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > ERROR:  could not read from reorderbuffer spill file: Invalid argument
> >
> > It seems to me that FileRead API used in the patch can return value <
> > 0 on  EOF.  See the API usage in BufFileLoadBuffer.  I got this error
> > on a windows machine and in the server log the message was "LOG:
> > unrecognized win32 error code: 38" which indicates "Reached the end of
> > the file."
>
> On Windows, it is documented that ReadFile() (which is called by
> pg_pread) will return false on EOF but only when the file is open for
> asynchronous reads/writes. But here we are just dealing with usual
> synchronous reads. So pg_pread() code should indeed return 0 on EOF on
> Windows. Not yet able to figure out how FileRead() managed to return
> this error on Windows. But from your symptoms, it does look like
> pg_pread()=>ReadFile() returned false (despite doing asynchronous
> reads), and so _dosmaperr() gets called, and then it does not find the
> eof error in doserrors[], so the "unrecognized win32 error code"
> message is printed. May have to dig up more on this.

Hmm.  See also this report:

https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Khandekar
On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
>
> On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar  wrote:
> >
> > On Mon, 18 Nov 2019 at 17:20, Amit Kapila  wrote:
> > > I see that you have made changes in ReorderBufferRestoreChanges to use
> > > PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> > > reason for the same?  In my test environment, with the test provided
> > > by you, I got the error (reported in this thread) via
> > > ReorderBufferSerializeTXN.
> >
> > You didn't get this error with the patch applied, did you ?
> >
>
> No, I got this before applying the patch.  However, after applying the
> patch, I got below error in the same test:
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR:  could not read from reorderbuffer spill file: Invalid argument
>
> It seems to me that FileRead API used in the patch can return value <
> 0 on  EOF.  See the API usage in BufFileLoadBuffer.  I got this error
> on a windows machine and in the server log the message was "LOG:
> unrecognized win32 error code: 38" which indicates "Reached the end of
> the file."

On Windows, it is documented that ReadFile() (which is called by
pg_pread) will return false on EOF but only when the file is open for
asynchronous reads/writes. But here we are just dealing with usual
synchronous reads. So pg_pread() code should indeed return 0 on EOF on
Windows. Not yet able to figure out how FileRead() managed to return
this error on Windows. But from your symptoms, it does look like
pg_pread()=>ReadFile() returned false (despite doing asynchronous
reads), and so _dosmaperr() gets called, and then it does not find the
eof error in doserrors[], so the "unrecognized win32 error code"
message is printed. May have to dig up more on this.


>
> > If you were debugging this without the patch applied, I suspect that
> > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
> > generating this error is because the max limit must be already crossed
> > because of earlier calls to ReorderBufferRestoreChanges().
> >
> > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
> > sufficient because the code in that function has made sure the fd gets
> > closed there itself.
> >
>
> Okay, then we might not need it there, but we should at least add a
> comment in ReorderBufferRestoreChanges to explain why we have used a
> different function to operate on the file at that place.

Yeah, that might make sense.

>
> >
> > For the API's that use VFDs (like PathNameOpenFile), the files opened
> > are always recorded in the VfdCache array. So it is not required to do
> > the cleanup at (sub)transaction end, because the kernel fds get closed
> > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
> > limit. So if a transaction aborts, the fds might remain open, but
> > those will get cleaned up whenever we require more fds, through
> > ReleaseLruFiles(). Whereas, for files opened through
> > OpenTransientFile(), VfdCache is not involved, so this needs
> > transaction end cleanup.
> >
>
> Have you tried by injecting some error?  After getting the error
> mentioned above in email, when I retried the same query, I got the
> below message.
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR:  could not remove file
> "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
> removal of pg_replslot/regression_slot/xid*: Permission denied
>
> And, then I tried to drop the replication slot and I got below error.
> postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> ERROR:  could not rename file "pg_replslot/regression_slot" to
> "pg_replslot/regression_slot.tmp": Permission denied
>
> It might be something related to Windows

Oh ok, I missed the fact that on Windows we can't delete the files
that are already open, unlike Linux/Unix.
I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use
OpenTemporaryFile(). I wonder though if this same issue might come up
for the other use-case of PathNameOpenFile() :
logical_rewrite_log_mapping().

> but you can once try by
> injecting some error after reading a few files in the code path and
> see the behavior.
Yeah, will check the behaviour, although on Linux, I think I won't get
this error. But yes, like I mentioned above, I think we might have to
arrange for something.



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Protect syscache from bloating with negative cache entries

2019-11-19 Thread Kyotaro Horiguchi
I'd like to throw in food for discussion on how much SearchSysCacheN
suffers degradation from some choices on how we can insert a code into
the SearchSysCacheN code path.

I ran the run2.sh script attached, which runs catcachebench2(), which
asks SearchSysCache3() for cached entries (almost) 24 times per
run.  The number of each output line is the mean of 3 times runs, and
stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl
| psql" creates a database for the benchmark. catcachebench2() runs
the shortest path in the three in the attached benchmark program.

(pg_ctl start)
$ perl gen_tbl.pl | psql ...
(pg_ctl stop)


0. Baseline (0001-benchmark.patch, 0002-Base-change.patch)

At first, I made two binaries from the literally same source. For the
benchmark's sake the source is already modified a bit. Specifically it
has SetCatCacheClock needed by the benchmark, but actually not called
in this benchmark.


  time(ms)|stddev(ms)
not patched | 7750.42 |  23.83   # 0.6% faster than 7775.23
not patched | 7864.73 |  43.21
not patched | 7866.80 | 106.47
not patched | 7952.06 |  63.14
master  | 7775.23 |  35.76
master  | 7870.42 | 120.31
master  | 7876.76 | 109.04
master  | 7963.04 |   9.49

So, it seems to me that we cannot tell something about differences
below about 80ms (about 1%) now.


1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)

 This is the most straightforward way to add an alternative feature.

pattern 1 | 8459.73 |  28.15  # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 |  55.61
pattern 1 | 8541.81 |  41.56
pattern 1 | 8552.20 |  27.99
master| 7757.58 |  22.65
master| 7801.32 |  20.64
master| 7839.57 |  25.28
master| 7925.30 |  38.84

 It's so slow that it cannot be used.


2. Making SearchCatCacheInternal be an indirect function.
   (CatCache_Pattern_2.patch)

Next, I made the work horse routine be called indirectly. The "inline"
for the function acutally let compiler optimize SearchCatCacheN
routines as described in comment but the effect doesn't seem so large
at least for this case.

pattern 2 | 7976.22 |  46.12  (2.6% slower > 1%)
pattern 2 | 8103.03 |  51.57
pattern 2 | 8144.97 |  68.46
pattern 2 | 8353.10 |  34.89
master| 7768.40 |  56.00
master| 7772.02 |  29.05
master| 7775.05 |  27.69
master| 7830.82 |  13.78


3. Making SearchCatCacheN be indirect functions. (CatCache_Pattern_3.patch)

As far as gcc/linux/x86 goes, SearchSysCacheN is comiled into the
following instructions:

 0x00866c20 <+0>:   movslq %edi,%rdi
 0x00866c23 <+3>:   mov0xd3da40(,%rdi,8),%rdi
 0x00866c2b <+11>:  jmpq   0x856ee0 

If we made SearchCatCacheN be indirect functions as the patch, it
changes just one instruction as:

 0x00866c50 <+0>:   movslq %edi,%rdi
 0x00866c53 <+3>:   mov0xd3da60(,%rdi,8),%rdi
 0x00866c5b <+11>:  jmpq   *0x4c0caf(%rip) # 0xd27910 


pattern 3 | 7836.26 |  48.66 (2% slower > 1%)
pattern 3 | 7963.74 |  67.88
pattern 3 | 7966.65 | 101.07
pattern 3 | 8214.57 |  71.93
master| 7679.74 |  62.20
master| 7756.14 |  77.19
master| 7867.14 |  73.33
master| 7893.97 |  47.67

I expected this runs in almost the same time. I'm not sure if it is
the result of spectre_v2 mitigation, but I show status of my
environment as follows.


# uname -r
4.18.0-80.11.2.el8_0.x86_64
# cat /proc/cpuinfo
...
model name  : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
stepping: 12
microcode   : 0xae
bugs: spectre_v1 spectre_v2 spec_store_bypass mds
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: 
disabled, RSB filling


I am using CentOS8 and I don't find a handy (or on-the-fly) way to
disable them..

Attached are:

0001-benchmark.patch: catcache benchmark extension (and core side fix)
0002-Base-change.patch  : baseline change in this series of benchmark
CatCache_Pattern_1.patch: naive branching
CatCache_Pattern_2.patch: indirect SearchCatCacheInternal
CatCache_Pattern_1.patch: indirect SearchCatCacheN

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 245e88e1b43df74273fbaa1b22f4f64621ffe9d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 14 Nov 2019 19:24:36 +0900
Subject: [PATCH 1/2] benchmark

---
 contrib/catcachebench/Makefile   |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c| 330 +++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c   |  33 ++
 src/backend/utils/cache/syscache.c   |   2 +-
 6 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 

Re: fix for BUG #3720: wrong results at using ltree

2019-11-19 Thread Benjie Gillam
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, failed

This is my first PostgreSQL commitfest and review, guidance welcome.

This patch is straightforward, it applies cleanly, and it includes tests (I've 
also tested the feature manually).

The (existing) documentation states "The length of a label path must be less 
than 65kB," I believe that the 65kB mentioned here should instead be 64kB - 
perhaps the patch could be updated with this single-character fix? At first I 
thought the 65kB limit would be applied to the label path string (e.g. 
'Top.Countries.Europe.Russia' would be 27 bytes), but it seems the limit 
applies to the number of labels in the path - perhaps `kB` is not the right 
measurement here and it should explicitly state 65536?

It is not stated in the documentation what should happen if the label path 
length is greater than 65535, so raising an error makes sense (but may be a 
breaking change).

The new status of this patch is: Waiting on Author


Re: backup manifests

2019-11-19 Thread Rushabh Lathia
My colleague Suraj did testing and noticed the performance impact
with the checksums.   On further testing, he found that specifically with
sha its more of performance impact.

Please find below statistics:

no of tables without checksum SHA256
checksum % performnce
overhead
with
SHA-256 md5 checksum % performnce
overhead with md5 CRC checksum % performnce
overhead with
CRC
10 (100 MB
in each table) real 0m10.957s
user 0m0.367s
sys 0m2.275s real 0m16.816s
user 0m0.210s
sys 0m2.067s 53% real 0m11.895s
user 0m0.174s
sys 0m1.725s 8% real 0m11.136s
user 0m0.365s
sys 0m2.298s 2%
20 (100 MB
in each table) real 0m20.610s
user 0m0.484s
sys 0m3.198s real 0m31.745s
user 0m0.569s
sys 0m4.089s
54% real 0m22.717s
user 0m0.638s
sys 0m4.026s 10% real 0m21.075s
user 0m0.538s
sys 0m3.417s 2%
50 (100 MB
in each table) real 0m49.143s
user 0m1.646s
sys 0m8.499s real 1m13.683s
user 0m1.305s
sys 0m10.541s 50% real 0m51.856s
user 0m0.932s
sys 0m7.702s 6% real 0m49.689s
user 0m1.028s
sys 0m6.921s 1%
100 (100 MB
in each table) real 1m34.308s
user 0m2.265s
sys 0m14.717s real 2m22.403s
user 0m2.613s
sys 0m20.776s 51% real 1m41.524s
user 0m2.158s
sys 0m15.949s
8% real 1m35.045s
user 0m2.061s
sys 0m16.308s 1%
100 (1 GB
in each table) real 17m18.336s
user 0m20.222s
sys 3m12.960s real 24m45.942s
user 0m26.911s
sys 3m33.501s 43% real 17m41.670s
user 0m26.506s
sys 3m18.402s 2% real 17m22.296s
user 0m26.811s
sys 3m56.653s

sometimes, this test
completes within the
same time as without
checksum. approx. 0.5%


Considering the above results, I modified the earlier Robert's patch and
added
"manifest_with_checksums" option to pg_basebackup.  With a new patch.
by default, checksums will be disabled and will be only enabled when
"manifest_with_checksums" option is provided.  Also re-based all patch set.



Regards,

-- 
Rushabh Lathia
www.EnterpriseDB.com

On Tue, Oct 1, 2019 at 5:43 PM Robert Haas  wrote:

> On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
>  wrote:
> > Entry for directory is not added in manifest. So it might be difficult
> > at client to get to know about the directories. Will it be good to add
> > an entry for each directory too? May be like:
> > Dir 
>
> Well, what kind of corruption would this allow us to detect that we
> can't detect as things stand? I think the only case is an empty
> directory. If it's not empty, we'd have some entries for the files in
> that directory, and those files won't be able to exist unless the
> directory does. But, how would we end up backing up an empty
> directory, anyway?
>
> I don't really *mind* adding directories into the manifest, but I'm
> not sure how much it helps.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Rushabh Lathia
From 75bc29edb4d84697901d257ac98514c186c29508 Mon Sep 17 00:00:00 2001
From: Rushabh Lathia 
Date: Wed, 13 Nov 2019 15:19:22 +0530
Subject: [PATCH] Reduce code duplication and eliminate weird macro tricks.

---
 src/bin/pg_basebackup/pg_basebackup.c | 1005 +
 1 file changed, 507 insertions(+), 498 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index a9d162a..0565212 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -57,6 +57,40 @@ typedef struct TablespaceList
 	TablespaceListCell *tail;
 } TablespaceList;
 
+typedef struct WriteTarState
+{
+	int			tablespacenum;
+	char		filename[MAXPGPATH];
+	FILE	   *tarfile;
+	char		tarhdr[512];
+	bool		basetablespace;
+	bool		in_tarhdr;
+	bool		skip_file;
+	bool		is_recovery_guc_supported;
+	bool		is_postgresql_auto_conf;
+	bool		found_postgresql_auto_conf;
+	int			file_padding_len;
+	size_t		tarhdrsz;
+	pgoff_t		filesz;
+#ifdef HAVE_LIBZ
+	gzFile		ztarfile;
+#endif
+}			WriteTarState;
+
+typedef struct UnpackTarState
+{
+	int			tablespacenum;
+	char		current_path[MAXPGPATH];
+	char		filename[MAXPGPATH];
+	const char *mapped_tblspc_path;
+	pgoff_t		current_len_left;
+	int			current_padding;
+	FILE	   *file;
+}			UnpackTarState;
+
+typedef void (*WriteDataCallback) (size_t nbytes, char *buf,
+   void *callback_data);
+
 /*
  * pg_xlog has been renamed to pg_wal in version 10.  This version number
  * should be compared with PQserverVersion().
@@ -142,7 +176,10 @@ static void verify_dir_is_empty_or_create(char *dirname, bool *created, bool *fo
 static void progress_report(int tablespacenum, const char *filename, bool force);
 
 static void ReceiveTarFile(PGconn *conn, PGresult *res, int rownum);
+static void ReceiveTarCopyChunk(size_t r, char *copybuf, void *callback_data);
 static void ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum);
+static void ReceiveTarAndUnpackCopyChunk(size_t r, char *copybuf,
+		 void *callback_data);
 static void BaseBackup(void);
 
 static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline,
@@ -874,42 +911,78 @@ parse_max_rate(char *src)
 }
 
 /*
+ * 

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-19 Thread Jinbao Chen
I think we have the same understanding of this issue.

Sometimes use smaller costs on scanning the chain in bucket like below would
be better.
run_cost += outer_path_rows * some_small_probe_cost;
run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
In some version of GreenPlum(a database based on postgres), we just disabled
the cost on scanning the bucket chain. In most cases, this can get a better
query
plan. But I am worried that it will be worse in some cases.

Now only the small table's distinct value is much smaller than the bucket
number,
and much smaller than the distinct value of the large table, the planner
will get the
wrong plan.

For example, if inner table has 100 distinct values, and 3000 rows. Hash
table
has 1000 buckets. Outer table has 1 distinct values.
We can assume that all the 100 distinct values of the inner table are
included in the
1 distinct values of the outer table. So (100/1)*outer_rows tuples
will
probe the buckets has chain. And (9900/1)*outer_rows tuples will probe
all the 1000 buckets randomly. So (9900/1)*outer_rows*(900/1000) tuples
will
probe empty buckets. So the costs on scanning bucket chain is

hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
(1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
inner_disttinct)/buckets_num))

Do you think this assumption is reasonable?


On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro  wrote:

> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen  wrote:
> > In the test case above, the small table has 3000 tuples and 100 distinct
> values on column ‘a’.
> > If we use small table as inner table.  The chan length of the bucket is
> 30. And we need to
> > search the whole chain on probing the hash table. So the cost of probing
> is bigger than build
> > hash table, and we need to use big table as inner.
> >
> > But in fact this is not true. We initialized 620,000 buckets in
> hashtable. But only 100 buckets
> > has chains with length 30. Other buckets are empty. Only hash values
> need to be compared.
> > Its costs are very small. We have 100,000 distinct key and 100,000,000
> tuple on outer table.
> > Only (100/10)* tuple_num tuples will search the whole chain. The
> other tuples
> > (number = (98900/10)*tuple_num*) in outer
> > table just compare with the hash value. So the actual cost is much
> smaller than the planner
> > calculated. This is the reason why using a small table as inner is
> faster.
>
> So basically we think that if t_big is on the outer side, we'll do
> 100,000,000 probes and each one is going to scan a t_small bucket with
> chain length 30, so that looks really expensive.  Actually only a
> small percentage of its probes find tuples with the right hash value,
> but final_cost_hash_join() doesn't know that.  So we hash t_big
> instead, which we estimated pretty well and it finishes up with
> buckets of length 1,000 (which is actually fine in this case, they're
> not unwanted hash collisions, they're duplicate keys that we need to
> emit) and we probe them 3,000 times (which is also fine in this case),
> but we had to do a bunch of memory allocation and/or batch file IO and
> that turns out to be slower.
>
> I am not at all sure about this but I wonder if it would be better to
> use something like:
>
>   run_cost += outer_path_rows * some_small_probe_cost;
>   run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>
> If we can estimate how many tuples will actually match accurately,
> that should also be the number of times we have to run the quals,
> since we don't usually expect hash collisions (bucket collisions, yes,
> but hash collisions where the key doesn't turn out to be equal, no*).
>
> * ... but also yes as you approach various limits, so you could also
> factor in bucket chain length that is due to being prevented from
> expanding the number of buckets by arbitrary constraints, and perhaps
> also birthday_problem(hash size, key space) to factor in unwanted hash
> collisions that start to matter once you get to billions of keys and
> expect collisions with short hashes.
>


Re: range_agg

2019-11-19 Thread Pavel Stehule
Hi

čt 7. 11. 2019 v 3:36 odesílatel Paul A Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On Wed, Nov 6, 2019 at 3:02 PM Paul A Jungwirth
>  wrote:
> > On Thu, Sep 26, 2019 at 2:13 PM Alvaro Herrera 
> wrote:
> > > Hello Paul, I've started to review this patch.  Here's a few minor
> > > things I ran across -- mostly compiler warnings (is my compiler too
> > > ancient?).
> > I just opened this thread to post a rebased set patches (especially
> > because of the `const` additions to range functions). Maybe it's not
> > that helpful since they don't include your changes yet but here they
> > are anyway. I'll post some more with your changes shortly.
>
> Here is another batch of patches incorporating your improvements. It
> seems like almost all the warnings were about moving variable
> declarations above any other statements. For some reason I don't get
> warnings about that on my end (compiling on OS X):
>
> platter:postgres paul$ gcc --version
> Configured with:
> --prefix=/Applications/Xcode.app/Contents/Developer/usr
> --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple clang version 11.0.0 (clang-1100.0.33.12)
> Target: x86_64-apple-darwin18.6.0
> Thread model: posix
> InstalledDir:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> For configure I'm saying this:
>
> ./configure CFLAGS=-ggdb 5-Og -g3 -fno-omit-frame-pointer
> --enable-tap-tests --enable-cassert --enable-debug
> --prefix=/Users/paul/local
>
> Any suggestions to get better warnings? On my other patch I got
> feedback about the very same kind. I could just compile on Linux but
> it's nice to work on this away from my desk on the laptop. Maybe
> installing a real gcc is the way to go.
>

I tested last patches. I found some issues

1. you should not to try patch catversion.

2. there is warning

parse_coerce.c: In function ‘enforce_generic_type_consistency’:
parse_coerce.c:1975:11: warning: ‘range_typelem’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
 1975 |   else if (range_typelem != elem_typeid)

3. there are problems with pg_upgrade. Regress tests fails

command:
"/home/pavel/src/postgresql.master/tmp_install/usr/local/pgsql/bin/pg_restore"
--host /home/pavel/src/postgresql.master/src/b
pg_restore: connecting to database for restore
pg_restore: creating DATABASE "regression"
pg_restore: connecting to new database "regression"
pg_restore: connecting to database "regression" as user "pavel"
pg_restore: creating DATABASE PROPERTIES "regression"
pg_restore: connecting to new database "regression"
pg_restore: connecting to database "regression" as user "pavel"
pg_restore: creating pg_largeobject "pg_largeobject"
pg_restore: creating SCHEMA "fkpart3"
pg_restore: creating SCHEMA "fkpart4"
pg_restore: creating SCHEMA "fkpart5"
pg_restore: creating SCHEMA "fkpart6"
pg_restore: creating SCHEMA "mvtest_mvschema"
pg_restore: creating SCHEMA "regress_indexing"
pg_restore: creating SCHEMA "regress_rls_schema"
pg_restore: creating SCHEMA "regress_schema_2"
pg_restore: creating SCHEMA "testxmlschema"
pg_restore: creating TRANSFORM "TRANSFORM FOR integer LANGUAGE "sql""
pg_restore: creating TYPE "public.aggtype"
pg_restore: creating TYPE "public.arrayrange"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 1653; 1247 17044 TYPE arrayrange pavel
pg_restore: error: could not execute query: ERROR:  pg_type array OID value
not set when in binary upgrade mode
Command was:.
-- For binary upgrade, must preserve pg_type oid
SELECT
pg_catalog.binary_upgrade_set_next_pg_type_oid('17044'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('17045'::pg_catalog.oid);

CREATE TYPE "public"."arrayrange" AS RANGE (
subtype = integer[]
);

4. there is a problem with doc

  echo ""; \
  echo ""; \
} > version.sgml
'/usr/bin/perl' ./mk_feature_tables.pl YES
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt > features-supported.sgml
'/usr/bin/perl' ./mk_feature_tables.pl NO
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt > features-unsupported.sgml
'/usr/bin/perl' ./generate-errcodes-table.pl
../../../src/backend/utils/errcodes.txt > errcodes-table.sgml
'/usr/bin/perl' ./generate-keywords-table.pl . > keywords-table.sgml
/usr/bin/xmllint --path . --noout --valid postgres.sgml
extend.sgml:281: parser error : Opening and ending tag mismatch: para line
270 and type
 type of the ranges in an anymultirange.
 ^
extend.sgml:281: parser error : Opening and ending tag mismatch: sect2 line
270 and type
 type of the ranges in an anymultirange.
 ^
extend.sgml:282: parser error : Opening and ending tag mismatch: sect1 line
270 and para

   ^
extend.sgml:324: parser error : Opening and ending tag mismatch: 

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Kapila
On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar  wrote:
>
> On Mon, 18 Nov 2019 at 17:20, Amit Kapila  wrote:
> > I see that you have made changes in ReorderBufferRestoreChanges to use
> > PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> > reason for the same?  In my test environment, with the test provided
> > by you, I got the error (reported in this thread) via
> > ReorderBufferSerializeTXN.
>
> You didn't get this error with the patch applied, did you ?
>

No, I got this before applying the patch.  However, after applying the
patch, I got below error in the same test:

postgres=# SELECT 1 from
pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
ERROR:  could not read from reorderbuffer spill file: Invalid argument

It seems to me that FileRead API used in the patch can return value <
0 on  EOF.  See the API usage in BufFileLoadBuffer.  I got this error
on a windows machine and in the server log the message was "LOG:
unrecognized win32 error code: 38" which indicates "Reached the end of
the file."

> If you were debugging this without the patch applied, I suspect that
> the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
> generating this error is because the max limit must be already crossed
> because of earlier calls to ReorderBufferRestoreChanges().
>
> Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
> sufficient because the code in that function has made sure the fd gets
> closed there itself.
>

Okay, then we might not need it there, but we should at least add a
comment in ReorderBufferRestoreChanges to explain why we have used a
different function to operate on the file at that place.

>
> For the API's that use VFDs (like PathNameOpenFile), the files opened
> are always recorded in the VfdCache array. So it is not required to do
> the cleanup at (sub)transaction end, because the kernel fds get closed
> dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
> limit. So if a transaction aborts, the fds might remain open, but
> those will get cleaned up whenever we require more fds, through
> ReleaseLruFiles(). Whereas, for files opened through
> OpenTransientFile(), VfdCache is not involved, so this needs
> transaction end cleanup.
>

Have you tried by injecting some error?  After getting the error
mentioned above in email, when I retried the same query, I got the
below message.

postgres=# SELECT 1 from
pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
ERROR:  could not remove file
"pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
removal of pg_replslot/regression_slot/xid*: Permission denied

And, then I tried to drop the replication slot and I got below error.
postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
ERROR:  could not rename file "pg_replslot/regression_slot" to
"pg_replslot/regression_slot.tmp": Permission denied

It might be something related to Windows, but you can once try by
injecting some error after reading a few files in the code path and
see the behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com