date:20200416

Re: cleaning perl code

2020-04-16 Thread Noah Misch

On Thu, Apr 16, 2020 at 09:53:46AM -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 4/15/20 11:01 PM, Noah Misch wrote:
> >> It would be an unpleasant surprise to cause a perlcritic buildfarm failure 
> >> by
> >> moving a function, verbatim, from a non-strategic file to a strategic file.
> >> Having two Perl style regimes in one tree is itself a liability.
> 
> > Honestly, I think you're reaching here.
> 
> I think that argument is wrong, actually.  Moving a function from a single
> use-case into a library (with, clearly, the intention for it to have more
> use-cases) is precisely the time when any weaknesses in its original
> implementation might be exposed.  So extra scrutiny seems well warranted.

Moving a function to a library does call for various scrutiny.  I don't think
it calls for replacing "no warnings;" with "no warnings; ## no critic", but
that observation is subordinate to your other point:

> Whether the "extra scrutiny" involved in perlcritic's higher levels
> is actually worth anything is a different debate, though, and so far
> it's not looking like it's worth much :-(

Yeah, this is the central point.  Many proposed style conformance changes are
(a) double-entry bookkeeping to emphasize the author's sincerity and (b) regex
performance optimization.  Those are not better for libraries than for
non-libraries, and I think they decrease code quality.

Even if such policies were better for libraries, the proposed patch applies
them to .pm files with narrow audiences.  If DBD::Pg were in this tree, that
would be a different conversation.

Re: Lexer issues

2020-04-16 Thread Patrick REED

Hi Julien,

Sorry for the late reply. I was able to solve the issue. It had to do with
the extra syntax I had introduced in gram.y. However, since you mentioned
the utility command, can you elaborate a bit more on that?

Thanks,
Patrick

On Tue, Apr 14, 2020 at 5:00 AM Julien Rouhaud  wrote:

> Hello,
>
> On Mon, Apr 13, 2020 at 4:04 PM Patrick REED 
> wrote:
> >
> > I am experimenting with postgres and am wondering if there is any
> tutorial on how to properly add a new command to postgres.
> >
> > I want to add a new constraint on "CREATE ROLE" that requires an
> integer, it has an identifier that is not a known (reserved or unreserved
> keyword) in postgres, say we call it TestPatrick. In other words, I want to
> do this "CREATE ROLE X TestPatrick=10". I am having an issue with having
> postgres recognize my new syntax.
> >
> > I have seen this video: https://www.youtube.com/watch?v=uSEXTcEiXGQ and
> was able to add have my postgres compile with my added word (modified
> gram.y, kwlist.h, gram.cpp etc based on the video). However, when I use my
> syntax on a client session, it still doesn't recognize my syntax... Are
> there any specific lexer changes I need to make? I followed the example of
> CONNECTION LIMIT and tried to mimic it for Create ROLE.
>
> I'd think that if you can get a successful compilation with a modified
> gram.y (and any kwlist change needed) the new syntax should be
> accepted (at least up to the parser, whether the utility command is
> properly handled is another thing), since there's a single version of
> the CreateRoleStmt.  Is there any chance that you're somehow
> connecting to something else than the freshly make-install-ed binary,
> or that the error is coming from later stage than parsing?
>

Re: ERROR: could not open file "pg_tblspc/ issue with replication setup.

2020-04-16 Thread Rajkumar Raghuwanshi

On Fri, Apr 17, 2020 at 9:51 AM Michael Paquier  wrote:

> On Thu, Apr 16, 2020 at 01:56:47PM +0530, Rajkumar Raghuwanshi wrote:
> > While testing for a feature I got this tablespace related error while
> > running script.
>
> Primary and standby are running on the same host, so they would
> interact with each other as the tablespace path used by both clusters
> would be the same (primary uses the path defined by the DDL, which is
> registered in the WAL record the standby replays).  What you are
> looking for here is to create the tablespace before taking the base
> backup, and then use the option --tablespace-mapping with
> pg_basebackup to avoid the issue.
>
Thanks for the help.



> --
> Michael
>

Possible cache reference leak by removeExtObjInitPriv

2020-04-16 Thread Kyotaro Horiguchi

Hello.

Recently a cache reference leak was reported then fixed [1].

I happened to notice a similar possible leakage in
removeEtObjInitPriv. I haven't found a way to reach the code, but can
be forcibly caused by tweaking the condition.

Please find the attached.

regards.

[1] 
https://www.postgresql.org/message-id/byapr08mb5606d1453d7f50e2af4d2fd29a...@byapr08mb5606.namprd08.prod.outlook.com

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index a3f680d388..11d9de9031 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -5869,11 +5869,17 @@ removeExtObjInitPriv(Oid objoid, Oid classoid)
 		/* Indexes don't have permissions */
 		if (pg_class_tuple->relkind == RELKIND_INDEX ||
 			pg_class_tuple->relkind == RELKIND_PARTITIONED_INDEX)
+		{
+			ReleaseSysCache(tuple);
 			return;
+		}
 
 		/* Composite types don't have permissions either */
 		if (pg_class_tuple->relkind == RELKIND_COMPOSITE_TYPE)
+		{
+			ReleaseSysCache(tuple);
 			return;
+		}
 
 		/*
 		 * If this isn't a sequence, index, or composite type then it's

Re: WIP/PoC for parallel backup

2020-04-16 Thread Kashif Zeeshan

On Tue, Apr 14, 2020 at 7:37 PM Asif Rehman  wrote:

>
>
> On Tue, Apr 14, 2020 at 6:32 PM Kashif Zeeshan <
> kashif.zees...@enterprisedb.com> wrote:
>
>> Hi Asif
>>
>> Getting the following error on Parallel backup when --no-manifest option
>> is used.
>>
>> [edb@localhost bin]$
>> [edb@localhost bin]$
>> [edb@localhost bin]$  ./pg_basebackup -v -j 5  -D
>>  /home/edb/Desktop/backup/ --no-manifest
>> pg_basebackup: initiating base backup, waiting for checkpoint to complete
>> pg_basebackup: checkpoint completed
>> pg_basebackup: write-ahead log start point: 0/228 on timeline 1
>> pg_basebackup: starting background WAL receiver
>> pg_basebackup: created temporary replication slot "pg_basebackup_10223"
>> pg_basebackup: backup worker (0) created
>> pg_basebackup: backup worker (1) created
>> pg_basebackup: backup worker (2) created
>> pg_basebackup: backup worker (3) created
>> pg_basebackup: backup worker (4) created
>> pg_basebackup: write-ahead log end point: 0/2000100
>> pg_basebackup: error: could not get data for 'BUILD_MANIFEST': ERROR:
>>  could not open file
>> "base/pgsql_tmp/pgsql_tmp_b4ef5ac0fd150b2a28caf626bbb1bef2.1": No such file
>> or directory
>> pg_basebackup: removing contents of data directory
>> "/home/edb/Desktop/backup/"
>> [edb@localhost bin]$
>>
>
> I forgot to make a check for no-manifest. Fixed. Attached is the updated
> patch.
>
Hi Asif

Verified the fix, thanks.

[edb@localhost bin]$
[edb@localhost bin]$
[edb@localhost bin]$ ./pg_basebackup -v -j 5 -D
/home/edb/Desktop/backup --no-manifest
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/428 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_27407"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: backup worker (3) created
pg_basebackup: backup worker (4) created
pg_basebackup: write-ahead log end point: 0/4000100
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
[edb@localhost bin]$
[edb@localhost bin]$ ls /home/edb/Desktop/backup
backup_label  pg_commit_ts  pg_ident.conf  pg_notifypg_snapshots
pg_subtrans  PG_VERSION  postgresql.auto.conf
base  pg_dynshmem   pg_logical pg_replslot  pg_stat
pg_tblspcpg_wal  postgresql.conf
globalpg_hba.conf   pg_multixact   pg_serialpg_stat_tmp
pg_twophase  pg_xact
[edb@localhost bin]$
[edb@localhost bin]$
[edb@localhost bin]$

Regards
Kashif Zeeshan

>
>
>> Thanks
>>
>> On Tue, Apr 14, 2020 at 5:33 PM Asif Rehman 
>> wrote:
>>
>>>
>>>
>>> On Wed, Apr 8, 2020 at 6:53 PM Kashif Zeeshan <
>>> kashif.zees...@enterprisedb.com> wrote:
>>>

 On Tue, Apr 7, 2020 at 9:44 PM Asif Rehman 
 wrote:

> Hi,
>
> Thanks, Kashif and Rajkumar. I have fixed the reported issues.
>
> I have added the shared state as previously described. The new grammar
> changes
> are as follows:
>
> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
> - This will generate a unique backupid using pg_strong_random(16)
> and hex-encoded
>   it. which is then returned as the result set.
> - It will also create a shared state and add it to the hashtable.
> The hash table size is set
>   to BACKUP_HASH_SIZE=10, but since hashtable can expand
> dynamically, I think it's
>   sufficient initial size. max_wal_senders is not used, because it
> can be set to quite a
>   large values.
>
> JOIN_BACKUP 'backup_id'
> - finds 'backup_id' in hashtable and attaches it to server process.
>
>
> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
> - renamed SEND_FILES to SEND_FILE
> - removed START_WAL_LOCATION from this because 'startptr' is now
> accessible through
>   shared state.
>
> There is no change in other commands:
> STOP_BACKUP [NOWAIT]
> LIST_TABLESPACES [PROGRESS]
> LIST_FILES [TABLESPACE]
> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>
> The current patches (v11) have been rebased to the latest master. The
> backup manifest is enabled
> by default, so I have disabled it for parallel backup mode and have
> generated a warning so that
> user is aware of it and not expect it in the backup.
>
> Hi Asif

 I have verified the bug fixes, one bug is fixed and working now as
 expected

 For the verification of the other bug fixes faced following issues,
 please have a look.

 1) Following bug fixes mentioned below are generating segmentation
 fault.

 Please note for reference I have added a description only as steps were
 given in previous emails o

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Masahiko Sawada

On Fri, 17 Apr 2020 at 02:58, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> > For btree indexes, IIRC skipping index cleanup could not be a cause of
> > corruption, but be a cause of index bloat since it leaves recyclable
> > pages which are not marked as recyclable. The index bloat is the main
> > side effect of skipping index cleanup. When user executes VACUUM with
> > INDEX_CLEANUP to reclaim index garbage, such pages will also be
> > recycled sooner or later? Or skipping index cleanup can be a cause of
> > recyclable page never being recycled?
>
> Well, it depends on what you define as "never". Once the xids on the
> pages have wrapped around, the page level xids will appear to be from
> the future for a long time. And the metapage xid appearing to be from
> the future will prevent some vacuums from actually doing the scan too,
> even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
> INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
> some point the wrapped around xids will be "current" again, if there's
> enough new xids.
>
>
> It's no ok for vacuumlazy.c to make decisions like this. I think the
> INDEX_CLEANUP logic clearly needs to be pushed down into the
> amvacuumcleanup callbacks, and it needs to be left to the index AMs to
> decide what the correct behaviour is.

I wanted to clarify the impact of this bug. I agree with you.

On Fri, 17 Apr 2020 at 07:49, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:
> > > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > > INDEX_CLEANUP on might not even visit the index. As there very well
> > > might not be many dead heap tuples around anymore (previous vacuums with
> > > cleanup off will have removed them), the
> > > vacuum_cleanup_index_scale_factor logic may prevent index vacuums.

I think this doesn't happen because, in the INDEX_CLEANUP off case,
vacuum marks linepointers of dead tuples as dead but leaves them.
Therefore future VACUUMs with INDEX_CLEANUP on will see these dead
linepointers and invoke ambulkdelete.

> > I guess that they should visit the metapage to see if they need to do
> > that much. That would allow us to fix the problem while mostly
> > honoring INDEX_CLEANUP off, I think.
>
> Yea. _bt_vacuum_needs_cleanup() needs to check if
> metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> false.

Agreed. So _bt_vacuum_needs_cleanup() would become something like the following?

if (metad->btm_version < BTREE_NOVAC_VERSION)
result = true;
else if (TransactionIdIsvaid(metad->btm_oldest_btpo_xact) &&
 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
   FreezeLimit))
result = true;
else if (index_cleanup_disabled)
result = false;
else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
   RecentGlobalXmin))
result = true;
else
   result = determine based on vacuum_cleanup_index_scale_factor;

Or perhaps we can change _bt_vacuum_needs_cleanup() so that it does
index cleanup if metad->btm_oldest_btpo_xact is older than the
FreezeLimit *and* it's an aggressive vacuum.

Anyway, if we change IndexVacuumInfo to tell AM that INDEX_CLEANUP
option is disabled and FreezeLimit a problem is that it would break
compatibility

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Kyotaro Horiguchi

At Thu, 16 Apr 2020 11:39:06 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > [ syncrep-fixes-4.patch ]
> 
> I agree that we could probably improve the clarity of this code with
> further rewriting, but I'm still very opposed to the idea of having
> callers know that the first num_sync array elements are the active
> ones.  It's wrong (or at least different from current behavior) for
> quorum mode, where there might be more than num_sync walsenders to
> consider.  And it might not generalize very well to other syncrep
> selection rules we might add in future, which might also not have
> exactly num_sync interesting walsenders.  So I much prefer an API
> definition that uses bool flags in an array that has no particular
> ordering (so far as the callers know, anyway).  If you don't like
> is_sync_standby, how about some more-neutral name like is_active
> or is_interesting or include_position?

I'm convinced that each element has is_sync_standby.  I agree to the
name is_sync_standby since I don't come up with a better name.

> I dislike the proposed comment revisions in SyncRepReleaseWaiters,
> too, particularly the change to say that what we're "announcing"
> is the ability to release waiters.  You did not change the actual
> log messages, and you would have gotten a lot of pushback if
> you tried, because the current messages make sense to users and
> something like that would not.  But by the same token this new
> comment isn't too helpful to somebody reading the code.

The current log messages look perfect to me.  I don't insist on the
comment change since I might take the definition of "sync standby" too
strictly.

> (Actually, I wonder why we even have the restriction that only
> sync standbys can release waiters.  It's not like they are
> going to get different results from SyncRepGetSyncRecPtr than
> any other walsender would.  Maybe we should just drop all the
> am_sync logic?)

I thought the same thing, though I didn't do that in the last patch.

am_sync seems intending to reduce spurious wakeups but actually
spurious wakeup won't increase even without it.  Thus the only
remaining task of am_sync is the trigger for the log messages and that
fact is the sign that the log messages should be emitted within
SyncRepGetSyncRecPtr. That eliminates references to SyncRepConfig in
SyncRepReleaseWaiters, which make me feel ease.

The attached is baed on syncrep-fixes-1.patch + am_sync elimination.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..eb0616121a 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static void SyncRepGetSyncStandbysPriority(SyncRepStandbyData *standbys, int n);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&MyWalSnd->mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(&MyWalSnd->mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -430,7 +433,6 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
 	bool		got_recptr;
-	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
@@ -461,34 +463,14 @@ SyncRepReleaseWaiters(void)
 	 * Check whether we are a sync standby or not, and calculate the synced
 	 * positions among all sync standbys.
 	 */
-	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr);
 
 	/*
-	 * If we are managing a sync standby, though we weren't prior to this,
-	 * then announce we are now a syn

Re: 001_rep_changes.pl stalls

2020-04-16 Thread Noah Misch

On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > This seems to have made the following race condition easier to hit:
> > https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
> > https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us
> 
> Yeah, I just came to the same guess in the other thread.
> 
> > While I don't think that indicates anything wrong with the fix for $SUBJECT,
> > creating more buildfarm noise is itself bad.  I am inclined to revert the 
> > fix
> > after a week.  Not immediately, in case it uncovers lower-probability bugs.
> > I'd then re-commit it after one of those threads fixes the other bug.  Would
> > anyone like to argue for a revert earlier, later, or not at all?
> 
> I don't think you should revert.  Those failures are (just) often enough
> to be annoying but I do not think that a proper fix is very far away.

That works for me, but an actual defect may trigger a revert.  Fujii Masao
reported high walsender CPU usage after this patch.  The patch caused idle
physical walsenders to use 100% CPU.  When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available.  XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record.  Some options:

1. Make XLogSendPhysical() more like XLogSendLogical(), calling
   WalSndWaitForWal() when no WAL is available.  A quick version of this
   passes tests, but I'll need to audit WalSndWaitForWal() for things that are
   wrong for physical replication.

2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
   insufficient WAL is available.  This complicates the xlogreader.h API to
   pass back "wait for this XLogRecPtr", and we'd then persist enough state to
   resume decoding.  This doesn't have any advantages to make up for those.

3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
   WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This risks
   further drift between the two wait sites; on the other hand, one could
   refactor later to help avoid that.

4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
   minimal fix, but it crudely punches through the abstraction between
   WalSndLoop() and its WalSndSendDataCallback.

I'm favoring (1).  Other preferences?

Re: snapshot too old issues, first around wraparound and then more.

2020-04-16 Thread Thomas Munro

On Fri, Apr 17, 2020 at 3:37 PM Thomas Munro  wrote:
> On Mon, Apr 13, 2020 at 5:14 PM Andres Freund  wrote:
> > FWIW, I think the part that is currently harder to fix is the time->xmin
> > mapping and some related pieces. Second comes the test
> > infrastructure. Compared to those, adding additional checks for old
> > snapshots wouldn't be too hard - although I'd argue that the approach of
> > sprinkling these tests everywhere isn't that scalable...
>
> Just trying out some ideas here...
> ... so I guess maybe I'll
> need to go and figure out how to write some perl.

Here's a very rough sketch of what I mean.  Patches 0001-0003 are
stolen directly from Robert.  I think 0005's t/001_truncate.pl
demonstrates that the map is purged of old xids as appropriate.  I
suppose this style of testing based on manually advancing the hands of
time should also allow for testing early pruning, but it may be Monday
before I can try that so I'm sharing what I have so far in case it's
useful...  I think this really wants to be in src/test/modules, not
contrib, but I just bolted it on top of what Robert posted.
From b78644a0f9580934b136ca8413366de91198203f Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Thu, 16 Apr 2020 09:37:31 -0400
Subject: [PATCH v1 1/5] Expose oldSnapshotControl.

---
 src/backend/utils/time/snapmgr.c | 55 +--
 src/include/utils/old_snapshot.h | 75 
 2 files changed, 77 insertions(+), 53 deletions(-)
 create mode 100644 src/include/utils/old_snapshot.h

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..abaaea569a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -63,6 +63,7 @@
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/old_snapshot.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
@@ -74,59 +75,7 @@
  */
 int			old_snapshot_threshold; /* number of minutes, -1 disables */
 
-/*
- * Structure for dealing with old_snapshot_threshold implementation.
- */
-typedef struct OldSnapshotControlData
-{
-	/*
-	 * Variables for old snapshot handling are shared among processes and are
-	 * only allowed to move forward.
-	 */
-	slock_t		mutex_current;	/* protect current_timestamp */
-	TimestampTz current_timestamp;	/* latest snapshot timestamp */
-	slock_t		mutex_latest_xmin;	/* protect latest_xmin and next_map_update */
-	TransactionId latest_xmin;	/* latest snapshot xmin */
-	TimestampTz next_map_update;	/* latest snapshot valid up to */
-	slock_t		mutex_threshold;	/* protect threshold fields */
-	TimestampTz threshold_timestamp;	/* earlier snapshot is old */
-	TransactionId threshold_xid;	/* earlier xid may be gone */
-
-	/*
-	 * Keep one xid per minute for old snapshot error handling.
-	 *
-	 * Use a circular buffer with a head offset, a count of entries currently
-	 * used, and a timestamp corresponding to the xid at the head offset.  A
-	 * count_used value of zero means that there are no times stored; a
-	 * count_used value of OLD_SNAPSHOT_TIME_MAP_ENTRIES means that the buffer
-	 * is full and the head must be advanced to add new entries.  Use
-	 * timestamps aligned to minute boundaries, since that seems less
-	 * surprising than aligning based on the first usage timestamp.  The
-	 * latest bucket is effectively stored within latest_xmin.  The circular
-	 * buffer is updated when we get a new xmin value that doesn't fall into
-	 * the same interval.
-	 *
-	 * It is OK if the xid for a given time slot is from earlier than
-	 * calculated by adding the number of minutes corresponding to the
-	 * (possibly wrapped) distance from the head offset to the time of the
-	 * head entry, since that just results in the vacuuming of old tuples
-	 * being slightly less aggressive.  It would not be OK for it to be off in
-	 * the other direction, since it might result in vacuuming tuples that are
-	 * still expected to be there.
-	 *
-	 * Use of an SLRU was considered but not chosen because it is more
-	 * heavyweight than is needed for this, and would probably not be any less
-	 * code to implement.
-	 *
-	 * Persistence is not needed.
-	 */
-	int			head_offset;	/* subscript of oldest tracked time */
-	TimestampTz head_timestamp; /* time corresponding to head xid */
-	int			count_used;		/* how many slots are in use */
-	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
-} OldSnapshotControlData;
-
-static volatile OldSnapshotControlData *oldSnapshotControl;
+volatile OldSnapshotControlData *oldSnapshotControl;
 
 
 /*
diff --git a/src/include/utils/old_snapshot.h b/src/include/utils/old_snapshot.h
new file mode 100644
index 00..284af7d508
--- /dev/null
+++ b/src/include/utils/old_snapshot.h
@@ -0,0 +1,75 @@
+/*-
+ *
+ * old_snapshot.h
+ *		Data structures for 'snapshot too old'
+ *
+ * Portions Copyright

Re: It is not documented that pg_promote can exit standby mode

2020-04-16 Thread Michael Paquier

On Fri, Apr 17, 2020 at 01:40:02PM +0900, Fujii Masao wrote:
> Thanks for the report and the patch! It looks good to me.
> Barring any objection, I will commit this patch.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: It is not documented that pg_promote can exit standby mode

2020-04-16 Thread Fujii Masao





On 2020/04/17 13:11, ikedamsh wrote:

Hi,

The document(high-availability.sgml) says that there are only two ways to exit 
standby mode.

　26.2.2. Standby Server Operation
　Standby mode is exited and the server switches to normal operation when pg_ctl 
promote is run or a trigger file is found (promote_trigger_file).

But there is another way, by calling pg_promote function.
I think we need to document it, doesn't it?

I attached a patch. Please review and let me know your thoughts.


Thanks for the report and the patch! It looks good to me.
Barring any objection, I will commit this patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: ERROR: could not open file "pg_tblspc/ issue with replication setup.

2020-04-16 Thread Michael Paquier

On Thu, Apr 16, 2020 at 01:56:47PM +0530, Rajkumar Raghuwanshi wrote:
> While testing for a feature I got this tablespace related error while
> running script.

Primary and standby are running on the same host, so they would
interact with each other as the tablespace path used by both clusters
would be the same (primary uses the path defined by the DDL, which is
registered in the WAL record the standby replays).  What you are
looking for here is to create the tablespace before taking the base
backup, and then use the option --tablespace-mapping with
pg_basebackup to avoid the issue.
--
Michael


signature.asc
Description: PGP signature

It is not documented that pg_promote can exit standby mode

2020-04-16 Thread ikedamsh


Hi, 

The document(high-availability.sgml) says that there are only two ways 
to exit standby mode.


　26.2.2. Standby Server Operation
　Standby mode is exited and the server switches to normal operation when 
pg_ctl promote is run or a trigger file is found (promote_trigger_file).


But there is another way, by calling pg_promote function.
I think we need to document it, doesn't it?

I attached a patch. Please review and let me know your thoughts.

Regards,
Masahiro IkedaFrom 719b754c36f4537aaab7c6de1951d7b7ec4759f6 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Tue, 7 Apr 2020 08:34:55 +0900
Subject: [PATCH] fix doc about the way to exit standby mode

---
 doc/src/sgml/high-availability.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 4659b9e..88bf960 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -644,8 +644,8 @@ protocol to make nodes agree on a serializable transactional order.
 

 Standby mode is exited and the server switches to normal operation
-when pg_ctl promote is run or a trigger file is found
-(promote_trigger_file). Before failover,
+when pg_ctl promote is run, pg_promote is called,
+or a trigger file is found(promote_trigger_file). Before failover,
 any WAL immediately available in the archive or in pg_wal will be
 restored, but no attempt is made to connect to the master.

-- 
1.8.3.1

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Hamid Akhtar

Thank you everyone for the detailed feedback.

On Fri, Apr 17, 2020 at 5:40 AM Michael Paquier  wrote:

> On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
> > Even if I liked the core idea, loading the functionality onto VACUUM
> seems
> > like a fairly horrid design choice.  It's quite unrelated to what that
> > command does.  In the autovac code path, it's going to lead to multiple
> > autovac workers all complaining simultaneously about the same problem.
> > But having manual vacuums complain about issues unrelated to the task at
> > hand is also a seriously poor bit of UX design.  Moreover, that won't do
> > all that much to surface problems, since most(?) installations never run
> > manual vacuums; or if they do, the "manual" runs are really done by a
> cron
> > job or the like, which is not going to notice the warnings.  So you still
> > need a log-scraping tool.
>
> +1.
>
> > If we were going to go down the path of periodically logging warnings
> > about old prepared transactions, some single-instance background task
> > like the checkpointer would be a better place to do the work in.  But
> > I'm not really recommending that, because I agree with Robert that
> > we just plain don't want this functionality.
>
> I am not sure that the checkpointer is a good place to do that either,
> joining back with your argument in the first paragraph of this email
> related to vacuum.  One potential approach would be a contrib module
> that works as a background worker?  However, I would think that
> finding a minimum set of requirements that we think are generic enough
> for most users would be something hard to draft a list of.  If we had
> a small, minimal contrib/ module in core that people could easily
> extend for their own needs and that we would intentionally keep as
> minimal, in the same spirit as say passwordcheck, perhaps..
> --
> Michael
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: snapshot too old issues, first around wraparound and then more.

2020-04-16 Thread Thomas Munro

On Mon, Apr 13, 2020 at 5:14 PM Andres Freund  wrote:
> FWIW, I think the part that is currently harder to fix is the time->xmin
> mapping and some related pieces. Second comes the test
> infrastructure. Compared to those, adding additional checks for old
> snapshots wouldn't be too hard - although I'd argue that the approach of
> sprinkling these tests everywhere isn't that scalable...

Just trying out some ideas here...  I suppose the wrapping problem
just requires something along the lines of the attached, but now I'm
wondering how to write decent tests for it.  Using the
pg_clobber_current_snapshot_timestamp() function I mentioned in
Robert's time->xmin thread, it's easy to build up a time map without
resorting to sleeping etc, with something like:

select pg_clobber_current_snapshot_timestamp('3000-01-01 00:00:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:01:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:02:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:03:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:04:00Z');

Then of course  frozenXID can be advanced with eg update pg_database
set datallowconn = 't' where datname = 'template0', then vacuumdb
--freeze --all, and checked before and after with Robert's
pg_old_snapshot_time_mapping() SRF to see that it's truncated.  But
it's not really the level of stuff we'd ideally mess with in
pg_regress tests and I don't see any precent, so I guess maybe I'll
need to go and figure out how to write some perl.
From 69d0f7d843a8145fc8cdbd6a56b948ee04d486b9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 17 Apr 2020 15:18:49 +1200
Subject: [PATCH] Truncate old snapshot XIDs before truncating CLOG.

---
 src/backend/commands/vacuum.c|  3 +++
 src/backend/utils/time/snapmgr.c | 21 +
 src/include/utils/snapmgr.h  |  1 +
 3 files changed, 25 insertions(+)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..37ead45fa5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1627,6 +1627,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	AdvanceOldestCommitTsXid(frozenXID);
 
+	/* Make sure snapshot_too_old drops old XIDs. */
+	TruncateOldSnapshotTimeMapping(frozenXID);
+
 	/*
 	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 72b2c61a07..d604e69270 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1998,6 +1998,27 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 }
 
 
+/*
+ * Remove old xids from the timing map, so the CLOG can be truncated.
+ */
+void
+TruncateOldSnapshotTimeMapping(TransactionId frozenXID)
+{
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
+	while (oldSnapshotControl->count_used > 0 &&
+		   TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[oldSnapshotControl->head_offset],
+ frozenXID))
+	{
+		oldSnapshotControl->head_timestamp += USECS_PER_MINUTE;
+		oldSnapshotControl->head_offset =
+			(oldSnapshotControl->head_offset + 1) %
+			OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+		oldSnapshotControl->count_used--;
+	}
+	LWLockRelease(OldSnapshotTimeMapLock);
+}
+
+
 /*
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..4f53aad956 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -135,6 +135,7 @@ extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmi
 		 Relation relation);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 		   TransactionId xmin);
+extern void TruncateOldSnapshotTimeMapping(TransactionId frozenXID);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
-- 
2.20.1

Re: where should I stick that backup?

2020-04-16 Thread Robert Haas

On Wed, Apr 15, 2020 at 7:55 PM Robert Haas  wrote:
> Yeah. I think we really need to understand the performance
> characteristics of pipes better. If they're slow, then anything that
> needs to be fast has to work some other way (but we could still
> provide a pipe-based slow way for niche uses).

Hmm. Could we learn what we need to know about this by doing something
as taking a basebackup of a cluster with some data in it (say, created
by pgbench -i -s 400 or something) and then comparing the speed of cat
< base.tar | gzip > base.tgz to the speed of gzip < base.tar >
base.tgz? It seems like there's no difference between those except
that the first one relays through an extra process and an extra pipe.

I don't know exactly how to do the equivalent of this on Windows, but
I bet somebody does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-16 Thread Thomas Munro

On Fri, Apr 17, 2020 at 5:46 AM Andres Freund  wrote:
> On 2020-04-16 13:34:39 -0400, Robert Haas wrote:
> > On Thu, Apr 16, 2020 at 1:14 PM Andres Freund  wrote:
> > > I still think we need a way to test this without waiting for hours to
> > > hit various edge cases. You argued against a fixed binning of
> > > old_snapshot_threshold/100 arguing its too coarse. How about a 1000 or
> > > so? For 60 days, the current max for old_snapshot_threshold, that'd be a
> > > granularity of 01:26:24, which seems fine.  The best way I can think of
> > > that'd keep current GUC values sensible is to change
> > > old_snapshot_threshold to be float. Ugly, but ...?
> >
> > Yeah, 1000 would be a lot better. However, if we switch to a fixed
> > number of bins, it's going to be a lot more code churn.
>
> Given the number of things that need to be addressed around the feature,
> I am not too concerned about that.
>
>
> > What did you think of my suggestion of making head_timestamp
> > artificially move backward to simulate the passage of time?
>
> I don't think it allows to exercise the various cases well enough. We
> need to be able to test this feature both interactively as well as in a
> scripted manner. Edge cases like wrapping around in the time mapping imo
> can not easily be tested by moving the head timestamp back.

What about a contrib function that lets you clobber
oldSnapshotControl->current_timestamp?  It looks like all times in
this system come ultimately from GetSnapshotCurrentTimestamp(), which
uses that variable to make sure that time never goes backwards.
Perhaps you could abuse that, like so, from test scripts:

postgres=# select * from pg_old_snapshot_time_mapping();
 array_offset | end_timestamp  | newest_xmin
--++-
0 | 3000-01-01 13:00:00+13 | 490
(1 row)

postgres=# select pg_clobber_current_snapshot_timestamp('3000-01-01 00:01:00Z');
 pg_clobber_current_snapshot_timestamp
---

(1 row)

postgres=# select * from pg_old_snapshot_time_mapping();
 array_offset | end_timestamp  | newest_xmin
--++-
0 | 3000-01-01 13:01:00+13 | 490
1 | 3000-01-01 13:02:00+13 | 490
(2 rows)
From 76fe56b732cdc420aeb7cb3b2adcc1e45343b0f7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 17 Apr 2020 14:10:35 +1200
Subject: [PATCH 3/3] Add pg_clobber_current_snapshot_timestamp().

---
 contrib/old_snapshot/old_snapshot--1.0.sql |  5 +
 contrib/old_snapshot/time_mapping.c| 13 +
 2 files changed, 18 insertions(+)

diff --git a/contrib/old_snapshot/old_snapshot--1.0.sql b/contrib/old_snapshot/old_snapshot--1.0.sql
index 9ebb8829e3..aacf1704b5 100644
--- a/contrib/old_snapshot/old_snapshot--1.0.sql
+++ b/contrib/old_snapshot/old_snapshot--1.0.sql
@@ -11,4 +11,9 @@ RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'pg_old_snapshot_time_mapping'
 LANGUAGE C STRICT;
 
+CREATE FUNCTION pg_clobber_current_snapshot_timestamp(now timestamptz)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'pg_clobber_current_snapshot_timestamp'
+LANGUAGE C STRICT;
+
 -- XXX. Do we want REVOKE commands here?
diff --git a/contrib/old_snapshot/time_mapping.c b/contrib/old_snapshot/time_mapping.c
index 37e0055a00..8728c4ddb5 100644
--- a/contrib/old_snapshot/time_mapping.c
+++ b/contrib/old_snapshot/time_mapping.c
@@ -36,6 +36,7 @@ typedef struct
 
 PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(pg_old_snapshot_time_mapping);
+PG_FUNCTION_INFO_V1(pg_clobber_current_snapshot_timestamp);
 
 static OldSnapshotTimeMapping *GetOldSnapshotTimeMapping(void);
 static TupleDesc MakeOldSnapshotTimeMappingTupleDesc(void);
@@ -157,3 +158,15 @@ MakeOldSnapshotTimeMappingTuple(TupleDesc tupdesc, OldSnapshotTimeMapping *mappi
 
 	return heap_form_tuple(tupdesc, values, nulls);
 }
+
+Datum
+pg_clobber_current_snapshot_timestamp(PG_FUNCTION_ARGS)
+{
+	TimestampTz new_current_timestamp = PG_GETARG_TIMESTAMPTZ(0);
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
+	oldSnapshotControl->current_timestamp = new_current_timestamp;
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	PG_RETURN_NULL();
+}
-- 
2.20.1

Re: Making openssl_tls_init_hook OpenSSL specific

2020-04-16 Thread Michael Paquier

On Thu, Apr 16, 2020 at 02:17:33PM +0200, Daniel Gustafsson wrote:
> Commit 896fcdb230e72 (sorry for chiming in too late, I missed that thread)
> added a TLS init hook which is OpenSSL specific: openssl_tls_init_hook.  Since
> the rest of the TLS support in the backend is library agnostic, we should IMO
> make this hook follow that pattern, else this will make a non-OpenSSL backend
> not compile.

Better sooner than later, thanks for the report.

> If we make the hook generic, extension authors must have a way to tell which
> backend invoked it, so maybe the best option is to simply wrap this hook in
> USE_OPENSSL ifdefs and keep the name/signature?  Looking at the Secure
> Transport patch I wrote, there is really no equivalent callsite; the same goes
> for a libnss patch which I haven't yet submitted.
> 
> The attached adds USE_OPENSSL guards.

I agree that this looks like an oversight of the original commit
introducing the hook as it gets called in the OpenSSL code path of
be_tls_init(), so I think that your patch is right (though I would
have just used #ifdef USE_OPENSSL here).  And if the future proves
that this hook has more uses for other SSL implementations, we could
always rework it at this point, if necessary.  Andrew, would you
prefer fixing that yourself?
--
Michael


signature.asc
Description: PGP signature

Re: [PATHC] Fix minor memory leak in pg_basebackup

2020-04-16 Thread Michael Paquier

On Thu, Apr 16, 2020 at 10:54:09AM +, Zhang, Jie wrote:
> So much the better!

Thanks for confirming.  Fixed, then.
--
Michael


signature.asc
Description: PGP signature

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Peter Geoghegan

On Thu, Apr 16, 2020 at 3:49 PM Andres Freund  wrote:
> I think we really just stop being miserly and update the xid to be
> FrozenTransactionId or InvalidTransactionId when vacuum encounters one
> that's from before the the xid cutoff used by vacuum (i.e. what could
> become the new relfrozenxid).  That seems like it'd be a few lines, not
> more.

Okay.

> > There is absolutely no reason why we have to delay recycling for very
> > long, even in cases with long running transactions or whatever. I
> > agree that it's just an accident that it works that way. VACUUM could
> > probably remember deleted pages, and then revisited those pages at the
> > end of the index vacuuming -- that might make a big difference in a
> > lot of workloads. Or it could chain them together as a linked list
> > which can be accessed much more eagerly in some cases.
>
> I think it doesn't really help meaningfully for vacuum to be a bit
> smarter about when to recognize pages as being recyclable. IMO the big
> issue is that vacuum won't be very frequent, so we'll grow the index
> until that time, even if there's many "effectively empty" pages.

(It seems like you're talking about the big picture now, not the
problems in Postgres 11 and 12 features in this area -- you're talking
about what happens to empty pages, not what happens to deleted pages.)

I'll say some more things about the less ambitious goal of more eager
recycling of pages, as they are deleted:

An individual VACUUM operation cannot recycle a page right after
_bt_pagedel() is called to delete the page. VACUUM will both set a
target leaf page half dead and delete it all at once in _bt_pagedel()
(it does that much in the simple and common case). Again, this is
because recycling very soon after the call to _bt_pagedel() will
introduce races with concurrent index scans -- they could fail to
observe the deletion in the parent (i.e. see its downlink, since child
isn't even half dead), and land on a concurrently recycled page
(VACUUM concurrently marks the page half-dead, fully dead/deleted, and
then even goes as far as recycling it). So the current design makes a
certain amount of sense -- we can't be super aggressive like that.
(Actually, maybe it doesn't make sense to not just put the page in the
FSM there and then -- see "Thinks some more" below.)

Even still, nothing stops the same VACUUM operation from (for example)
remembering a list of pages it has deleted during the current scan,
and then coming back at the end of the bulk scan of the index to
reconsider if it can recycle the pages now (2 minutes later instead of
2 months later). With a new RecentGlobalXmin (or something that's
conceptually like a new RecentGlobalXmin).

Similarly, we could do limited VACUUMs that only visit previously
deleted pages, once VACUUM is taught to chain deleted pages together
to optimize recycling. We don't have to repeat another pass over the
entire index to recycle the pages because of this special deleted page
linking. This is something that we use when we have to recycle pages,
but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want
to do most of the stuff that index vacuuming needs to do, but we must
still visit the metapage to check btm_oldest_btpo_xact, and then maybe
walk the deleted page linked list.

*** Thinks some more ***

As you pointed out, _bt_getbuf() already distrusts the FSM -- it has
its own _bt_page_recyclable() check, probably because the FSM isn't
crash safe. Maybe we could improve matters by teaching _bt_pagedel()
to put a page it deleted in the FSM immediately -- no need to wait
until the next index VACUUM for the RecordFreeIndexPage() call. It
still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd
also have to teach _bt_getbuf() some heuristics that made it
understand that VACUUM is now designed to put stuff in the FSM
immediately, so we don't have to wait for the next VACUUM operation to
get to it. Maybe _bt_getbuf() should try the FSM a few times before
giving up and allocating a new page, etc.

This wouldn't make VACUUM delete any more pages any sooner, but it
would make those pages reclaimable much sooner. Also, it wouldn't
solve the wraparound problem, but that is a bug, not a
performance/efficiency issue.

> I.e. even if the killtuples logic allows us to recognize that all actual
> index tuples are fully dead, we'll not benefit from that unless there's
> a new insertion that belongs onto the "empty" page. That's fine for
> indexes that are updated roughly evenly across the value range, but
> terrible for indexes that "grow" mostly on one side, and "shrink" on the
> other.

That could be true, but there are certain things about B-Tree space
utilization that might surprise you:

https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3

> I'd bet it be beneficial if we were to either have scans unlink such
> pages directly, or if they just entered the page into the FSM and have
> _bt_getbuf() do the unlinking.

That won't work,

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-16 Thread Kyotaro Horiguchi

At Thu, 16 Apr 2020 14:46:00 +0200, Magnus Hagander  wrote 
in 
> On Thu, Apr 16, 2020 at 7:05 AM Kyotaro Horiguchi 
> wrote:
> 
> > At Wed, 15 Apr 2020 15:58:05 +0500, "Andrey M. Borodin" <
> > x4...@yandex-team.ru> wrote in
> > > > 15 апр. 2020 г., в 15:25, Magnus Hagander 
> > написал(а):
> > > > I think that makes perfect sense. The documentation explicitly says
> > "can read all pg_stat_* views", which is clearly wrong -- so either the
> > code or the docs should be fixed, and it looks like it's the code that
> > should be fixed to me.
> > > Should it be bug or v14 feature?
> > >
> > > Also pgstatfuncs.c contains a lot more checks of
> > has_privs_of_role(GetUserId(), beentry->st_userid).
> > > Maybe grant them all?
> > >
> > > > As for the patch, one could argue that we should just store the
> > resulting boolean instead of re-running the check (e.g. have a "bool
> > has_stats_privilege" or such), but perhaps that's an unnecessary
> > micro-optimization, like the attached.
> > >
> > > Looks good to me.
> >
> > pg_stat_get_activty checks (has_privs_of_role() ||
> > is_member_of_role()) in-place for every entry.  It's not necessary but
> > I suppose that doing the same thing for pg_stat_progress_info might be
> > better.
> >
> 
> From a result perspective, it shouldn't make a difference though, should
> it? It's a micro-optimization, but it might not have an actual performance
> effect in reality as well, but the result should always be the same?
> 
> (FWIW, pg_stat_statements has a coding pattern similar to the one I
> suggested in the patch)

As a priciple, I prefer the "optimized" (or pg_stat_statements')
pattern because that style suggests that the privilege is (shold be)
same to all entries, not because that it might be a bit faster.  My
suggestion above is just from "same style with a nearby code". But at
least the v2 code introduces the third style (mixture of in-place and
pre-evaluated) seemed a kind of ad-hoc.

> > It's another issue, but pg_stat_get_backend_* functions don't consider
> > pg_read_all_stats. I suppose that the functions should work under the
> > same criteria to pg_stat views, and maybe explicitly documented?
> >
> 
> That's a good question. They haven't been documented to do so, but it
> certainly seems *weird* that the same information should be available
> through a view like pg_stat_activity, but not through the functions.
> 
> I would guess this was simply forgotten in 25fff40798f -- I don't recall
> any discussion about it. The commit message specifically says
> pg_database_size() and pg_tablespace_size(), but mentions nothing about
> pg_stat_*.

Yeah. pg_database_size() ERRORs out for insufficient privileges.  On
the other hand pg_stat_* returns "" not
ERRORing out.

For example, pg_stat_get_backend_wait_event_type is documented as

https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-BACKEND-FUNCS-TABLE

"Wait event type name if backend is currently waiting, otherwise
 NULL. See Table 27.4 for details."

I would read this as "If the function returns non-null value, the
returned value represents the wait event type mentioned in Table
27.4", but, "" is not a wait event type.  I
think something like "text-returning functions may return some
out-of-the-domain strings like "" under
corresponding conditions".

> > If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
> > something like and replace the all occurances of the idiomatic
> > condition with it.
> >
> 
> You mean something like the attached?

Exactly.  It looks good to me. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-16 Thread James Coleman

On Thu, Apr 16, 2020 at 1:10 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Fri, Apr 10, 2020 at 10:12 AM James Coleman  wrote:
> >> One thing I just noticed and had a question about: in
> >> preparePresortedCols (which sets up a function call context), do we
> >> need to call pg_proc_aclcheck?
>
> > Background: this came up because I noticed that pg_proc_aclcheck is
> > called in the scalar array op case in execExpr.c.
>
> > However grepping through the source code I see several places where a
> > function (including an equality op for an ordering op, like the case
> > we have here) gets looked up without calling pg_proc_aclcheck, but
> > then other places where the acl check is invoked.
>
> Rule of thumb is that we don't apply ACL checks to functions/ops
> we get out of an opclass; adding a function to an opclass is tantamount
> to giving public execute permission on it.  If the function/operator
> reference came directly from the SQL query it must be checked.

All right, in that case I believe we're OK here without modification.
We're looking up the equality op based on the ordering op the planner
has already selected for sorting the query, and I'm assuming that
looking that up via the op family is in the same category as "getting
out of an opclass" (since opclasses are part of an opfamily).

Thanks for the explanation.

> > In addition, I haven't been able to discern a reason for why sometimes
> > InvokeFunctionExecuteHook gets called with the function after lookup,
> > but not others.
>
> I would not stand here and say that that hook infrastructure is worth
> anything at all.  Maybe the coverage is sufficient for some use-cases,
> but who's to say?

Interesting. It does look to be particularly underused. Just grepping
for that hook invocation macro shows, for example, that it's not used
in nodeSort.c or tuplesort.c, so clearly it's not executed for the
functions we'd use in regular sort. Given that...I think we can
proceed without it here too.

James

Re: sqlsmith crash incremental sort

2020-04-16 Thread Tom Lane

Tomas Vondra  writes:
> I think we have essentially three options:
> 1) assuming there's just a single group
> 2) assuming each row is a separate group
> 3) something in between
> If (1) and (2) are worst/best-case scenarios, maybe we should pick
> something in between. We have DEFAULT_NUM_DISTINCT (200) which
> essentially says "we don't know what the number of groups is" so maybe
> we should use that.

I wouldn't recommend picking either the best or worst cases.

Possibly DEFAULT_NUM_DISTINCT is a sane choice, though it's fair to
wonder if it's quite applicable to the case where we already know
we've grouped by some columns.

regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-16 Thread Andy Fan

On Thu, Apr 16, 2020 at 8:36 PM Ashutosh Bapat 
wrote:

> On Thu, Apr 16, 2020 at 7:47 AM Andy Fan  wrote:
>
> > (9 rows)
> >
> > With this feature:
> > explain analyze select a, sum(c) from grp2 group by a;
> > QUERY PLAN
> >
> --
> >  GroupAggregate  (cost=0.00..553031.57 rows=1023 width=12) (actual
> time=0.044..13209.485 rows=1000 loops=1)
> >Group Key: a
> >->  Seq Scan on grp2  (cost=0.00..403031.23 rows=1023 width=8)
> (actual time=0.023..4938.171 rows=1000 loops=1)
> >  Planning Time: 0.400 ms
> >  Execution Time: 13749.121 ms
> > (5 rows)
> >
>
> Applying the patch gives a white space warning
> git am /tmp/v6-000*
> Applying: Introduce UniqueKeys to determine RelOptInfo unique properties
> .git/rebase-apply/patch:545: indent with spaces.
> /* Fast path */
> warning: 1 line adds whitespace errors.
> Applying: Skip DISTINCT / GROUP BY if input is already unique
>
> Compiling the patch causes one warning
> nodeAgg.c:2134:3: warning: enumeration value ‘AGG_UNIQUE’ not handled
> in switch [-Wswitch]
>
>
Thanks, I will fix them together with some detailed review suggestion.
(I know the review need lots of time, so appreciated for it).


> I have not looked at the patch. The numbers above look good. The time
> spent in summing up a column in each row (we are summing only one
> number per group) is twice the time it took to read those rows from
> the table. That looks odd. But it may not be something unrelated to
> your patch. I also observed that for explain analyze select a from
> grp2 group by a; we just produce a plan containing seq scan node,
> which is a good thing.
>

Great and welcome back Ashutosh:)


> --
> Best Wishes,
> Ashutosh Bapat
>

Re: [DOC] Document concurrent index builds waiting on each other

2020-04-16 Thread James Coleman

On Thu, Apr 16, 2020 at 6:12 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-15 21:44:48 -0400, James Coleman wrote:
> > On Wed, Apr 15, 2020 at 6:31 PM Andres Freund  wrote:
> > > If it's about the xmin horizon for vacuum: I think we could probably
> > > avoid that using the same flag. As vacuum cannot be run against a table
> > > that has a CIC running (although it'd theoretically be possible to allow
> > > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
> > > GetOldestXmin() call. That might not be true for system relations, but
> > > we don't allow CIC on those.
> >
> > Yeah, I mean that if I have a CIC running on table X then vacuum can't
> > remove dead tuples (from after the CIC's snapshot) on table Y.
>
> For me "blocking" evokes waiting for a lock, which is why I thought
> you'd not mean that issue.

It was sloppy choice of language on my part; for better or worse at
work we've taken to talking about "blocking vacuum" when that's really
shorthand for "blocking [or you'd prefer preventing] vacuuming dead
tuples".

> > That's a pretty significant danger, given the combination of:
> > 1. Index builds on very large tables can take many days, and
>
> We at least don't hold a single snapshot over the multiple phases...

For sure. And text sorting improvements have made this better also,
still, as you often point out re: xid size, databases are only getting
larger (and more TPS).

> > 2. The well understood problems of high update tables with dead tuples
> > and poor plans.
>
> Which specific problem are you referring to? The planner probing the end
> of the index for values outside of the histogram? I'd hope
> 3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a
> bit?

Yes, and other commits too, IIRC from the time we spent debugging
exactly the scenario mentioned in that commit.

But by "poor plans" I don't mean specifically "poor planning time" but
that we can still end up choosing the "wrong" plan, right? And dead
tuples can make an index scan be significantly worse than it would
otherwise be. Same for a seq scan: you can end up looking at millions
of dead tuples in a nominally 500 row table.

> > > [description why we could ignore CIC for vacuum horizon on other tables ]
>
> > I've previously discussed this with other hackers and the reasoning
> > they'd understood way that we couldn't always safely ignore
> > PROC_IN_CIC backends in the vacuum's oldest xmin call because of
> > function indexes, and the fact that (despite clear recommendations to
> > the contrary), there's nothing actually preventing someone from adding
> > a function index on table X that queries table Y.
>
> Well, even if we consider this an actual problem, we could still use
> PROC_IN_CIC for non-expression non-partial indexes (index operator
> themselves better ensure this isn't a problem, or they're ridiculously
> broken already - they can get called during vacuum).

Agreed. It'd be unfortunate to have to limit it though.

> Even when expressions are involved, I don't think that necessarily would
> have to mean that we need to use the same snapshot to run expressions in
> for the hole scan. So we could occasionally take a new snapshot for the
> purpose of computing expressions.
>
> The hard part presumably would be that we'd need to advertise one xmin
> for the expression snapshot to protect tuples potentially accessed from
> being removed, but at the same time we also need to advertise the xmin
> of the snapshot used by CIC, to avoid HOT pruning in other session from
> removing tuple versions from the table the index is being created
> on.
>
> There's not really infrastructure for doing so. I think we'd basically
> have to start publicizing multiple xmin values (as long as PGXACT->xmin
> is <= new xmin for expressions, only GetOldestXmin() would need to care,
> and it's not that performance critical). Not pretty.

In other words, pretty invasive.

> > I'm not sure I buy that we should care about people doing something
> > clearly so dangerous, but...I grant that it'd be nice not to cause new
> > crashes.
>
> I don't think it's just dangerous expressions that would be
> affected. Normal expression indexes need to be able to do syscache
> lookups etc, and they can't safely do so if tuple versions can be
> removed in the middle of a scan. We could avoid that by not ignoring
> PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck).

At first glance this sounds a lot less invasive, but I also agree it's gross.

James

Re: sqlsmith crash incremental sort

2020-04-16 Thread James Coleman

On Thu, Apr 16, 2020 at 8:54 PM Tomas Vondra
 wrote:
>
> On Wed, Apr 15, 2020 at 11:26:12AM -0400, James Coleman wrote:
> >On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra
> > wrote:
> >>
> >> ...
> >>
> >> Yeah. And I'm not even sure having that information would allow good
> >> estimates e.g. for UNIONs of multiple relations etc.
> >>
> >> >> Another option is to use something as simple as checking for Vars with
> >> >> varno==0 in cost_incremental_sort() and ignoring them somehow. We could
> >> >> simply use some arbitrary estimate - by assuming the rows are unique or
> >> >> something like that. Yes, I agree it's pretty ugly and I'd much rather
> >> >> find a way to generate something sensible, but I'm not even sure we can
> >> >> generate good estimate when doing UNION of data from different relations
> >> >> and so on. The attached (ugly) patch does this.
> >> >
> >> >...therefore I think this is worth proceeding with.
> >> >
> >>
> >> OK, then the question is what estimate to use in this case. Should we
> >> assume 1 group or uniqueness? I'd assume a single group produces costs
> >> slightly above regular sort, right?
> >
> >Originally I'd intuitively leaned towards assuming they were unique.
> >But that would be the best case for memory/disk space usage, for
> >example, and the costing for incremental sort is always (at least
> >mildly) higher than regular sort if the number of groups is 1. That
> >also guarantees the startup cost is higher than regular sort also.
> >
> >So I think using a number of groups estimate of 1, we just wouldn't
> >choose an incremental sort ever in this case.
> >
> >Maybe that's the right choice? It'd certainly be the conservative
> >choice. What are your thoughts on the trade-offs there?
> >
>
> I think we have essentially three options:
>
> 1) assuming there's just a single group
>
> This should produce cost estimate higher than plain sort, disabling
> incremental sort. I'd say this is "worst case" assumption. I think this
> might be overly pessimistic, though.
>
> 2) assuming each row is a separate group
>
> If (1) is worst case scenario, then this is probably the best case,
> particularly when the query is sensitive to startup cost.
>
>
> 3) something in between
>
> If (1) and (2) are worst/best-case scenarios, maybe we should pick
> something in between. We have DEFAULT_NUM_DISTINCT (200) which
> essentially says "we don't know what the number of groups is" so maybe
> we should use that. Another option would be nrows/10, which is the cap
> we use in estimate_num_groups without extended stats.
>
>
> I was leaning towards (1) as "worst case" choice seems natural to
> prevent possible regressions. But consider this:
>
>create table small (a int);
>create table large (a int);
>
>insert into small select mod(i,10) from generate_series(1,100) s(i);
>insert into large select mod(i,10) from generate_series(1,10) s(i);
>
>analyze small;
>analyze large;
>
>explain select i from large union select i from small;
>
>  QUERY PLAN
>
> ---
> Unique  (cost=11260.35..11760.85 rows=100100 width=4)
>   ->  Sort  (cost=11260.35..11510.60 rows=100100 width=4)
> Sort Key: large.i
> ->  Append  (cost=0.00..2946.50 rows=100100 width=4)
>   ->  Seq Scan on large  (cost=0.00..1443.00 rows=10 
> width=4)
>   ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=4)
>(6 rows)
>
> The estimate fo number of groups is clearly bogus - we know there are
> only 10 groups in each relation, but here we end up with 100100.
>
> So perhaps we should do (2) to keep the behavior consistent?

First of all, I agree that we shouldn't (at this point in the cycle)
try to apply pathkey reordering etc. We already have ideas about
additional ways to apply and improve usage of incremental sort, and it
seem like this naturally fits into that list.

Second of all, I like the idea of keeping it consistent (even if
consistency boils down to "we're just guessing").

James

Re: sqlsmith crash incremental sort

2020-04-16 Thread Tomas Vondra

On Wed, Apr 15, 2020 at 11:26:12AM -0400, James Coleman wrote:

On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra
 wrote:

...

Yeah. And I'm not even sure having that information would allow good
estimates e.g. for UNIONs of multiple relations etc.

>> Another option is to use something as simple as checking for Vars with
>> varno==0 in cost_incremental_sort() and ignoring them somehow. We could
>> simply use some arbitrary estimate - by assuming the rows are unique or
>> something like that. Yes, I agree it's pretty ugly and I'd much rather
>> find a way to generate something sensible, but I'm not even sure we can
>> generate good estimate when doing UNION of data from different relations
>> and so on. The attached (ugly) patch does this.
>
>...therefore I think this is worth proceeding with.
>

OK, then the question is what estimate to use in this case. Should we
assume 1 group or uniqueness? I'd assume a single group produces costs
slightly above regular sort, right?

Originally I'd intuitively leaned towards assuming they were unique.
But that would be the best case for memory/disk space usage, for
example, and the costing for incremental sort is always (at least
mildly) higher than regular sort if the number of groups is 1. That
also guarantees the startup cost is higher than regular sort also.

So I think using a number of groups estimate of 1, we just wouldn't
choose an incremental sort ever in this case.

Maybe that's the right choice? It'd certainly be the conservative
choice. What are your thoughts on the trade-offs there?

I think we have essentially three options:

1) assuming there's just a single group

This should produce cost estimate higher than plain sort, disabling
incremental sort. I'd say this is "worst case" assumption. I think this
might be overly pessimistic, though.

2) assuming each row is a separate group

If (1) is worst case scenario, then this is probably the best case,
particularly when the query is sensitive to startup cost.

3) something in between

If (1) and (2) are worst/best-case scenarios, maybe we should pick
something in between. We have DEFAULT_NUM_DISTINCT (200) which
essentially says "we don't know what the number of groups is" so maybe
we should use that. Another option would be nrows/10, which is the cap
we use in estimate_num_groups without extended stats.

I was leaning towards (1) as "worst case" choice seems natural to
prevent possible regressions. But consider this:

  create table small (a int);
  create table large (a int);

  insert into small select mod(i,10) from generate_series(1,100) s(i);

  insert into large select mod(i,10) from generate_series(1,10) s(i);

  analyze small;
  analyze large;

  explain select i from large union select i from small;

QUERY PLAN

---
   Unique  (cost=11260.35..11760.85 rows=100100 width=4)
 ->  Sort  (cost=11260.35..11510.60 rows=100100 width=4)
   Sort Key: large.i
   ->  Append  (cost=0.00..2946.50 rows=100100 width=4)
 ->  Seq Scan on large  (cost=0.00..1443.00 rows=10 width=4)
 ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=4)
  (6 rows)

The estimate fo number of groups is clearly bogus - we know there are
only 10 groups in each relation, but here we end up with 100100.

So perhaps we should do (2) to keep the behavior consistent?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Michael Paquier

On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
> Even if I liked the core idea, loading the functionality onto VACUUM seems
> like a fairly horrid design choice.  It's quite unrelated to what that
> command does.  In the autovac code path, it's going to lead to multiple
> autovac workers all complaining simultaneously about the same problem.
> But having manual vacuums complain about issues unrelated to the task at
> hand is also a seriously poor bit of UX design.  Moreover, that won't do
> all that much to surface problems, since most(?) installations never run
> manual vacuums; or if they do, the "manual" runs are really done by a cron
> job or the like, which is not going to notice the warnings.  So you still
> need a log-scraping tool.

+1.

> If we were going to go down the path of periodically logging warnings
> about old prepared transactions, some single-instance background task
> like the checkpointer would be a better place to do the work in.  But
> I'm not really recommending that, because I agree with Robert that
> we just plain don't want this functionality.

I am not sure that the checkpointer is a good place to do that either,
joining back with your argument in the first paragraph of this email
related to vacuum.  One potential approach would be a contrib module
that works as a background worker?  However, I would think that
finding a minimum set of requirements that we think are generic enough
for most users would be something hard to draft a list of.  If we had
a small, minimal contrib/ module in core that people could easily
extend for their own needs and that we would intentionally keep as
minimal, in the same spirit as say passwordcheck, perhaps..
--
Michael


signature.asc
Description: PGP signature

Re: Problem with logical replication (crash with REPLICA IDENTITY FULL and cascading replication)

2020-04-16 Thread Justin Pryzby

Confirmed and added to opened items - this is a live bug back to v10.

for a in data data1; do ./tmp_install/usr/local/pgsql/bin/initdb -D $a 
--no-sync& done; wait
echo "wal_level = logical">> data/postgresql.conf; echo "port=5433" >> 
data1/postgresql.conf
for a in data data1; do ./tmp_install/usr/local/pgsql/bin/postmaster -D $a& done
for a in "CREATE TABLE pgbench_accounts(a int primary key, b int)" "ALTER TABLE 
pgbench_accounts REPLICA IDENTITY FULL" "CREATE PUBLICATION mypub FOR TABLE 
pgbench_accounts"; do \
for p in 5432 5433; do psql -d postgres -h /tmp -p "$p" -c "$a"; done; 
done

psql -d postgres -h /tmp -p 5433 -c "CREATE SUBSCRIPTION mysub CONNECTION 
'host=127.0.0.1 port=5432 dbname=postgres' PUBLICATION mypub"
psql -d postgres -h /tmp -p 5432 -c "insert into pgbench_accounts values(1,2); 
update pgbench_accounts set b=30 where a=1;"

-- 
Justin

[PATCH] Fix possible overflow on tuplesort.c

2020-04-16 Thread Ranier Vilela

Hi,

When multiplying variables, the overflow will take place anyway, and only
then will the meaningless product be explicitly promoted to type int64.
It is one of the operands that should have been cast instead to avoid the
overflow.

regards,
Ranier Vilela


tuplesort_avoid_possible_overflow.patch
Description: Binary data

Re: sqlsmith crash incremental sort

2020-04-16 Thread Tomas Vondra


On Thu, Apr 16, 2020 at 08:44:16PM +0200, Tomas Vondra wrote:

On Thu, Apr 16, 2020 at 12:04:03PM -0400, James Coleman wrote:

On Thu, Apr 16, 2020 at 8:22 AM Richard Guo  wrote:



On Thu, Apr 16, 2020 at 6:35 PM Richard Guo  wrote:



On Wed, Apr 15, 2020 at 10:47 PM Tomas Vondra  
wrote:



Well, yeah. The problem is the Unique simply compares the columns in the
order it sees them, and it does not match the column order desired by
incremental sort. But we don't push down this information at all :-(



This is a nice optimization better to have. Since the 'Sort and Unique'
would unique-ify the result of a UNION by sorting on all columns, why
not we adjust the sort order trying to match parse->sortClause so that
we can avoid the final sort node?

Doing that we can transform plan from:

# explain (costs off) select * from foo union select * from foo order by 1,3;
 QUERY PLAN
---
Incremental Sort
  Sort Key: foo.a, foo.c
  Presorted Key: foo.a
  ->  Unique
->  Sort
  Sort Key: foo.a, foo.b, foo.c
  ->  Append
->  Seq Scan on foo
->  Seq Scan on foo foo_1
(9 rows)

To:

# explain (costs off) select * from foo union select * from foo order by 1,3;
  QUERY PLAN
-
Unique
  ->  Sort
Sort Key: foo.a, foo.c, foo.b
->  Append
  ->  Seq Scan on foo
  ->  Seq Scan on foo foo_1
(6 rows)



Attached is what I'm thinking about this optimization. Does it make any
sense?


Shouldn't this go one either a new thread or on the thread for the
patch Tomas was referencing (by Teodor I believe)?



FWIW the optimization I had in mind is this:

 https://commitfest.postgresql.org/21/1651/

I now realize that was about GROUP BY, but it's not all that different
and the concerns will / should be fairly similar, I think.

IMO simply tweaking the sort keys to match the upper parts of the plan
is probably way too simplistic, I'm afraid. For example, if the Unique
significantly reduces cardinality, then the cost of the additional sort
is much less important. It may be much better to optimize the "large"
sort of the whole data set, either by reordering the columns as proposed
by Teodor in his patch (by number of distinct values and/or cost of the
comparison function function).

Furthermore, this is one of the places that is not using incremental
sort yet - I can easily imagine doing something like this:


  Sort
-> Unique
   -> Incremenal Sort
   -> ...

could be a massive win. So I think we can't just rejigger the sort keys
abitrarily, we should / need to consider those alternatives.


Or are you saying you believe this patch guarantees we never see this
problem in incremental sort costing?



Yeah, that's not entirely close to me. But maybe it shows us where we to
get the unprocessed target list?



I think at the very least this needs to apply the same change also to
generate_nonunion_paths, because otherwise this fails because of the
same issue:

  set enable_hashagg = off;
  explain select * from foo except select * from foo order by 1, 3;

I'm still of the opinion that this is really an optimization and
behavior change, and I feel rather uneasy about pushing it post feature
freeze without appropriate review. I also think we really ought to
consider how would this work with the other optimizations I outlined
elsewhere in this thread (comparison costs, ...).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Poll: are people okay with function/operator table redesign?

2020-04-16 Thread Alvaro Herrera

v1 is good.

I like your v2 even better.  If it becomes possible to remove or soften
the "inter-row" horizontal line with CSS tricks afterwards, that would
be swell, but even without that, I cast my vote to using this table
format.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[PATCH] Tiny optimization on nbtinsert.c

2020-04-16 Thread Ranier Vilela

Hi,

Avoiding some calls and set vars, when it is not necessary.

best regards,
Ranier Vilela


nbtinsert_tiny_optimization.patch
Description: Binary data

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:
> > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > INDEX_CLEANUP on might not even visit the index. As there very well
> > might not be many dead heap tuples around anymore (previous vacuums with
> > cleanup off will have removed them), the
> > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> > contrast to the normal situations where the btm_oldest_btpo_xact check
> > will prevent that from becoming a problem.
> 
> I guess that they should visit the metapage to see if they need to do
> that much. That would allow us to fix the problem while mostly
> honoring INDEX_CLEANUP off, I think.

Yea. _bt_vacuum_needs_cleanup() needs to check if
metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
false.

> BWT, a lot of people get confused about what half-dead pages are. I
> would like to make something clear that may not be obvious: While it's
> bad that the implementation leaks pages that should go in the FSM,
> it's not the end of the world. They should get evicted from
> shared_buffers pretty quickly if there is any pressure, and impose no
> real cost on index scans.

Yea, half-dead pages aren't the main problem. It's pages that contain
only dead tuples, but aren't unlinked from the tree. Without a vacuum
scan we'll never reuse them - even if we know they're all dead.

Note that the page being in the FSM is not protection against wraparound
:(. We recheck whether a page is recyclable when getting it from the FSM
(probably required, due to the FSM not being crashsafe). It's of course
much less likely to happen at that stage, because the pages can get
reused.

I think we really just stop being miserly and update the xid to be
FrozenTransactionId or InvalidTransactionId when vacuum encounters one
that's from before the the xid cutoff used by vacuum (i.e. what could
become the new relfrozenxid).  That seems like it'd be a few lines, not
more.

> Another thing that's worth pointing out is that this whole
> RecentGlobalXmin business is how we opted to implement what Lanin &
> Sasha call "the drain technique". It is rather different to the usual
> ways in which we use RecentGlobalXmin. We're only using it as a proxy
> (an absurdly conservative proxy) for whether or not there might be an
> in-flight index scan that lands on a concurrently recycled index page
> and gets completely confused. So it is purely about the integrity of
> the data structure itself. It is a consequence of doing so little
> locking when descending the tree -- our index scans don't need to
> couple buffer locks on the way down the tree at all. So we make VACUUM
> worry about that, rather than making index scans worry about VACUUM
> (though the latter design is a reasonable and common one).
>
> There is absolutely no reason why we have to delay recycling for very
> long, even in cases with long running transactions or whatever. I
> agree that it's just an accident that it works that way. VACUUM could
> probably remember deleted pages, and then revisited those pages at the
> end of the index vacuuming -- that might make a big difference in a
> lot of workloads. Or it could chain them together as a linked list
> which can be accessed much more eagerly in some cases.

I think it doesn't really help meaningfully for vacuum to be a bit
smarter about when to recognize pages as being recyclable. IMO the big
issue is that vacuum won't be very frequent, so we'll grow the index
until that time, even if there's many "effectively empty" pages.

I.e. even if the killtuples logic allows us to recognize that all actual
index tuples are fully dead, we'll not benefit from that unless there's
a new insertion that belongs onto the "empty" page. That's fine for
indexes that are updated roughly evenly across the value range, but
terrible for indexes that "grow" mostly on one side, and "shrink" on the
other.

I'd bet it be beneficial if we were to either have scans unlink such
pages directly, or if they just entered the page into the FSM and have
_bt_getbuf() do the unlinking.  I'm not sure if the current locking
model assumes anywhere that there is only one process (vacuum) unlinking
pages though?

Greetings,

Andres Freund

Re: remove_useless_groupby_columns does not need to record constraint dependencies

2020-04-16 Thread David Rowley

On Fri, 17 Apr 2020 at 02:53, Tom Lane  wrote:
>
> David Rowley  writes:
> > I've reworded the comment in the attached version.
>
> LGTM.

Thanks for reviewing. Pushed.

Re: [PATCH'] Variables assigned with values that is never used.

2020-04-16 Thread Ranier Vilela

Em sáb., 28 de mar. de 2020 às 10:33, Ranier Vilela 
escreveu:

> Hi,
>
> Theses variables, are assigned with values that never is used and, can
> safely have their values removed.
>
1.
https://github.com/postgres/postgres/commit/f0ca378d4c139eda99ef14998115c1674dac3fc5

diff --git a/src/backend/access/nbtree/nbtsplitloc.c
b/src/backend/access/nbtree/nbtsplitloc.c
index 8ba055be9e..15ac106525 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -812,7 +812,6 @@ _bt_bestsplitloc(FindSplitData *state, int
perfectpenalty,

  if (penalty <= perfectpenalty)
  {
- bestpenalty = penalty;
  lowsplit = i;
  break;
  }

Coincidence? I think not.

regards,
Ranier Vilela

Re: documenting the backup manifest file format

2020-04-16 Thread Jehan-Guillaume de Rorthais

On Wed, 15 Apr 2020 18:54:14 -0400
David Steele  wrote:

> On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 15 Apr 2020 12:03:28 -0400
> > Robert Haas  wrote:
> >   
> >> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
> >>  wrote:  
> >>> But for backup_manifest, it's kind of shame we have to check the checksum
> >>> against an transformed version of the file. Did you consider creating eg.
> >>> a separate backup_manifest.sha256 file?
> >>>
> >>> I'm very sorry in advance if this has been discussed previously.  
> >>
> >> It was briefly mentioned in the original (lengthy) discussion, but I
> >> think there was one vote in favor and two votes against or something
> >> like that, so it didn't go anywhere.  
> > 
> > Argh.
> >   
> >> I didn't realize that there were handy command-line tools for manipulating
> >> json like that, or I probably would have considered that idea more
> >> strongly.  
> > 
> > That was indeed a lengthy thread with various details discussed. I'm sorry I
> > didn't catch the ball back then.  
> 
> One of the reasons to use JSON was to be able to use command line tools 
> like jq to do tasks (I use it myself).

That's perfectly fine. I was only wondering about having the manifest checksum
outside of the manifest itself.

> But I think only the pg_verifybackup tool should be used to verify the
> internal checksum.

true.

> Two thoughts:
> 
> 1) You can always generate an external checksum when you generate the 
> backup if you want to do your own verification without running 
> pg_verifybackup.

Sure, but by the time I want to produce an external checksum, the manifest
would have travel around quite a bit with various danger on its way to corrupt
it. Checksuming it from the original process that produced it sounds safer.

> 2) Perhaps it would be good if the pg_verifybackup command had a 
> --verify-manifest-checksum option (or something) to check that the 
> manifest file looks valid without checking any files. That's not going 
> to happen for PG13, but it's possible for PG14.

Sure.

I just liked the idea to be able to check the manifest using an external
command line implementing the same standardized checksum algo. Without editing
the manifest first. But I understand it's too late to discuss this now.

Regards,

Re: [DOC] Document concurrent index builds waiting on each other

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-15 21:44:48 -0400, James Coleman wrote:
> On Wed, Apr 15, 2020 at 6:31 PM Andres Freund  wrote:
> > If it's about the xmin horizon for vacuum: I think we could probably
> > avoid that using the same flag. As vacuum cannot be run against a table
> > that has a CIC running (although it'd theoretically be possible to allow
> > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
> > GetOldestXmin() call. That might not be true for system relations, but
> > we don't allow CIC on those.
>
> Yeah, I mean that if I have a CIC running on table X then vacuum can't
> remove dead tuples (from after the CIC's snapshot) on table Y.

For me "blocking" evokes waiting for a lock, which is why I thought
you'd not mean that issue.

> That's a pretty significant danger, given the combination of:
> 1. Index builds on very large tables can take many days, and

We at least don't hold a single snapshot over the multiple phases...

> 2. The well understood problems of high update tables with dead tuples
> and poor plans.

Which specific problem are you referring to? The planner probing the end
of the index for values outside of the histogram? I'd hope
3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a
bit?

> > [description why we could ignore CIC for vacuum horizon on other tables ]

> I've previously discussed this with other hackers and the reasoning
> they'd understood way that we couldn't always safely ignore
> PROC_IN_CIC backends in the vacuum's oldest xmin call because of
> function indexes, and the fact that (despite clear recommendations to
> the contrary), there's nothing actually preventing someone from adding
> a function index on table X that queries table Y.

Well, even if we consider this an actual problem, we could still use
PROC_IN_CIC for non-expression non-partial indexes (index operator
themselves better ensure this isn't a problem, or they're ridiculously
broken already - they can get called during vacuum).

Even when expressions are involved, I don't think that necessarily would
have to mean that we need to use the same snapshot to run expressions in
for the hole scan. So we could occasionally take a new snapshot for the
purpose of computing expressions.

The hard part presumably would be that we'd need to advertise one xmin
for the expression snapshot to protect tuples potentially accessed from
being removed, but at the same time we also need to advertise the xmin
of the snapshot used by CIC, to avoid HOT pruning in other session from
removing tuple versions from the table the index is being created
on.

There's not really infrastructure for doing so. I think we'd basically
have to start publicizing multiple xmin values (as long as PGXACT->xmin
is <= new xmin for expressions, only GetOldestXmin() would need to care,
and it's not that performance critical). Not pretty.

> I'm not sure I buy that we should care about people doing something
> clearly so dangerous, but...I grant that it'd be nice not to cause new
> crashes.

I don't think it's just dangerous expressions that would be
affected. Normal expression indexes need to be able to do syscache
lookups etc, and they can't safely do so if tuple versions can be
removed in the middle of a scan. We could avoid that by not ignoring
PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck).

Regards,

Andres

Re: Poll: are people okay with function/operator table redesign?

2020-04-16 Thread Andreas Karlsson


On 4/14/20 4:52 PM, Tom Lane wrote:

Andreas Karlsson  writes:

That said, I agree with that quite many of our tables right now are
ugly, but I prefer ugly to hard to read. For me the mix of having every
third row split into two fields makes the tables very hard to read. I
have a hard time seeing which rows belong to which function.


Did you look at the variants without that discussed downthread?


Yeah, I did some of them are quite readable, for example your latest two 
screenshots of table 9.10.


Andreas

Re: cleaning perl code

2020-04-16 Thread Mark Dilger




> On Apr 16, 2020, at 2:07 PM, Andrew Dunstan  
> wrote:
> 
> 
> On 4/16/20 11:12 AM, Alvaro Herrera wrote:
>> On 2020-Apr-16, Hamlin, Garick L wrote:
>> 
>>> With the old expression 'something;' would be stripped away.  
>>> Is that an issue where this this is used?  Why are we parsing
>>> these headers?
>> These are files from which bootstrap catalog data is generated, which is
>> why we parse from Perl; but also where C structs are declared, which is
>> why they're C.
>> 
>> I think switching to non-greedy is a win in itself.  Non-capturing
>> parens is probably a wash (this doesn't run often so the performance
>> argument isn't very interesting).
> 
> 
> Yeah, I'm inclined to fix this independently of the perlcritic stuff.
> The change is more readable and more correct as well as being perlcritic
> friendly.
> 
> 
> I might take a closer look at Catalog.pm.
> 
> 
> Meanwhile, the other regex highlighted in the patch, in Solution.pm:
> 
> 
> if (/^AC_INIT\(\[([^\]]+)\], \[([^\]]+)\], \[([^\]]+)\], \[([^\]]*)\],
> \[([^\]]+)\]/)
> 
> 
> is sufficiently horrid that I think we should see if we can rewrite it,

my $re = qr/
\[  # literal opening bracket
(   # Capture anything but a closing 
bracket
(?> # without backtracking
[^\]]+
)
)
\]  # literal closing bracket
/x;
if (/^AC_INIT\($re, $re, $re, $re, $re/)



> maybe as an extended regex. And a better fix here instead of marking the
> fourth group as non-capturing would be simply to get rid of the parens
> altogether. The serve no purpose at all.

But then you'd have to use something else in position 4, which complicates the 
code.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-16 Thread Tomas Vondra

On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote:

On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh  wrote:

On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar  wrote:
>
> I have rebased the patch on the latest head.  I haven't yet changed
> anything for xid assignment thing because it is not yet concluded.
>
Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

Yeah maybe, in some cases we can avoid checking multiple conditions by
maintaining that state.  But, that state will have to be at the
transaction level.  But, I am not sure how much worth it will be to
add one extra condition to skip a few if checks and it will also add
the code complexity.  And, in some cases where logical decoding is not
enabled, it may add one extra check? I mean first check the state and
that will take you to the first if check.

Perhaps. I think we should only do that if we can demonstrate it's an
issue in practice. Otherwise it's just unnecessary complexity.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: cleaning perl code

2020-04-16 Thread Andrew Dunstan



On 4/16/20 11:12 AM, Alvaro Herrera wrote:
> On 2020-Apr-16, Hamlin, Garick L wrote:
>
>> With the old expression 'something;' would be stripped away.  
>> Is that an issue where this this is used?  Why are we parsing
>> these headers?
> These are files from which bootstrap catalog data is generated, which is
> why we parse from Perl; but also where C structs are declared, which is
> why they're C.
>
> I think switching to non-greedy is a win in itself.  Non-capturing
> parens is probably a wash (this doesn't run often so the performance
> argument isn't very interesting).


Yeah, I'm inclined to fix this independently of the perlcritic stuff.
The change is more readable and more correct as well as being perlcritic
friendly.


I might take a closer look at Catalog.pm.


Meanwhile, the other regex highlighted in the patch, in Solution.pm:


if (/^AC_INIT\(\[([^\]]+)\], \[([^\]]+)\], \[([^\]]+)\], \[([^\]]*)\],
\[([^\]]+)\]/)


is sufficiently horrid that I think we should see if we can rewrite it,
maybe as an extended regex. And a better fix here instead of marking the
fourth group as non-capturing would be simply to get rid of the parens
altogether. The serve no purpose at all.


>
> An example.  This eval in Catalog.pm
>
> +   ## no critic (ProhibitStringyEval)
> +   ## no critic (RequireCheckingReturnValueOfEval)
> +   eval '$hash_ref = ' . $_;
>
> is really weird stuff generally speaking, and the fact that we have to
> mark it specially for critic is a good indicator of that -- it serves as
> documentation.  Catalog.pm is all a huge weird hack, but it's a critically
> important hack.  Heck, what about RequireCheckingReturnValueOfEval --
> should we instead consider actually checking the return value of eval?
> It would seem to make sense, would it not?  (Not for this patch, though
> -- I would be fine with just adding the nocritic line now, and removing
> it later while fixing that).


+1


>
> All in all, I think it's a positive value in having this code be checked
> with a bit more strength -- checks that are pointless in, say, t/00*.pl
> prove files.



thanks


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Peter Geoghegan

On Thu, Apr 16, 2020 at 11:27 AM Andres Freund  wrote:
> Sure, there is some pre-existing wraparound danger for individual
> pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> off.

It's a matter of degree. Hard to judge something like that.

> And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> INDEX_CLEANUP on might not even visit the index. As there very well
> might not be many dead heap tuples around anymore (previous vacuums with
> cleanup off will have removed them), the
> vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> contrast to the normal situations where the btm_oldest_btpo_xact check
> will prevent that from becoming a problem.

I guess that they should visit the metapage to see if they need to do
that much. That would allow us to fix the problem while mostly
honoring INDEX_CLEANUP off, I think.

> Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
> be able to recycle half-dead pages? And thus would effectively never
> recycle any dead space? Is that correct?

I agree. The fact that btm_oldest_btpo_xact is an all-or-nothing thing
(with wraparound hazards) is bad in itself, and introduced new risk to
v11 compared to previous versions (without the INDEX_CLEANUP = off
feature entering into it).  The simple fact that we don't even check
it with INDEX_CLEANUP = off is a bigger problem, though, and one that
now seems unrelated.

BWT, a lot of people get confused about what half-dead pages are. I
would like to make something clear that may not be obvious: While it's
bad that the implementation leaks pages that should go in the FSM,
it's not the end of the world. They should get evicted from
shared_buffers pretty quickly if there is any pressure, and impose no
real cost on index scans.

There are (roughly) 3 types of pages that we're concerned about here
in the common case where we're just deleting a leaf page:

* A half-dead page -- no downlink in its parent, marked dead.

* A deleted page -- now no sidelinks, either. Not initially safe to recycle.

* A deleted page in the FSM -- this is what we have the interlock for.

Half-dead pages are pretty rare, because VACUUM really has to have a
hard crash for that to happen (that might not be 100% true, but it's
at least 99% true). That's always been the case, and we don't really
need to talk about them here at all. We're just concerned with deleted
pages in the context of this discussion (and whether or not they can
be recycled without confusing in-flight index scans). These are the
only pages that are marked with an XID at all.

Another thing that's worth pointing out is that this whole
RecentGlobalXmin business is how we opted to implement what Lanin &
Sasha call "the drain technique". It is rather different to the usual
ways in which we use RecentGlobalXmin. We're only using it as a proxy
(an absurdly conservative proxy) for whether or not there might be an
in-flight index scan that lands on a concurrently recycled index page
and gets completely confused. So it is purely about the integrity of
the data structure itself. It is a consequence of doing so little
locking when descending the tree -- our index scans don't need to
couple buffer locks on the way down the tree at all. So we make VACUUM
worry about that, rather than making index scans worry about VACUUM
(though the latter design is a reasonable and common one).

There is absolutely no reason why we have to delay recycling for very
long, even in cases with long running transactions or whatever. I
agree that it's just an accident that it works that way. VACUUM could
probably remember deleted pages, and then revisited those pages at the
end of the index vacuuming -- that might make a big difference in a
lot of workloads. Or it could chain them together as a linked list
which can be accessed much more eagerly in some cases.

--
Peter Geoghegan

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-16 Thread Kuntal Ghosh

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar  wrote:
>

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Apr 16, 2020 at 2:17 PM Hamid Akhtar  wrote:
>> So is the concern performance overhead rather than the need for such a 
>> feature?

> No, I don't think this would have any significant overhead. My concern
> is that I think it's the wrong way to solve the problem.

FWIW, I agree with Robert that this patch is a bad idea.  His
recommendation is to use an external monitoring tool, which is not a
self-contained solution, but this isn't either: you'd need to add an
external log-scraping tool to spot the warnings.

Even if I liked the core idea, loading the functionality onto VACUUM seems
like a fairly horrid design choice.  It's quite unrelated to what that
command does.  In the autovac code path, it's going to lead to multiple
autovac workers all complaining simultaneously about the same problem.
But having manual vacuums complain about issues unrelated to the task at
hand is also a seriously poor bit of UX design.  Moreover, that won't do
all that much to surface problems, since most(?) installations never run
manual vacuums; or if they do, the "manual" runs are really done by a cron
job or the like, which is not going to notice the warnings.  So you still
need a log-scraping tool.

If we were going to go down the path of periodically logging warnings
about old prepared transactions, some single-instance background task
like the checkpointer would be a better place to do the work in.  But
I'm not really recommending that, because I agree with Robert that
we just plain don't want this functionality.

regards, tom lane

Re: "cache reference leak" issue happened when using sepgsql module

2020-04-16 Thread Tom Lane

Michael Luo  writes:
> When using sepgsql module, I got warning "WARNING:  cache reference leak”.
> ...
> The patch attached to fix this issue, please check it.

Right you are, fix pushed.

regards, tom lane

Re: sqlsmith crash incremental sort

2020-04-16 Thread Tomas Vondra


On Thu, Apr 16, 2020 at 12:04:03PM -0400, James Coleman wrote:

On Thu, Apr 16, 2020 at 8:22 AM Richard Guo  wrote:



On Thu, Apr 16, 2020 at 6:35 PM Richard Guo  wrote:



On Wed, Apr 15, 2020 at 10:47 PM Tomas Vondra  
wrote:



Well, yeah. The problem is the Unique simply compares the columns in the
order it sees them, and it does not match the column order desired by
incremental sort. But we don't push down this information at all :-(



This is a nice optimization better to have. Since the 'Sort and Unique'
would unique-ify the result of a UNION by sorting on all columns, why
not we adjust the sort order trying to match parse->sortClause so that
we can avoid the final sort node?

Doing that we can transform plan from:

# explain (costs off) select * from foo union select * from foo order by 1,3;
  QUERY PLAN
---
 Incremental Sort
   Sort Key: foo.a, foo.c
   Presorted Key: foo.a
   ->  Unique
 ->  Sort
   Sort Key: foo.a, foo.b, foo.c
   ->  Append
 ->  Seq Scan on foo
 ->  Seq Scan on foo foo_1
(9 rows)

To:

# explain (costs off) select * from foo union select * from foo order by 1,3;
   QUERY PLAN
-
 Unique
   ->  Sort
 Sort Key: foo.a, foo.c, foo.b
 ->  Append
   ->  Seq Scan on foo
   ->  Seq Scan on foo foo_1
(6 rows)



Attached is what I'm thinking about this optimization. Does it make any
sense?


Shouldn't this go one either a new thread or on the thread for the
patch Tomas was referencing (by Teodor I believe)?



FWIW the optimization I had in mind is this:

  https://commitfest.postgresql.org/21/1651/

I now realize that was about GROUP BY, but it's not all that different
and the concerns will / should be fairly similar, I think.

IMO simply tweaking the sort keys to match the upper parts of the plan
is probably way too simplistic, I'm afraid. For example, if the Unique
significantly reduces cardinality, then the cost of the additional sort
is much less important. It may be much better to optimize the "large"
sort of the whole data set, either by reordering the columns as proposed
by Teodor in his patch (by number of distinct values and/or cost of the
comparison function function).

Furthermore, this is one of the places that is not using incremental
sort yet - I can easily imagine doing something like this:


   Sort
 -> Unique
-> Incremenal Sort
   -> ...

could be a massive win. So I think we can't just rejigger the sort keys
abitrarily, we should / need to consider those alternatives.


Or are you saying you believe this patch guarantees we never see this
problem in incremental sort costing?



Yeah, that's not entirely close to me. But maybe it shows us where we to
get the unprocessed target list?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-15 18:11:40 -0700, Peter Geoghegan wrote:
> On Wed, Apr 15, 2020 at 4:57 PM Robert Haas  wrote:
> > I seem to recall Simon raising this issue at the time that the patch
> > was being discussed, and I thought that we had eventually decided that
> > it was OK for some reason. But I don't remember the details, and it is
> > possible that we got it wrong. :-(
> 
> It must be unreliable because it's based on something that is known to
> be unreliable:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Sure, there is some pre-existing wraparound danger for individual
pages. But it's a pretty narrow corner case before INDEX_CLEANUP
off.

That comment says something about "shared-memory free space map", making
it sound like any crash would loose the page. But it's a normal FSM
these days. Vacuum will insert the deleted page into the free space
map. So either the FSM would need to be corrupted to not find the
inserted page anymore, or the index would need to grow slow enough to
not use a page before the wraparound.  And then such wrapped around xids
would exist on individual pages. Not on all deleted pages, like with
INDEX_CLEANUP false.

And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
INDEX_CLEANUP on might not even visit the index. As there very well
might not be many dead heap tuples around anymore (previous vacuums with
cleanup off will have removed them), the
vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
contrast to the normal situations where the btm_oldest_btpo_xact check
will prevent that from becoming a problem.

Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
be able to recycle half-dead pages? And thus would effectively never
recycle any dead space? Is that correct?

Greetings,

Andres Freund

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Robert Haas

On Thu, Apr 16, 2020 at 2:17 PM Hamid Akhtar  wrote:
> So is the concern performance overhead rather than the need for such a 
> feature?

No, I don't think this would have any significant overhead. My concern
is that I think it's the wrong way to solve the problem. If you need
to check for prepared transactions that got missed, the right way to
do that is to use a monitoring tool that runs an appropriate query
against the server on a regular basis and alerts based on the output.
Such a tool can be used for many things, of which this is just one,
and the queries can be customized to the needs of a particular
environment, whereas this feature is much less flexible in that way
because it is hard-coded into the server.

To put that another way, any problem you can solve with this feature,
you can also solve without this feature. And you can solve it any
released branch, without waiting for a release that would
hypothetically contain this patch, and you can solve it in a more
flexible way than this patch allows, because you can tailor the query
any way you like. The right place for a feature like this in something
like check_postgres.pl, not the server. It looks like they added it in
2009:

https://bucardo.org/pipermail/check_postgres/2009-April/000349.html

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Hamid Akhtar

On Thu, Apr 16, 2020 at 5:20 PM Robert Haas  wrote:

> On Thu, Apr 16, 2020 at 4:43 AM Hamid Akhtar 
> wrote:
> > My real question is whether vacuum should be preemptively complaining
> about prepared transactions or stale replication slots rather than waiting
> for transaction id to exceed the safe limit. I presume by the time safe
> limit is exceeded, vacuum's work would already have been significantly
> impacted.
>
> Yeah, for my part, I agree that letting things go until the point
> where VACUUM starts to complain is usually bad. Generally, you want to
> know a lot sooner. That being said, I think the solution to that is to
> run a monitoring tool, not to overload the autovacuum worker with
> additional duties.
>

So is the concern performance overhead rather than the need for such a
feature?

Any server running with prepared transactions enabled, more likely than
not, requires a monitoring tool for tracking orphaned prepared
transactions. For such environments, surely the overhead created by such a
feature implemented in the server will create a lower overhead than their
monitoring tool.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Tom Lane

Fujii Masao  writes:
> On 2020/04/14 22:52, Tom Lane wrote:
>> *Yes it does*.  The existing code can deliver entirely broken results
>> if some walsender exits between where we examine the priorities and
>> where we fetch the WAL pointers.

> IMO that the broken results can be delivered when walsender marked
> as sync exits *and* new walsender comes at that moment. If this new
> walsender uses the WalSnd slot that the exited walsender used,
> SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
> on this new walsender (i.e., different walsender from one marked as sync).

Right, exactly, sorry that I was not more specific.

> BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
> it should not be back-patched to avoid ABI break. Right?

Anything that is using that is just as broken as the core code is, for the
same reasons, so I don't have a problem with changing its API.  Maybe we
should rename it while we're at it, just to make it clear that we are
breaking any external callers.  (If there are any, which seems somewhat
unlikely.)

The only concession to ABI that I had in mind was to not re-order
the fields of WalSnd in the back branches.

regards, tom lane

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> For btree indexes, IIRC skipping index cleanup could not be a cause of
> corruption, but be a cause of index bloat since it leaves recyclable
> pages which are not marked as recyclable. The index bloat is the main
> side effect of skipping index cleanup. When user executes VACUUM with
> INDEX_CLEANUP to reclaim index garbage, such pages will also be
> recycled sooner or later? Or skipping index cleanup can be a cause of
> recyclable page never being recycled?

Well, it depends on what you define as "never". Once the xids on the
pages have wrapped around, the page level xids will appear to be from
the future for a long time. And the metapage xid appearing to be from
the future will prevent some vacuums from actually doing the scan too,
even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
some point the wrapped around xids will be "current" again, if there's
enough new xids.

It's no ok for vacuumlazy.c to make decisions like this. I think the
INDEX_CLEANUP logic clearly needs to be pushed down into the
amvacuumcleanup callbacks, and it needs to be left to the index AMs to
decide what the correct behaviour is.

You can't just change things like this without reviewing the
consequences to AMs and documenting them?

Greetings,

Andres Freund

Re: Should we add xid_current() or a int8->xid cast?

2020-04-16 Thread Thomas Munro

On Fri, Apr 17, 2020 at 3:44 AM Mark Dilger
 wrote:
> Hmm, I should have spoken sooner...
>
> src/backend/replication/walsender.c:static bool 
> TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
> src/backend/utils/adt/xid8funcs.c:TransactionIdInRecentPast(FullTransactionId 
> fxid, TransactionId *extracted_xid)
>
> I don't care much for having two different functions with the same name and 
> related semantics but different argument types.

Maybe that's not ideal, but it's not because of this patch.  Those
functions are from 5737c12df05 and 857ee8e391f.

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-16 13:34:39 -0400, Robert Haas wrote:
> On Thu, Apr 16, 2020 at 1:14 PM Andres Freund  wrote:
> > I still think we need a way to test this without waiting for hours to
> > hit various edge cases. You argued against a fixed binning of
> > old_snapshot_threshold/100 arguing its too coarse. How about a 1000 or
> > so? For 60 days, the current max for old_snapshot_threshold, that'd be a
> > granularity of 01:26:24, which seems fine.  The best way I can think of
> > that'd keep current GUC values sensible is to change
> > old_snapshot_threshold to be float. Ugly, but ...?
> 
> Yeah, 1000 would be a lot better. However, if we switch to a fixed
> number of bins, it's going to be a lot more code churn.

Given the number of things that need to be addressed around the feature,
I am not too concerned about that.

> What did you think of my suggestion of making head_timestamp
> artificially move backward to simulate the passage of time?

I don't think it allows to exercise the various cases well enough. We
need to be able to test this feature both interactively as well as in a
scripted manner. Edge cases like wrapping around in the time mapping imo
can not easily be tested by moving the head timestamp back.

Greetings,

Andres Freund

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-16 Thread Robert Haas

On Thu, Apr 16, 2020 at 1:14 PM Andres Freund  wrote:
> I still think we need a way to test this without waiting for hours to
> hit various edge cases. You argued against a fixed binning of
> old_snapshot_threshold/100 arguing its too coarse. How about a 1000 or
> so? For 60 days, the current max for old_snapshot_threshold, that'd be a
> granularity of 01:26:24, which seems fine.  The best way I can think of
> that'd keep current GUC values sensible is to change
> old_snapshot_threshold to be float. Ugly, but ...?

Yeah, 1000 would be a lot better. However, if we switch to a fixed
number of bins, it's going to be a lot more code churn. What did you
think of my suggestion of making head_timestamp artificially move
backward to simulate the passage of time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Fujii Masao





On 2020/04/14 22:52, Tom Lane wrote:

Kyotaro Horiguchi  writes:

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function.  Thus
how many times we scan on the array doesn't make any difference.


*Yes it does*.  The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers.


So, in this case, the oldest lsn that SyncRepGetOldestSyncRecPtr()
calculates may be based on also the lsn of already-exited walsender.
This is what you say "broken results"? If yes, ISTM that this issue still
remains even after applying your patch. No? The walsender marked
as sync may still exit just before SyncRepGetOldestSyncRecPtr()
calculates the oldest lsn.

IMO that the broken results can be delivered when walsender marked
as sync exits *and* new walsender comes at that moment. If this new
walsender uses the WalSnd slot that the exited walsender used,
SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
on this new walsender (i.e., different walsender from one marked as sync).
If this is actually what you tried to say "broken results", your patch
seems fine and fixes the issue.

BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
it should not be back-patched to avoid ABI break. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-16 Thread Andres Freund

Hi,

On 2020-04-16 12:41:55 -0400, Robert Haas wrote:
> I'm starting a new thread for this, because the recent discussion of
> problems with old_snapshot_threshold[1] touched on a lot of separate
> issues, and I think it will be too confusing if we discuss all of them
> on one thread. Attached are three patches.

Cool.


> 0003 attempts to fix bugs in MaintainOldSnapshotTimeMapping() so that
> it produces a sensible mapping. I encountered and tried to fix two
> issues here:
> 
> First, as previously discussed, the branch that advances the mapping
> should not categorically do "oldSnapshotControl->head_timestamp = ts;"
> assuming that the head_timestamp is supposed to be the timestamp for
> the oldest bucket rather than the newest one. Rather, there are three
> cases: (1) resetting the mapping resets head_timestamp, (2) extending
> the mapping by an entry without dropping an entry leaves
> head_timestamp alone, and (3) overwriting the previous head with a new
> entry advances head_timestamp by 1 minute.

> Second, the calculation of the number of entries by which the mapping
> should advance is incorrect. It thinks that it should advance by the
> number of minutes between the current head_timestamp and the incoming
> timestamp. That would be correct if head_timestamp were the most
> recent entry in the mapping, but it's actually the oldest. As a
> result, without this fix, every time we move into a new minute, we
> advance the mapping much further than we actually should. Instead of
> advancing by 1, we advance by the number of entries that already exist
> in the mapping - which means we now have entries that correspond to
> times which are in the future, and don't advance the mapping again
> until those future timestamps are in the past.
> 
> With these fixes, I seem to get reasonably sensible mappings, at least
> in light testing.  I tried running this in one window with \watch 10:
> 
> select *, age(newest_xmin), clock_timestamp()  from
> pg_old_snapshot_time_mapping();
> 
> And in another window I ran:
> 
> pgbench -T 300 -R 10
> 
> And the age does in fact advance by ~600 transactions per minute.

I still think we need a way to test this without waiting for hours to
hit various edge cases. You argued against a fixed binning of
old_snapshot_threshold/100 arguing its too coarse. How about a 1000 or
so? For 60 days, the current max for old_snapshot_threshold, that'd be a
granularity of 01:26:24, which seems fine.  The best way I can think of
that'd keep current GUC values sensible is to change
old_snapshot_threshold to be float. Ugly, but ...?


Greetings,

Andres Freund

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-16 Thread Tom Lane

James Coleman  writes:
> On Fri, Apr 10, 2020 at 10:12 AM James Coleman  wrote:
>> One thing I just noticed and had a question about: in
>> preparePresortedCols (which sets up a function call context), do we
>> need to call pg_proc_aclcheck?

> Background: this came up because I noticed that pg_proc_aclcheck is
> called in the scalar array op case in execExpr.c.

> However grepping through the source code I see several places where a
> function (including an equality op for an ordering op, like the case
> we have here) gets looked up without calling pg_proc_aclcheck, but
> then other places where the acl check is invoked.

Rule of thumb is that we don't apply ACL checks to functions/ops
we get out of an opclass; adding a function to an opclass is tantamount
to giving public execute permission on it.  If the function/operator
reference came directly from the SQL query it must be checked.

> In addition, I haven't been able to discern a reason for why sometimes
> InvokeFunctionExecuteHook gets called with the function after lookup,
> but not others.

I would not stand here and say that that hook infrastructure is worth
anything at all.  Maybe the coverage is sufficient for some use-cases,
but who's to say?

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-16 Thread Tom Lane

Pierre Giraud  writes:
> Le 16/04/2020 à 16:43, Tom Lane a écrit :
>> Pierre Giraud  writes:
>>> The screenshot attached uses a  tag for the descrition/example block.

>> That looks about right, perhaps, but could you be a little clearer about
>> how you accomplished that?

> Attached you will find the HTML structure with associated styles.
> Sorry I haven't tried to do this from the DocBook sources.
> I hope this helps though.

After a bit of poking at it, I couldn't find another way to do that
than using a  structure.  Which is an annoying amount
of markup to be adding to each table cell, but I guess we could live
with it.  A bigger problem is that docbook applies styles to the
 structure that, at least by default, add a LOT of vertical space.
Doesn't seem real workable unless we can undo that.

regards, tom lane

Re: Include sequence relation support in logical replication

2020-04-16 Thread Cary Huang

Hi Craig, Andres

Thank you guys so much for your reviews and comments. Really helpful. Yes you 
guys are right, Sequence does not guarantee free of gaps and replicating 
sequence is useful for failover cases, then there will be no problem for a 
subscriber to get a future value 32 increments after. I will do more analysis 
on my end based on your comments and refine the patch with better test cases. 
Much appreciated of your help.

Best regards

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Wed, 15 Apr 2020 22:18:28 -0700 Craig Ringer  
wrote 

On Thu, 16 Apr 2020 at 07:44, Andres Freund  wrote:

> I would like to ask if you have some suggestions or ideas that can make 
> subscriber receives the current value without the need to
 > 
 > disabling the 32 increment behavior?

 It simply shouldn't expect that to be the case.  What do you need it
 for?

 As far as I can tell replicating sequence values is useful to allow
 failover, by ensuring all sequences will return sensible values going
 forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken 
code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical 
decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was 
introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of 
transactional and non-transactional state in a very messy way. This made it 
very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, 
as TAP tests in src/test/recovery, probably a new module for logical decoding 
of sequences:

1.

* Begin txn

* Create sequence

* Call nextval() on sequence over generate_series() and discard results

* Rollback

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.

* Begin 1st txn

* Create a sequence

* Use the sequence to populate a temp table with enough rows to ensure sequence 
updates are written

* Begin a 2nd txn

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Commit the 2nd txn

* Commit the 1st txn

* Wait for subscription catchup

* Check that the sequence value on the subscriber reflects the value after 
sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive 
for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should 
really have a helper for this. I'll see if I'm allowed to post the one I use 
for my own TAP tests to the list.

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-16 Thread James Coleman

On Fri, Apr 10, 2020 at 10:12 AM James Coleman  wrote:
>
> One thing I just noticed and had a question about: in
> preparePresortedCols (which sets up a function call context), do we
> need to call pg_proc_aclcheck?

Background: this came up because I noticed that pg_proc_aclcheck is
called in the scalar array op case in execExpr.c.

However grepping through the source code I see several places where a
function (including an equality op for an ordering op, like the case
we have here) gets looked up without calling pg_proc_aclcheck, but
then other places where the acl check is invoked.

In addition, I haven't been able to discern a reason for why sometimes
InvokeFunctionExecuteHook gets called with the function after lookup,
but not others.

So I'm not sure if either of these needed to be added to the equality
op/function lookup code in nodeIncrementalSort's preparePresortedCols
or not.

James

fixing old_snapshot_threshold's time->xid mapping

2020-04-16 Thread Robert Haas

Hi,

I'm starting a new thread for this, because the recent discussion of
problems with old_snapshot_threshold[1] touched on a lot of separate
issues, and I think it will be too confusing if we discuss all of them
on one thread. Attached are three patches.

0001 makes oldSnapshotControl "extern" rather than "static" and
exposes the struct definition via a header.

0002 adds a contrib module called old_snapshot which makes it possible
to examine the time->XID mapping via SQL. As Andres said, the comments
are not really adequate in the existing code, and the code itself is
buggy, so it was a little hard to be sure that I was understanding the
intended meaning of the different fields correctly. However, I gave it
a shot.

0003 attempts to fix bugs in MaintainOldSnapshotTimeMapping() so that
it produces a sensible mapping. I encountered and tried to fix two
issues here:

First, as previously discussed, the branch that advances the mapping
should not categorically do "oldSnapshotControl->head_timestamp = ts;"
assuming that the head_timestamp is supposed to be the timestamp for
the oldest bucket rather than the newest one. Rather, there are three
cases: (1) resetting the mapping resets head_timestamp, (2) extending
the mapping by an entry without dropping an entry leaves
head_timestamp alone, and (3) overwriting the previous head with a new
entry advances head_timestamp by 1 minute.

Second, the calculation of the number of entries by which the mapping
should advance is incorrect. It thinks that it should advance by the
number of minutes between the current head_timestamp and the incoming
timestamp. That would be correct if head_timestamp were the most
recent entry in the mapping, but it's actually the oldest. As a
result, without this fix, every time we move into a new minute, we
advance the mapping much further than we actually should. Instead of
advancing by 1, we advance by the number of entries that already exist
in the mapping - which means we now have entries that correspond to
times which are in the future, and don't advance the mapping again
until those future timestamps are in the past.

With these fixes, I seem to get reasonably sensible mappings, at least
in light testing.  I tried running this in one window with \watch 10:

select *, age(newest_xmin), clock_timestamp()  from
pg_old_snapshot_time_mapping();

And in another window I ran:

pgbench -T 300 -R 10

And the age does in fact advance by ~600 transactions per minute.

I'm not proposing to commit anything here right now. These patches
haven't had enough testing for that, and their interaction with other
bugs in the feature needs to be considered before we do anything.
However, I thought it might be useful to put them out for review and
comment, and I also thought that having the contrib module from 0002
might permit other people to do some better testing of this feature
and these fixes.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/20200401064008.qob7bfnnbu4w5...@alap3.anarazel.de


v1-0001-Expose-oldSnapshotControl.patch
Description: Binary data


v1-0003-Fix-bugs-in-MaintainOldSnapshotTimeMapping.patch
Description: Binary data


v1-0002-contrib-old_snapshot-time-xid-mapping.patch
Description: Binary data

Re: sqlsmith crash incremental sort

2020-04-16 Thread James Coleman

On Thu, Apr 16, 2020 at 8:22 AM Richard Guo  wrote:
>
>
> On Thu, Apr 16, 2020 at 6:35 PM Richard Guo  wrote:
>>
>>
>> On Wed, Apr 15, 2020 at 10:47 PM Tomas Vondra  
>> wrote:
>>>
>>>
>>> Well, yeah. The problem is the Unique simply compares the columns in the
>>> order it sees them, and it does not match the column order desired by
>>> incremental sort. But we don't push down this information at all :-(
>>
>>
>> This is a nice optimization better to have. Since the 'Sort and Unique'
>> would unique-ify the result of a UNION by sorting on all columns, why
>> not we adjust the sort order trying to match parse->sortClause so that
>> we can avoid the final sort node?
>>
>> Doing that we can transform plan from:
>>
>> # explain (costs off) select * from foo union select * from foo order by 1,3;
>>   QUERY PLAN
>> ---
>>  Incremental Sort
>>Sort Key: foo.a, foo.c
>>Presorted Key: foo.a
>>->  Unique
>>  ->  Sort
>>Sort Key: foo.a, foo.b, foo.c
>>->  Append
>>  ->  Seq Scan on foo
>>  ->  Seq Scan on foo foo_1
>> (9 rows)
>>
>> To:
>>
>> # explain (costs off) select * from foo union select * from foo order by 1,3;
>>QUERY PLAN
>> -
>>  Unique
>>->  Sort
>>  Sort Key: foo.a, foo.c, foo.b
>>  ->  Append
>>->  Seq Scan on foo
>>->  Seq Scan on foo foo_1
>> (6 rows)
>>
>
> Attached is what I'm thinking about this optimization. Does it make any
> sense?

Shouldn't this go one either a new thread or on the thread for the
patch Tomas was referencing (by Teodor I believe)?

Or are you saying you believe this patch guarantees we never see this
problem in incremental sort costing?

James

Re: Should we add xid_current() or a int8->xid cast?

2020-04-16 Thread Mark Dilger




> On Apr 6, 2020, at 5:14 PM, Thomas Munro  wrote:
> 
> On Sun, Apr 5, 2020 at 11:31 AM Thomas Munro  wrote:
>> On Sun, Apr 5, 2020 at 10:34 AM Mark Dilger
>>  wrote:
>>> The "xid8_" warts are partly motivated by having a type named "xid8", which 
>>> is a bit of a wart in itself.
>> 
>> Just a thought for the future, not sure if it's a good one: would it
>> seem less warty in years to come if we introduced xid4 as an alias for
>> xid, and preferred the name xid4?  Then it wouldn't look so much like
>> xid is the "real" transaction ID type and xid8 is some kind of freaky
>> extended version; instead it would look like xid4 and xid8 are narrow
>> and wide transaction IDs, and xid is just a historical name for xid4.
> 
> I'll look into proposing that for PG14.  One reason I like that idea
> is that system view names like backend_xid could potentially retain
> their names while switching to xid8 type, (maybe?) breaking fewer
> queries and avoiding ugly names, on the theory that _xid doesn't
> specify whether it's xid4 or an xid8.
> 
 pg_current_xact_id()
 pg_current_xact_id_if_assigned()
 pg_xact_status(xid8)
> 
 pg_current_snapshot()
 pg_snapshot_xmin(pg_snapshot)
 pg_snapshot_xmax(pg_snapshot)
 pg_snapshot_xip(pg_snapshot)
 pg_visible_in_snapshot(xid8, pg_snapshot)
> 
>>> As such, I'm ±0 for the change.
>> 
>> I'll let this sit for another day and see if some more reactions appear.
> 
> Hearing no objections, pushed.  Happy to reconsider these names before
> release if someone finds a problem with this scheme.

Hmm, I should have spoken sooner...

src/backend/replication/walsender.c:static bool 
TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
src/backend/utils/adt/xid8funcs.c:TransactionIdInRecentPast(FullTransactionId 
fxid, TransactionId *extracted_xid)

I don't care much for having two different functions with the same name and 
related semantics but different argument types.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Tom Lane

Kyotaro Horiguchi  writes:
> [ syncrep-fixes-4.patch ]

I agree that we could probably improve the clarity of this code with
further rewriting, but I'm still very opposed to the idea of having
callers know that the first num_sync array elements are the active
ones.  It's wrong (or at least different from current behavior) for
quorum mode, where there might be more than num_sync walsenders to
consider.  And it might not generalize very well to other syncrep
selection rules we might add in future, which might also not have
exactly num_sync interesting walsenders.  So I much prefer an API
definition that uses bool flags in an array that has no particular
ordering (so far as the callers know, anyway).  If you don't like
is_sync_standby, how about some more-neutral name like is_active
or is_interesting or include_position?

I dislike the proposed comment revisions in SyncRepReleaseWaiters,
too, particularly the change to say that what we're "announcing"
is the ability to release waiters.  You did not change the actual
log messages, and you would have gotten a lot of pushback if
you tried, because the current messages make sense to users and
something like that would not.  But by the same token this new
comment isn't too helpful to somebody reading the code.

(Actually, I wonder why we even have the restriction that only
sync standbys can release waiters.  It's not like they are
going to get different results from SyncRepGetSyncRecPtr than
any other walsender would.  Maybe we should just drop all the
am_sync logic?)

regards, tom lane

Re: cleaning perl code

2020-04-16 Thread Alvaro Herrera

On 2020-Apr-16, Hamlin, Garick L wrote:

> With the old expression 'something;' would be stripped away.  
> Is that an issue where this this is used?  Why are we parsing
> these headers?

These are files from which bootstrap catalog data is generated, which is
why we parse from Perl; but also where C structs are declared, which is
why they're C.

I think switching to non-greedy is a win in itself.  Non-capturing
parens is probably a wash (this doesn't run often so the performance
argument isn't very interesting).

An example.  This eval in Catalog.pm

+   ## no critic (ProhibitStringyEval)
+   ## no critic (RequireCheckingReturnValueOfEval)
+   eval '$hash_ref = ' . $_;

is really weird stuff generally speaking, and the fact that we have to
mark it specially for critic is a good indicator of that -- it serves as
documentation.  Catalog.pm is all a huge weird hack, but it's a critically
important hack.  Heck, what about RequireCheckingReturnValueOfEval --
should we instead consider actually checking the return value of eval?
It would seem to make sense, would it not?  (Not for this patch, though
-- I would be fine with just adding the nocritic line now, and removing
it later while fixing that).

All in all, I think it's a positive value in having this code be checked
with a bit more strength -- checks that are pointless in, say, t/00*.pl
prove files.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Poll: are people okay with function/operator table redesign?

2020-04-16 Thread Pierre Giraud



Le 16/04/2020 à 16:43, Tom Lane a écrit :
> Pierre Giraud  writes:
>> Le 16/04/2020 à 00:18, Tom Lane a écrit :
>>> The main disadvantage I can see to the v2 design is that we're back
>>> to having two  per function, which is inevitably going to result
>>> in PDF builds putting page breaks between those rows.  But you can't
>>> have everything ... and maybe we could find a way to discourage such
>>> breaks if we tried.
> 
> Further experimentation shows that the PDF toolchain is perfectly willing
> to put a page break *within* a multi-line ; if there is any
> preference to break between rows instead, it's pretty weak.  So that
> argument is a red herring and we shouldn't waste time chasing it.
> However, there'd still be some advantage in not being dependent on CSS
> hackery to make it look nice in HTML.
> 
> What we're down to wanting, at this point, is basically a para with
> hanging indent.
> 
>> What about putting everything into one  and use a block with
>> some left padding/margin for description + example.
>> This would solve the PDF page break issue as well as the column
>> separation border one.
>> The screenshot attached uses a  tag for the descrition/example block.
> 
> That looks about right, perhaps, but could you be a little clearer about
> how you accomplished that?

Attached you will find the HTML structure with associated styles.
Sorry I haven't tried to do this from the DocBook sources.
I hope this helps though.

Regards
# HTML

  

  

  Function
  
Description 
Example
  

  


  

  age(timestamp, timestamp) → interval
  
Subtract arguments, producing a “symbolic” result that uses years and months, rather 
than just days



  age(timestamp '2001-04-10', timestamp 
'1957-06-13') → 43 years 9 mons 27 days


  

  

  

# CSS

#docContent table.table code.literal {
font-style: italic;
}

code.returnvalue {
font-weight: bold;
}

code.function {
font-weight: bold;
}

.replaceable > code {
font-weight: normal !important;
}

#docContent table.table dl {
margin-left: 30px;
margin-top: 5px;
}

#docContent table.table dl > dd {
margin-bottom: 0;
}

"cache reference leak" issue happened when using sepgsql module

2020-04-16 Thread Michael Luo

Hi !

When using sepgsql module, I got warning "WARNING:  cache reference leak”.
```
postgres=# UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 
returning (range_parted), *;
WARNING:  cache reference leak: cache pg_attribute (7), tuple 38/54 has count 1
WARNING:  cache reference leak: cache pg_attribute (7), tuple 39/56 has count 1
WARNING:  cache reference leak: cache pg_attribute (7), tuple 53/51 has count 1
WARNING:  cache reference leak: cache pg_attribute (7), tuple 53/50 has count 1
 range_parted  | a | b  | c  | d  | e
---+---++++---
 (b,15,95,16,) | b | 15 | 95 | 16 |
 (b,17,95,19,) | b | 17 | 95 | 19 |
(2 rows)

UPDATE 2
postgres=#
```
I am using the codes of Postgres REL_12_STABLE branch.
This issue can be reproduced by the SQLs below, and I test that on CentOS7 with 
“permissive” mode of SeLinux.

```
CREATE TABLE range_parted (a text, b bigint, c numeric, d int, e varchar) 
PARTITION BY RANGE (b);
CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) 
PARTITION BY RANGE (c);
ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM (10) 
TO (20);
CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int);
ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD 
COLUMN a text;
ALTER TABLE part_c_100_200 DROP COLUMN b;
ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO 
(100);
ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM 
(100) TO (200);

\set init_range_parted 'truncate range_parted; insert into range_parted 
VALUES(''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 
105, 19)'
:init_range_parted;
UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning 
(range_parted), *;
```

The patch attached to fix this issue, please check it.

```
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -69,7 +69,10 @@ fixup_whole_row_references(Oid relOid, Bitmapset *columns)
  continue;

  if (((Form_pg_attribute) GETSTRUCT(tuple))->attisdropped)
+ {
+ ReleaseSysCache(tuple);
  continue;
+ }

  index = attno - FirstLowInvalidHeapAttributeNumber;






骆政丞 / Michael Luo
成都文武信息技术有限公司 / ChengDu WenWu Information Technology Co.,Ltd.
地址：成都高新区天府软件园 D 区 5 栋 1705官网：http://w3.ww-it.cn.





sepgsql_bug_fix.patch
Description: sepgsql_bug_fix.patch

Re: remove_useless_groupby_columns does not need to record constraint dependencies

2020-04-16 Thread Tom Lane

David Rowley  writes:
> I've reworded the comment in the attached version.

LGTM.

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-16 Thread Tom Lane

Pierre Giraud  writes:
> Le 16/04/2020 à 00:18, Tom Lane a écrit :
>> The main disadvantage I can see to the v2 design is that we're back
>> to having two  per function, which is inevitably going to result
>> in PDF builds putting page breaks between those rows.  But you can't
>> have everything ... and maybe we could find a way to discourage such
>> breaks if we tried.

Further experimentation shows that the PDF toolchain is perfectly willing
to put a page break *within* a multi-line ; if there is any
preference to break between rows instead, it's pretty weak.  So that
argument is a red herring and we shouldn't waste time chasing it.
However, there'd still be some advantage in not being dependent on CSS
hackery to make it look nice in HTML.

What we're down to wanting, at this point, is basically a para with
hanging indent.

> What about putting everything into one  and use a block with
> some left padding/margin for description + example.
> This would solve the PDF page break issue as well as the column
> separation border one.
> The screenshot attached uses a  tag for the descrition/example block.

That looks about right, perhaps, but could you be a little clearer about
how you accomplished that?

regards, tom lane

Re: cleaning perl code

2020-04-16 Thread Andrew Dunstan



On 4/16/20 10:20 AM, Hamlin, Garick L wrote:
> On Thu, Apr 16, 2020 at 08:50:35AM -0400, Andrew Dunstan wrote:
>> It would also be more robust using non-greedy matching:
> This seems more important.
> I don't know how/where this is being used, but if it has input like:
>
> /* one */ 
> something;
> /* two */
>
> With the old expression 'something;' would be stripped away.  
> Is that an issue where this this is used?  Why are we parsing
> these headers?
>



It's not quite as bad as that, because we're doing it line by line
rather than on a whole file that's been slurped in. Multiline comments
are handled using some redo logic. But


    /* one */ something(); /* two */


would all be removed. Of course, we hope we don't have anything so
horrible, but still ...


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: cleaning perl code

2020-04-16 Thread Hamlin, Garick L

On Thu, Apr 16, 2020 at 08:50:35AM -0400, Andrew Dunstan wrote:
> 
> It would also be more robust using non-greedy matching:

This seems more important.
I don't know how/where this is being used, but if it has input like:

/* one */ 
something;
/* two */

With the old expression 'something;' would be stripped away.  
Is that an issue where this this is used?  Why are we parsing
these headers?

Garick

Re: Autovacuum on partitioned table (autoanalyze)

2020-04-16 Thread Justin Pryzby

On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:
> > I think it ought to be possible to configure this feature such that an
> > auto-analyze on any child partition would trigger analyze of the parent.  I
> > think that would be important for maintaining accurate stats of the 
> > partition
> > key column for many cases involving RANGE-partitioned tables, which are 
> > likely
> > to rely on histogram rather than MCVs.
>
> I read your previous email and understand that it would be neccesary to 
> analyze
> partitioned tables automatically when any of its children are analyzed.  In my
> first patch, auto-analyze on partitioned tables worked like this but there 
> were
> some comments about performance of autovacuum, especially when partitioned
> tables have a lot of children.

I reread that part.  There was also confusion between autovacuum vacuum and
autovacuum analyze.

I agree that it *might* be a problem to analyze the parent every time any child
is analyzed.

But it might also be what's needed for this feature to be useful.

> The latest patch lets users set different autovacuum configuration for
> each partitioned
> tables like this,
>   create table p3(i int) partition by range(i) with
>(autovacuum_analyze_scale_factor=0.0005, autovacuum_analyze_threshold=100);
> so users can configure those parameters according to partitioning strategies
> and other requirements.
> 
> So I think this patch can solve problem you mentioned.

I don't think that adequately allows what's needed.

I think it out to be possible to get the "analyze parent whenever a child is
analyzed" behavior easily, without having to compute new thershold parameters
every time one adds partitions, detaches partitions, loades 10x more data into
one of the partitions, load only 10% as much data into the latest partition,
etc.

For example, say a new customer has bunch of partitioned tables which each
currently have only one partition (for the current month), and that's expected
to grow to at least 20+ partitions (2+ years of history).  How does one set the
partitioned table's auto-analyze parameters to analyze whenever any child is
analyzed ?  I don't think it should be needed to update it every month after
computing sum(child tuples).

Possibly you could allow that behavior for some special values of the
threshold.  Like if autovacuum_analyze_threshold=-2, then analyze the parent
whenever any of its children are analyzed.

I think that use case and that need would be common, but I'd like to hear what
others think.

-- 
Justin

Re: cleaning perl code

2020-04-16 Thread Tom Lane

Andrew Dunstan  writes:
> On 4/15/20 11:01 PM, Noah Misch wrote:
>> It would be an unpleasant surprise to cause a perlcritic buildfarm failure by
>> moving a function, verbatim, from a non-strategic file to a strategic file.
>> Having two Perl style regimes in one tree is itself a liability.

> Honestly, I think you're reaching here.

I think that argument is wrong, actually.  Moving a function from a single
use-case into a library (with, clearly, the intention for it to have more
use-cases) is precisely the time when any weaknesses in its original
implementation might be exposed.  So extra scrutiny seems well warranted.

Whether the "extra scrutiny" involved in perlcritic's higher levels
is actually worth anything is a different debate, though, and so far
it's not looking like it's worth much :-(

regards, tom lane

Re: sqlsmith crash incremental sort

2020-04-16 Thread Tomas Vondra


On Thu, Apr 16, 2020 at 04:44:10PM +0800, Richard Guo wrote:

On Mon, Apr 13, 2020 at 8:09 AM Tomas Vondra 
wrote:



I've been messing with this the whole day, without much progress :-(

I'm 99.% sure it's the same issue described by the quoted comment,
because the plan looks like this:

  Nested Loop Left Join
->  Sample Scan on pg_namespace
  Sampling: system ('7.2'::real)
->  Incremental Sort
  Sort Key: ...
  Presorted Key: ...
  ->  Unique
->  Sort
  Sort Key: ...
  ->  Append
->  Nested Loop
...
->  Nested Loop
...

so yeah, the plan does have set operations, and generate_append_tlist
does generate Vars with varno == 0, causing this issue.



After some digging I believe here is what happened.

1. For the UNION query, we build an upper rel of UPPERREL_SETOP and
generate Append path for it. Since Append doesn't actually evaluate its
targetlist, we generate 'varno 0' Vars for its targetlist. (setrefs.c
would just replace them with OUTER_VAR when adjusting the final plan so
this usually does not cause problems.)

2. To remove duplicates for UNION, we use hash/sort to unique-ify the
result. If sort is chosen, we add Sort path and then Unique path above
Append path, with pathkeys made from Append's targetlist.

3. Also the Append's targetlist would be built into
root->processed_tlist and with that we calculate root->sort_pathkeys.

4. When handling ORDER BY clause, we figure out the pathkeys of
Unique->Sort->Append path share some same prefix with
root->sort_pathkeys and thus incremental sort would be considered.

5. When calculating cost for incremental sort, estimate_num_groups does
not cope with 'varno 0' Vars extracted from root->sort_pathkeys.



Right.



With this scenario, here is a simple recipe:

create table foo(a int, b int, c int);
set enable_hashagg to off;
explain select * from foo union select * from foo order by 1,3;



Yep, that's a much simpler query / plan. Thanks.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: cleaning perl code

2020-04-16 Thread Andrew Dunstan



On 4/15/20 11:01 PM, Noah Misch wrote:
> On Wed, Apr 15, 2020 at 03:43:36PM -0400, Andrew Dunstan wrote:
>> On 4/14/20 4:44 PM, Alvaro Herrera wrote:
>>> On 2020-Apr-14, Andrew Dunstan wrote:
 One of the things that's a bit sad is that perlcritic doesn't generally
 let you apply policies to a given set of files or files matching some
 pattern. It would be nice, for instance, to be able to apply some
 additional standards to strategic library files like PostgresNode.pm,
 TestLib.pm and Catalog.pm. There are good reasons as suggested upthread
 to apply higher standards to library files than to, say, a TAP test
 script. The only easy way I can see to do that would be to have two
 different perlcriticrc files and adjust pgperlcritic to make two runs.
 If people think that's worth it I'll put a little work into it. If not,
 I'll just leave things here.
>>> I think being more strict about it in strategic files (I'd say that's
>>> Catalog.pm plus src/test/perl/*.pm) might be a good idea.  Maybe give it
>>> a try and see what comes up.
>> OK, in fact those files are in reasonably good shape. I also took a pass
>> through the library files in src/tools/msvc, which had a few more issues.
> It would be an unpleasant surprise to cause a perlcritic buildfarm failure by
> moving a function, verbatim, from a non-strategic file to a strategic file.
> Having two Perl style regimes in one tree is itself a liability.


Honestly, I think you're reaching here.


>
>> --- a/src/backend/catalog/Catalog.pm
>> +++ b/src/backend/catalog/Catalog.pm
>> @@ -67,7 +67,7 @@ sub ParseHeader
>>  if (!$is_client_code)
>>  {
>>  # Strip C-style comments.
>> -s;/\*(.|\n)*\*/;;g;
>> +s;/\*(?:.|\n)*\*/;;g;
> This policy against unreferenced groups makes the code harder to read, and the
> chance of preventing a bug is too low to justify that.



Non-capturing groups are also more efficient, and are something perl
programmers should be familiar with.


In fact, there's a much better renovation of semantics of this
particular instance, which is to make . match \n using the s modifier:


    s;/\*.*\*/;;gs;


It would also be more robust using non-greedy matching:


    s;/\*.*?\*/;;gs


After I wrote the above I went and looked at what we do the buildfarm
code to strip comments when looking for typedefs, and it's exactly that,
so at least I'm consistent :-)


I don't care that much if we throw this whole thing away. This was sent
in response to Alvaro's suggestion to "give it a try and see what comes up".


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-16 Thread Magnus Hagander

On Thu, Apr 16, 2020 at 7:05 AM Kyotaro Horiguchi 
wrote:

> At Wed, 15 Apr 2020 15:58:05 +0500, "Andrey M. Borodin" <
> x4...@yandex-team.ru> wrote in
> > > 15 апр. 2020 г., в 15:25, Magnus Hagander 
> написал(а):
> > > I think that makes perfect sense. The documentation explicitly says
> "can read all pg_stat_* views", which is clearly wrong -- so either the
> code or the docs should be fixed, and it looks like it's the code that
> should be fixed to me.
> > Should it be bug or v14 feature?
> >
> > Also pgstatfuncs.c contains a lot more checks of
> has_privs_of_role(GetUserId(), beentry->st_userid).
> > Maybe grant them all?
> >
> > > As for the patch, one could argue that we should just store the
> resulting boolean instead of re-running the check (e.g. have a "bool
> has_stats_privilege" or such), but perhaps that's an unnecessary
> micro-optimization, like the attached.
> >
> > Looks good to me.
>
> pg_stat_get_activty checks (has_privs_of_role() ||
> is_member_of_role()) in-place for every entry.  It's not necessary but
> I suppose that doing the same thing for pg_stat_progress_info might be
> better.
>

>From a result perspective, it shouldn't make a difference though, should
it? It's a micro-optimization, but it might not have an actual performance
effect in reality as well, but the result should always be the same?

(FWIW, pg_stat_statements has a coding pattern similar to the one I
suggested in the patch)



>
> It's another issue, but pg_stat_get_backend_* functions don't consider
> pg_read_all_stats. I suppose that the functions should work under the
> same criteria to pg_stat views, and maybe explicitly documented?
>

That's a good question. They haven't been documented to do so, but it
certainly seems *weird* that the same information should be available
through a view like pg_stat_activity, but not through the functions.

I would guess this was simply forgotten in 25fff40798f -- I don't recall
any discussion about it. The commit message specifically says
pg_database_size() and pg_tablespace_size(), but mentions nothing about
pg_stat_*.


>
> If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
> something like and replace the all occurances of the idiomatic
> condition with it.
>

You mean something like the attached?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 175f4fd26b..446044609e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,6 +33,8 @@
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var
 
+#define HAS_PGSTAT_PERMISSIONS(role)	 (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
+
 /* Global bgwriter statistics, from bgwriter.c */
 extern PgStat_MsgBgWriter bgwriterStats;
 
@@ -537,7 +539,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		values[1] = ObjectIdGetDatum(beentry->st_databaseid);
 
 		/* show rest of the values including relid only to role members */
-		if (has_privs_of_role(GetUserId(), beentry->st_userid))
+		if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		{
 			values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
 			for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
@@ -669,8 +671,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			nulls[16] = true;
 
 		/* Values only available to role member or pg_read_all_stats */
-		if (has_privs_of_role(GetUserId(), beentry->st_userid) ||
-			is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))
+		if (HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		{
 			SockAddr	zero_clientaddr;
 			char	   *clipped_activity;
@@ -1007,7 +1008,7 @@ pg_stat_get_backend_activity(PG_FUNCTION_ARGS)
 
 	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
 		activity = "";
-	else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
+	else if (!HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		activity = "";
 	else if (*(beentry->st_activity_raw) == '\0')
 		activity = "";
@@ -1031,7 +1032,7 @@ pg_stat_get_backend_wait_event_type(PG_FUNCTION_ARGS)
 
 	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
 		wait_event_type = "";
-	else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
+	else if (!HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		wait_event_type = "";
 	else if ((proc = BackendPidGetProc(beentry->st_procpid)) != NULL)
 		wait_event_type = pgstat_get_wait_event_type(proc->wait_event_info);
@@ -1052,7 +1053,7 @@ pg_stat_get_backend_wait_event(PG_FUNCTION_ARGS)
 
 	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
 		wait_event = "";
-	else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
+	else if (!HAS_PGSTAT_PERMISSIONS(beentry->st_userid))
 		wait_event = "";
 	else if ((proc = BackendPidGetProc(beentry->st_procpid)) != NULL)
 		wait_event = pgstat_get_wai

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-16 Thread Ashutosh Bapat

On Thu, Apr 16, 2020 at 7:47 AM Andy Fan  wrote:

> (9 rows)
>
> With this feature:
> explain analyze select a, sum(c) from grp2 group by a;
> QUERY PLAN
> --
>  GroupAggregate  (cost=0.00..553031.57 rows=1023 width=12) (actual 
> time=0.044..13209.485 rows=1000 loops=1)
>Group Key: a
>->  Seq Scan on grp2  (cost=0.00..403031.23 rows=1023 width=8) (actual 
> time=0.023..4938.171 rows=1000 loops=1)
>  Planning Time: 0.400 ms
>  Execution Time: 13749.121 ms
> (5 rows)
>

Applying the patch gives a white space warning
git am /tmp/v6-000*
Applying: Introduce UniqueKeys to determine RelOptInfo unique properties
.git/rebase-apply/patch:545: indent with spaces.
/* Fast path */
warning: 1 line adds whitespace errors.
Applying: Skip DISTINCT / GROUP BY if input is already unique

Compiling the patch causes one warning
nodeAgg.c:2134:3: warning: enumeration value ‘AGG_UNIQUE’ not handled
in switch [-Wswitch]

I have not looked at the patch. The numbers above look good. The time
spent in summing up a column in each row (we are summing only one
number per group) is twice the time it took to read those rows from
the table. That looks odd. But it may not be something unrelated to
your patch. I also observed that for explain analyze select a from
grp2 group by a; we just produce a plan containing seq scan node,
which is a good thing.

--
Best Wishes,
Ashutosh Bapat

Re: sqlsmith crash incremental sort

2020-04-16 Thread Richard Guo

On Thu, Apr 16, 2020 at 6:35 PM Richard Guo  wrote:

>
> On Wed, Apr 15, 2020 at 10:47 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>>
>> Well, yeah. The problem is the Unique simply compares the columns in the
>> order it sees them, and it does not match the column order desired by
>> incremental sort. But we don't push down this information at all :-(
>>
>
> This is a nice optimization better to have. Since the 'Sort and Unique'
> would unique-ify the result of a UNION by sorting on all columns, why
> not we adjust the sort order trying to match parse->sortClause so that
> we can avoid the final sort node?
>
> Doing that we can transform plan from:
>
> # explain (costs off) select * from foo union select * from foo order by
> 1,3;
>   QUERY PLAN
> ---
>  Incremental Sort
>Sort Key: foo.a, foo.c
>Presorted Key: foo.a
>->  Unique
>  ->  Sort
>Sort Key: foo.a, foo.b, foo.c
>->  Append
>  ->  Seq Scan on foo
>  ->  Seq Scan on foo foo_1
> (9 rows)
>
> To:
>
> # explain (costs off) select * from foo union select * from foo order by
> 1,3;
>QUERY PLAN
> -
>  Unique
>->  Sort
>  Sort Key: foo.a, foo.c, foo.b
>  ->  Append
>->  Seq Scan on foo
>->  Seq Scan on foo foo_1
> (6 rows)
>
>
Attached is what I'm thinking about this optimization. Does it make any
sense?

Thanks
Richard


0001-Fix-incremental-sort-crash.patch
Description: Binary data

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Robert Haas

On Thu, Apr 16, 2020 at 4:43 AM Hamid Akhtar  wrote:
> My real question is whether vacuum should be preemptively complaining about 
> prepared transactions or stale replication slots rather than waiting for 
> transaction id to exceed the safe limit. I presume by the time safe limit is 
> exceeded, vacuum's work would already have been significantly impacted.

Yeah, for my part, I agree that letting things go until the point
where VACUUM starts to complain is usually bad. Generally, you want to
know a lot sooner. That being said, I think the solution to that is to
run a monitoring tool, not to overload the autovacuum worker with
additional duties.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Making openssl_tls_init_hook OpenSSL specific

2020-04-16 Thread Daniel Gustafsson

Commit 896fcdb230e72 (sorry for chiming in too late, I missed that thread)
added a TLS init hook which is OpenSSL specific: openssl_tls_init_hook.  Since
the rest of the TLS support in the backend is library agnostic, we should IMO
make this hook follow that pattern, else this will make a non-OpenSSL backend
not compile.

If we make the hook generic, extension authors must have a way to tell which
backend invoked it, so maybe the best option is to simply wrap this hook in
USE_OPENSSL ifdefs and keep the name/signature?  Looking at the Secure
Transport patch I wrote, there is really no equivalent callsite; the same goes
for a libnss patch which I haven't yet submitted.

The attached adds USE_OPENSSL guards.

cheers ./daniel



openssl_hook_guards.patch
Description: Binary data

RE: [PATHC] Fix minor memory leak in pg_basebackup

2020-04-16 Thread Zhang, Jie

Hi  Michael

so much the better!

-Original Message-
From: Michael Paquier [mailto:mich...@paquier.xyz] 
Sent: Thursday, April 16, 2020 2:31 PM
To: Zhang, Jie/张 杰 
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: [PATHC] Fix minor memory leak in pg_basebackup

On Wed, Apr 15, 2020 at 10:06:52AM +, Zhang, Jie wrote:
> In some cases , PGresult is not cleared.
> 
> File: src\bin\pg_basebackup\streamutil.c
> 
> bool
> RetrieveWalSegSize(PGconn *conn)
> {
>   PGresult   *res;

RetrieveWalSegSize() gets called only once at the beginning of pg_basebackup 
and pg_receivewal, so that's not an issue that has major effects, still that's 
an issue.  The first one PQclear() is needed where you say.  Now for the second 
one, I would just move it once the code is done with the query result, aka 
after calling PQgetvalue().
What do you think?  Please see the attached.
--
Michael

Re: sqlsmith crash incremental sort

2020-04-16 Thread Richard Guo

On Wed, Apr 15, 2020 at 10:47 PM Tomas Vondra 
wrote:

>
> Well, yeah. The problem is the Unique simply compares the columns in the
> order it sees them, and it does not match the column order desired by
> incremental sort. But we don't push down this information at all :-(
>

This is a nice optimization better to have. Since the 'Sort and Unique'
would unique-ify the result of a UNION by sorting on all columns, why
not we adjust the sort order trying to match parse->sortClause so that
we can avoid the final sort node?

Doing that we can transform plan from:

# explain (costs off) select * from foo union select * from foo order by
1,3;
  QUERY PLAN
---
 Incremental Sort
   Sort Key: foo.a, foo.c
   Presorted Key: foo.a
   ->  Unique
 ->  Sort
   Sort Key: foo.a, foo.b, foo.c
   ->  Append
 ->  Seq Scan on foo
 ->  Seq Scan on foo foo_1
(9 rows)

To:

# explain (costs off) select * from foo union select * from foo order by
1,3;
   QUERY PLAN
-
 Unique
   ->  Sort
 Sort Key: foo.a, foo.c, foo.b
 ->  Append
   ->  Seq Scan on foo
   ->  Seq Scan on foo foo_1
(6 rows)

Thanks
Richard

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-16 Thread Erik Rijkers

On 2020-04-16 11:33, Dilip Kumar wrote:

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers  wrote:

On 2020-04-14 12:10, Dilip Kumar wrote:

> v14-0001-Immediately-WAL-log-assignments.patch +
> v14-0002-Issue-individual-invalidations-with.patch +
> v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch   +
> v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> v14-0007-Track-statistics-for-streaming.patch  +
> v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch  +
> v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

I've added your new patch

[bugfix_replica_identity_full_on_subscriber.patch]

on top of all those above but the crash (apparently the same crash) that 
I had earlier still occurs (and pretty soon).

server process (PID 1721) was terminated by signal 11: Segmentation 
fault

I'll try to isolate it better and get a stacktrace

Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1].  After fixing that I have got one crash with streaming
on patch.  I am not sure whether you are facing any of these 2 issues
or any other issue.  If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1]
https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-16 Thread Dilip Kumar

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers  wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch +
> > v14-0002-Issue-individual-invalidations-with.patch +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch   +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch  +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch  +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)


Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1].  After fixing that I have got one crash with streaming
on patch.  I am not sure whether you are facing any of these 2 issues
or any other issue.  If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1] 
https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bugfix_in_schema_sent.patch
Description: Binary data

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Kyotaro Horiguchi

At Thu, 16 Apr 2020 16:48:28 +0900, Masahiko Sawada 
 wrote in 
> This is just a notice; I'm reading your latest patch but it seems to
> include unrelated changes:
> 
> $ git diff --stat
>  src/backend/replication/syncrep.c   | 475
> +---
>  src/backend/replication/walsender.c |  40 ++-
>  src/bin/pg_dump/compress_io.c   |  12 ++
>  src/bin/pg_dump/pg_backup_directory.c   |  48 ++-
>  src/include/replication/syncrep.h   |  20 +-
>  src/include/replication/walsender_private.h |  16 
>  6 files changed, 274 insertions(+), 337 deletions(-)

Ugg.  I failed to clean up working directory..  I didn't noticed as I
made the file by git diff. Thanks for noticing me of that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..99a7bbbc86 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys,
+	   uint8 nsyncs);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&MyWalSnd->mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(&MyWalSnd->mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -429,8 +432,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_recptr;
-	bool		am_sync;
+	bool		release_waiters;
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
@@ -458,16 +460,24 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the synced
-	 * positions among all sync standbys.
+	 * Check whether we may release any of waiter processes, and calculate the
+	 * synced positions.
 	 */
-	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
+	release_waiters = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr);
+
+	/* Return if nothing to do. */
+	if (!release_waiters)
+	{
+		LWLockRelease(SyncRepLock);
+		announce_next_takeover = true;
+		return;
+	}
 
 	/*
-	 * If we are managing a sync standby, though we weren't prior to this,
-	 * then announce we are now a sync standby.
+	 * If this walsender becomes to be able to release waiter processes,
+	 * announce about that.
 	 */
-	if (announce_next_takeover && am_sync)
+	if (announce_next_takeover)
 	{
 		announce_next_takeover = false;
 
@@ -481,17 +491,6 @@ SyncRepReleaseWaiters(void)
 			application_name)));
 	}
 
-	/*
-	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
-	 */
-	if (!got_recptr || !am_sync)
-	{
-		LWLockRelease(SyncRepLock);
-		announce_next_takeover = !am_sync;
-		return;
-	}
-
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
 	 * this location.
@@ -523,43 +522,62 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
- * Return false if the number of sync standbys is less than
- * synchronous_standby_names specifies. Otherwise return true and
- * store the positions into *writePtr, *flushPtr and *applyPtr.
- *
- * On return, *am_sync is set to true if this walsender is connecting to
- * sync standby. Otherwise it's set to false.
+ * Return fa

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-16 Thread Etsuro Fujita

On Mon, Apr 6, 2020 at 8:43 PM Ashutosh Bapat
 wrote:
> On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita  wrote:
> But it will be good to have following addition I suggested in my patches to 
> make sure nparts is set to 0 for an unpartitioned relation as per the comment 
> on nparts in RelOptInfo.
> @@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel, 
> RelOptInfo *outer_rel,
> jointype, restrictlist))
> {
> Assert(!IS_PARTITIONED_REL(joinrel));
> +   /* Join is not partitioned. */
> +   joinrel->nparts = 0;
> return;
> }

I didn't modified that function as proposed, because I thought that 1)
there would be no need to do so, and that 2) it would be better to set
joinrel->nparts only when we set joinrel->part_schema, for
consistency.

>> >> 3) I think the for nparts comment is somewhat misleading:
>> >>
>> >>int nparts;  /* number of partitions; 0 = not partitioned;
>> >>  * -1 = not yet set */
>> >>
>> >> which says that nparts=0 means not partitioned. But then there are
>> >> conditions like this:
>> >>
>> >>  /* Nothing to do, if the join relation is not partitioned. */
>> >>  if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> >>  return;
>> >>
>> >> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> >> even for partitioned tables, no?
>>
>> Yeah, I think I was a bit hasty.  I fixed this.

> For a non-join relation, nparts = 0 and nparts = -1 both have the same 
> meaning. Although we never set nparts = 0 for a non-join relation?

I don't think so.  Consider this:

create table prt (a int, b int) partition by range (a);
create table prt_p1 partition of prt for values from (0) to (250);
create table prt_p2 partition of prt for values from (250) to (500);
drop table prt_p1;
drop table prt_p2;
select count(*) from prt;

For this query, we would have nparts=0 for the partitioned table prt.

Thanks!  Sorry for the delay.

Best regards,
Etsuro Fujita

Re: pg_restore: could not close data file: Success

2020-04-16 Thread Kyotaro Horiguchi

At Thu, 16 Apr 2020 14:40:09 +0900, Michael Paquier  wrote 
in 
> On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote:
> > I'm surprised to find an old thread about the same issue.
> > 
> > https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp
> > 
> > But I don't think it's not acceptable that use fake errno for gzclose,
> > but cfclose properly passes-through the error code from gzclose, so it
> > is enought that the caller should recognize the difference.
> 
> A problem with this patch is that we may forget again to add this
> special error handling if more code paths use cfclose().

Definitely.  The reason for the patch is the error codes are diffrent
according to callers and some of callers don't even checking the error
(seemingly intentionally).

> As of HEAD, there are three code paths where cfclose() is called but
> it does not generate an error: two when ending a blob and one when
> ending a data file.  Perhaps it would make sense to just move all this
> error within the routine itself?  Note that it would also mean
> registering file names in lclContext or equivalent as that's an
> important piece of the error message.

Hmm. Sounds reasonable.  I'm going to do that.  Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Autovacuum on partitioned table (autoanalyze)

2020-04-16 Thread yuzuko

Hi Justin,

Thank you for commens.

On Tue, Apr 7, 2020 at 12:32 PM Justin Pryzby  wrote:
>
> Not sure if you saw my earlier message ?
>
I'm sorry, I didn't notice for a while.

> I think it ought to be possible to configure this feature such that an
> auto-analyze on any child partition would trigger analyze of the parent.  I
> think that would be important for maintaining accurate stats of the partition
> key column for many cases involving RANGE-partitioned tables, which are likely
> to rely on histogram rather than MCVs.
>
I read your previous email and understand that it would be neccesary to analyze
partitioned tables automatically when any of its children are analyzed.  In my
first patch, auto-analyze on partitioned tables worked like this but there were
some comments about performance of autovacuum, especially when partitioned
tables have a lot of children.

The latest patch lets users set different autovacuum configuration for
each partitioned
tables like this,
  create table p3(i int) partition by range(i) with
   (autovacuum_analyze_scale_factor=0.0005, autovacuum_analyze_threshold=100);
so users can configure those parameters according to partitioning strategies
and other requirements.

So I think this patch can solve problem you mentioned.

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Problem with logical replication

2020-04-16 Thread Dilip Kumar

While try to setup a cascading replication, I have observed that if we
set the REPLICA IDENTITY to FULL on the subscriber side then there is
an Assert hit.

After analysis I have found that, when we set the REPLICA IDENTITY to
FULL on subscriber side (because I wanted to make this a publisher for
another subscriber).
then it will set relation->rd_replidindex to InvalidOid refer below code snippet
RelationGetIndexList()
{

if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
relation->rd_replidindex = candidateIndex;
else
relation->rd_replidindex = InvalidOid;
}

But, while appying the update and if the table have an index we have
this assert in build_replindex_scan_key

static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
TupleTableSlot *searchslot)
{
...
Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
}

To me it appears like this assert is not correct.  Attached patch has
removed this assert and things works fine.


#0  0x7ff2a0c8d5d7 in raise () from /lib64/libc.so.6
#1  0x7ff2a0c8ecc8 in abort () from /lib64/libc.so.6
#2  0x00aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
"RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
errorType=0xc1bad9 "FailedAssertion",
fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
#3  0x007153c3 in build_replindex_scan_key
(skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
searchslot=0x21328c8) at execReplication.c:60
#4  0x007156ac in RelationFindReplTupleByIndex
(rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
#5  0x008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
localslot=0x7fff25711f28) at worker.c:989
#6  0x008ae6f2 in apply_handle_update_internal
(relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
#7  0x008ae609 in apply_handle_update (s=0x7fff25719560) at worker.c:788
#8  0x008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
#9  0x008afd52 in LogicalRepApplyLoop (last_received=22926832)
at worker.c:1570
#10 0x008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
#11 0x00869c15 in StartBackgroundWorker () at bgworker.c:813
#12 0x0087d28f in do_start_bgworker (rw=0x20a07a0) at postmaster.c:5852
#13 0x0087d63d in maybe_start_bgworkers () at postmaster.c:6078
#14 0x0087c685 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5247
#15 
#16 0x7ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
#17 0x00878153 in ServerLoop () at postmaster.c:1691
#18 0x00877b42 in PostmasterMain (argc=3, argv=0x2079120) at
postmaster.c:1400
#19 0x0077f256 in main (argc=3, argv=0x2079120) at main.c:210

To reproduce this issue
run start1.sh
then execute below commands on publishers.
insert into pgbench_accounts values(1,2);
update pgbench_accounts set b=30 where a=1;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
pkill -9 postgres
rm -rf data data1 data2
./initdb -D data
./initdb -D data1
echo "wal_level = logical">> data/postgresql.conf
echo "port=5433" >> data1/postgresql.conf
#cp postgresql.conf data/
#cp postgresql.conf_sub data1/postgresql.conf
./pg_ctl -D data -c start
./pg_ctl -D data1 -c start
./psql -p 5432 -d postgres -c "create table pgbench_accounts(a int primary key, b int)"
./psql -p 5433 -d postgres -c "create table pgbench_accounts(a int primary key, b int)"
./psql -p 5432 -d postgres -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;"
./psql -p 5433 -d postgres -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;"
./psql -p 5432 -d postgres -c "CREATE PUBLICATION mypub FOR TABLE pgbench_accounts"
./psql -p 5433 -d postgres -c "CREATE PUBLICATION mypub1 FOR TABLE pgbench_accounts"
./psql -d postgres -p 5433 -c "CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432 dbname=postgres' PUBLICATION mypub"



bugfix_replica_identity_full_on_subscriber.patch
Description: Binary data

Re: sqlsmith crash incremental sort

2020-04-16 Thread Richard Guo

On Mon, Apr 13, 2020 at 8:09 AM Tomas Vondra 
wrote:

>
> I've been messing with this the whole day, without much progress :-(
>
> I'm 99.% sure it's the same issue described by the quoted comment,
> because the plan looks like this:
>
>   Nested Loop Left Join
> ->  Sample Scan on pg_namespace
>   Sampling: system ('7.2'::real)
> ->  Incremental Sort
>   Sort Key: ...
>   Presorted Key: ...
>   ->  Unique
> ->  Sort
>   Sort Key: ...
>   ->  Append
> ->  Nested Loop
> ...
> ->  Nested Loop
> ...
>
> so yeah, the plan does have set operations, and generate_append_tlist
> does generate Vars with varno == 0, causing this issue.
>

After some digging I believe here is what happened.

1. For the UNION query, we build an upper rel of UPPERREL_SETOP and
generate Append path for it. Since Append doesn't actually evaluate its
targetlist, we generate 'varno 0' Vars for its targetlist. (setrefs.c
would just replace them with OUTER_VAR when adjusting the final plan so
this usually does not cause problems.)

2. To remove duplicates for UNION, we use hash/sort to unique-ify the
result. If sort is chosen, we add Sort path and then Unique path above
Append path, with pathkeys made from Append's targetlist.

3. Also the Append's targetlist would be built into
root->processed_tlist and with that we calculate root->sort_pathkeys.

4. When handling ORDER BY clause, we figure out the pathkeys of
Unique->Sort->Append path share some same prefix with
root->sort_pathkeys and thus incremental sort would be considered.

5. When calculating cost for incremental sort, estimate_num_groups does
not cope with 'varno 0' Vars extracted from root->sort_pathkeys.

With this scenario, here is a simple recipe:

create table foo(a int, b int, c int);
set enable_hashagg to off;
explain select * from foo union select * from foo order by 1,3;

Thanks
Richard

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-16 Thread Hamid Akhtar

This patch actually does not discard any prepared transactions and only
throws a warning for each orphaned one. So, there is no behaviour change
except for getting some warnings in the log or emitting some warning to a
client executing a vacuum command.

I hear all the criticism which I don't disagree with. Obviously, scripts
and other solutions could provide a lot more flexibility.

Also, I believe most of us agree that vacuum needs to be smarter.

src/backend/commands/vacuum.c does throw warnings for upcoming wraparound
issues with one warning in particular mentioning prepared transactions and
stale replication slots. So, throwing warnings is not unprecedented. There
are 3 warnings in this file which I believe can also be handled by external
tools. I'm not debating the merit of these warnings, nor am I trying to
justify the addition of new warnings based on these.

My real question is whether vacuum should be preemptively complaining about
prepared transactions or stale replication slots rather than waiting for
transaction id to exceed the safe limit. I presume by the time safe limit
is exceeded, vacuum's work would already have been significantly impacted.

AFAICT, my patch actually doesn't break anything and doesn't add any
significant overhead to the vacuum process. It does supplement the current
warnings though which might be useful.

On Thu, Apr 16, 2020 at 10:32 AM Craig Ringer  wrote:

> On Thu, 16 Apr 2020 at 13:23, Craig Ringer  wrote:
>
>>
>> Just discarding the prepared xacts is not the answer though.
>>
>
> ... however, I have wondered a few times about making vacuum smarter about
> cases where the xmin is held down by prepared xacts or by replication
> slots. If we could record the oldest *and newest* xid needed by such
> resource retention markers we could potentially teach vacuum to remove
> intermediate dead rows. For high-churn workloads like like workqueue
> applications that could be a really big win.
>
> We wouldn't need to track a fine-grained snapshot with an in-progress list
> (or inverted in-progress list like historic snapshots) for these. We'd just
> remember the needed xid range in [xmin,xmax] form. And we could even do the
> same for live backends' PGXACT - it might not be worth the price there, but
> if you have workloads that have batch xacts + high churn rate xacts it'd be
> pretty appealing.
>
> It wouldn't help with xid wraparound concerns, but it could help a lot
> with bloat caused by old snapshots for some very common workloads.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  2ndQuadrant - PostgreSQL Solutions for the Enterprise
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

ERROR: could not open file "pg_tblspc/ issue with replication setup.

2020-04-16 Thread Rajkumar Raghuwanshi

Hi,

While testing for a feature I got this tablespace related error while
running script.
The Issue is not reproducible everytime, but If I am running the same set
of commands after 2-3 runs I am able to reproduce the same error.

--first run - pass
# master slave setup
+ mkdir /tmp/test_bkp/tblsp01
+ ./psql postgres -p 5432 -c 'create tablespace tblsp01 location
'\''/tmp/test_bkp/tblsp01'\'';'
CREATE TABLESPACE
+ ./psql postgres -p 5432 -c 'create table test (a text) tablespace
tblsp01;'
CREATE TABLE
#cleanup

--next
#master-slave setup
+ mkdir /tmp/test_bkp/tblsp01
+ ./psql postgres -p 5432 -c 'create tablespace tblsp01 location
'\''/tmp/test_bkp/tblsp01'\'';'
CREATE TABLESPACE
+ ./psql postgres -p 5432 -c 'create table test (a text) tablespace
tblsp01;'
ERROR:  could not open file "pg_tblspc/16384/PG_13_202004074/13530/16388":
No such file or directory


Attaching command and script which help to reproduce it.
[edb@localhost bin]$ while sh pg_tblsp_wal.sh; do :; done

Thanks & Regards,
Rajkumar Raghuwanshi


pg_tblsp_wal.sh
Description: application/shellscript

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Masahiko Sawada

On Thu, 16 Apr 2020 at 16:22, Kyotaro Horiguchi  wrote:
>
> At Wed, 15 Apr 2020 11:31:49 -0400, Tom Lane  wrote in
> > Kyotaro Horiguchi  writes:
> > > At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in
> > > +   stby->is_sync_standby = true;   /* might change below */
> >
> > > I'm uneasy with that.  In quorum mode all running standbys are marked
> > > as "sync" and that's bogus.
> >
> > I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
> > returns all the candidates in its list, so this is isomorphic to that.
>
> The existing code actully does that. On the other hand
> SyncRepGetSyncStandbysPriority returns standbys that *are known to be*
> synchronous, but *Quorum returns standbys that *can be* synchronous.
> What the two functions return are different from each other.  So it
> should be is_sync_standby for -Priority and is_sync_candidate for
> -Quorum.
>
> > Possibly a different name for the flag would be more suited?
> >
> > > On the other hand sync_standbys is already sorted in priority order so I 
> > > think we can get rid of the member by setting *am_sync as the follows.
> >
> > > SyncRepGetSyncRecPtr:
> > >   if (sync_standbys[i].is_me)
> > >   {
> > >   *am_sync = (i < SyncRepConfig->num_sync);
> > >   break;
> > >   }
> >
> > I disagree with this, it will change the behavior in the quorum case.
>
> Oops, you're right.  I find the whole thing there (and me) is a bit
> confusing. syncrep_method affects how some values (specifically
> am_sync and sync_standbys) are translated at several calling depths.
> And the *am_sync informs nothing in quorum mode.
>
> > In any case, a change like this will cause callers to know way more than
> > they ought to about the ordering of the array.  In my mind, the fact that
> > SyncRepGetSyncStandbysPriority is sorting the array is an internal
> > implementation detail; I do not want it to be part of the API.
>
> Anyway the am_sync and is_sync_standby is utterly useless in quorum
> mode.  That discussion is pretty persuasive if not, but actually the
> upper layers (SyncRepReleaseWaiters and SyncRepGetSyncRecPtr) referes
> to syncrep_method to differentiate the interpretation of the am_sync
> flag and sync_standbys list.  So anyway the difference is actually a
> part of API.
>
> After thinking some more, I concluded that some of the variables are
> wrongly named or considered, and redundant. The fucntion of am_sync is
> covered by got_recptr in SyncRepReleaseWaiters, so it's enough that
> SyncRepGetSyncRecPtr just reports to the caller whether the caller may
> release some of the waiter processes. This simplifies the related
> functions and make it (to me) clearer.
>
> Please find the attached.
>
>
> > (Apropos to that, I realized from working on this patch that there's
> > another, completely undocumented assumption in the existing code, that
> > the integer list will be sorted by walsender index for equal priorities.
> > I don't like that either, and not just because it's undocumented.)
>
> That seems accidentally. Sorting by priority is the disigned behavior
> and documented, in contrast, entries of the same priority are ordered
> in index order by accident and not documented, that means it can be
> changed anytime.  I think we don't define everyting in such detail.
>

This is just a notice; I'm reading your latest patch but it seems to
include unrelated changes:

$ git diff --stat
 src/backend/replication/syncrep.c   | 475
+---
 src/backend/replication/walsender.c |  40 ++-
 src/bin/pg_dump/compress_io.c   |  12 ++
 src/bin/pg_dump/pg_backup_directory.c   |  48 ++-
 src/include/replication/syncrep.h   |  20 +-
 src/include/replication/walsender_private.h |  16 
 6 files changed, 274 insertions(+), 337 deletions(-)


Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-16 Thread Masahiko Sawada

On Thu, 16 Apr 2020 at 15:02, Amit Kapila  wrote:
>
> On Wed, Apr 15, 2020 at 9:12 AM Amit Kapila  wrote:
> >
> > On Wed, Apr 15, 2020 at 9:03 AM Justin Pryzby  wrote:
> > >
> > > On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote:
> > > > On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
> > > > > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp 
> > > > > > tables
> > > > > >  WARNING:  disabling parallel option of vacuum on "tmp" --- cannot  
> > > > > > vacuum temporary tables in parallel
> > > > > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel 
> > > > > > disabled (even though that's implied by FULL)
> > > > > >
> > > > > > To fully close the gap in the tests, I would also add a test for
> > > > > > (PARALLEL 1, FULL false) where FULL directly specified, even if that
> > > > > > sounds like a nit.  That's fine to test even on a temporary table.
> > > > > >
> > > > >
> > > > > Okay, I will do this once we agree on the error message stuff.
> > > > >
> > > >
> > > > I have changed one of the existing tests to test the option suggested
> > > > by you.  Additionally, I have changed the docs as per suggestion from
> > > > Sawada-san.  I haven't changed the error message.  Let me know if you
> > > > have any more comments?
> > >
> > > You did:
> > > |...then the number of workers is determined based on the number of
> > > |indexes that support parallel vacuum operation on the 
> > > [-relation,-]{+relation+} and is further
> > > |limited by .
> > >
> > > I'd suggest to say instead:
> > > |...then the number of workers is determined based on the number of
> > > |indexes ON THE RELATION that support parallel vacuum operation, and is 
> > > further
> > > |limited by .
> > >
> >
> > I have not changed this now but I find your suggestion better than
> > existing wording.  I'll change this before committing the patch unless
> > there are more comments.
> >
>
> Pushed.

Thanks! I've updated the Open Items page.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: spin_delay() for ARM

2020-04-16 Thread Pavel Stehule

čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar 
napsal:

> On Mon, 13 Apr 2020 at 20:16, Amit Khandekar 
> wrote:
> > On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> > >
> > > I wrote:
> > > > A more useful test would be to directly experiment with contended
> > > > spinlocks.  As I recall, we had some test cases laying about when
> > > > we were fooling with the spin delay stuff on Intel --- maybe
> > > > resurrecting one of those would be useful?
> > >
> > > The last really significant performance testing we did in this area
> > > seems to have been in this thread:
> > >
> > >
> https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> > >
> > > A relevant point from that is Haas' comment
> > >
> > > I think optimizing spinlocks for machines with only a few CPUs is
> > > probably pointless.  Based on what I've seen so far, spinlock
> > > contention even at 16 CPUs is negligible pretty much no matter what
> > > you do.  Whether your implementation is fast or slow isn't going to
> > > matter, because even an inefficient implementation will account for
> > > only a negligible percentage of the total CPU time - much less
> than 1%
> > > - as opposed to a 64-core machine, where it's not that hard to find
> > > cases where spin-waits consume the *majority* of available CPU time
> > > (recall previous discussion of lseek).
> >
> > Yeah, will check if I find some machines with large cores.
>
> I got hold of a 32 CPUs VM (actually it was a 16-core, but being
> hyperthreaded, CPUs were 32).
> It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
> KVM. Single NUMA node.
> PG parameters changed : shared_buffer: 8G ; max_connections : 1000
>
> I compared pgbench results with HEAD versus PAUSE removed like this :
>  perform_spin_delay(SpinDelayStatus *status)
>  {
> -   /* CPU-specific delay each time through the loop */
> -   SPIN_DELAY();
>
> Ran with increasing number of parallel clients :
> pgbench -S -c $num -j $num -T 60 -M prepared
> But couldn't find any significant change in the TPS numbers with or
> without PAUSE:
>
> Clients HEAD Without_PAUSE
> 8 26   247264
> 16399939   399549
> 24454189   453244
> 32   1097592  1098844
> 40   1090424  1087984
> 48   1068645  1075173
> 64   1035035  1039973
> 96976578   970699
>
> May be it will indeed show some difference only with around 64 cores,
> or perhaps a bare metal machine will help; but as of now I didn't get
> such a machine. Anyways, I thought why not archive the results with
> whatever I have.
>
> Not relevant to the PAUSE stuff  Note that when the parallel
> clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> shoots from 454189 to 1097592 which is more than double speed gain
> with just a 30% increase in parallel sessions. I was not expecting
> this much speed gain, because, with contended scenario already pgbench
> processes are already taking around 20% of the total CPU time of
> pgbench run. May be later on, I will get a chance to run with some
> customized pgbench script that runs a server function which keeps on
> running an index scan on pgbench_accounts, so as to make pgbench
> clients almost idle.
>

what I know, pgbench cannot be used for testing spinlocks problems.

Maybe you can see this issue when a) use higher number clients - hundreds,
thousands. Decrease share memory, so there will be press on related spin
lock.

Regards

Pavel


> Thanks
> -Amit Khandekar
>
>
>

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-16 Thread Masahiko Sawada

On Thu, Apr 16, 2020 at 8:38 AM Andres Freund  wrote:
>
> Hi,
>
> commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
> Author: Robert Haas 
> Date:   2019-04-04 14:58:53 -0400
>
> Allow VACUUM to be run with index cleanup disabled.
>
> This commit adds a new reloption, vacuum_index_cleanup, which
> controls whether index cleanup is performed for a particular
> relation by default.  It also adds a new option to the VACUUM
> command, INDEX_CLEANUP, which can be used to override the
> reloption.  If neither the reloption nor the VACUUM option is
> used, the default is true, as before.
>
> Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
> Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
> The wording of the documentation is mostly due to me.
>
> Discussion: 
> http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naypuouffvsrzutymz29ylz...@mail.gmail.com
>
> made the index scan that is part of vacuum optional.  I'm afraid that it
> is not safe to do so unconditionally. Until this commit indexes could
> rely on at least the amvacuumcleanup callback being called once per
> vacuum. Which guaranteed that an index could ensure that there are no
> too-old xids anywhere in the index.
>
> But now that's not the case anymore:
>
> vacrelstats->useindex = (nindexes > 0 &&
>  
> params->index_cleanup == VACOPT_TERNARY_ENABLED);
> ...
> /* Do post-vacuum cleanup */
> if (vacrelstats->useindex)
> lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, 
> nindexes);
>
> E.g. btree has xids both in the metapage contents, as well as using it
> on normal index pages as part of page deletion.
>
> The slightly oder feature to avoid unnecessary scans during cleanup
> protects against this issue by skipping the scan inside the index AM:
>
> /*
>  * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
>  *  btbulkdelete() wasn't called.
>  */
> static bool
> _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
> {
> ...
> else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
>  TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
>
> RecentGlobalXmin))
> {
> /*
>  * If oldest btpo.xact in the deleted pages is older than
>  * RecentGlobalXmin, then at least one deleted page can be 
> recycled.
>  */
> result = true;
> }
>
> which will afaict result in all such xids getting removed (or at least
> give the AM the choice to do so).

For btree indexes, IIRC skipping index cleanup could not be a cause of
corruption, but be a cause of index bloat since it leaves recyclable
pages which are not marked as recyclable. The index bloat is the main
side effect of skipping index cleanup. When user executes VACUUM with
INDEX_CLEANUP to reclaim index garbage, such pages will also be
recycled sooner or later? Or skipping index cleanup can be a cause of
recyclable page never being recycled?

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Kyotaro Horiguchi

At Wed, 15 Apr 2020 11:31:49 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> > +   stby->is_sync_standby = true;   /* might change below */
> 
> > I'm uneasy with that.  In quorum mode all running standbys are marked
> > as "sync" and that's bogus.
> 
> I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
> returns all the candidates in its list, so this is isomorphic to that.

The existing code actully does that. On the other hand
SyncRepGetSyncStandbysPriority returns standbys that *are known to be*
synchronous, but *Quorum returns standbys that *can be* synchronous.
What the two functions return are different from each other.  So it
should be is_sync_standby for -Priority and is_sync_candidate for
-Quorum.

> Possibly a different name for the flag would be more suited?
> 
> > On the other hand sync_standbys is already sorted in priority order so I 
> > think we can get rid of the member by setting *am_sync as the follows.
> 
> > SyncRepGetSyncRecPtr:
> >   if (sync_standbys[i].is_me)
> >   {
> >   *am_sync = (i < SyncRepConfig->num_sync);
> >   break;
> >   }
> 
> I disagree with this, it will change the behavior in the quorum case.

Oops, you're right.  I find the whole thing there (and me) is a bit
confusing. syncrep_method affects how some values (specifically
am_sync and sync_standbys) are translated at several calling depths.
And the *am_sync informs nothing in quorum mode.

> In any case, a change like this will cause callers to know way more than
> they ought to about the ordering of the array.  In my mind, the fact that
> SyncRepGetSyncStandbysPriority is sorting the array is an internal
> implementation detail; I do not want it to be part of the API.

Anyway the am_sync and is_sync_standby is utterly useless in quorum
mode.  That discussion is pretty persuasive if not, but actually the
upper layers (SyncRepReleaseWaiters and SyncRepGetSyncRecPtr) referes
to syncrep_method to differentiate the interpretation of the am_sync
flag and sync_standbys list.  So anyway the difference is actually a
part of API.

After thinking some more, I concluded that some of the variables are
wrongly named or considered, and redundant. The fucntion of am_sync is
covered by got_recptr in SyncRepReleaseWaiters, so it's enough that
SyncRepGetSyncRecPtr just reports to the caller whether the caller may
release some of the waiter processes. This simplifies the related
functions and make it (to me) clearer.

Please find the attached.


> (Apropos to that, I realized from working on this patch that there's
> another, completely undocumented assumption in the existing code, that
> the integer list will be sorted by walsender index for equal priorities.
> I don't like that either, and not just because it's undocumented.)

That seems accidentally. Sorting by priority is the disigned behavior
and documented, in contrast, entries of the same priority are ordered
in index order by accident and not documented, that means it can be
changed anytime.  I think we don't define everyting in such detail.

regards.

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..99a7bbbc86 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys,
+	   uint8 nsyncs);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&MyWalSnd->mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(&MyWalSnd->mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -429,8 +432,7 @@ SyncRepReleaseWaiters(void)
 	XLog

Re: spin_delay() for ARM

2020-04-16 Thread Amit Khandekar

On Mon, 13 Apr 2020 at 20:16, Amit Khandekar  wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> > I think optimizing spinlocks for machines with only a few CPUs is
> > probably pointless.  Based on what I've seen so far, spinlock
> > contention even at 16 CPUs is negligible pretty much no matter what
> > you do.  Whether your implementation is fast or slow isn't going to
> > matter, because even an inefficient implementation will account for
> > only a negligible percentage of the total CPU time - much less than 1%
> > - as opposed to a 64-core machine, where it's not that hard to find
> > cases where spin-waits consume the *majority* of available CPU time
> > (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-   /* CPU-specific delay each time through the loop */
-   SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients HEAD Without_PAUSE
8 26   247264
16399939   399549
24454189   453244
32   1097592  1098844
40   1090424  1087984
48   1068645  1075173
64   1035035  1039973
96976578   970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff  Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

Thanks
-Amit Khandekar

99 matches

Mail list logo