Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-20 Thread Fujii Masao




On 2020/04/18 0:31, Tom Lane wrote:

Kyotaro Horiguchi  writes:

At Fri, 17 Apr 2020 16:03:34 +0900, Fujii Masao  
wrote in

I agree that it might be worth considering the removal of am_sync for
the master branch or v14. But I think that it should not be
back-patched.



Ah! Agreed.


Yeah, that's not necessary to fix the bug.  I'd be inclined to leave
it for v14 at this point.

I don't much like the patch Fujii-san posted, though.  An important part
of the problem, IMO, is that SyncRepGetSyncStandbysPriority is too
complicated and it's unclear what dependencies it has on the set of
priorities in shared memory being consistent.  His patch does not improve
that situation; if anything it makes it worse.


Understood.



If we're concerned about not breaking ABI in the back branches, what
I propose we do about that is just leave SyncRepGetSyncStandbys in
place but not used by the core code, and remove it only in HEAD.
We can do an absolutely minimal fix for the assertion failure, in
case anybody is calling that code, by just dropping the Assert and
letting SyncRepGetSyncStandbys return NIL if it falls out.  (Or we
could let it return the incomplete list, which'd be the behavior
you get today in a non-assert build.)

Also, I realized while re-reading my patch that Kyotaro-san is onto
something about the is_sync_standby flag not being necessary: instead
we can just have the new function SyncRepGetCandidateStandbys return
a reduced count.  I'd initially believed that it was necessary for
that function to return the rejected candidate walsenders along with
the accepted ones, but that was a misunderstanding.  I still don't
want its API spec to say anything about ordering of the result array,
but we don't need to.

So that leads me to the attached.  I propose applying this to the
back branches except for the rearrangement of WALSnd field order.
In HEAD, I'd remove SyncRepGetSyncStandbys and subroutines altogether.


Thanks for making and committing the patch!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-19 Thread Masahiko Sawada
On Sun, 19 Apr 2020 at 01:00, Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Sat, 18 Apr 2020 at 00:31, Tom Lane  wrote:
> >> +   /* Quick out if not even configured to be synchronous */
> >> +   if (SyncRepConfig == NULL)
> >> +   return false;
>
> > I felt strange a bit that we do the above check in
> > SyncRepGetSyncRecPtr() because SyncRepReleaseWaiters() which is the
> > only caller says the following before calling it:
>
> Notice there was such a test in SyncRepGetSyncRecPtr already --- I just
> moved it to be before doing some work instead of after.
>
> > Can we either change it to an assertion, move it to before acquiring
> > SyncRepLock in SyncRepReleaseWaiters or just remove it?
>
> I have no objection to that in principle, but it seems like it's a
> change in SyncRepGetSyncRecPtr's API that is not necessary to fix
> this bug.  So I'd rather leave it to happen along with the larger
> API changes (getting rid of am_sync) that are proposed for v14.

Agreed.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-18 Thread Tom Lane
Masahiko Sawada  writes:
> On Sat, 18 Apr 2020 at 00:31, Tom Lane  wrote:
>> +   /* Quick out if not even configured to be synchronous */
>> +   if (SyncRepConfig == NULL)
>> +   return false;

> I felt strange a bit that we do the above check in
> SyncRepGetSyncRecPtr() because SyncRepReleaseWaiters() which is the
> only caller says the following before calling it:

Notice there was such a test in SyncRepGetSyncRecPtr already --- I just
moved it to be before doing some work instead of after.

> Can we either change it to an assertion, move it to before acquiring
> SyncRepLock in SyncRepReleaseWaiters or just remove it?

I have no objection to that in principle, but it seems like it's a
change in SyncRepGetSyncRecPtr's API that is not necessary to fix
this bug.  So I'd rather leave it to happen along with the larger
API changes (getting rid of am_sync) that are proposed for v14.

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Masahiko Sawada
On Sat, 18 Apr 2020 at 00:31, Tom Lane  wrote:
>
> Kyotaro Horiguchi  writes:
> > At Fri, 17 Apr 2020 16:03:34 +0900, Fujii Masao 
> >  wrote in
> >> I agree that it might be worth considering the removal of am_sync for
> >> the master branch or v14. But I think that it should not be
> >> back-patched.
>
> > Ah! Agreed.
>
> Yeah, that's not necessary to fix the bug.  I'd be inclined to leave
> it for v14 at this point.
>
> I don't much like the patch Fujii-san posted, though.  An important part
> of the problem, IMO, is that SyncRepGetSyncStandbysPriority is too
> complicated and it's unclear what dependencies it has on the set of
> priorities in shared memory being consistent.  His patch does not improve
> that situation; if anything it makes it worse.
>
> If we're concerned about not breaking ABI in the back branches, what
> I propose we do about that is just leave SyncRepGetSyncStandbys in
> place but not used by the core code, and remove it only in HEAD.
> We can do an absolutely minimal fix for the assertion failure, in
> case anybody is calling that code, by just dropping the Assert and
> letting SyncRepGetSyncStandbys return NIL if it falls out.  (Or we
> could let it return the incomplete list, which'd be the behavior
> you get today in a non-assert build.)

+1

>
> Also, I realized while re-reading my patch that Kyotaro-san is onto
> something about the is_sync_standby flag not being necessary: instead
> we can just have the new function SyncRepGetCandidateStandbys return
> a reduced count.  I'd initially believed that it was necessary for
> that function to return the rejected candidate walsenders along with
> the accepted ones, but that was a misunderstanding.  I still don't
> want its API spec to say anything about ordering of the result array,
> but we don't need to.
>
> So that leads me to the attached.  I propose applying this to the
> back branches except for the rearrangement of WALSnd field order.
> In HEAD, I'd remove SyncRepGetSyncStandbys and subroutines altogether.
>

+   /* Quick out if not even configured to be synchronous */
+   if (SyncRepConfig == NULL)
+   return false;

I felt strange a bit that we do the above check in
SyncRepGetSyncRecPtr() because SyncRepReleaseWaiters() which is the
only caller says the following before calling it:

/*
 * We're a potential sync standby. Release waiters if there are enough
 * sync standbys and we are considered as sync.
 */
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);

Can we either change it to an assertion, move it to before acquiring
SyncRepLock in SyncRepReleaseWaiters or just remove it?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Fri, 17 Apr 2020 17:03:11 +0900, Masahiko Sawada 
>  wrote in 
>> On Fri, 17 Apr 2020 at 14:58, Kyotaro Horiguchi  
>> wrote:
>> Just for confirmation, since the new approach doesn't change that
>> walsenders reload new config at their convenient timing, it still can
>> happen that a walsender releases waiters according to the old config
>> that defines fewer number of sync standbys, during walsenders

> Right.

>> absorbing a change in the set of synchronous walsenders. In the worst
>> case where the master crashes in the middle, we cannot be sure how
>> many sync servers the data has been replicated to. Is that right?

> Wal senders can set a stupid value as priority or in a worse case the
> shared walsender information might be of another walsender that is
> launched just now. In any case SyncRepGetSyncStandbys can return a set
> of walsenders with descending priority (in priority mode).  What can
> be happen in the worst case is some transactions are released by a bit
> wrong LSN information.  Such inconsistency also can be happen when the
> oldest sync standby in priority mode goes out and sync-LSN goes back
> even if the wal-sender list is strictly kept consistent.

I don't really see a problem here.  It's true that transactions might
be released based on either the old or the new value of num_sync,
depending on whether the particular walsender executing the release
logic has noticed the SIGHUP yet.  But if a transaction was released,
then there were at least num_sync confirmed transmissions of data
to someplace, so it's not like you've got no redundancy at all.

The only thing that seems slightly odd is that there could in principle
be some transactions released on the basis of the new num_sync, and
then slightly later some transactions released on the basis of the old
num_sync.  But I don't think it's really going to be possible to avoid
that, given that the GUC update is propagated in an asynchronous
fashion.

I spent a few moments wondering if we could avoid such cases by having
SyncRepReleaseWaiters check for GUC updates after it's acquired
SyncRepLock.  But that wouldn't really guarantee much, since the
postmaster can't deliver SIGHUP to all the walsenders simultaneously.
I think the main practical effect would be to allow some possibly-slow
processing to happen while holding SyncRepLock, which surely isn't a
great idea.

BTW, it might be worth documenting in this thread that my proposed
patch intentionally doesn't move SyncRepReleaseWaiters' acquisition
of SyncRepLock.  With the patch, SyncRepGetSyncRecPtr does not require
SyncRepLock so one could consider acquiring that lock only while updating 
walsndctl and releasing waiters.  My concern about that is that then
it'd be possible for a later round of waiter-releasing to happen on the
basis of slightly older SyncRepGetSyncRecPtr results, if a walsender that
had done SyncRepGetSyncRecPtr first were only able to acquire the lock
second.  Perhaps that would be okay, but I'm not sure, so I left it
alone.

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Fri, 17 Apr 2020 16:03:34 +0900, Fujii Masao  
> wrote in 
>> I agree that it might be worth considering the removal of am_sync for
>> the master branch or v14. But I think that it should not be
>> back-patched.

> Ah! Agreed.

Yeah, that's not necessary to fix the bug.  I'd be inclined to leave
it for v14 at this point.

I don't much like the patch Fujii-san posted, though.  An important part
of the problem, IMO, is that SyncRepGetSyncStandbysPriority is too
complicated and it's unclear what dependencies it has on the set of
priorities in shared memory being consistent.  His patch does not improve
that situation; if anything it makes it worse.

If we're concerned about not breaking ABI in the back branches, what
I propose we do about that is just leave SyncRepGetSyncStandbys in
place but not used by the core code, and remove it only in HEAD.
We can do an absolutely minimal fix for the assertion failure, in
case anybody is calling that code, by just dropping the Assert and
letting SyncRepGetSyncStandbys return NIL if it falls out.  (Or we
could let it return the incomplete list, which'd be the behavior
you get today in a non-assert build.)

Also, I realized while re-reading my patch that Kyotaro-san is onto
something about the is_sync_standby flag not being necessary: instead
we can just have the new function SyncRepGetCandidateStandbys return
a reduced count.  I'd initially believed that it was necessary for
that function to return the rejected candidate walsenders along with
the accepted ones, but that was a misunderstanding.  I still don't
want its API spec to say anything about ordering of the result array,
but we don't need to.

So that leads me to the attached.  I propose applying this to the
back branches except for the rearrangement of WALSnd field order.
In HEAD, I'd remove SyncRepGetSyncStandbys and subroutines altogether.

regards, tom lane

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31..b47c7fa 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -108,14 +108,18 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
 static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
 static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +410,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(>mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(>mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -523,8 +528,6 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
  * store the positions into *writePtr, *flushPtr and *applyPtr.
@@ -536,27 +539,41 @@ static bool
 SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	 XLogRecPtr *applyPtr, bool *am_sync)
 {
-	List	   *sync_standbys;
-
-	Assert(LWLockHeldByMe(SyncRepLock));
+	SyncRepStandbyData *sync_standbys;
+	int			num_standbys;
+	int			i;
 
+	/* Initialize default results */
 	*writePtr = InvalidXLogRecPtr;
 	*flushPtr = InvalidXLogRecPtr;
 	*applyPtr = InvalidXLogRecPtr;
 	*am_sync = false;
 
+	/* Quick out if not even configured to be synchronous */
+	if (SyncRepConfig == NULL)
+		return false;
+
 	/* Get standbys that are considered as synchronous at this moment */
-	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+	num_standbys = SyncRepGetCandidateStandbys(_standbys);
+
+	/* Am I among the candidate sync standbys? */
+	for (i = 0; i < num_standbys; i++)
+	{
+		if (sync_standbys[i].is_me)
+		{
+			*am_sync = true;
+			break;
+		}
+	}
 
 	/*
-	 * Quick exit if we are not managing a sync standby or there are not
-	 * enough synchronous standbys.
+	 * Nothing more to do if we are not managing a sync standby or there are
+	 * not enough synchronous 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Kyotaro Horiguchi
At Fri, 17 Apr 2020 17:03:11 +0900, Masahiko Sawada 
 wrote in 
> On Fri, 17 Apr 2020 at 14:58, Kyotaro Horiguchi  
> wrote:
> > The attached is baed on syncrep-fixes-1.patch + am_sync elimination.
> >
> 
> Just for confirmation, since the new approach doesn't change that
> walsenders reload new config at their convenient timing, it still can
> happen that a walsender releases waiters according to the old config
> that defines fewer number of sync standbys, during walsenders

Right.

> absorbing a change in the set of synchronous walsenders. In the worst
> case where the master crashes in the middle, we cannot be sure how
> many sync servers the data has been replicated to. Is that right?

Wal senders can set a stupid value as priority or in a worse case the
shared walsender information might be of another walsender that is
launched just now. In any case SyncRepGetSyncStandbys can return a set
of walsenders with descending priority (in priority mode).  What can
be happen in the worst case is some transactions are released by a bit
wrong LSN information.  Such inconsistency also can be happen when the
oldest sync standby in priority mode goes out and sync-LSN goes back
even if the wal-sender list is strictly kept consistent.

In quorum mode, we cannot even know which servers that endorsed the
master's commit after a crash.

I don't come up of clean solution for such inconsistency or
unrecoverability(?) for now..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Kyotaro Horiguchi
At Fri, 17 Apr 2020 16:03:34 +0900, Fujii Masao  
wrote in 
> I agree that it might be worth considering the removal of am_sync for
> the master branch or v14. But I think that it should not be
> back-patched.

Ah! Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Masahiko Sawada
On Fri, 17 Apr 2020 at 14:58, Kyotaro Horiguchi  wrote:
>
> At Thu, 16 Apr 2020 11:39:06 -0400, Tom Lane  wrote in
> > Kyotaro Horiguchi  writes:
> > > [ syncrep-fixes-4.patch ]
> >
> > I agree that we could probably improve the clarity of this code with
> > further rewriting, but I'm still very opposed to the idea of having
> > callers know that the first num_sync array elements are the active
> > ones.  It's wrong (or at least different from current behavior) for
> > quorum mode, where there might be more than num_sync walsenders to
> > consider.  And it might not generalize very well to other syncrep
> > selection rules we might add in future, which might also not have
> > exactly num_sync interesting walsenders.  So I much prefer an API
> > definition that uses bool flags in an array that has no particular
> > ordering (so far as the callers know, anyway).  If you don't like
> > is_sync_standby, how about some more-neutral name like is_active
> > or is_interesting or include_position?
>
> I'm convinced that each element has is_sync_standby.  I agree to the
> name is_sync_standby since I don't come up with a better name.
>
> > I dislike the proposed comment revisions in SyncRepReleaseWaiters,
> > too, particularly the change to say that what we're "announcing"
> > is the ability to release waiters.  You did not change the actual
> > log messages, and you would have gotten a lot of pushback if
> > you tried, because the current messages make sense to users and
> > something like that would not.  But by the same token this new
> > comment isn't too helpful to somebody reading the code.
>
> The current log messages look perfect to me.  I don't insist on the
> comment change since I might take the definition of "sync standby" too
> strictly.
>
> > (Actually, I wonder why we even have the restriction that only
> > sync standbys can release waiters.  It's not like they are
> > going to get different results from SyncRepGetSyncRecPtr than
> > any other walsender would.  Maybe we should just drop all the
> > am_sync logic?)
>
> I thought the same thing, though I didn't do that in the last patch.
>
> am_sync seems intending to reduce spurious wakeups but actually
> spurious wakeup won't increase even without it.  Thus the only
> remaining task of am_sync is the trigger for the log messages and that
> fact is the sign that the log messages should be emitted within
> SyncRepGetSyncRecPtr. That eliminates references to SyncRepConfig in
> SyncRepReleaseWaiters, which make me feel ease.
>
> The attached is baed on syncrep-fixes-1.patch + am_sync elimination.
>

Just for confirmation, since the new approach doesn't change that
walsenders reload new config at their convenient timing, it still can
happen that a walsender releases waiters according to the old config
that defines fewer number of sync standbys, during walsenders
absorbing a change in the set of synchronous walsenders. In the worst
case where the master crashes in the middle, we cannot be sure how
many sync servers the data has been replicated to. Is that right?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Fujii Masao



On 2020/04/17 3:00, Tom Lane wrote:

Fujii Masao  writes:

On 2020/04/14 22:52, Tom Lane wrote:

*Yes it does*.  The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers.



IMO that the broken results can be delivered when walsender marked
as sync exits *and* new walsender comes at that moment. If this new
walsender uses the WalSnd slot that the exited walsender used,
SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
on this new walsender (i.e., different walsender from one marked as sync).


Right, exactly, sorry that I was not more specific.


BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
it should not be back-patched to avoid ABI break. Right?


Anything that is using that is just as broken as the core code is, for the
same reasons, so I don't have a problem with changing its API.  Maybe we
should rename it while we're at it, just to make it clear that we are
breaking any external callers.  (If there are any, which seems somewhat
unlikely.)


I agree to change the API if that's the only way to fix the bug. But ISTM that
we can fix the bug without changing the API, like the attached patch does.

Your patch changes the logic to pick up sync standbys, e.g., use qsort(),
in addition to the bug fix. This might be an improvement and I agree that
it's worth considering that idea for the master branch or v14. But I'm not
fan of adding such changes into the back branches if they are not
necessary for the bug fix. I like to basically keep the current logic as it is,
at least for the back branch, like the attached patch does.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index ffd5b31eb2..6add59f18c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -108,14 +108,20 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
   
XLogRecPtr *flushPtr,
   
XLogRecPtr *applyPtr,
-  List 
*sync_standbys);
+  List 
*sync_standbys,
+  
SyncRepStandbyData *sdata);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,

  XLogRecPtr *flushPtr,

  XLogRecPtr *applyPtr,
-   
  List *sync_standbys, uint8 nth);
+   
  List *sync_standbys,
+   
  SyncRepStandbyData *sdata, uint8 nth);
 static int SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static List *SyncRepGetSyncStandbysInternal(bool *am_sync,
+   
SyncRepStandbyData *sdata);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync,
+   
SyncRepStandbyData *sdata);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync,
+   
  SyncRepStandbyData *sdata);
 static int cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -537,6 +543,7 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr 
*flushPtr,
 XLogRecPtr *applyPtr, bool *am_sync)
 {
List   *sync_standbys;
+   static SyncRepStandbyData   *sdata = NULL;
 
Assert(LWLockHeldByMe(SyncRepLock));
 
@@ -545,8 +552,12 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr 
*flushPtr,
*applyPtr = InvalidXLogRecPtr;
*am_sync = false;
 
+   if (sdata == NULL)
+   sdata = (SyncRepStandbyData *)
+   palloc(max_wal_senders * sizeof(SyncRepStandbyData));
+
/* Get standbys that are considered as synchronous at this moment */
-   sync_standbys = SyncRepGetSyncStandbys(am_sync);
+   sync_standbys = SyncRepGetSyncStandbysInternal(am_sync, sdata);
 
/*
 * Quick exit if we are not managing a sync standby or there are not
@@ -576,12 +587,13 @@ 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Fujii Masao




On 2020/04/17 14:58, Kyotaro Horiguchi wrote:

At Thu, 16 Apr 2020 11:39:06 -0400, Tom Lane  wrote in

Kyotaro Horiguchi  writes:

[ syncrep-fixes-4.patch ]


I agree that we could probably improve the clarity of this code with
further rewriting, but I'm still very opposed to the idea of having
callers know that the first num_sync array elements are the active
ones.  It's wrong (or at least different from current behavior) for
quorum mode, where there might be more than num_sync walsenders to
consider.  And it might not generalize very well to other syncrep
selection rules we might add in future, which might also not have
exactly num_sync interesting walsenders.  So I much prefer an API
definition that uses bool flags in an array that has no particular
ordering (so far as the callers know, anyway).  If you don't like
is_sync_standby, how about some more-neutral name like is_active
or is_interesting or include_position?


I'm convinced that each element has is_sync_standby.  I agree to the
name is_sync_standby since I don't come up with a better name.


I dislike the proposed comment revisions in SyncRepReleaseWaiters,
too, particularly the change to say that what we're "announcing"
is the ability to release waiters.  You did not change the actual
log messages, and you would have gotten a lot of pushback if
you tried, because the current messages make sense to users and
something like that would not.  But by the same token this new
comment isn't too helpful to somebody reading the code.


The current log messages look perfect to me.  I don't insist on the
comment change since I might take the definition of "sync standby" too
strictly.


(Actually, I wonder why we even have the restriction that only
sync standbys can release waiters.  It's not like they are
going to get different results from SyncRepGetSyncRecPtr than
any other walsender would.  Maybe we should just drop all the
am_sync logic?)


I thought the same thing, though I didn't do that in the last patch.

am_sync seems intending to reduce spurious wakeups but actually
spurious wakeup won't increase even without it.  Thus the only
remaining task of am_sync is the trigger for the log messages and that
fact is the sign that the log messages should be emitted within
SyncRepGetSyncRecPtr. That eliminates references to SyncRepConfig in
SyncRepReleaseWaiters, which make me feel ease.

The attached is baed on syncrep-fixes-1.patch + am_sync elimination.


I agree that it might be worth considering the removal of am_sync for
the master branch or v14. But I think that it should not be back-patched.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-17 Thread Kyotaro Horiguchi
At Thu, 16 Apr 2020 11:39:06 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > [ syncrep-fixes-4.patch ]
> 
> I agree that we could probably improve the clarity of this code with
> further rewriting, but I'm still very opposed to the idea of having
> callers know that the first num_sync array elements are the active
> ones.  It's wrong (or at least different from current behavior) for
> quorum mode, where there might be more than num_sync walsenders to
> consider.  And it might not generalize very well to other syncrep
> selection rules we might add in future, which might also not have
> exactly num_sync interesting walsenders.  So I much prefer an API
> definition that uses bool flags in an array that has no particular
> ordering (so far as the callers know, anyway).  If you don't like
> is_sync_standby, how about some more-neutral name like is_active
> or is_interesting or include_position?

I'm convinced that each element has is_sync_standby.  I agree to the
name is_sync_standby since I don't come up with a better name.

> I dislike the proposed comment revisions in SyncRepReleaseWaiters,
> too, particularly the change to say that what we're "announcing"
> is the ability to release waiters.  You did not change the actual
> log messages, and you would have gotten a lot of pushback if
> you tried, because the current messages make sense to users and
> something like that would not.  But by the same token this new
> comment isn't too helpful to somebody reading the code.

The current log messages look perfect to me.  I don't insist on the
comment change since I might take the definition of "sync standby" too
strictly.

> (Actually, I wonder why we even have the restriction that only
> sync standbys can release waiters.  It's not like they are
> going to get different results from SyncRepGetSyncRecPtr than
> any other walsender would.  Maybe we should just drop all the
> am_sync logic?)

I thought the same thing, though I didn't do that in the last patch.

am_sync seems intending to reduce spurious wakeups but actually
spurious wakeup won't increase even without it.  Thus the only
remaining task of am_sync is the trigger for the log messages and that
fact is the sign that the log messages should be emitted within
SyncRepGetSyncRecPtr. That eliminates references to SyncRepConfig in
SyncRepReleaseWaiters, which make me feel ease.

The attached is baed on syncrep-fixes-1.patch + am_sync elimination.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..eb0616121a 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static void SyncRepGetSyncStandbysPriority(SyncRepStandbyData *standbys, int n);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(>mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(>mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -430,7 +433,6 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
 	bool		got_recptr;
-	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
@@ -461,34 +463,14 @@ SyncRepReleaseWaiters(void)
 	 * Check whether we are a sync standby or not, and calculate the synced
 	 * positions among all sync standbys.
 	 */
-	got_recptr = SyncRepGetSyncRecPtr(, , , _sync);
+	got_recptr = SyncRepGetSyncRecPtr(, , );
 
 	/*
-	 * If we are managing a sync standby, though we weren't prior to this,
-	 * then announce we are now a sync standby.
+	 * If we don't have enough sync standbys, just leave.
 	 */
-	

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Tom Lane
Fujii Masao  writes:
> On 2020/04/14 22:52, Tom Lane wrote:
>> *Yes it does*.  The existing code can deliver entirely broken results
>> if some walsender exits between where we examine the priorities and
>> where we fetch the WAL pointers.

> IMO that the broken results can be delivered when walsender marked
> as sync exits *and* new walsender comes at that moment. If this new
> walsender uses the WalSnd slot that the exited walsender used,
> SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
> on this new walsender (i.e., different walsender from one marked as sync).

Right, exactly, sorry that I was not more specific.

> BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
> it should not be back-patched to avoid ABI break. Right?

Anything that is using that is just as broken as the core code is, for the
same reasons, so I don't have a problem with changing its API.  Maybe we
should rename it while we're at it, just to make it clear that we are
breaking any external callers.  (If there are any, which seems somewhat
unlikely.)

The only concession to ABI that I had in mind was to not re-order
the fields of WalSnd in the back branches.

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Fujii Masao




On 2020/04/14 22:52, Tom Lane wrote:

Kyotaro Horiguchi  writes:

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function.  Thus
how many times we scan on the array doesn't make any difference.


*Yes it does*.  The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers.


So, in this case, the oldest lsn that SyncRepGetOldestSyncRecPtr()
calculates may be based on also the lsn of already-exited walsender.
This is what you say "broken results"? If yes, ISTM that this issue still
remains even after applying your patch. No? The walsender marked
as sync may still exit just before SyncRepGetOldestSyncRecPtr()
calculates the oldest lsn.

IMO that the broken results can be delivered when walsender marked
as sync exits *and* new walsender comes at that moment. If this new
walsender uses the WalSnd slot that the exited walsender used,
SyncRepGetOldestSyncRecPtr() wronly calculates the oldest lsn based
on this new walsender (i.e., different walsender from one marked as sync).
If this is actually what you tried to say "broken results", your patch
seems fine and fixes the issue.

BTW, since the patch changes the API of SyncRepGetSyncStandbys(),
it should not be back-patched to avoid ABI break. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Tom Lane
Kyotaro Horiguchi  writes:
> [ syncrep-fixes-4.patch ]

I agree that we could probably improve the clarity of this code with
further rewriting, but I'm still very opposed to the idea of having
callers know that the first num_sync array elements are the active
ones.  It's wrong (or at least different from current behavior) for
quorum mode, where there might be more than num_sync walsenders to
consider.  And it might not generalize very well to other syncrep
selection rules we might add in future, which might also not have
exactly num_sync interesting walsenders.  So I much prefer an API
definition that uses bool flags in an array that has no particular
ordering (so far as the callers know, anyway).  If you don't like
is_sync_standby, how about some more-neutral name like is_active
or is_interesting or include_position?

I dislike the proposed comment revisions in SyncRepReleaseWaiters,
too, particularly the change to say that what we're "announcing"
is the ability to release waiters.  You did not change the actual
log messages, and you would have gotten a lot of pushback if
you tried, because the current messages make sense to users and
something like that would not.  But by the same token this new
comment isn't too helpful to somebody reading the code.

(Actually, I wonder why we even have the restriction that only
sync standbys can release waiters.  It's not like they are
going to get different results from SyncRepGetSyncRecPtr than
any other walsender would.  Maybe we should just drop all the
am_sync logic?)

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Kyotaro Horiguchi
At Thu, 16 Apr 2020 16:48:28 +0900, Masahiko Sawada 
 wrote in 
> This is just a notice; I'm reading your latest patch but it seems to
> include unrelated changes:
> 
> $ git diff --stat
>  src/backend/replication/syncrep.c   | 475
> +---
>  src/backend/replication/walsender.c |  40 ++-
>  src/bin/pg_dump/compress_io.c   |  12 ++
>  src/bin/pg_dump/pg_backup_directory.c   |  48 ++-
>  src/include/replication/syncrep.h   |  20 +-
>  src/include/replication/walsender_private.h |  16 
>  6 files changed, 274 insertions(+), 337 deletions(-)

Ugg.  I failed to clean up working directory..  I didn't noticed as I
made the file by git diff. Thanks for noticing me of that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..99a7bbbc86 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys,
+	   uint8 nsyncs);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(>mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(>mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -429,8 +432,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_recptr;
-	bool		am_sync;
+	bool		release_waiters;
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			numapply = 0;
@@ -458,16 +460,24 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the synced
-	 * positions among all sync standbys.
+	 * Check whether we may release any of waiter processes, and calculate the
+	 * synced positions.
 	 */
-	got_recptr = SyncRepGetSyncRecPtr(, , , _sync);
+	release_waiters = SyncRepGetSyncRecPtr(, , );
+
+	/* Return if nothing to do. */
+	if (!release_waiters)
+	{
+		LWLockRelease(SyncRepLock);
+		announce_next_takeover = true;
+		return;
+	}
 
 	/*
-	 * If we are managing a sync standby, though we weren't prior to this,
-	 * then announce we are now a sync standby.
+	 * If this walsender becomes to be able to release waiter processes,
+	 * announce about that.
 	 */
-	if (announce_next_takeover && am_sync)
+	if (announce_next_takeover)
 	{
 		announce_next_takeover = false;
 
@@ -481,17 +491,6 @@ SyncRepReleaseWaiters(void)
 			application_name)));
 	}
 
-	/*
-	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
-	 */
-	if (!got_recptr || !am_sync)
-	{
-		LWLockRelease(SyncRepLock);
-		announce_next_takeover = !am_sync;
-		return;
-	}
-
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
 	 * this location.
@@ -523,43 +522,62 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
- * Return false if the number of sync standbys is less than
- * synchronous_standby_names specifies. Otherwise return true and
- * store the positions into *writePtr, *flushPtr and *applyPtr.
- *
- * On return, *am_sync is set to true if this walsender is connecting to
- * sync standby. Otherwise it's set to false.
+ * Return false if this walsender cannot release any of waiteres. Otherwise
+ * return 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Masahiko Sawada
On Thu, 16 Apr 2020 at 16:22, Kyotaro Horiguchi  wrote:
>
> At Wed, 15 Apr 2020 11:31:49 -0400, Tom Lane  wrote in
> > Kyotaro Horiguchi  writes:
> > > At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in
> > > +   stby->is_sync_standby = true;   /* might change below */
> >
> > > I'm uneasy with that.  In quorum mode all running standbys are marked
> > > as "sync" and that's bogus.
> >
> > I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
> > returns all the candidates in its list, so this is isomorphic to that.
>
> The existing code actully does that. On the other hand
> SyncRepGetSyncStandbysPriority returns standbys that *are known to be*
> synchronous, but *Quorum returns standbys that *can be* synchronous.
> What the two functions return are different from each other.  So it
> should be is_sync_standby for -Priority and is_sync_candidate for
> -Quorum.
>
> > Possibly a different name for the flag would be more suited?
> >
> > > On the other hand sync_standbys is already sorted in priority order so I 
> > > think we can get rid of the member by setting *am_sync as the follows.
> >
> > > SyncRepGetSyncRecPtr:
> > >   if (sync_standbys[i].is_me)
> > >   {
> > >   *am_sync = (i < SyncRepConfig->num_sync);
> > >   break;
> > >   }
> >
> > I disagree with this, it will change the behavior in the quorum case.
>
> Oops, you're right.  I find the whole thing there (and me) is a bit
> confusing. syncrep_method affects how some values (specifically
> am_sync and sync_standbys) are translated at several calling depths.
> And the *am_sync informs nothing in quorum mode.
>
> > In any case, a change like this will cause callers to know way more than
> > they ought to about the ordering of the array.  In my mind, the fact that
> > SyncRepGetSyncStandbysPriority is sorting the array is an internal
> > implementation detail; I do not want it to be part of the API.
>
> Anyway the am_sync and is_sync_standby is utterly useless in quorum
> mode.  That discussion is pretty persuasive if not, but actually the
> upper layers (SyncRepReleaseWaiters and SyncRepGetSyncRecPtr) referes
> to syncrep_method to differentiate the interpretation of the am_sync
> flag and sync_standbys list.  So anyway the difference is actually a
> part of API.
>
> After thinking some more, I concluded that some of the variables are
> wrongly named or considered, and redundant. The fucntion of am_sync is
> covered by got_recptr in SyncRepReleaseWaiters, so it's enough that
> SyncRepGetSyncRecPtr just reports to the caller whether the caller may
> release some of the waiter processes. This simplifies the related
> functions and make it (to me) clearer.
>
> Please find the attached.
>
>
> > (Apropos to that, I realized from working on this patch that there's
> > another, completely undocumented assumption in the existing code, that
> > the integer list will be sorted by walsender index for equal priorities.
> > I don't like that either, and not just because it's undocumented.)
>
> That seems accidentally. Sorting by priority is the disigned behavior
> and documented, in contrast, entries of the same priority are ordered
> in index order by accident and not documented, that means it can be
> changed anytime.  I think we don't define everyting in such detail.
>

This is just a notice; I'm reading your latest patch but it seems to
include unrelated changes:

$ git diff --stat
 src/backend/replication/syncrep.c   | 475
+---
 src/backend/replication/walsender.c |  40 ++-
 src/bin/pg_dump/compress_io.c   |  12 ++
 src/bin/pg_dump/pg_backup_directory.c   |  48 ++-
 src/include/replication/syncrep.h   |  20 +-
 src/include/replication/walsender_private.h |  16 
 6 files changed, 274 insertions(+), 337 deletions(-)


Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-16 Thread Kyotaro Horiguchi
At Wed, 15 Apr 2020 11:31:49 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> > +   stby->is_sync_standby = true;   /* might change below */
> 
> > I'm uneasy with that.  In quorum mode all running standbys are marked
> > as "sync" and that's bogus.
> 
> I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
> returns all the candidates in its list, so this is isomorphic to that.

The existing code actully does that. On the other hand
SyncRepGetSyncStandbysPriority returns standbys that *are known to be*
synchronous, but *Quorum returns standbys that *can be* synchronous.
What the two functions return are different from each other.  So it
should be is_sync_standby for -Priority and is_sync_candidate for
-Quorum.

> Possibly a different name for the flag would be more suited?
> 
> > On the other hand sync_standbys is already sorted in priority order so I 
> > think we can get rid of the member by setting *am_sync as the follows.
> 
> > SyncRepGetSyncRecPtr:
> >   if (sync_standbys[i].is_me)
> >   {
> >   *am_sync = (i < SyncRepConfig->num_sync);
> >   break;
> >   }
> 
> I disagree with this, it will change the behavior in the quorum case.

Oops, you're right.  I find the whole thing there (and me) is a bit
confusing. syncrep_method affects how some values (specifically
am_sync and sync_standbys) are translated at several calling depths.
And the *am_sync informs nothing in quorum mode.

> In any case, a change like this will cause callers to know way more than
> they ought to about the ordering of the array.  In my mind, the fact that
> SyncRepGetSyncStandbysPriority is sorting the array is an internal
> implementation detail; I do not want it to be part of the API.

Anyway the am_sync and is_sync_standby is utterly useless in quorum
mode.  That discussion is pretty persuasive if not, but actually the
upper layers (SyncRepReleaseWaiters and SyncRepGetSyncRecPtr) referes
to syncrep_method to differentiate the interpretation of the am_sync
flag and sync_standbys list.  So anyway the difference is actually a
part of API.

After thinking some more, I concluded that some of the variables are
wrongly named or considered, and redundant. The fucntion of am_sync is
covered by got_recptr in SyncRepReleaseWaiters, so it's enough that
SyncRepGetSyncRecPtr just reports to the caller whether the caller may
release some of the waiter processes. This simplifies the related
functions and make it (to me) clearer.

Please find the attached.


> (Apropos to that, I realized from working on this patch that there's
> another, completely undocumented assumption in the existing code, that
> the integer list will be sorted by walsender index for equal priorities.
> I don't like that either, and not just because it's undocumented.)

That seems accidentally. Sorting by priority is the disigned behavior
and documented, in contrast, entries of the same priority are ordered
in index order by accident and not documented, that means it can be
changed anytime.  I think we don't define everyting in such detail.

regards.

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..99a7bbbc86 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -103,19 +103,21 @@ static int	SyncRepWakeQueue(bool all, int mode);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
  XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+ XLogRecPtr *applyPtr);
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys,
+	   uint8 nsyncs);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +408,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(>mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(>mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -429,8 +432,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-15 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> + stby->is_sync_standby = true;   /* might change below */

> I'm uneasy with that.  In quorum mode all running standbys are marked
> as "sync" and that's bogus.

I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
returns all the candidates in its list, so this is isomorphic to that.

Possibly a different name for the flag would be more suited?

> On the other hand sync_standbys is already sorted in priority order so I 
> think we can get rid of the member by setting *am_sync as the follows.

> SyncRepGetSyncRecPtr:
>   if (sync_standbys[i].is_me)
>   {
>   *am_sync = (i < SyncRepConfig->num_sync);
>   break;
>   }

I disagree with this, it will change the behavior in the quorum case.

In any case, a change like this will cause callers to know way more than
they ought to about the ordering of the array.  In my mind, the fact that
SyncRepGetSyncStandbysPriority is sorting the array is an internal
implementation detail; I do not want it to be part of the API.

(Apropos to that, I realized from working on this patch that there's
another, completely undocumented assumption in the existing code, that
the integer list will be sorted by walsender index for equal priorities.
I don't like that either, and not just because it's undocumented.)

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-15 Thread Kyotaro Horiguchi
At Wed, 15 Apr 2020 13:01:02 +0900, Masahiko Sawada 
 wrote in 
> On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi  
> wrote:
> >
> > At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
> >  wrote in
> > > On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> > > >
> > > > Kyotaro Horiguchi  writes:
> > > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  
> > > > > wrote in
> > > > >> What I think we should do about this is, essentially, to get rid of
> > > > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > > > >> whether *it* believes that it is a sync standby, based on its last
> > > > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > > > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> > > >
> > > > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > > > decide "I am a sync standby" we need to know how many walsenders with
> > > > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > > > judgment now and suffers from the inconsistency of priority values.
> > > >
> > > > Yeah.  After looking a bit closer, I think that the current definition
> > > > of sync_standby_priority (that is, as the result of local evaluation
> > > > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > > > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > > > is make one sweep across the WalSnd array, collecting PID,
> > > > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > > > Then examine that data and decide which WAL value we need, without 
> > > > assuming
> > > > that the sync_standby_priority values are necessarily totally 
> > > > consistent.
> > > > But in any case we must examine each entry just once while holding its
> > > > mutex, not go back to it later expecting it to still be the same.
> >
> > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> > sync_standby_priority of any walsender can be changed while the
> > function is scanning welsenders. The issue is we already have
> > inconsistent walsender information before we enter the function.  Thus
> > how many times we scan on the array doesn't make any difference.
> >
> > I think we need to do one of the followings.
> >
> >  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> > walsender priority is inconsistent.
> >
> >  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> > inconsistency.
> >
> >  C) protect walsender priority array from beinig inconsistent.
> >
> > The (B) is the band aids. To achieve A we need to central controller
> > of priority config handling.  C is:
> >
> > > Can we have a similar approach of sync_standby_defined for
> > > sync_standby_priority? That is, checkpionter is responsible for
> > > changing sync_standby_priority of all walsenders when SIGHUP. That
> > > way, all walsenders can see a consistent view of
> > > sync_standby_priority. And when a walsender starts, it sets
> > > sync_standby_priority by itself. The logic to decide who's a sync
> > > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> > > having higher priority along with their WAL positions.
> >
> > Yeah, it works if we do , but the problem of that way is that to
> > determin priority of walsenders, we need to know what walsenders are
> > running. That is, when new walsender comes the process needs to aware
> > of the arrival (or leaving) right away and reassign the priority of
> > every wal senders again.
> 
> I think we don't need to reassign the priority when new walsender
> comes or leaves. IIUC The priority is calculated based on only
> synchronous_standby_names. Coming or leaving a walsender doesn't
> affect other's priorities.

Sorry, the "priority" in this area is a bit confusing.  The "priority"
defined by synchronous_standby_names is determined in isolation from
the presence of walsenders.  The "priority" in
walsnd->sync_standby_priority needs walsender presence to determine.
I thought of the latter in the discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Masahiko Sawada
On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi  wrote:
>
> At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
>  wrote in
> > On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> > >
> > > Kyotaro Horiguchi  writes:
> > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote 
> > > > in
> > > >> What I think we should do about this is, essentially, to get rid of
> > > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > > >> whether *it* believes that it is a sync standby, based on its last
> > > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> > >
> > > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > > decide "I am a sync standby" we need to know how many walsenders with
> > > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > > judgment now and suffers from the inconsistency of priority values.
> > >
> > > Yeah.  After looking a bit closer, I think that the current definition
> > > of sync_standby_priority (that is, as the result of local evaluation
> > > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > > is make one sweep across the WalSnd array, collecting PID,
> > > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > > Then examine that data and decide which WAL value we need, without 
> > > assuming
> > > that the sync_standby_priority values are necessarily totally consistent.
> > > But in any case we must examine each entry just once while holding its
> > > mutex, not go back to it later expecting it to still be the same.
>
> SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> sync_standby_priority of any walsender can be changed while the
> function is scanning welsenders. The issue is we already have
> inconsistent walsender information before we enter the function.  Thus
> how many times we scan on the array doesn't make any difference.
>
> I think we need to do one of the followings.
>
>  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> walsender priority is inconsistent.
>
>  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> inconsistency.
>
>  C) protect walsender priority array from beinig inconsistent.
>
> The (B) is the band aids. To achieve A we need to central controller
> of priority config handling.  C is:
>
> > Can we have a similar approach of sync_standby_defined for
> > sync_standby_priority? That is, checkpionter is responsible for
> > changing sync_standby_priority of all walsenders when SIGHUP. That
> > way, all walsenders can see a consistent view of
> > sync_standby_priority. And when a walsender starts, it sets
> > sync_standby_priority by itself. The logic to decide who's a sync
> > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> > having higher priority along with their WAL positions.
>
> Yeah, it works if we do , but the problem of that way is that to
> determin priority of walsenders, we need to know what walsenders are
> running. That is, when new walsender comes the process needs to aware
> of the arrival (or leaving) right away and reassign the priority of
> every wal senders again.

I think we don't need to reassign the priority when new walsender
comes or leaves. IIUC The priority is calculated based on only
synchronous_standby_names. Coming or leaving a walsender doesn't
affect other's priorities.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Noah Misch
On Tue, Apr 14, 2020 at 04:32:40PM -0400, Tom Lane wrote:
> I wrote:
> > It doesn't seem to me to be that hard to implement the desired
> > semantics for synchronous_standby_names with inconsistent info.
> > In FIRST mode you basically just need to take the N smallest
> > priorities you see in the array, but without assuming there are no
> > duplicates or holes.  It might be a good idea to include ties at the
> > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> > standbys, include the first three of them in the calculation until
> > the inconsistency is resolved.  In ANY mode I don't see that
> > inconsistent priorities matter at all.
> 
> Concretely, I think we ought to do the attached, or something pretty
> close to it.
> 
> I'm not really happy about breaking ties based on walsnd_index,
> but I see that there are several TAP test cases that fail if we
> do something else.  I'm inclined to think those tests are bogus ...
> but I won't argue to change them right now.

This passes the test battery I wrote in preparation for the 2020-02 thread.




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi
At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> I wrote:
> > It doesn't seem to me to be that hard to implement the desired
> > semantics for synchronous_standby_names with inconsistent info.
> > In FIRST mode you basically just need to take the N smallest
> > priorities you see in the array, but without assuming there are no
> > duplicates or holes.  It might be a good idea to include ties at the
> > end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> > standbys, include the first three of them in the calculation until
> > the inconsistency is resolved.  In ANY mode I don't see that
> > inconsistent priorities matter at all.
> 
> Concretely, I think we ought to do the attached, or something pretty
> close to it.

Looking SyncRepGetSyncStandbys, I agree that it's good not assuming
lowest_priority, which I thought as the culprit of the assertion
failure.  The current code intends to use less memory.  I don't think
there is a case where only 3 out of 1000 standbys are required to be
sync-standby so collecting all wal senders then sorting them seems
reasonable strategy.  The new code looks clearer.

+   stby->is_sync_standby = true;   /* might change below */

I'm uneasy with that.  In quorum mode all running standbys are marked
as "sync" and that's bogus.

The only users of the flag seems to be:

SyncRepGetSyncRecPtr:
+   *am_sync = sync_standbys[i].is_sync_standby;

and

SyncRepGetOldestSyncRecPtr:
+   /* Ignore candidates that aren't considered synchronous */
+   if (!sync_standbys[i].is_sync_standby)
+   continue;

On the other hand sync_standbys is already sorted in priority order so I think 
we can get rid of the member by setting *am_sync as the follows.


SyncRepGetSyncRecPtr:
  if (sync_standbys[i].is_me)
  {
  *am_sync = (i < SyncRepConfig->num_sync);
  break;
  }

And the second user can be as the follows.

SyncRepGetOldestSyncRecPtr:
   /* Ignore candidates that aren't considered synchronous */
   if (i >= SyncRepConfig->num_sync)
   break;

> I'm not really happy about breaking ties based on walsnd_index,
> but I see that there are several TAP test cases that fail if we
> do something else.  I'm inclined to think those tests are bogus ...
> but I won't argue to change them right now.

Agreed about the tie-breaker.

I'm looking this more closer.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi
At Tue, 14 Apr 2020 09:52:42 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> > sync_standby_priority of any walsender can be changed while the
> > function is scanning welsenders. The issue is we already have
> > inconsistent walsender information before we enter the function.  Thus
> > how many times we scan on the array doesn't make any difference.
> 
> *Yes it does*.  The existing code can deliver entirely broken results
> if some walsender exits between where we examine the priorities and
> where we fetch the WAL pointers.  While that doesn't seem to be the

Ah. I didn't take that as inconsistency. Actually walsender exit
inactivate the corresponding slot by setting pid = 0.  In a bad case
(as you mentioned upthread) the entry can be occupied by another wal
sender. However, sync_standby_priority cannot be updated until the
whole work is finished.

> exact issue we're seeing in the buildfarm, it's still another obvious
> bug in this code.  I will not accept a "fix" that doesn't fix that.

I think that the "inconsistency" that can be observed in a process is
disagreement between SyncRepConfig->nmembers and
->sync_standby_priority.  If any one of walsenders
regards its priority as lower (larger in value) than nmembers in the
"current" process, the assertion fires.  If that is the issue, the
issue is not dynamic inconsistency.

# It's the assumption of my band-aid.

> > I think we need to do one of the followings.
> 
> >  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> > walsender priority is inconsistent.
> >  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> > inconsistency.
> >  C) protect walsender priority array from beinig inconsistent.
> 
> (B) seems like the only practical solution from here.  We could
> probably arrange for synchronous update of the priorities when
> they change in response to a GUC change, but it doesn't seem to
> me to be practical to do that in response to walsender exit.
> You'd end up finding that an unexpected walsender exit results
> in panic'ing the system, which is no better than where we are now.

I agree to you as a whole.  I thought of several ways to keep the
consistency of the array but all of them looked too much.

> It doesn't seem to me to be that hard to implement the desired
> semantics for synchronous_standby_names with inconsistent info.
> In FIRST mode you basically just need to take the N smallest
> priorities you see in the array, but without assuming there are no
> duplicates or holes.  It might be a good idea to include ties at the
> end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> standbys, include the first three of them in the calculation until
> the inconsistency is resolved.  In ANY mode I don't see that
> inconsistent priorities matter at all.

Mmm, the priority lists like 1,2,2,4 are not thought as inconsistency
at all in the context of walsender priority.  That happenes stablly if
any two or more walreceivers reported the same application_name. I
believe the existing code is already taking that case into
consideration.

> > If we accept to share variable-length information among processes,
> > sharing sync_standby_names or parsed SyncRepConfigData among processes
> > would work.
> 
> Not sure that we really need more than what's being shared now,
> ie each process's last-known index in the sync_standby_names list.

If we take the (B), we don't need anymore.  (A) and (C) would need
more.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Tom Lane
I wrote:
> It doesn't seem to me to be that hard to implement the desired
> semantics for synchronous_standby_names with inconsistent info.
> In FIRST mode you basically just need to take the N smallest
> priorities you see in the array, but without assuming there are no
> duplicates or holes.  It might be a good idea to include ties at the
> end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
> standbys, include the first three of them in the calculation until
> the inconsistency is resolved.  In ANY mode I don't see that
> inconsistent priorities matter at all.

Concretely, I think we ought to do the attached, or something pretty
close to it.

I'm not really happy about breaking ties based on walsnd_index,
but I see that there are several TAP test cases that fail if we
do something else.  I'm inclined to think those tests are bogus ...
but I won't argue to change them right now.

regards, tom lane

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31..c66c371 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -108,14 +108,17 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static void SyncRepGetSyncStandbysPriority(SyncRepStandbyData *standbys, int n);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +409,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(>mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(>mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -523,8 +527,6 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
  * store the positions into *writePtr, *flushPtr and *applyPtr.
@@ -536,27 +538,43 @@ static bool
 SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	 XLogRecPtr *applyPtr, bool *am_sync)
 {
-	List	   *sync_standbys;
-
-	Assert(LWLockHeldByMe(SyncRepLock));
+	SyncRepStandbyData *sync_standbys;
+	int			num_standbys;
+	int			i;
 
+	/* Initialize default results */
 	*writePtr = InvalidXLogRecPtr;
 	*flushPtr = InvalidXLogRecPtr;
 	*applyPtr = InvalidXLogRecPtr;
 	*am_sync = false;
 
+	/* Quick out if not even configured to be synchronous */
+	if (SyncRepConfig == NULL)
+		return false;
+
 	/* Get standbys that are considered as synchronous at this moment */
-	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+	num_standbys = SyncRepGetSyncStandbys(_standbys);
+
+	/* Am I among the candidate sync standbys? */
+	for (i = 0; i < num_standbys; i++)
+	{
+		if (sync_standbys[i].is_me)
+		{
+			*am_sync = sync_standbys[i].is_sync_standby;
+			break;
+		}
+	}
 
 	/*
-	 * Quick exit if we are not managing a sync standby or there are not
-	 * enough synchronous standbys.
+	 * Nothing more to do if we are not managing a sync standby or there are
+	 * not enough synchronous standbys.  (Note: if there are least num_sync
+	 * candidates, then at least num_sync of them will be marked as
+	 * is_sync_standby; we don't need to count them here.)
 	 */
 	if (!(*am_sync) ||
-		SyncRepConfig == NULL ||
-		list_length(sync_standbys) < SyncRepConfig->num_sync)
+		num_standbys < SyncRepConfig->num_sync)
 	{
-		list_free(sync_standbys);
+		pfree(sync_standbys);
 		return false;
 	}
 
@@ -576,15 +594,16 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
 	{
 		SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr,
-   sync_standbys);
+   sync_standbys, num_standbys);
 	}
 	else
 	{
 		SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr,
-	  sync_standbys, SyncRepConfig->num_sync);
+	  sync_standbys, num_standbys,
+	  SyncRepConfig->num_sync);

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Tom Lane
Kyotaro Horiguchi  writes:
> SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> sync_standby_priority of any walsender can be changed while the
> function is scanning welsenders. The issue is we already have
> inconsistent walsender information before we enter the function.  Thus
> how many times we scan on the array doesn't make any difference.

*Yes it does*.  The existing code can deliver entirely broken results
if some walsender exits between where we examine the priorities and
where we fetch the WAL pointers.  While that doesn't seem to be the
exact issue we're seeing in the buildfarm, it's still another obvious
bug in this code.  I will not accept a "fix" that doesn't fix that.

> I think we need to do one of the followings.

>  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> walsender priority is inconsistent.
>  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> inconsistency.
>  C) protect walsender priority array from beinig inconsistent.

(B) seems like the only practical solution from here.  We could
probably arrange for synchronous update of the priorities when
they change in response to a GUC change, but it doesn't seem to
me to be practical to do that in response to walsender exit.
You'd end up finding that an unexpected walsender exit results
in panic'ing the system, which is no better than where we are now.

It doesn't seem to me to be that hard to implement the desired
semantics for synchronous_standby_names with inconsistent info.
In FIRST mode you basically just need to take the N smallest
priorities you see in the array, but without assuming there are no
duplicates or holes.  It might be a good idea to include ties at the
end, that is if you see 1,2,2,4 or 1,3,3,4 and you want 2 sync
standbys, include the first three of them in the calculation until
the inconsistency is resolved.  In ANY mode I don't see that
inconsistent priorities matter at all.

> If we accept to share variable-length information among processes,
> sharing sync_standby_names or parsed SyncRepConfigData among processes
> would work.

Not sure that we really need more than what's being shared now,
ie each process's last-known index in the sync_standby_names list.

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-14 Thread Kyotaro Horiguchi
At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
 wrote in 
> On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> >
> > Kyotaro Horiguchi  writes:
> > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in
> > >> What I think we should do about this is, essentially, to get rid of
> > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > >> whether *it* believes that it is a sync standby, based on its last
> > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> >
> > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > decide "I am a sync standby" we need to know how many walsenders with
> > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > judgment now and suffers from the inconsistency of priority values.
> >
> > Yeah.  After looking a bit closer, I think that the current definition
> > of sync_standby_priority (that is, as the result of local evaluation
> > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > is make one sweep across the WalSnd array, collecting PID,
> > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > Then examine that data and decide which WAL value we need, without assuming
> > that the sync_standby_priority values are necessarily totally consistent.
> > But in any case we must examine each entry just once while holding its
> > mutex, not go back to it later expecting it to still be the same.

SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
sync_standby_priority of any walsender can be changed while the
function is scanning welsenders. The issue is we already have
inconsistent walsender information before we enter the function.  Thus
how many times we scan on the array doesn't make any difference.

I think we need to do one of the followings.

 A) prevent SyncRepGetSyncStandbysPriority from being entered while
walsender priority is inconsistent.

 B) make SyncRepGetSyncStandbysPriority be tolerant of priority
inconsistency.

 C) protect walsender priority array from beinig inconsistent.

The (B) is the band aids. To achieve A we need to central controller
of priority config handling.  C is:

> Can we have a similar approach of sync_standby_defined for
> sync_standby_priority? That is, checkpionter is responsible for
> changing sync_standby_priority of all walsenders when SIGHUP. That
> way, all walsenders can see a consistent view of
> sync_standby_priority. And when a walsender starts, it sets
> sync_standby_priority by itself. The logic to decide who's a sync
> standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> having higher priority along with their WAL positions.

Yeah, it works if we do , but the problem of that way is that to
determin priority of walsenders, we need to know what walsenders are
running. That is, when new walsender comes the process needs to aware
of the arrival (or leaving) right away and reassign the priority of
every wal senders again.

If we accept to share variable-length information among processes,
sharing sync_standby_names or parsed SyncRepConfigData among processes
would work.


> >
> > Another thing that I'm finding interesting is that I now see this is
> > not at all new code.  It doesn't look like SyncRepGetSyncStandbysPriority
> > has changed much since 2016.  So how come we didn't detect this problem
> > long ago?  I searched the buildfarm logs for assertion failures in
> > syncrep.c, looking back one year, and here's what I found:
...
> > The line numbers vary in the back branches, but all of these crashes are
> > at that same Assert.  So (a) yes, this does happen in the back branches,
> > but (b) some fairly recent change has made it a whole lot more probable.
> > Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
> > so whatever the change was was indirect.  Curious.  Is it just timing?
> 
> Interesting. It's happening on certain animals, not all. Especially
> tests with HEAD on sidewinder and curculio, which are NetBSD 7 and
> OpenBSD 5.9 respectively, started to fail at a high rate since a
> couple of days ago.

Coundn't this align the timing of config reloading? (I didn't checked
anything yet.)

| commit 421685812290406daea58b78dfab0346eb683bbb
| Author: Noah Misch 
| Date:   Sat Apr 11 10:30:00 2020 -0700
| 
| When WalSndCaughtUp, sleep only in WalSndWaitForWal().
|Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
|< sentPtr.  That is important in logical replication.  When the latest
|physical LSN yields no logical replication messages (a common case),
|that keepalive elicits a reply, and processing the reply updates
|pg_stat_replication.replay_lsn.  WalSndLoop() lacks that; 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-13 Thread Masahiko Sawada
On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
>
> Kyotaro Horiguchi  writes:
> > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in
> >> What I think we should do about this is, essentially, to get rid of
> >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> >> whether *it* believes that it is a sync standby, based on its last
> >> evaluation of the relevant GUCs.  This would be a bool that it'd
> >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
>
> > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > walsender thinks it is at, among synchronous_standby_names.  Then to
> > decide "I am a sync standby" we need to know how many walsenders with
> > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > judgment now and suffers from the inconsistency of priority values.
>
> Yeah.  After looking a bit closer, I think that the current definition
> of sync_standby_priority (that is, as the result of local evaluation
> of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> is make one sweep across the WalSnd array, collecting PID,
> sync_standby_priority, *and* the WAL pointers from each valid entry.
> Then examine that data and decide which WAL value we need, without assuming
> that the sync_standby_priority values are necessarily totally consistent.
> But in any case we must examine each entry just once while holding its
> mutex, not go back to it later expecting it to still be the same.

Can we have a similar approach of sync_standby_defined for
sync_standby_priority? That is, checkpionter is responsible for
changing sync_standby_priority of all walsenders when SIGHUP. That
way, all walsenders can see a consistent view of
sync_standby_priority. And when a walsender starts, it sets
sync_standby_priority by itself. The logic to decide who's a sync
standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
having higher priority along with their WAL positions.

>
> Another thing that I'm finding interesting is that I now see this is
> not at all new code.  It doesn't look like SyncRepGetSyncStandbysPriority
> has changed much since 2016.  So how come we didn't detect this problem
> long ago?  I searched the buildfarm logs for assertion failures in
> syncrep.c, looking back one year, and here's what I found:
>
>   sysname   |branch |  snapshot   | stage |   
> l
> +---+-+---+---
>  nightjar   | REL_10_STABLE | 2019-08-13 23:04:41 | recoveryCheck | TRAP: 
> FailedAssertion("!(((bool) 0))", File: 
> "/pgbuild/root/REL_10_STABLE/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
>  Line: 940)
>  hoverfly   | REL9_6_STABLE | 2019-11-07 17:19:12 | recoveryCheck | TRAP: 
> FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
>  hoverfly   | HEAD  | 2019-11-22 12:15:08 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  francolin  | HEAD  | 2020-01-16 23:10:06 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: 
> "/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
>  Line: 951)
>  hoverfly   | REL_11_STABLE | 2020-02-29 01:34:55 | recoveryCheck | TRAP: 
> FailedAssertion("!(0)", File: "syncrep.c", Line: 946)
>  hoverfly   | REL9_6_STABLE | 2020-03-26 13:51:15 | recoveryCheck | TRAP: 
> FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
>  hoverfly   | REL9_6_STABLE | 2020-04-07 21:52:07 | recoveryCheck | TRAP: 
> FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
>  curculio   | HEAD  | 2020-04-11 18:30:21 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  sidewinder | HEAD  | 2020-04-11 18:45:39 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  curculio   | HEAD  | 2020-04-11 20:30:26 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  sidewinder | HEAD  | 2020-04-11 21:45:48 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  sidewinder | HEAD  | 2020-04-13 10:45:35 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
>  conchuela  | HEAD  | 2020-04-13 16:00:18 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: 
> "/home/pgbf/buildroot/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
>  Line: 951)
>  sidewinder | HEAD  | 2020-04-13 18:45:34 | recoveryCheck | TRAP: 
> FailedAssertion("false", File: "syncrep.c", Line: 951)
> (14 rows)
>
> The line numbers vary in 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-13 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in 
>> What I think we should do about this is, essentially, to get rid of
>> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
>> whether *it* believes that it is a sync standby, based on its last
>> evaluation of the relevant GUCs.  This would be a bool that it'd
>> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not

> Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> walsender thinks it is at, among synchronous_standby_names.  Then to
> decide "I am a sync standby" we need to know how many walsenders with
> higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> judgment now and suffers from the inconsistency of priority values.

Yeah.  After looking a bit closer, I think that the current definition
of sync_standby_priority (that is, as the result of local evaluation
of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
is make one sweep across the WalSnd array, collecting PID,
sync_standby_priority, *and* the WAL pointers from each valid entry.
Then examine that data and decide which WAL value we need, without assuming
that the sync_standby_priority values are necessarily totally consistent.
But in any case we must examine each entry just once while holding its
mutex, not go back to it later expecting it to still be the same.

Another thing that I'm finding interesting is that I now see this is
not at all new code.  It doesn't look like SyncRepGetSyncStandbysPriority
has changed much since 2016.  So how come we didn't detect this problem
long ago?  I searched the buildfarm logs for assertion failures in
syncrep.c, looking back one year, and here's what I found:

  sysname   |branch |  snapshot   | stage | 
  l 
  
+---+-+---+---
 nightjar   | REL_10_STABLE | 2019-08-13 23:04:41 | recoveryCheck | TRAP: 
FailedAssertion("!(((bool) 0))", File: 
"/pgbuild/root/REL_10_STABLE/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
 Line: 940)
 hoverfly   | REL9_6_STABLE | 2019-11-07 17:19:12 | recoveryCheck | TRAP: 
FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
 hoverfly   | HEAD  | 2019-11-22 12:15:08 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 francolin  | HEAD  | 2020-01-16 23:10:06 | recoveryCheck | TRAP: 
FailedAssertion("false", File: 
"/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
 Line: 951)
 hoverfly   | REL_11_STABLE | 2020-02-29 01:34:55 | recoveryCheck | TRAP: 
FailedAssertion("!(0)", File: "syncrep.c", Line: 946)
 hoverfly   | REL9_6_STABLE | 2020-03-26 13:51:15 | recoveryCheck | TRAP: 
FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
 hoverfly   | REL9_6_STABLE | 2020-04-07 21:52:07 | recoveryCheck | TRAP: 
FailedAssertion("!(((bool) 0))", File: "syncrep.c", Line: 723)
 curculio   | HEAD  | 2020-04-11 18:30:21 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 sidewinder | HEAD  | 2020-04-11 18:45:39 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 curculio   | HEAD  | 2020-04-11 20:30:26 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 sidewinder | HEAD  | 2020-04-11 21:45:48 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 sidewinder | HEAD  | 2020-04-13 10:45:35 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
 conchuela  | HEAD  | 2020-04-13 16:00:18 | recoveryCheck | TRAP: 
FailedAssertion("false", File: 
"/home/pgbf/buildroot/HEAD/pgsql.build/../pgsql/src/backend/replication/syncrep.c",
 Line: 951)
 sidewinder | HEAD  | 2020-04-13 18:45:34 | recoveryCheck | TRAP: 
FailedAssertion("false", File: "syncrep.c", Line: 951)
(14 rows)

The line numbers vary in the back branches, but all of these crashes are
at that same Assert.  So (a) yes, this does happen in the back branches,
but (b) some fairly recent change has made it a whole lot more probable.
Neither syncrep.c nor 007_sync_rep.pl have changed much in some time,
so whatever the change was was indirect.  Curious.  Is it just timing?

I'm giving the side-eye to Noah's recent changes 328c70997 and 421685812,
but this isn't enough evidence to say definitely that that's what boosted
the failure rate.

regards, tom lane




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-13 Thread Kyotaro Horiguchi
At Mon, 13 Apr 2020 15:31:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in 
> > The point that I was trying to make originally is that it seems quite
> > insane to imagine that a walsender's sync_standby_priority value is
> > somehow more stable than the very existence of the process.  Yet we
> > only require a walsender to lock its own mutex while claiming or
> > disowning its WalSnd entry (by setting or clearing the pid field).
> > So I think it's nuts to define those fields as being protected by
> > the global SyncRepLock.
> 
> Right. FWIW, furthermore, even SyncRepConfig->syncrep_method can be
> inconsistent among walsenders.  I haven't thought that it can be
> relied on as always consistent and it is enough that it makes a
> consistent result only while the setting and the set of walsenders is
> stable.

Yes, the sentene "and (I haven't thought that) it is enough .." is a
mistake of "and I have thought that it is enough that..".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-13 Thread Kyotaro Horiguchi
At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  wrote in 
> Masahiko Sawada  writes:
> > On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
> >  wrote:
> >>> Therefore, the band-aid fix seems to be to set the lowest priority to
> >>> very large number at the beginning of SyncRepGetSyncStandbysPriority().
> 
> >> I think we can use max_wal_senders.
> 
> > Sorry, that's not true. We need another number large enough.
> 
> The buildfarm had another three failures of this type today, so that
> motivated me to look at it some more.  I don't think this code needs
> a band-aid fix; I think "nuke from orbit" is more nearly the right
> level of response.
> 
> The point that I was trying to make originally is that it seems quite
> insane to imagine that a walsender's sync_standby_priority value is
> somehow more stable than the very existence of the process.  Yet we
> only require a walsender to lock its own mutex while claiming or
> disowning its WalSnd entry (by setting or clearing the pid field).
> So I think it's nuts to define those fields as being protected by
> the global SyncRepLock.

Right. FWIW, furthermore, even SyncRepConfig->syncrep_method can be
inconsistent among walsenders.  I haven't thought that it can be
relied on as always consistent and it is enough that it makes a
consistent result only while the setting and the set of walsenders is
stable.

> Even without considering the possibility that a walsender has just
> started or stopped, we have the problem Fujii-san described that after
> a change in the synchronous_standby_names GUC setting, different
> walsenders will update their values of sync_standby_priority at
> different instants.  (BTW, I now notice that Noah had previously
> identified this problem at [1].)
> 
> Thus, even while holding SyncRepLock, we do not have a guarantee that
> we'll see a consistent set of sync_standby_priority values.  In fact
> we don't even know that the walsnd array entries still belong to the
> processes that last set those values.  This is what is breaking
> SyncRepGetSyncStandbysPriority, and what it means is that there's
> really fundamentally no chance of that function producing trustworthy
> results.  The "band aid" fixes discussed here might avoid crashing on
> the Assert, but they won't fix the problems that (a) the result is
> possibly wrong and (b) it can become stale immediately even if it's
> right when returned.

Agreed. And I thought that it's not a problem if we had wrong result
temporarily. And the unstability persists for the standby-reply
interval at most (unless the next cause of instability comes).

> Now, there are only two callers of SyncRepGetSyncStandbys:
> SyncRepGetSyncRecPtr and pg_stat_get_wal_senders.  The latter is
> mostly cosmetic (which is a good thing, because to add insult to
> injury, it continues to use the list after releasing SyncRepLock;
> not that continuing to hold that lock would make things much safer).
> If I'm reading the code correctly, the former doesn't really care
> exactly which walsenders are sync standbys: all it cares about is
> to collect their WAL position pointers.

Agreed.  To find the sync standby with the largest delay.

> What I think we should do about this is, essentially, to get rid of
> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> whether *it* believes that it is a sync standby, based on its last
> evaluation of the relevant GUCs.  This would be a bool that it'd
> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not

Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
walsender thinks it is at, among synchronous_standby_names.  Then to
decide "I am a sync standby" we need to know how many walsenders with
higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
judgment now and suffers from the inconsistency of priority values.

In short, it seems to me like moving the problem into another
place. But I think that there might be a smarter way to find "I am
sync".

> even need to have that field anymore?  Not sure.)  We should also
> redefine that flag, and sync_standby_priority if it survives, as being
> protected by the per-walsender mutex not SyncRepLock.  Then, what
> SyncRepGetSyncRecPtr would do is just sweep through the walsender
> array and collect WAL position pointers from the walsenders that
> claim to be sync standbys at the instant that it's inspecting them.
> pg_stat_get_wal_senders could also use those flags instead of the
> list from SyncRepGetSyncStandbys.
> 
> It's likely that this definition would have slightly different
> behavior from the current implementation during the period where
> the system is absorbing a change in the set of synchronous
> walsenders.  However, since the current implementation is visibly
> wrong during that period anyway, I'm not sure how this could be
> worse.  And at least we can be certain that SyncRepGetSyncRecPtr
> will not include WAL positions from already-dead walsenders in
> its 

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-11 Thread Tom Lane
Masahiko Sawada  writes:
> On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
>  wrote:
>>> Therefore, the band-aid fix seems to be to set the lowest priority to
>>> very large number at the beginning of SyncRepGetSyncStandbysPriority().

>> I think we can use max_wal_senders.

> Sorry, that's not true. We need another number large enough.

The buildfarm had another three failures of this type today, so that
motivated me to look at it some more.  I don't think this code needs
a band-aid fix; I think "nuke from orbit" is more nearly the right
level of response.

The point that I was trying to make originally is that it seems quite
insane to imagine that a walsender's sync_standby_priority value is
somehow more stable than the very existence of the process.  Yet we
only require a walsender to lock its own mutex while claiming or
disowning its WalSnd entry (by setting or clearing the pid field).
So I think it's nuts to define those fields as being protected by
the global SyncRepLock.

Even without considering the possibility that a walsender has just
started or stopped, we have the problem Fujii-san described that after
a change in the synchronous_standby_names GUC setting, different
walsenders will update their values of sync_standby_priority at
different instants.  (BTW, I now notice that Noah had previously
identified this problem at [1].)

Thus, even while holding SyncRepLock, we do not have a guarantee that
we'll see a consistent set of sync_standby_priority values.  In fact
we don't even know that the walsnd array entries still belong to the
processes that last set those values.  This is what is breaking
SyncRepGetSyncStandbysPriority, and what it means is that there's
really fundamentally no chance of that function producing trustworthy
results.  The "band aid" fixes discussed here might avoid crashing on
the Assert, but they won't fix the problems that (a) the result is
possibly wrong and (b) it can become stale immediately even if it's
right when returned.

Now, there are only two callers of SyncRepGetSyncStandbys:
SyncRepGetSyncRecPtr and pg_stat_get_wal_senders.  The latter is
mostly cosmetic (which is a good thing, because to add insult to
injury, it continues to use the list after releasing SyncRepLock;
not that continuing to hold that lock would make things much safer).
If I'm reading the code correctly, the former doesn't really care
exactly which walsenders are sync standbys: all it cares about is
to collect their WAL position pointers.

What I think we should do about this is, essentially, to get rid of
SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
whether *it* believes that it is a sync standby, based on its last
evaluation of the relevant GUCs.  This would be a bool that it'd
compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
even need to have that field anymore?  Not sure.)  We should also
redefine that flag, and sync_standby_priority if it survives, as being
protected by the per-walsender mutex not SyncRepLock.  Then, what
SyncRepGetSyncRecPtr would do is just sweep through the walsender
array and collect WAL position pointers from the walsenders that
claim to be sync standbys at the instant that it's inspecting them.
pg_stat_get_wal_senders could also use those flags instead of the
list from SyncRepGetSyncStandbys.

It's likely that this definition would have slightly different
behavior from the current implementation during the period where
the system is absorbing a change in the set of synchronous
walsenders.  However, since the current implementation is visibly
wrong during that period anyway, I'm not sure how this could be
worse.  And at least we can be certain that SyncRepGetSyncRecPtr
will not include WAL positions from already-dead walsenders in
its calculations, which *is* a hazard in the code as it stands.

I also estimate that this would be noticeably more efficient than
the current code, since the logic to decide who's a sync standby
would only run when we're dealing with walsender start/stop or
SIGHUP, rather than every time SyncRepGetSyncRecPtr runs.

Don't especially want to code this myself, though.  Anyone?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-31 Thread Masahiko Sawada
On Tue, 31 Mar 2020 at 23:16, Masahiko Sawada
 wrote:
>
> On Fri, 27 Mar 2020 at 13:54, Fujii Masao  wrote:
> >
> >
> >
> > On 2020/03/27 10:26, Tom Lane wrote:
> > > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > > to reach the "unreachable" Assert(false) at the end of
> > > SyncRepGetSyncStandbysPriority.
> >
> > When I search the past discussions, I found that Noah Misch reported
> > the same issue.
> > https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com
> >
> > > What seems likely to me, after quickly eyeballing the code, is that
> > > hoverfly is hitting the blatantly-obvious race condition in that function.
> > > Namely, that the second loop supposes that the state of the walsender
> > > array hasn't changed since the first loop.
> > >
> > > The minimum fix for this, I suppose, would have the first loop capture
> > > the sync_standby_priority value for each walsender along with what it's
> > > already capturing.  But I wonder if the whole function shouldn't be
> > > rewritten from scratch, because it seems like the algorithm is both
> > > expensively brute-force and unintelligible, which is a sad combination.
> > > It's likely that the number of walsenders would never be high enough
> > > that efficiency could matter, but then couldn't we use an algorithm
> > > that is less complicated and more obviously correct?
> >
> > +1 to rewrite the function with better algorithm.
> >
> > > (Because the
> > > alternative conclusion, if you reject the theory that a race is happening,
> > > is that the algorithm is just flat out buggy; something that's not too
> > > easy to disprove either.)
> > >
> > > Another fairly dubious thing here is that whether or not *am_sync
> > > gets set depends not only on whether MyWalSnd is claiming to be
> > > synchronous but on how many lower-numbered walsenders are too.
> > > Is that really the right thing?
> > >
> > > But worse than any of that is that the return value seems to be
> > > a list of walsender array indexes, meaning that the callers cannot
> > > use it without making even stronger assumptions about the array
> > > contents not having changed since the start of this function.
> > >
> > > It sort of looks like the design is based on the assumption that
> > > the array contents can't change while SyncRepLock is held ... but
> > > if that's the plan then why bother with the per-walsender spinlocks?
> > > In any case this assumption seems to be failing, suggesting either
> > > that there's a caller that's not holding SyncRepLock when it calls
> > > this function, or that somebody is failing to take that lock while
> > > modifying the array.
> >
> > As far as I read the code, that assumption seems still valid. But the 
> > problem
> > is that each walsender updates MyWalSnd->sync_standby_priority at each
> > convenient timing, when SIGHUP is signaled. That is, at a certain moment,
> > some walsenders (also their WalSnd entries in shmem) work based on
> > the latest configuration but the others (also their WalSnd entries) work 
> > based
> > on the old one.
> >
> > lowest_priority = SyncRepConfig->nmembers;
> > next_highest_priority = lowest_priority + 1;
> >
> > SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> > all running walsenders as the above, by using the configuration info that
> > this walsender is based on. But this calculated lowest priority would be
> > invalid if other walsender is based on different (e.g., old) configuraiton.
> > This can cause the (other) walsender to have lower priority than
> > the calculated lowest priority and the second loop in
> > SyncRepGetSyncStandbysPriority() to unexpectedly end.
>
> I have the same understanding. Since sync_standby_priroity is
> protected by SyncRepLock these values of each walsender are not
> changed through two loops in SyncRepGetSyncStandbysPriority().
> However, as Fujii-san already mentioned the true lowest priority can
> be lower than lowest_priority, nmembers, when only part of walsenders
> reloaded the configuration, which in turn could be the cause of
> leaving entries in the pending list at the end of the function.
>
> > Therefore, the band-aid fix seems to be to set the lowest priority to
> > very large number at the beginning of SyncRepGetSyncStandbysPriority().
>
> I think we can use max_wal_senders.

Sorry, that's not true. We need another number large enough.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-31 Thread Masahiko Sawada
On Fri, 27 Mar 2020 at 13:54, Fujii Masao  wrote:
>
>
>
> On 2020/03/27 10:26, Tom Lane wrote:
> > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > to reach the "unreachable" Assert(false) at the end of
> > SyncRepGetSyncStandbysPriority.
>
> When I search the past discussions, I found that Noah Misch reported
> the same issue.
> https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com
>
> > What seems likely to me, after quickly eyeballing the code, is that
> > hoverfly is hitting the blatantly-obvious race condition in that function.
> > Namely, that the second loop supposes that the state of the walsender
> > array hasn't changed since the first loop.
> >
> > The minimum fix for this, I suppose, would have the first loop capture
> > the sync_standby_priority value for each walsender along with what it's
> > already capturing.  But I wonder if the whole function shouldn't be
> > rewritten from scratch, because it seems like the algorithm is both
> > expensively brute-force and unintelligible, which is a sad combination.
> > It's likely that the number of walsenders would never be high enough
> > that efficiency could matter, but then couldn't we use an algorithm
> > that is less complicated and more obviously correct?
>
> +1 to rewrite the function with better algorithm.
>
> > (Because the
> > alternative conclusion, if you reject the theory that a race is happening,
> > is that the algorithm is just flat out buggy; something that's not too
> > easy to disprove either.)
> >
> > Another fairly dubious thing here is that whether or not *am_sync
> > gets set depends not only on whether MyWalSnd is claiming to be
> > synchronous but on how many lower-numbered walsenders are too.
> > Is that really the right thing?
> >
> > But worse than any of that is that the return value seems to be
> > a list of walsender array indexes, meaning that the callers cannot
> > use it without making even stronger assumptions about the array
> > contents not having changed since the start of this function.
> >
> > It sort of looks like the design is based on the assumption that
> > the array contents can't change while SyncRepLock is held ... but
> > if that's the plan then why bother with the per-walsender spinlocks?
> > In any case this assumption seems to be failing, suggesting either
> > that there's a caller that's not holding SyncRepLock when it calls
> > this function, or that somebody is failing to take that lock while
> > modifying the array.
>
> As far as I read the code, that assumption seems still valid. But the problem
> is that each walsender updates MyWalSnd->sync_standby_priority at each
> convenient timing, when SIGHUP is signaled. That is, at a certain moment,
> some walsenders (also their WalSnd entries in shmem) work based on
> the latest configuration but the others (also their WalSnd entries) work based
> on the old one.
>
> lowest_priority = SyncRepConfig->nmembers;
> next_highest_priority = lowest_priority + 1;
>
> SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> all running walsenders as the above, by using the configuration info that
> this walsender is based on. But this calculated lowest priority would be
> invalid if other walsender is based on different (e.g., old) configuraiton.
> This can cause the (other) walsender to have lower priority than
> the calculated lowest priority and the second loop in
> SyncRepGetSyncStandbysPriority() to unexpectedly end.

I have the same understanding. Since sync_standby_priroity is
protected by SyncRepLock these values of each walsender are not
changed through two loops in SyncRepGetSyncStandbysPriority().
However, as Fujii-san already mentioned the true lowest priority can
be lower than lowest_priority, nmembers, when only part of walsenders
reloaded the configuration, which in turn could be the cause of
leaving entries in the pending list at the end of the function.

> Therefore, the band-aid fix seems to be to set the lowest priority to
> very large number at the beginning of SyncRepGetSyncStandbysPriority().

I think we can use max_wal_senders. And we can change the second loop
so that we exit from the function if the pending list is empty.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-30 Thread Kyotaro Horiguchi
At Fri, 27 Mar 2020 13:54:25 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/03/27 10:26, Tom Lane wrote:
> > Twice in the past month [1][2], buildfarm member hoverfly has managed
> > to reach the "unreachable" Assert(false) at the end of
> > SyncRepGetSyncStandbysPriority.
> 
> When I search the past discussions, I found that Noah Misch reported
> the same issue.
> https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com
> 
> > What seems likely to me, after quickly eyeballing the code, is that
> > hoverfly is hitting the blatantly-obvious race condition in that
> > function.
> > Namely, that the second loop supposes that the state of the walsender
> > array hasn't changed since the first loop.
> > The minimum fix for this, I suppose, would have the first loop capture
> > the sync_standby_priority value for each walsender along with what
> > it's
> > already capturing.  But I wonder if the whole function shouldn't be
> > rewritten from scratch, because it seems like the algorithm is both
> > expensively brute-force and unintelligible, which is a sad
> > combination.
> > It's likely that the number of walsenders would never be high enough
> > that efficiency could matter, but then couldn't we use an algorithm
> > that is less complicated and more obviously correct?
> 
> +1 to rewrite the function with better algorithm.
> 
> > (Because the
> > alternative conclusion, if you reject the theory that a race is
> > happening,
> > is that the algorithm is just flat out buggy; something that's not too
> > easy to disprove either.)
> > Another fairly dubious thing here is that whether or not *am_sync
> > gets set depends not only on whether MyWalSnd is claiming to be
> > synchronous but on how many lower-numbered walsenders are too.
> > Is that really the right thing?
> > But worse than any of that is that the return value seems to be
> > a list of walsender array indexes, meaning that the callers cannot
> > use it without making even stronger assumptions about the array
> > contents not having changed since the start of this function.
> > It sort of looks like the design is based on the assumption that
> > the array contents can't change while SyncRepLock is held ... but
> > if that's the plan then why bother with the per-walsender spinlocks?
> > In any case this assumption seems to be failing, suggesting either
> > that there's a caller that's not holding SyncRepLock when it calls
> > this function, or that somebody is failing to take that lock while
> > modifying the array.
> 
> As far as I read the code, that assumption seems still valid. But the
> problem
> is that each walsender updates MyWalSnd->sync_standby_priority at each
> convenient timing, when SIGHUP is signaled. That is, at a certain
> moment,
> some walsenders (also their WalSnd entries in shmem) work based on
> the latest configuration but the others (also their WalSnd entries)
> work based
> on the old one.
> 
>   lowest_priority = SyncRepConfig->nmembers;
>   next_highest_priority = lowest_priority + 1;
> 
> SyncRepGetSyncStandbysPriority() calculates the lowest priority among
> all running walsenders as the above, by using the configuration info
> that
> this walsender is based on. But this calculated lowest priority would
> be
> invalid if other walsender is based on different (e.g., old)
> configuraiton.
> This can cause the (other) walsender to have lower priority than
> the calculated lowest priority and the second loop in
> SyncRepGetSyncStandbysPriority() to unexpectedly end.
> 
> Therefore, the band-aid fix seems to be to set the lowest priority to
> very large number at the beginning of
> SyncRepGetSyncStandbysPriority().

Or just ignore impossible priorities as non-sync standby. Anyway the
confused state is fixed after all walsenders have loaded the new
configuration.

I remember that I posted a bandaid for maybe the same issue.

https://www.postgresql.org/message-id/20200207.125251.146972241588695685.horikyota@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in SyncRepGetSyncStandbysPriority

2020-03-26 Thread Fujii Masao




On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.


When I search the past discussions, I found that Noah Misch reported
the same issue.
https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com


What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing.  But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?


+1 to rewrite the function with better algorithm.


(Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.


As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters