Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-14 Thread Petr Jelinek
On 14/06/17 20:57, Andres Freund wrote:
> Hi,
> 
> On 2017-06-13 00:50:20 -0700, Andres Freund wrote:
>> Just to be clear: The patch, after the first point above (which I did),
>> looks ok.  I'm just looking for comments.
> 
> And pushed.
> 

Thanks!

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-14 Thread Andres Freund
Hi,

On 2017-06-13 00:50:20 -0700, Andres Freund wrote:
> Just to be clear: The patch, after the first point above (which I did),
> looks ok.  I'm just looking for comments.

And pushed.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-13 Thread Andres Freund
On 2017-06-13 00:32:40 -0700, Andres Freund wrote:
> On 2017-06-09 22:28:00 +0200, Petr Jelinek wrote:
> > On 09/06/17 03:20, Petr Jelinek wrote:
> > > On 09/06/17 01:06, Andres Freund wrote:
> > >> Hi,
> > >>
> > >> On 2017-06-03 04:50:00 +0200, Petr Jelinek wrote:
> > >>> One thing I don't like is the GetLastLocalTransactionId() that I had to
> > >>> add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
> > >>> but I don't have better solution there.
> > >>
> > >> I dislike that quite a bit - how about we instead just change the
> > >> information we keep in exportedSnapshots?  I.e. store a pointer to an
> > >> struct ExportedSnapshot
> > >> {
> > >> char *exportedAs;
> > >> Snapshot snapshot;
> > >> }
> > >>
> > >> Then we can get rid of that relatively ugly list_length() business too.
> > >>
> > > 
> > > That sounds reasonable, I'll do that.
> > > 
> > 
> > And here it is, seems better (the 0002 is same as before).
> 
> I spent a chunk of time with this.  One surprising, but easy to fix issue was
> that this patch had the exported file name as:
> 
> > /*
> > +* Generate file path for the snapshot.  We start numbering of snapshots
> > +* inside the transaction from 1.
> > +*/
> > +   snprintf(path, sizeof(path), SNAPSHOT_EXPORT_DIR "/%08X-%d",
> > +topXid, list_length(exportedSnapshots) + 1);
> > +
> 
> I.e. just as before, unlike the previous version of the patch which used
> virtualxids.  Given that we don't require topXid to be set, this seems
> to largely break the patch.
> 
> Secondly, because of
> 
> > /*
> > -* This will assign a transaction ID if we do not yet have one.
> > +* Get our transaction ID if there is one, to include in the snapshot.
> >  */
> > -   topXid = GetTopTransactionId();
> > +   topXid = GetTopTransactionIdIfAny();
> >  
> 
> which is perfectly correct on its face, we actually added a substantial
> feature: Previously exporting snapshots was only possible on primaries,
> now it's also possible on standbys.  The only thing that made that error
> out before was the check during xid assignment.  Because snapshots work
> somewhat differntly on standbys, I spent a good chunk of time staring at
> code trying to see whether it'd break anything.  Neither code-reading
> nor testing seems to suggest any problems.   Were we earlier in the
> release cycle I'd be perfectly fine with an accidental feature, but I'm
> wondering a bit whether we should just make it error out at this point,
> because it'd be a new feature.  I'm -0.5 on that, but I think it should
> be raised.


Just to be clear: The patch, after the first point above (which I did),
looks ok.  I'm just looking for comments.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-13 Thread Andres Freund
On 2017-06-09 22:28:00 +0200, Petr Jelinek wrote:
> On 09/06/17 03:20, Petr Jelinek wrote:
> > On 09/06/17 01:06, Andres Freund wrote:
> >> Hi,
> >>
> >> On 2017-06-03 04:50:00 +0200, Petr Jelinek wrote:
> >>> One thing I don't like is the GetLastLocalTransactionId() that I had to
> >>> add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
> >>> but I don't have better solution there.
> >>
> >> I dislike that quite a bit - how about we instead just change the
> >> information we keep in exportedSnapshots?  I.e. store a pointer to an
> >> struct ExportedSnapshot
> >> {
> >> char *exportedAs;
> >> Snapshot snapshot;
> >> }
> >>
> >> Then we can get rid of that relatively ugly list_length() business too.
> >>
> > 
> > That sounds reasonable, I'll do that.
> > 
> 
> And here it is, seems better (the 0002 is same as before).

I spent a chunk of time with this.  One surprising, but easy to fix issue was
that this patch had the exported file name as:

>   /*
> +  * Generate file path for the snapshot.  We start numbering of snapshots
> +  * inside the transaction from 1.
> +  */
> + snprintf(path, sizeof(path), SNAPSHOT_EXPORT_DIR "/%08X-%d",
> +  topXid, list_length(exportedSnapshots) + 1);
> +

I.e. just as before, unlike the previous version of the patch which used
virtualxids.  Given that we don't require topXid to be set, this seems
to largely break the patch.

Secondly, because of

>   /*
> -  * This will assign a transaction ID if we do not yet have one.
> +  * Get our transaction ID if there is one, to include in the snapshot.
>*/
> - topXid = GetTopTransactionId();
> + topXid = GetTopTransactionIdIfAny();
>  

which is perfectly correct on its face, we actually added a substantial
feature: Previously exporting snapshots was only possible on primaries,
now it's also possible on standbys.  The only thing that made that error
out before was the check during xid assignment.  Because snapshots work
somewhat differntly on standbys, I spent a good chunk of time staring at
code trying to see whether it'd break anything.  Neither code-reading
nor testing seems to suggest any problems.   Were we earlier in the
release cycle I'd be perfectly fine with an accidental feature, but I'm
wondering a bit whether we should just make it error out at this point,
because it'd be a new feature.  I'm -0.5 on that, but I think it should
be raised.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-09 Thread Andres Freund
On 2017-06-09 22:28:00 +0200, Petr Jelinek wrote:
> And here it is, seems better (the 0002 is same as before).

Cool, looks good on a quick scan.


>  /* Define pathname of exported-snapshot files */
>  #define SNAPSHOT_EXPORT_DIR "pg_snapshots"
> -#define XactExportFilePath(path, xid, num, suffix) \
> - snprintf(path, sizeof(path), SNAPSHOT_EXPORT_DIR "/%08X-%d%s", \
> -  xid, num, suffix)
>  
> -/* Current xact's exported snapshots (a list of Snapshot structs) */
> +/* Structure holding info about exported snapshot. */
> +typedef struct ExportedSnapshot
> +{
> + char *snapfile;
> + Snapshot snapshot;
> +} ExportedSnapshot;
> +
> +/* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
>  static List *exportedSnapshots = NIL;

trival quibble: *pointers to


Will take care of it over the weekend.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-09 Thread Petr Jelinek
On 09/06/17 03:20, Petr Jelinek wrote:
> On 09/06/17 01:06, Andres Freund wrote:
>> Hi,
>>
>> On 2017-06-03 04:50:00 +0200, Petr Jelinek wrote:
>>> One thing I don't like is the GetLastLocalTransactionId() that I had to
>>> add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
>>> but I don't have better solution there.
>>
>> I dislike that quite a bit - how about we instead just change the
>> information we keep in exportedSnapshots?  I.e. store a pointer to an
>> struct ExportedSnapshot
>> {
>> char *exportedAs;
>> Snapshot snapshot;
>> }
>>
>> Then we can get rid of that relatively ugly list_length() business too.
>>
> 
> That sounds reasonable, I'll do that.
> 

And here it is, seems better (the 0002 is same as before).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From f6df47e8257cbce4a78ceeeac504c9fe86527b05 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 3 Jun 2017 02:07:49 +0200
Subject: [PATCH 2/2] Don't assign xid to logical decoding snapshots

---
 src/backend/replication/logical/snapbuild.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 8848f5b..e06aa09 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -262,7 +262,7 @@ static bool ExportInProgress = false;
 static void SnapBuildPurgeCommittedTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, TransactionId xid);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
 
 static void SnapBuildFreeSnapshot(Snapshot snap);
 
@@ -463,7 +463,7 @@ SnapBuildSnapDecRefcount(Snapshot snap)
  * and ->subxip/subxcnt values.
  */
 static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder, TransactionId xid)
+SnapBuildBuildSnapshot(SnapBuild *builder)
 {
 	Snapshot	snapshot;
 	Size		ssize;
@@ -562,7 +562,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	if (TransactionIdIsValid(MyPgXact->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
 
-	snap = SnapBuildBuildSnapshot(builder, GetTopTransactionId());
+	snap = SnapBuildBuildSnapshot(builder);
 
 	/*
 	 * We know that snap->xmin is alive, enforced by the logical xmin
@@ -679,7 +679,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder, TransactionId xid)
 	/* only build a new snapshot if we don't have a prebuilt one */
 	if (builder->snapshot == NULL)
 	{
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		builder->snapshot = SnapBuildBuildSnapshot(builder);
 		/* increase refcount for the snapshot builder */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 	}
@@ -743,7 +743,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
 		/* only build a new snapshot if we don't have a prebuilt one */
 		if (builder->snapshot == NULL)
 		{
-			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+			builder->snapshot = SnapBuildBuildSnapshot(builder);
 			/* increase refcount for the snapshot builder */
 			SnapBuildSnapIncRefcount(builder->snapshot);
 		}
@@ -1061,7 +1061,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		if (builder->snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		builder->snapshot = SnapBuildBuildSnapshot(builder);
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1831,7 +1831,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	{
 		SnapBuildSnapDecRefcount(builder->snapshot);
 	}
-	builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidTransactionId);
+	builder->snapshot = SnapBuildBuildSnapshot(builder);
 	SnapBuildSnapIncRefcount(builder->snapshot);
 
 	ReorderBufferSetRestartPoint(builder->reorder, lsn);
-- 
2.7.4

From aaf6a332bd5010a6f4f6fd113c703a26533d5737 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 3 Jun 2017 02:06:22 +0200
Subject: [PATCH 1/2] Use virtual transaction instead of normal ones in
 exported snapshots

---
 doc/src/sgml/ref/set_transaction.sgml |   8 +--
 src/backend/storage/ipc/procarray.c   |  11 ++--
 src/backend/storage/lmgr/predicate.c  |  26 
 src/backend/utils/time/snapmgr.c  | 121 ++
 src/include/storage/predicate.h   |   4 +-
 src/include/storage/procarray.h   |   2 +-
 6 files changed, 109 insertions(+), 63 deletions(-)

diff --git a/doc/src/sgml/ref/set_transaction.sgml b/doc/src/sgml/ref/set_transaction.sgml
index ca55a5b..116315f 100644
--- a/doc/src/sgml/ref/set_transaction.sgml
+++ b/doc/src/sgml/ref/set_transaction.sgml
@@ -221,9 +221,9 @@ SET SESSION 

Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-08 Thread Petr Jelinek
On 09/06/17 01:06, Andres Freund wrote:
> Hi,
> 
> On 2017-06-03 04:50:00 +0200, Petr Jelinek wrote:
>> One thing I don't like is the GetLastLocalTransactionId() that I had to
>> add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
>> but I don't have better solution there.
> 
> I dislike that quite a bit - how about we instead just change the
> information we keep in exportedSnapshots?  I.e. store a pointer to an
> struct ExportedSnapshot
> {
> char *exportedAs;
> Snapshot snapshot;
> }
> 
> Then we can get rid of that relatively ugly list_length() business too.
> 

That sounds reasonable, I'll do that.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-08 Thread Andres Freund
Hi,

On 2017-06-03 04:50:00 +0200, Petr Jelinek wrote:
> One thing I don't like is the GetLastLocalTransactionId() that I had to
> add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
> but I don't have better solution there.

I dislike that quite a bit - how about we instead just change the
information we keep in exportedSnapshots?  I.e. store a pointer to an
struct ExportedSnapshot
{
char *exportedAs;
Snapshot snapshot;
}

Then we can get rid of that relatively ugly list_length() business too.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-02 Thread Petr Jelinek
On 03/06/17 03:25, Andres Freund wrote:
> 
>> That should solve the original problem reported here.
> 
> Did you verify that?
>

Yes, I tried to manually create multiple exported logical decoding
snapshots in parallel and read data from them and it worked fine, while
it blocked before.

> 
>> @@ -1741,17 +1741,17 @@ GetSerializableTransactionSnapshotInt(Snapshot 
>> snapshot,
>>  } while (!sxact);
>>  
>>  /* Get the snapshot, or check that it's safe to use */
>> -if (!TransactionIdIsValid(sourcexid))
>> +if (!sourcevxid)
>>  snapshot = GetSnapshotData(snapshot);
>> -else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcexid))
>> +else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcevxid))
>>  {
>>  ReleasePredXact(sxact);
>>  LWLockRelease(SerializableXactHashLock);
>>  ereport(ERROR,
>>  
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>>   errmsg("could not import the requested 
>> snapshot"),
>> -   errdetail("The source transaction %u is not running 
>> anymore.",
>> - sourcexid)));
>> +   errdetail("The source virtual transaction %d/%u is not running 
>> anymore.",
>> + sourcevxid->backendId, 
>> sourcevxid->localTransactionId)));
> 
> Hm, this is a harder to read.  Wonder if we should add a pid field, that'd
> make it a bit easier to interpret?
> 

Agreed, see attached. We have to pass the pid around a bit but I don't
think it's too bad.

One thing I don't like is the GetLastLocalTransactionId() that I had to
add because we clear the proc->lxid before we get to AtEOXact_Snapshot()
but I don't have better solution there.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


0002-Don-t-assign-xid-to-logical-decoding-snapshots.patch
Description: invalid/octet-stream


0001-Use-virtual-transaction-instead-of-normal-ones-in-ex.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-02 Thread Andres Freund
Hi,

On 2017-06-03 03:03:20 +0200, Petr Jelinek wrote:
> All right, here is my rough attempt on doing what Andres has suggested,
> going straight from xid to vxid, seems to work fine in my tests and
> parallel pg_dump seems happy with it as well.

Cool!


> The second patch removes the xid parameter from SnapBuildBuildSnapshot
> as it's not used for anything and skips the GetTopTransactionId() call
> as well.

Makes sense.


> That should solve the original problem reported here.

Did you verify that?


> From 4f7eb5b8fdbab2539731414e81d7a7419a83e5b1 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek 
> Date: Sat, 3 Jun 2017 02:06:22 +0200
> Subject: [PATCH 1/2] Use virtual transaction instead of normal ones in
>  exported snapshots

> @@ -1817,8 +1818,10 @@ ProcArrayInstallImportedXmin(TransactionId xmin, 
> TransactionId sourcexid)
>   if (pgxact->vacuumFlags & PROC_IN_VACUUM)
>   continue;
>  
> - xid = pgxact->xid;  /* fetch just once */

Hm, I don't quite understand what the 'fetch just once' bit was trying
to achieve here (not to speak of the fact that it's simply not
guaranteed this way).  Therefore I think you're right to simply
disregard it.


> @@ -1741,17 +1741,17 @@ GetSerializableTransactionSnapshotInt(Snapshot 
> snapshot,
>   } while (!sxact);
>  
>   /* Get the snapshot, or check that it's safe to use */
> - if (!TransactionIdIsValid(sourcexid))
> + if (!sourcevxid)
>   snapshot = GetSnapshotData(snapshot);
> - else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcexid))
> + else if (!ProcArrayInstallImportedXmin(snapshot->xmin, sourcevxid))
>   {
>   ReleasePredXact(sxact);
>   LWLockRelease(SerializableXactHashLock);
>   ereport(ERROR,
>   
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>errmsg("could not import the requested 
> snapshot"),
> -errdetail("The source transaction %u is not running 
> anymore.",
> -  sourcexid)));
> +errdetail("The source virtual transaction %d/%u is not running 
> anymore.",
> +  sourcevxid->backendId, 
> sourcevxid->localTransactionId)));

Hm, this is a harder to read.  Wonder if we should add a pid field, that'd
make it a bit easier to interpret?


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-02 Thread Petr Jelinek
On 02/06/17 03:38, Robert Haas wrote:
> On Thu, Jun 1, 2017 at 2:28 PM, Andres Freund  wrote:
>> On 2017-06-01 14:17:44 +0200, Petr Jelinek wrote:
>>> Thinking more about this, I am not convinced it's a good idea to change
>>> exports this late in the cycle. I still think it's best to do the xid
>>> assignment only when the snapshot is actually exported but don't assign
>>> xid when the export is only used by the local session (the new option in
>>> PG10). That's one line change which impacts only logical
>>> replication/decoding as opposed to everything else which uses exported
>>> snapshots.
>>
>> I'm not quite convinced by this argument.  Exported snapshot contents
>> are ephemeral, we can change the format at any time.  The wait is fairly
>> annoying for every user of logical decoding.  For me the combination of
>> those two fact implies that we should rather fix this properly.
> 
> +1.  The change Andres is proposing doesn't sound like it will be
> terribly high-impact, and I think we'll be happier in the long run if
> we install real fixes instead of kludging it.
> 

All right, here is my rough attempt on doing what Andres has suggested,
going straight from xid to vxid, seems to work fine in my tests and
parallel pg_dump seems happy with it as well.

The second patch removes the xid parameter from SnapBuildBuildSnapshot
as it's not used for anything and skips the GetTopTransactionId() call
as well. That should solve the original problem reported here.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


0002-Don-t-assign-xid-to-logical-decoding-snapshots.patch
Description: invalid/octet-stream


0001-Use-virtual-transaction-instead-of-normal-ones-in-ex.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-01 Thread Robert Haas
On Thu, Jun 1, 2017 at 2:28 PM, Andres Freund  wrote:
> On 2017-06-01 14:17:44 +0200, Petr Jelinek wrote:
>> Thinking more about this, I am not convinced it's a good idea to change
>> exports this late in the cycle. I still think it's best to do the xid
>> assignment only when the snapshot is actually exported but don't assign
>> xid when the export is only used by the local session (the new option in
>> PG10). That's one line change which impacts only logical
>> replication/decoding as opposed to everything else which uses exported
>> snapshots.
>
> I'm not quite convinced by this argument.  Exported snapshot contents
> are ephemeral, we can change the format at any time.  The wait is fairly
> annoying for every user of logical decoding.  For me the combination of
> those two fact implies that we should rather fix this properly.

+1.  The change Andres is proposing doesn't sound like it will be
terribly high-impact, and I think we'll be happier in the long run if
we install real fixes instead of kludging it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-01 Thread Andres Freund
On 2017-06-01 14:17:44 +0200, Petr Jelinek wrote:
> Thinking more about this, I am not convinced it's a good idea to change
> exports this late in the cycle. I still think it's best to do the xid
> assignment only when the snapshot is actually exported but don't assign
> xid when the export is only used by the local session (the new option in
> PG10). That's one line change which impacts only logical
> replication/decoding as opposed to everything else which uses exported
> snapshots.

I'm not quite convinced by this argument.  Exported snapshot contents
are ephemeral, we can change the format at any time.  The wait is fairly
annoying for every user of logical decoding.  For me the combination of
those two fact implies that we should rather fix this properly.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-06-01 Thread Petr Jelinek
On 31/05/17 11:21, Petr Jelinek wrote:
> On 31/05/17 09:24, Andres Freund wrote:
>> On 2017-05-29 23:49:33 +0200, Petr Jelinek wrote:
>>> I am not quite sure I understand (both the vxid suggestion and for the
>>> session dying badly). Maybe we can discuss bit more when you get to
>>> computer so it's easier for you to expand on what you mean.
>>
>> The xid interlock when exporting a snapshot is required because
>> otherwise it's not generally guaranteed that all resourced required for
>> the snapshot are reserved.  In the logical replication case we could
>> guarantee that otherwise, but there'd be weird-ish edge cases when
>> erroring out just after exporting a snapshot.
>>
>> The problem with using the xid as that interlock is that it requires
>> acquiring an xid - which is something we're going to block against when
>> building a new catalog snapshot.  Afaict that's not entirely required -
>> all that we need to verify is that the snapshot in the source
>> transaction is still running.  The easiest way for the importer to check
>> that the source is still alive seems to be export the virtual
>> transaction id instead of the xid.  Normally we can't store things like
>> virtual xids on disk, but that concern isn't valid here because exported
>> snapshots are ephemeral, there's also already a precedent in
>> predicate.c.
>>
>> It seems like it'd be fairly easy to change things around that way, but
>> maybe I'm missing something.
>>
> 
> Okay, thanks for explanation. Code-wise it does seem simple enough
> indeed. I admit I don't know enough about the exported snapshots and
> snapshot management in general to be able to answer the question of
> safety here. That said, it does seem to me like it should work as the
> exported snapshots are just on disk representation of in-memory state
> that becomes invalid once the in-memory state does.
> 

Thinking more about this, I am not convinced it's a good idea to change
exports this late in the cycle. I still think it's best to do the xid
assignment only when the snapshot is actually exported but don't assign
xid when the export is only used by the local session (the new option in
PG10). That's one line change which impacts only logical
replication/decoding as opposed to everything else which uses exported
snapshots.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-31 Thread Petr Jelinek
On 31/05/17 09:24, Andres Freund wrote:
> On 2017-05-29 23:49:33 +0200, Petr Jelinek wrote:
>> I am not quite sure I understand (both the vxid suggestion and for the
>> session dying badly). Maybe we can discuss bit more when you get to
>> computer so it's easier for you to expand on what you mean.
> 
> The xid interlock when exporting a snapshot is required because
> otherwise it's not generally guaranteed that all resourced required for
> the snapshot are reserved.  In the logical replication case we could
> guarantee that otherwise, but there'd be weird-ish edge cases when
> erroring out just after exporting a snapshot.
> 
> The problem with using the xid as that interlock is that it requires
> acquiring an xid - which is something we're going to block against when
> building a new catalog snapshot.  Afaict that's not entirely required -
> all that we need to verify is that the snapshot in the source
> transaction is still running.  The easiest way for the importer to check
> that the source is still alive seems to be export the virtual
> transaction id instead of the xid.  Normally we can't store things like
> virtual xids on disk, but that concern isn't valid here because exported
> snapshots are ephemeral, there's also already a precedent in
> predicate.c.
> 
> It seems like it'd be fairly easy to change things around that way, but
> maybe I'm missing something.
> 

Okay, thanks for explanation. Code-wise it does seem simple enough
indeed. I admit I don't know enough about the exported snapshots and
snapshot management in general to be able to answer the question of
safety here. That said, it does seem to me like it should work as the
exported snapshots are just on disk representation of in-memory state
that becomes invalid once the in-memory state does.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-31 Thread Andres Freund
On 2017-05-29 23:49:33 +0200, Petr Jelinek wrote:
> I am not quite sure I understand (both the vxid suggestion and for the
> session dying badly). Maybe we can discuss bit more when you get to
> computer so it's easier for you to expand on what you mean.

The xid interlock when exporting a snapshot is required because
otherwise it's not generally guaranteed that all resourced required for
the snapshot are reserved.  In the logical replication case we could
guarantee that otherwise, but there'd be weird-ish edge cases when
erroring out just after exporting a snapshot.

The problem with using the xid as that interlock is that it requires
acquiring an xid - which is something we're going to block against when
building a new catalog snapshot.  Afaict that's not entirely required -
all that we need to verify is that the snapshot in the source
transaction is still running.  The easiest way for the importer to check
that the source is still alive seems to be export the virtual
transaction id instead of the xid.  Normally we can't store things like
virtual xids on disk, but that concern isn't valid here because exported
snapshots are ephemeral, there's also already a precedent in
predicate.c.

It seems like it'd be fairly easy to change things around that way, but
maybe I'm missing something.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 29/05/17 21:44, Andres Freund wrote:
> 
> 
> On May 29, 2017 12:41:26 PM PDT, Petr Jelinek  
> wrote:
>> On 29/05/17 21:28, Andres Freund wrote:
>>>
>>> On May 29, 2017 12:25:35 PM PDT, Petr Jelinek
>>  wrote:

 Looking at the code more, the xid is only used as parameter for
 SnapBuildBuildSnapshot() which never does anything with that
>> parameter,
 I wonder if it's really needed then.
>>>
>>> Not at a computer, but by memory that'll trigger the snapshot export
>> routine to include it.  Import in turn requires the xid to check if the
>> source is still alive.  But there's better ways, e.g. using the virtual
>> xactid.
>>>
>>
>> It does, and while that's unfortunate the logical replication does not
>> actually export the snapshots. It uses the USE_SNAPSHOT option where
>> the
>> snapshot is just installed into current transaction but not exported.
>> So
>> not calling the GetTopTransactionId() would solve it at least for that
>> in-core use-case. I don't see any bad side effects from doing so yet,
>> so
>> it might be good enough solution for PG10.
> 
> In the general case you can't do so, because of vacuum and such.  Even for LR 
> we need to make sure the exporting session didn't die badly, deleting the 
> slot.  Hence suggestion to use the virtual xid.
> 

I am not quite sure I understand (both the vxid suggestion and for the
session dying badly). Maybe we can discuss bit more when you get to
computer so it's easier for you to expand on what you mean.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Andres Freund


On May 29, 2017 12:41:26 PM PDT, Petr Jelinek  
wrote:
>On 29/05/17 21:28, Andres Freund wrote:
>> 
>> On May 29, 2017 12:25:35 PM PDT, Petr Jelinek
> wrote:
>>>
>>> Looking at the code more, the xid is only used as parameter for
>>> SnapBuildBuildSnapshot() which never does anything with that
>parameter,
>>> I wonder if it's really needed then.
>> 
>> Not at a computer, but by memory that'll trigger the snapshot export
>routine to include it.  Import in turn requires the xid to check if the
>source is still alive.  But there's better ways, e.g. using the virtual
>xactid.
>> 
>
>It does, and while that's unfortunate the logical replication does not
>actually export the snapshots. It uses the USE_SNAPSHOT option where
>the
>snapshot is just installed into current transaction but not exported.
>So
>not calling the GetTopTransactionId() would solve it at least for that
>in-core use-case. I don't see any bad side effects from doing so yet,
>so
>it might be good enough solution for PG10.

In the general case you can't do so, because of vacuum and such.  Even for LR 
we need to make sure the exporting session didn't die badly, deleting the slot. 
 Hence suggestion to use the virtual xid.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 29/05/17 21:28, Andres Freund wrote:
> 
> On May 29, 2017 12:25:35 PM PDT, Petr Jelinek  
> wrote:
>>
>> Looking at the code more, the xid is only used as parameter for
>> SnapBuildBuildSnapshot() which never does anything with that parameter,
>> I wonder if it's really needed then.
> 
> Not at a computer, but by memory that'll trigger the snapshot export routine 
> to include it.  Import in turn requires the xid to check if the source is 
> still alive.  But there's better ways, e.g. using the virtual xactid.
> 

It does, and while that's unfortunate the logical replication does not
actually export the snapshots. It uses the USE_SNAPSHOT option where the
snapshot is just installed into current transaction but not exported. So
not calling the GetTopTransactionId() would solve it at least for that
in-core use-case. I don't see any bad side effects from doing so yet, so
it might be good enough solution for PG10.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 29/05/17 21:23, Andres Freund wrote:
> 
> 
> On May 29, 2017 12:21:50 PM PDT, Petr Jelinek  
> wrote:
>> On 29/05/17 20:59, Andres Freund wrote:
>>>
>>>
>>> On May 29, 2017 11:58:05 AM PDT, Petr Jelinek
>>  wrote:
 On 27/05/17 17:17, Andres Freund wrote:
>
>
> On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
  wrote:
>> Actually, I guess it's the pid 47457 (COPY process) who is
>> actually
>> running the xid 73322726. In that case that's the same thing
 Masahiko
>> Sawada reported [1]. Which basically is result of snapshot builder
>> waiting for transaction to finish, that's normal if there is a
>> long
>> transaction running when the snapshot is being created (and the
>> COPY
 is
>> a long transaction).
>
> Hm.  I suspect the issue is that the exported snapshot needs an xid
 for some crosscheck, and that's what we're waiting for.  Could you
 check what happens if you don't assign one and just content the
>> error
 checks out?   Not at my computer, just theorizing.
>

 I don't think that's it, in my opinion it's the parallelization of
 table
 data copy where we create snapshot for one process but then the next
 one
 has to wait for the first one to finish. Before we fixed the
 snapshotting, the second one would just use the ondisk snapshot so
>> it
 would work fine (except the snapshot was corrupted of course). I
>> wonder
 if we could somehow give it a hint to ignore the read-only txes, but
 then we have no way to enforce the txes to stay read-only so it does
 not
 seem safe.
>>>
>>> Read-only txs have no xid ...
>>>
>>
>> That's what I mean by hinting, normally they don't but building initial
>> snapshot in snapshot builder calls GetTopTransactionId() (see
>> SnapBuildInitialSnapshot()) which will assign it xid.
> 
> That's precisely what I pointed out a few emails above, and what I suggest 
> changing.
> 

Ah didn't realize that's what you meant. I can try playing with it.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Andres Freund


On May 29, 2017 12:25:35 PM PDT, Petr Jelinek  
wrote:
>On 29/05/17 21:21, Petr Jelinek wrote:
>> On 29/05/17 20:59, Andres Freund wrote:
>>>
>>>
>>> On May 29, 2017 11:58:05 AM PDT, Petr Jelinek
> wrote:
 On 27/05/17 17:17, Andres Freund wrote:
>
>
> On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
  wrote:
>> Actually, I guess it's the pid 47457 (COPY process) who is
>actually
>> running the xid 73322726. In that case that's the same thing
 Masahiko
>> Sawada reported [1]. Which basically is result of snapshot
>builder
>> waiting for transaction to finish, that's normal if there is a
>long
>> transaction running when the snapshot is being created (and the
>COPY
 is
>> a long transaction).
>
> Hm.  I suspect the issue is that the exported snapshot needs an
>xid
 for some crosscheck, and that's what we're waiting for.  Could you
 check what happens if you don't assign one and just content the
>error
 checks out?   Not at my computer, just theorizing.
>

 I don't think that's it, in my opinion it's the parallelization of
 table
 data copy where we create snapshot for one process but then the
>next
 one
 has to wait for the first one to finish. Before we fixed the
 snapshotting, the second one would just use the ondisk snapshot so
>it
 would work fine (except the snapshot was corrupted of course). I
>wonder
 if we could somehow give it a hint to ignore the read-only txes,
>but
 then we have no way to enforce the txes to stay read-only so it
>does
 not
 seem safe.
>>>
>>> Read-only txs have no xid ...
>>>
>> 
>> That's what I mean by hinting, normally they don't but building
>initial
>> snapshot in snapshot builder calls GetTopTransactionId() (see
>> SnapBuildInitialSnapshot()) which will assign it xid.
>> 
>
>Looking at the code more, the xid is only used as parameter for
>SnapBuildBuildSnapshot() which never does anything with that parameter,
>I wonder if it's really needed then.

Not at a computer, but by memory that'll trigger the snapshot export routine to 
include it.  Import in turn requires the xid to check if the source is still 
alive.  But there's better ways, e.g. using the virtual xactid.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 29/05/17 21:21, Petr Jelinek wrote:
> On 29/05/17 20:59, Andres Freund wrote:
>>
>>
>> On May 29, 2017 11:58:05 AM PDT, Petr Jelinek  
>> wrote:
>>> On 27/05/17 17:17, Andres Freund wrote:


 On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
>>>  wrote:
> Actually, I guess it's the pid 47457 (COPY process) who is actually
> running the xid 73322726. In that case that's the same thing
>>> Masahiko
> Sawada reported [1]. Which basically is result of snapshot builder
> waiting for transaction to finish, that's normal if there is a long
> transaction running when the snapshot is being created (and the COPY
>>> is
> a long transaction).

 Hm.  I suspect the issue is that the exported snapshot needs an xid
>>> for some crosscheck, and that's what we're waiting for.  Could you
>>> check what happens if you don't assign one and just content the error
>>> checks out?   Not at my computer, just theorizing.

>>>
>>> I don't think that's it, in my opinion it's the parallelization of
>>> table
>>> data copy where we create snapshot for one process but then the next
>>> one
>>> has to wait for the first one to finish. Before we fixed the
>>> snapshotting, the second one would just use the ondisk snapshot so it
>>> would work fine (except the snapshot was corrupted of course). I wonder
>>> if we could somehow give it a hint to ignore the read-only txes, but
>>> then we have no way to enforce the txes to stay read-only so it does
>>> not
>>> seem safe.
>>
>> Read-only txs have no xid ...
>>
> 
> That's what I mean by hinting, normally they don't but building initial
> snapshot in snapshot builder calls GetTopTransactionId() (see
> SnapBuildInitialSnapshot()) which will assign it xid.
> 

Looking at the code more, the xid is only used as parameter for
SnapBuildBuildSnapshot() which never does anything with that parameter,
I wonder if it's really needed then.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Andres Freund


On May 29, 2017 12:21:50 PM PDT, Petr Jelinek  
wrote:
>On 29/05/17 20:59, Andres Freund wrote:
>> 
>> 
>> On May 29, 2017 11:58:05 AM PDT, Petr Jelinek
> wrote:
>>> On 27/05/17 17:17, Andres Freund wrote:


 On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
>>>  wrote:
> Actually, I guess it's the pid 47457 (COPY process) who is
>actually
> running the xid 73322726. In that case that's the same thing
>>> Masahiko
> Sawada reported [1]. Which basically is result of snapshot builder
> waiting for transaction to finish, that's normal if there is a
>long
> transaction running when the snapshot is being created (and the
>COPY
>>> is
> a long transaction).

 Hm.  I suspect the issue is that the exported snapshot needs an xid
>>> for some crosscheck, and that's what we're waiting for.  Could you
>>> check what happens if you don't assign one and just content the
>error
>>> checks out?   Not at my computer, just theorizing.

>>>
>>> I don't think that's it, in my opinion it's the parallelization of
>>> table
>>> data copy where we create snapshot for one process but then the next
>>> one
>>> has to wait for the first one to finish. Before we fixed the
>>> snapshotting, the second one would just use the ondisk snapshot so
>it
>>> would work fine (except the snapshot was corrupted of course). I
>wonder
>>> if we could somehow give it a hint to ignore the read-only txes, but
>>> then we have no way to enforce the txes to stay read-only so it does
>>> not
>>> seem safe.
>> 
>> Read-only txs have no xid ...
>> 
>
>That's what I mean by hinting, normally they don't but building initial
>snapshot in snapshot builder calls GetTopTransactionId() (see
>SnapBuildInitialSnapshot()) which will assign it xid.

That's precisely what I pointed out a few emails above, and what I suggest 
changing.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 29/05/17 20:59, Andres Freund wrote:
> 
> 
> On May 29, 2017 11:58:05 AM PDT, Petr Jelinek  
> wrote:
>> On 27/05/17 17:17, Andres Freund wrote:
>>>
>>>
>>> On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
>>  wrote:
 Actually, I guess it's the pid 47457 (COPY process) who is actually
 running the xid 73322726. In that case that's the same thing
>> Masahiko
 Sawada reported [1]. Which basically is result of snapshot builder
 waiting for transaction to finish, that's normal if there is a long
 transaction running when the snapshot is being created (and the COPY
>> is
 a long transaction).
>>>
>>> Hm.  I suspect the issue is that the exported snapshot needs an xid
>> for some crosscheck, and that's what we're waiting for.  Could you
>> check what happens if you don't assign one and just content the error
>> checks out?   Not at my computer, just theorizing.
>>>
>>
>> I don't think that's it, in my opinion it's the parallelization of
>> table
>> data copy where we create snapshot for one process but then the next
>> one
>> has to wait for the first one to finish. Before we fixed the
>> snapshotting, the second one would just use the ondisk snapshot so it
>> would work fine (except the snapshot was corrupted of course). I wonder
>> if we could somehow give it a hint to ignore the read-only txes, but
>> then we have no way to enforce the txes to stay read-only so it does
>> not
>> seem safe.
> 
> Read-only txs have no xid ...
> 

That's what I mean by hinting, normally they don't but building initial
snapshot in snapshot builder calls GetTopTransactionId() (see
SnapBuildInitialSnapshot()) which will assign it xid.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Andres Freund


On May 29, 2017 11:58:05 AM PDT, Petr Jelinek  
wrote:
>On 27/05/17 17:17, Andres Freund wrote:
>> 
>> 
>> On May 27, 2017 9:48:22 AM EDT, Petr Jelinek
> wrote:
>>> Actually, I guess it's the pid 47457 (COPY process) who is actually
>>> running the xid 73322726. In that case that's the same thing
>Masahiko
>>> Sawada reported [1]. Which basically is result of snapshot builder
>>> waiting for transaction to finish, that's normal if there is a long
>>> transaction running when the snapshot is being created (and the COPY
>is
>>> a long transaction).
>> 
>> Hm.  I suspect the issue is that the exported snapshot needs an xid
>for some crosscheck, and that's what we're waiting for.  Could you
>check what happens if you don't assign one and just content the error
>checks out?   Not at my computer, just theorizing.
>> 
>
>I don't think that's it, in my opinion it's the parallelization of
>table
>data copy where we create snapshot for one process but then the next
>one
>has to wait for the first one to finish. Before we fixed the
>snapshotting, the second one would just use the ondisk snapshot so it
>would work fine (except the snapshot was corrupted of course). I wonder
>if we could somehow give it a hint to ignore the read-only txes, but
>then we have no way to enforce the txes to stay read-only so it does
>not
>seem safe.

Read-only txs have no xid ...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Petr Jelinek
On 27/05/17 17:17, Andres Freund wrote:
> 
> 
> On May 27, 2017 9:48:22 AM EDT, Petr Jelinek  
> wrote:
>> Actually, I guess it's the pid 47457 (COPY process) who is actually
>> running the xid 73322726. In that case that's the same thing Masahiko
>> Sawada reported [1]. Which basically is result of snapshot builder
>> waiting for transaction to finish, that's normal if there is a long
>> transaction running when the snapshot is being created (and the COPY is
>> a long transaction).
> 
> Hm.  I suspect the issue is that the exported snapshot needs an xid for some 
> crosscheck, and that's what we're waiting for.  Could you check what happens 
> if you don't assign one and just content the error checks out?   Not at my 
> computer, just theorizing.
> 

I don't think that's it, in my opinion it's the parallelization of table
data copy where we create snapshot for one process but then the next one
has to wait for the first one to finish. Before we fixed the
snapshotting, the second one would just use the ondisk snapshot so it
would work fine (except the snapshot was corrupted of course). I wonder
if we could somehow give it a hint to ignore the read-only txes, but
then we have no way to enforce the txes to stay read-only so it does not
seem safe.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Andres Freund
On 2017-05-29 11:38:20 -0700, Jeff Janes wrote:
> Related, but not the same.  It would be nice if they didn't block, but if
> they do have to block, shouldn't it wait on a semaphore, rather than doing
> a tight loop?  It looks like maybe a latch didn't get reset when it should
> have or something.

The code certainly is trying to just block using a lock (on the xid of
the running xact), there shouldn't be any busy looping going on...
There's no latch involved, so something is certainly weird here.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-29 Thread Jeff Janes
On Sat, May 27, 2017 at 6:48 AM, Petr Jelinek 
wrote:

>
>
> Actually, I guess it's the pid 47457 (COPY process) who is actually
> running the xid 73322726. In that case that's the same thing Masahiko
> Sawada reported [1].


Related, but not the same.  It would be nice if they didn't block, but if
they do have to block, shouldn't it wait on a semaphore, rather than doing
a tight loop?  It looks like maybe a latch didn't get reset when it should
have or something.


[1]
> https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3V
> g_0xmuc%3DUpBrZ-_MUxh-Q%40mail.gmail.com


Cheers,

Jeff


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-27 Thread Andres Freund


On May 27, 2017 9:48:22 AM EDT, Petr Jelinek  
wrote:
>Actually, I guess it's the pid 47457 (COPY process) who is actually
>running the xid 73322726. In that case that's the same thing Masahiko
>Sawada reported [1]. Which basically is result of snapshot builder
>waiting for transaction to finish, that's normal if there is a long
>transaction running when the snapshot is being created (and the COPY is
>a long transaction).

Hm.  I suspect the issue is that the exported snapshot needs an xid for some 
crosscheck, and that's what we're waiting for.  Could you check what happens if 
you don't assign one and just content the error checks out?   Not at my 
computer, just theorizing.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-27 Thread Petr Jelinek
On 27/05/17 15:44, Petr Jelinek wrote:
> On 27/05/17 01:25, Jeff Janes wrote:
>> When I create a subscription in the disabled state, and then later doing
>> "alter subscription sub enable;", on the master I sometimes get a tight
>> loop of the deadlock detector:
>>
>> (log_lock_waits is on, of course)
>>
>> deadlock_timeout is set to 1s, so I don't know why it seems to be
>> running several times per millisecond.
>>
>> .
>>
>> And so on out to "after 9616.814", when it finally acquires the lock.
>>
>> The other process, 47457, is doing the initial COPY of another table as
>> part of the same publisher/subscriber set.
>>
> 
> We lock wait for running transactions in snapshot builder while the
> snapshot is being built so I guess that's what you are seeing. I am not
> quite sure why the snapshot builder would hold the xid lock for
> prolonged period of time though, the XactLockTableWait releases the lock
> immediately after acquiring it. In fact AFAICS everything that acquires
> ShareLock on xid releases it immediately after acquiring as it's only
> used for waits.
> 

Actually, I guess it's the pid 47457 (COPY process) who is actually
running the xid 73322726. In that case that's the same thing Masahiko
Sawada reported [1]. Which basically is result of snapshot builder
waiting for transaction to finish, that's normal if there is a long
transaction running when the snapshot is being created (and the COPY is
a long transaction).

[1]
https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3Vg_0xmuc%3DUpBrZ-_MUxh-Q%40mail.gmail.com

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication busy-waiting on a lock

2017-05-27 Thread Petr Jelinek
On 27/05/17 01:25, Jeff Janes wrote:
> When I create a subscription in the disabled state, and then later doing
> "alter subscription sub enable;", on the master I sometimes get a tight
> loop of the deadlock detector:
> 
> (log_lock_waits is on, of course)
> 
> deadlock_timeout is set to 1s, so I don't know why it seems to be
> running several times per millisecond.
> 
> .
> 
> And so on out to "after 9616.814", when it finally acquires the lock.
> 
> The other process, 47457, is doing the initial COPY of another table as
> part of the same publisher/subscriber set.
> 

We lock wait for running transactions in snapshot builder while the
snapshot is being built so I guess that's what you are seeing. I am not
quite sure why the snapshot builder would hold the xid lock for
prolonged period of time though, the XactLockTableWait releases the lock
immediately after acquiring it. In fact AFAICS everything that acquires
ShareLock on xid releases it immediately after acquiring as it's only
used for waits.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication busy-waiting on a lock

2017-05-26 Thread Jeff Janes
When I create a subscription in the disabled state, and then later doing
"alter subscription sub enable;", on the master I sometimes get a tight
loop of the deadlock detector:

(log_lock_waits is on, of course)

deadlock_timeout is set to 1s, so I don't know why it seems to be running
several times per millisecond.

47462 idle in transaction 2017-05-26 16:05:20.505 PDT LOG:  logical
decoding found initial starting point at 1B/7BAC9D50
47462 idle in transaction 2017-05-26 16:05:20.505 PDT DETAIL:  Waiting for
transactions (approximately 9) older than 73326615 to end.
47462 idle in transaction waiting 2017-05-26 16:05:21.505 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1000.060 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.505 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1000.398 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1000.574 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1000.816 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1001.180 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.506 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.507 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1001.284 ms
47462 idle in transaction waiting 2017-05-26 16:05:21.507 PDT DETAIL:
 Process holding the lock: 47457. Wait queue: 47462.
47462 idle in transaction waiting 2017-05-26 16:05:21.507 PDT LOG:  process
47462 still waiting for ShareLock on transaction 73322726 after 1001.493 ms
.

And so on out to "after 9616.814", when it finally acquires the lock.

The other process, 47457, is doing the initial COPY of another table as
part of the same publisher/subscriber set.

Cheers,

Jeff