Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2017-09-15 Thread i . kartyshov

Alexander Korotkov писал 2017-09-05 00:33:

I've following comments on this patch:
1) You shouldn't use ">=" to compare xids.  You should use
TransactionIdFollowsOrEquals() function which handles transaction id
wraparound correctly.
I fixed it and as an additional I add GUC parameter that could turn off 
this behavior. These parameter can be set in the postgresql.conf 
configuration file, or with the help of the ALTER SYSTEM command. For 
the changes to take effect, call the pg_reload_conf() function or reload 
the database server.

Changes are in vacuum_lazy_truncate_v3.patch


2) As I understood, this patch makes heap truncate only when no
concurrent transactions are running on both master and replicas with
hot_standby_feedback enabled.  For busy system, it would be literally
"never do heap truncate after vacuum".
Yes, the difficulty is that if we have a lot of replicas and requests 
for them will overlap each other, then truncation will not occur on the 
loaded system



Perhaps, it's certainly bad idea to disable replaying
AccessExclusiveLock *every time*, we should skip taking
AccessExclusiveLock only while writing XLOG_SMGR_TRUNCATE record.
Besides that, this code violates out coding style.
In patch (standby_truncate_skip_v2.patch) fixed this behaviour and now 
it skip writing XLOG_STANDBY_LOCK records (with in AccessExclusiveLock) 
only while autovacuum truncations the table.


Amit Kapila писал 2017-09-05 12:52:

I think one deficiency of this solution is that it will keep on
generating extra WAL even if standby doesn't need it (as standby has
successfully truncated the relation).  I don't know if we can just get
away by saying that an additional WAL record per vacuum cycle is
harmless especially when that doesn't serve any purpose (like for the
cases when standby is always able to truncate the heap on first WAL
record).  Am I missing some part of solution which avoids it?
You are right, and this implementation generating extra WALs and it 
could have some cases. If you have any solutions to avoid it, I`ll be 
glade to hear them.


Simon Riggs писал 2017-09-05 14:44:

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

Perhaps altering the way truncation requires an AccessExclusiveLock
would help workloads on both master and standby? If a Relation knew it
had a pending truncation then scans wouldn't need to go past newsize.
Once older lockers have gone we could simply truncate without the
lock. Would need a few pushups but seems less scary then trying to
make pending standby actions work well enough to commit.
Yes, it is the first idea that come to my mind and the most difficult 
one. I think that in a few days I'll finish the patch with such ideas of 
implementation.



--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..a6dc369 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -31,6 +31,9 @@
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "access/transam.h"
 
 /*
  * We keep a list of all relations (represented as RelFileNode values)
@@ -495,9 +498,21 @@ smgr_redo(XLogReaderState *record)
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
 		SMgrRelation reln;
 		Relation	rel;
+		LOCKTAG		locktag;
+		VirtualTransactionId *backends;
+		VirtualTransactionId *backends2;
 
 		reln = smgropen(xlrec->rnode, InvalidBackendId);
 
+		SET_LOCKTAG_RELATION(locktag, reln->smgr_rnode.node.dbNode,
+			 reln->smgr_rnode.node.relNode);
+
+		backends = GetLockConflicts(, AccessExclusiveLock);
+		backends2 = GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid);
+
+		if (!VirtualTransactionIdIsValid(*backends) && !VirtualTransactionIdIsValid(*backends2))
+		{
+
 		/*
 		 * Forcibly create relation if it doesn't exist (which suggests that
 		 * it was dropped somewhere later in the WAL sequence).  As in
@@ -542,6 +557,10 @@ smgr_redo(XLogReaderState *record)
 			visibilitymap_truncate(rel, xlrec->blkno);
 
 		FreeFakeRelcacheEntry(rel);
+		} else
+		{
+			XLogFlush(lsn);
+		}
 	}
 	else
 		elog(PANIC, "smgr_redo: unknown op code %u", info);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..3c25907 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -61,6 +61,8 @@
 #include "utils/pg_rusage.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
+#include "catalog/storage_xlog.h"
+#include "access/rmgr.h"
 
 
 /*
@@ -317,6 +319,32 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 		new_rel_pages = vacrelstats->old_rel_pages;
 		new_rel_tuples = vacrelstats->old_rel_tuples;
 	}
+	else
+	{
+		if 

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-09-04 Thread i . kartyshov

I forget to include patch in last letter.

Craig Ringer wrote 2017-08-15 05:00:

That ship sailed a long time ago unfortunately, they're all over
pg_stat_replication and pg_replication_slots and so on. They're
already routinely used for monitoring replication lag in bytes,
waiting for a peer to catch up, etc.

--

 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Function pg_replication_slots is only master, and waitlsn is async hot 
standby replication function. It allows us to wait untill insert made on 
master is be replayed on replica.


--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 01acc2e..6792eb0 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -181,6 +181,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/waitlsn.sgml b/doc/src/sgml/ref/waitlsn.sgml
new file mode 100644
index 000..077e869
--- /dev/null
+++ b/doc/src/sgml/ref/waitlsn.sgml
@@ -0,0 +1,119 @@
+
+
+
+ 
+  WAITLSN
+ 
+
+ 
+  WAITLSN
+  7
+  SQL - Language Statements
+ 
+
+ 
+  WAITLSN
+  wait until target LSN has been replayed
+ 
+
+ 
+
+WAITLSN 'LSN' [ , delay ]
+WAITLSN_INFINITE 'LSN'
+WAITLSN_NO_WAIT 'LSN'
+
+ 
+
+ 
+  Description
+
+  
+   The WAITLSN wait till target LSN will
+   be replayed with an optional delay (milliseconds by default
+   infinity) to be wait for LSN to replayed.
+  
+
+  
+   WAITLSN provides a simple
+   interprocess LSN wait mechanism for a backends on slave
+   in master-slave replication scheme on PostgreSQL database.
+  
+
+  
+   The WAITLSN_INFINITE wait till target LSN will
+   be replayed on slave .
+  
+
+  
+   The WAITLSN_NO_WAIT will tell if target LSN was replayed without any wait.
+  
+ 
+
+ 
+  Parameters
+
+  
+   
+LSN
+
+ 
+  Target log sequence number to be wait for.
+ 
+
+   
+   
+delay
+
+ 
+  Time in miliseconds to waiting for LSN to be replayed.
+ 
+
+   
+  
+ 
+
+ 
+  Notes
+
+  
+   Delay time for waiting till LSN to be replayed must be integer. For
+   default it is infinity. Waiting can be interupped using Ctl+C, or
+   by Postmaster death.
+  
+ 
+
+ 
+  Examples
+
+  
+   Configure and execute a waitlsn from
+   psql:
+
+
+WAITLSN '0/3F07A6B1', 1;
+NOTICE:  LSN is not reached. Try to make bigger delay.
+WAITLSN
+
+WAITLSN '0/3F07A611';
+WAITLSN
+
+WAITLSN '0/3F0FF791', 50;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to make bigger delay.
+ERROR:  canceling statement due to user request
+
+
+ 
+
+ 
+  Compatibility
+
+  
+   There is no WAITLSN statement in the SQL
+   standard.
+  
+ 
+
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index 9000b3a..0c5951a 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -209,6 +209,7 @@



+   
 
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df4843f..9dfb59d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -147,7 +148,6 @@ const struct config_enum_entry sync_method_options[] = {
 	{NULL, 0, false}
 };
 
-
 /*
  * Although only "on", "off", and "always" are documented,
  * we accept all the likely variants of "on" and "off".
@@ -7237,6 +7237,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * After update lastReplayedEndRecPtr set Latches in SHMEM array
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitLSN())
+{
+
+	WaitLSNSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 			} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 4a6c99e..0d10117 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -20,6 +20,6 @@ OBJS = amcmds.o aggregatecmds.o alter.o analyze.o async.o cluster.o comment.o \
 	policy.o portalcmds.o prepare.o proclang.o publicationcmds.o \
 	schemacmds.o seclabel.o sequence.o statscmds.o subscriptioncmds.o \
 	tablecmds.o tablespace.o trigger.o tsearchcmds.o typecmds.o user.o \
-	vacuum.o vacuumlazy.o variable.o view.o
+	vacuum.o vacuumlazy.o variable.o view.o waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index bacc08e..7d6011f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -139,7 +139,6 @@
 #include "utils/ps_status.h"
 #include "utils/timestamp.h"
 
-
 /*
  

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-09-04 Thread i . kartyshov

Hello, thank you for your comments over main idea and code.

On 13.03.2017 05:20, Thomas Munro wrote:
1)

First, I'll restate my view of the syntax-vs-function question:  I
think an fmgr function is the wrong place to do this, because it
doesn't work for our 2 higher isolation levels as mentioned.  Most
people probably use READ COMMITTED most of the time so the extension
version you've proposed is certainly useful for many people and I like
it, but I will vote against inclusion in core of any feature that
doesn't consider all three of our isolation levels, especially if
there is no way to extend support to other levels later.  I don't want
PostgreSQL to keep adding features that eventually force everyone to
use READ COMMITTED because they want to use feature X, Y or Z.


Waiting for LSN is expected to be used on hot standby READ ONLY 
replication.

Only READ COMMITTED and REPEATABLE READ, are allowed on hot standby.


Maybe someone can think of a clever way for an extension to insert a
wait for a user-supplied LSN *before* acquiring a snapshot so it can
work for the higher levels, or maybe the hooks should go into core
PostgreSQL so that the extension can exist as an external project not
requiring a patched PostgreSQL installation, or maybe this should be
done with new core syntax that extends transaction commands.  Do other
people have views on this?


I think it is a good idea, but it is not clear for me, how to do it 
properly.


2)

+wl_lsn_updated_hook(void)
+{
+uint i;
+/*
+ * After update lastReplayedEndRecPtr set Latches in SHMEM array
+ */
+if (counter_waitlsn % count_waitlsn == 0
+|| 
TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn))

+{

Doesn't this mean that if you are waiting for LSN 1234, and the
primary sends that LSN and then doesn't send anything for another
hour, a standby waiting in pg_waitlsn is quite likely to skip that
update (either because of count_waitlsn or interval_waitlsn), and then
finish up waiting for a very long time?

I'm not sure if this is a good idea, but it's an idea:  You could keep
your update skipping logic, but make sure you always catch the
important case where recovery hits the end of WAL, by invoking your
callback from WaitForWALToBecomeAvailable.  It could have a boolean
parameter that means 'don't skip this one!'.  In other words, it's OK
to skip updates, but not if you know there is no more WAL available to
apply (since you have no idea how long it will be for more to arrive).

Calling GetCurrentTimestamp() at high frequency (after every WAL
record is replayed) may be a bad idea.  It's a slow system call on
some operating systems.  Can we use an existing timestamp for free,
like recoveryLastXTime?  Remembering that the motivation for using
this feature is to wait for *whole transactions* to be replayed and
become visible, there is no sensible reason to wait for random WAL
positions inside a transaction, so if you used that then you would
effectively skip all non-COMMIT records and also skip some COMMIT
records that are coming down the pipe too fast.


Yes, I applied this idea and prepared new patch.

3)

+static void
+wl_own_latch(void)
+{
+SpinLockAcquire(>l_arr[MyBackendId].slock);
+OwnLatch(>l_arr[MyBackendId].latch);
+is_latch_owned = true;
+
+if (state->backend_maxid < MyBackendId)
+state->backend_maxid = MyBackendId;
+
+state->l_arr[MyBackendId].pid = MyProcPid;
+SpinLockRelease(>l_arr[MyBackendId].slock);
+}

What is the point of using extra latches for this?  Why not just use
the standard latch?  Syncrep and many other things do that.  I'm not
actually sure if there is ever a reason to create more latches in
regular backends.  SIGUSR1 will be delivered and set the main latch
anyway.

There are cases of specialised latches in the system, like the wal
receiver latch, and I'm reliably informed that that solves problems
like delivering a wakeup message without having to know which backend
is currently the wal receiver (a problem we might consider solving
today with a condition variable?)  I don't think anything like that
applies here.


In my case I create a bunch of shared latches for each backend, I`ll 
think

how to use standard latches in an efficient way. And about specialized
latches on standby they are already in busy with wal replaying 
functions.


4)

+for (i = 0; i <= state->backend_maxid; i++)
+{
+SpinLockAcquire(>l_arr[i].slock);
+if (state->l_arr[i].pid != 0)
+SetLatch(>l_arr[i].latch);
+SpinLockRelease(>l_arr[i].slock);
+}

Once we get through the update-skipping logic above, we hit this loop
which acquires  spinlocks for every backend one after another and sets
the latches of every backend, no matter whether they are waiting for
the LSN that has been applied.  Assuming we go with this
scan-the-whole-array approach, why not include the LSN waited for in
the array slots, 

[HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2017-09-04 Thread i . kartyshov
Our clients complain about this issue and therefore I want to raise the 
discussion and suggest several solutions to this problem:


I. Why does PG use Fatal when Error is enough to release lock that rises 
lock conflict?

"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at 
the same time when truncate on master occurs?


In my case conflict happens when the autovacuum truncates table tbl1 on 
master while backend on replica is performing a long transaction 
involving the same table tbl1. This happens because truncate takes an 
AccessExclusiveLock. To tackle this issue we have several options:


1. We can postpone the truncate on the master until all the replicas 
have finished their transactions (in this case, feedback requests to the 
master should be sent frequently)

Patch 1
vacuum_lazy_truncate.patch

2. Maybe there is an option somehow not to send AccessExclusiveLock and 
not to truncate table on the replica right away. We could try to wait a 
little and truncate tbl1 on replica again.


Here is a patch that makes replica skip truncate WAL record if some 
transaction using the same table is already running on replica. And it 
also forces master to send truncate_wal to replica with actual table 
length every time autovacuum starts.

Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be 
interrupted because of truncate. Even if for some reason we haven’t 
truncated this table tbl1 right away, nothing terrible will happen. The 
next truncate wal record will reduce the file size by the actual length 
(if some inserts or updates have been performed on master).


If you have any ideas or concerns about suggested solution I’ll be glad 
to hear them.

Thanks!

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..bf73a1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -280,7 +280,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	 * Optionally truncate the relation.
 	 */
 	if (should_attempt_truncation(vacrelstats))
-		lazy_truncate_heap(onerel, vacrelstats);
+		if (OldestXmin >= ShmemVariableCache->latestCompletedXid)
+			lazy_truncate_heap(onerel, vacrelstats);
 
 	/* Report that we are now doing final cleanup */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..a6dc369 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -31,6 +31,9 @@
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "access/transam.h"
 
 /*
  * We keep a list of all relations (represented as RelFileNode values)
@@ -495,9 +498,21 @@ smgr_redo(XLogReaderState *record)
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
 		SMgrRelation reln;
 		Relation	rel;
+		LOCKTAG		locktag;
+		VirtualTransactionId *backends;
+		VirtualTransactionId *backends2;
 
 		reln = smgropen(xlrec->rnode, InvalidBackendId);
 
+		SET_LOCKTAG_RELATION(locktag, reln->smgr_rnode.node.dbNode,
+			 reln->smgr_rnode.node.relNode);
+
+		backends = GetLockConflicts(, AccessExclusiveLock);
+		backends2 = GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid);
+
+		if (!VirtualTransactionIdIsValid(*backends) && !VirtualTransactionIdIsValid(*backends2))
+		{
+
 		/*
 		 * Forcibly create relation if it doesn't exist (which suggests that
 		 * it was dropped somewhere later in the WAL sequence).  As in
@@ -542,6 +557,10 @@ smgr_redo(XLogReaderState *record)
 			visibilitymap_truncate(rel, xlrec->blkno);
 
 		FreeFakeRelcacheEntry(rel);
+		} else
+		{
+			XLogFlush(lsn);
+		}
 	}
 	else
 		elog(PANIC, "smgr_redo: unknown op code %u", info);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..fd91db0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -61,6 +61,8 @@
 #include "utils/pg_rusage.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
+#include "catalog/storage_xlog.h"
+#include "access/rmgr.h"
 
 
 /*
@@ -317,6 +319,32 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 		new_rel_pages = vacrelstats->old_rel_pages;
 		new_rel_tuples = vacrelstats->old_rel_tuples;
 	}
+	else
+	{
+		if (RelationNeedsWAL(onerel))
+		{
+			/* Get the current relation length */
+			LockRelationForExtension(onerel, ExclusiveLock);
+
+			/*
+			 * Make an XLOG entry reporting the file truncation.
+			 */
+			XLogRecPtr	lsn;
+			xl_smgr_truncate xlrec;
+
+			xlrec.blkno = vacrelstats->rel_pages;
+			xlrec.rnode = onerel->rd_node;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) , sizeof(xlrec));
+
+