Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Michael Paquier
On Sun, Oct 23, 2016 at 10:28 PM, Magnus Hagander  wrote:
> It also broke the tests and invalidated some documentation. But it was easy
> enough to fix.
>
> I've now applied this, so next time you get to do the merging :P Joking
> aside, please review and let me know if you can spot something I messed up
> in the final merge.

Just had another look at it..
+static int
+tar_fsync(Walfile f)
+{
+   Assert(f != NULL);
+   tar_clear_error();
+
+   /*
+* Always sync the whole tarfile, because that's all we can do. This makes
+* no sense on compressed files, so just ignore those.
+*/
+   if (tar_data->compression)
+   return 0;
+
+   return fsync(tar_data->fd);
+}
fsync() should not be called here if --no-sync is used.

+   /* sync the empty blocks as well, since they're after the last file */
+   fsync(tar_data->fd);
Similarly, the fsync call of tar_finish() should not happen when
--no-sync is used.

+   if (format == 'p')
+   stream.walmethod = CreateWalDirectoryMethod(param->xlog, do_sync);
+   else
+   stream.walmethod = CreateWalTarMethod(param->xlog,
compresslevel, do_sync);
LogStreamerMain() exits immediately once it is done, but I think that
we had better be tidy here and clean up the WAL methods that have been
allocated. I am thinking here about a potentially retry method on
failure, though the best shot in this area would be with
ReceiveXlogStream().

Attached is a patch addressing those points.
-- 
Michael


pg_basebackup-tar-fixes.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] issue with track_commit_timestamp and server restart

2016-10-23 Thread Craig Ringer
On 24 October 2016 at 12:58, Craig Ringer  wrote:

> The attached patch adds a TAP test to src/test/recovery to show this.

Added to CF.

https://commitfest.postgresql.org/11/834/

... but IMO this should go in the next bugfix release.

I've also applied nearly the same fix for the same bug in the original
commit timestamp implementation in the BDR Postgres tree.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] issue with track_commit_timestamp and server restart

2016-10-23 Thread Michael Paquier
(thread hijacking)

On Mon, Oct 24, 2016 at 1:58 PM, Craig Ringer  wrote:
> To save time running the recovery suite, just
>
>rm src/test/recovery/00[0-8]*.pl
>
> (It'd be nice to have a prove_check target to run just one test file).

Agreed! Or multiple chosen files. I usually remove those files
manually from time to time when working on a fix just to run one
dedicated test repeatedly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] issue with track_commit_timestamp and server restart

2016-10-23 Thread Craig Ringer
On 22 October 2016 at 19:51, Julien Rouhaud  wrote:
> I just noticed that if track_commit_timestamp is enabled, the
> oldestCommitTsXid and newestCommitTsXid don't persist after a server
> restart, so you can't ask for the commit ts of a transaction that
> committed before the last server start, although the information is
> still available (unless of course if a freeze occured).  AFAICT it
> always behave this way.

I initially could'n't see this when tested on 8f1fb7d with a
src/test/recovery/t test script. But it turns out it's OK on immediate
shutdown and crash recovery, but not on clean shutdown and restart.

The attached patch adds a TAP test to src/test/recovery to show this.
If you run the TAP test before recompiling with the fix it'll fail.
"make" to apply the fix, then re-run and it'll succeed. Or just
temporarily roll back the fix with:

git checkout HEAD^1 -- src/backend/access/transam/commit_ts.c
git reset src/backend/access/transam/commit_ts.c

and rebuild to see it fail.

To save time running the recovery suite, just

   rm src/test/recovery/00[0-8]*.pl

(It'd be nice to have a prove_check target to run just one test file).

This would explain a few issues I've seen reported with BDR from the
community which I've so far been unable to reproduce, so thanks very
much for the report.

Álvaro, would you mind checking this and committing to HEAD and 9.6?
The commits.c change only should also be committed to 9.5, but not the
TAP test.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 485ecc6c579073cf71ac88746c53668733bc5cac Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 24 Oct 2016 12:16:41 +0800
Subject: [PATCH] Preserve access to commit timestamps across restart

An oversight in setting the boundaries of known commit timestamps during
startup caused old commit timestamps to become inaccessible after a server
restart.

Report and bugfix by Julien Rouhaud
Review and commit timestamp tests by Craig Ringer
---
 src/backend/access/transam/commit_ts.c |   2 +
 src/test/recovery/t/009_committs.pl| 103 +
 2 files changed, 105 insertions(+)
 create mode 100644 src/test/recovery/t/009_committs.pl

diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index a8d275f..7746578 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -844,6 +844,8 @@ SetCommitTsLimit(TransactionId oldestXact, TransactionId newestXact)
 	else
 	{
 		Assert(ShmemVariableCache->newestCommitTsXid == InvalidTransactionId);
+		ShmemVariableCache->oldestCommitTsXid = oldestXact;
+		ShmemVariableCache->newestCommitTsXid = newestXact;
 	}
 	LWLockRelease(CommitTsLock);
 }
diff --git a/src/test/recovery/t/009_committs.pl b/src/test/recovery/t/009_committs.pl
new file mode 100644
index 000..48f8899
--- /dev/null
+++ b/src/test/recovery/t/009_committs.pl
@@ -0,0 +1,103 @@
+# Testing of logical decoding using SQL interface and/or pg_recvlogical
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf(
+		'postgresql.conf', qq(
+track_commit_timestamp = on
+));
+$node_master->start;
+
+my ($ret, $stdout, $stderr);
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres', qq[SELECT pg_xact_commit_timestamp('0');]);
+is($ret, 3, 'getting ts of InvalidTransactionId reports error');
+like($stderr, qr/cannot retrieve commit timestamp for transaction/, 'expected error from InvalidTransactionId');
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres', qq[SELECT pg_xact_commit_timestamp('1');]);
+is($ret, 3, 'getting ts of BootstrapTransactionId reports error');
+like($stderr, qr/cannot retrieve commit timestamp for transaction/, 'expected error from BootstrapTransactionId');
+
+($ret, $stdout, $stderr) = $node_master->psql('postgres', qq[SELECT pg_xact_commit_timestamp('2');]);
+is($ret, 3, 'getting ts of FrozenTransactionId reports error');
+like($stderr, qr/cannot retrieve commit timestamp for transaction/, 'expected error from FrozenTransactionId');
+
+# Since FirstNormalTransactionId will've occurred during initdb, long before we
+# enabled commit timestamps, it'll be null since we have no cts data for it but
+# cts are enabled.
+is($node_master->safe_psql('postgres', qq[SELECT pg_xact_commit_timestamp('3');]), '',
+	'committs for FirstNormalTransactionId is null');
+
+$node_master->safe_psql('postgres', qq[CREATE TABLE committs_test(x integer, y timestamp with time zone);]);
+
+my $xid = $node_master->safe_psql('postgres', qq[
+	BEGIN;
+	INSERT INTO committs_test(x, y) VALUES (1, current_timestamp);
+	SELECT txid_current();
+	COMMIT;
+]);
+
+my $before_restart_ts = $node_master->safe_psql('postgres', qq[SELECT 

Re: [HACKERS] [COMMITTERS] pgsql: Remove extra comma at end of enum list

2016-10-23 Thread Michael Paquier
(Moved to -hackers)

On Sun, Oct 23, 2016 at 10:57 PM, Magnus Hagander  wrote:
> Remove extra comma at end of enum list
>
> C99-specific feature, and wasn't intentional in the first place.
>
> Per buildfarm member mylodon

This is the second time I see that in the last couple of weeks. Would
it be better to use some custom CFLAGS to detect that earlier in the
review process? I have never much used gcc's std or clang extensions
in this area. Has somebody any recommendations?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Michael Paquier
On Mon, Oct 24, 2016 at 1:38 PM, Andres Freund  wrote:
> On 2016-10-17 14:37:05 +0900, Michael Paquier wrote:
>> 2) Add an option to pg_xlogdump to be able to output its output to a
>> file. That would be awkward to rely on grabbing the output data from a
>> pipe... On Windows particularly. Thinking about it, would that
>> actually be useful to others? That's not a complicated patch.
>
> Hm? Just redirecting output seems less complicated? And afaik works on
> windows as well?

In the TAP suite STDOUT is already redirect to the log files. Perhaps
we could just do a SELECT FILE; to redirect the output of pg_xlogdump
temporarily into a custom location just for the sake of a test like
that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails

2016-10-23 Thread Amit Kapila
On Mon, Oct 24, 2016 at 9:18 AM, Kyotaro HORIGUCHI
 wrote:
> Thank you for looking and retelling this.
>
> At Fri, 21 Oct 2016 13:02:21 -0400, Robert Haas  wrote 
> in 
>> On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier
>>  wrote:
>> >> So, if I understand correctly, then we can mark the version posted by
>> >> you upthread [1] which includes a test along with Kyotaro's fix can be
>> >> marked as Ready for committer.  If so, then please change the status
>> >> of patch accordingly.
>> >
>> > Patch moved to next CF 2016-11, still with status "ready for committer".
>> >
>> > IMO, this thread has reached as conclusion to use this patch to
>> > address the problem reported:
>> > cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com
>>
>> I spent some time reading this thread today and trying to understand
>> exactly what's going on here.  Here's my attempt to restate the
>> problem:
>>
>> 1. If a restartpoint flushes no dirty buffers, then the redo location
>> advances but the minimum recovery point does not; therefore, the
>> minimum recovery point can fall behind the redo location.  That's
>> correct behavior, because if the standby is subsequently restarted, it
>> can correctly begin at the checkpoint record associated with the
>> restartpoint and need not replay any further.
>
> Yes.
>
>> 2. However, it causes a problem when a base backup with the "include
>> WAL" option is taken from a standby because -- on standby servers only
>> -- the end location for a backup as returned by do_pg_stop_backup() is
>> the minimum recovery point.  If this precedes the start point for the
>> backup -- which is the restart point -- perform_base_backup() will
>> conclude that no WAL files are required for the backup and fail an
>> internal sanity check.
>
> Yes. The option implies that the copied $PGDATA is a standalone
> backup, that is, usable to start a standby instantly so at least
> one WAL segment is needed, as mentioned in 3 below.
>
>> 3. Removing the sanity check wouldn't help, because it can never be
>> right to end up with zero WAL files as a result of a base backup.  At
>> a minimum, we need the WAL file which contains the checkpoint record
>> from which the control file specifies that redo should start.
>> Actually, I think that checkpoint record could be spread across
>> multiple files: it might be big.  We need them all, but the current
>> code won't ensure that we get them.
>
> Yes. The "checkpoit records" means that records during a checkpoint.
>
>> The consensus solution on this thread seems to be that we should have
>> pg_do_stop_backup() return the last-replayed XLOG location as the
>> backup end point.  If the control file has been updated with a newer
>> redo location, then the associated checkpoint record has surely been
>> replayed, so we'll definitely have enough WAL to replay that
>> checkpoint record (and we don't need to replay anything else, because
>> we're supposing that this is the case where the minimum recovery point
>> precedes the redo location).  Although this will work, it might
>> include more WAL in the backup than is strictly necessary.  If the
>> additional WAL replayed beyond the minimum recovery point does NOT
>> include a checkpoint, we don't need any of it; if it does, we need
>> only the portion up to and including the last checkpoint record, and
>> not anything beyond that.
>
> StartupXLOG mandates the existence of the start record of the
> latest checkpoint. pg_start_backup() returns the start point
> (redo location) of the restartpoint that it requested and
> minRecoveryPoint is always after the replpayed checkpoint so it
> works if always was going well.
>
> But if the checkpoint contains no update, the restart point
> exceeds minRecoveryPoint. Then no WAL copied.
>
>> I can think of two solutions that would be "tighter":
>>
>> 1. When performing a restartpoint, update the minimum recovery point
>> to just beyond the checkpoint record.  I think this can't hurt anyone
>> who is actually restarting recovery on the same machine, because
>> that's exactly the point where they are going to start recovery.  A
>> minimum recovery point which precedes the point at which they will
>> start recovery is no better than one which is equal to the point at
>> which they will start recovery.
>
> This is a candidate that I thought of. But I avoided to change
> the behavior of minRecoveryPoint that (seems to me that) it
> describes only buffer state. So checkpoint with no update doesn't
> advances minRecoveryPoint as my understanding.
>

I think what you are saying is not completely right, because we do
update minRecoveryPoint when we don't perform a new restart point.
When we perform restart point, then it assumes that flushing the
buffers will take care of updating minRecoveryPoint.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: 

Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Andres Freund
On 2016-10-17 14:37:05 +0900, Michael Paquier wrote:
> 2) Add an option to pg_xlogdump to be able to output its output to a
> file. That would be awkward to rely on grabbing the output data from a
> pipe... On Windows particularly. Thinking about it, would that
> actually be useful to others? That's not a complicated patch.

Hm? Just redirecting output seems less complicated? And afaik works on
windows as well?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Michael Paquier
On Sun, Oct 23, 2016 at 10:52 PM, Michael Paquier
 wrote:
> On Sun, Oct 23, 2016 at 10:30 PM, Magnus Hagander  wrote:
>> I think both of those would be worthwhile. Just for the testability in
>> itself, but such a flag to pg_xlogdump would probably be useful in other
>> cases as well, beyond just the testing.
>
> Looking quickly at the code, it does not seem that complicated... I
> may just send patches tomorrow for all those things and be done with
> it, all that on its new dedicated thread.

Actually not that much after noticing that pg_xlogdump emulates some
of the backend's StringInfo routines and calls at the end vprintf() to
output everything to stdout, which is ugly. The cleanest solution here
would be to make StringInfo a bit more portable and allow them for
frontends, somehting that may be useful for any utility playing with
rm_desc. A less cleaner solution would be to somewhat store a fd
pointing to a file (or stdout) into compat.c and output to it. I'd
slightly prefer the first solution, but that does not seem worth the
effort just for pg_xlogdump and one test.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails

2016-10-23 Thread Amit Kapila
On Fri, Oct 21, 2016 at 10:32 PM, Robert Haas  wrote:
> On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier
>  wrote:
>>> So, if I understand correctly, then we can mark the version posted by
>>> you upthread [1] which includes a test along with Kyotaro's fix can be
>>> marked as Ready for committer.  If so, then please change the status
>>> of patch accordingly.
>>
>> Patch moved to next CF 2016-11, still with status "ready for committer".
>>
>> IMO, this thread has reached as conclusion to use this patch to
>> address the problem reported:
>> cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com
>
> I spent some time reading this thread today and trying to understand
> exactly what's going on here.  Here's my attempt to restate the
> problem:
>
> 1. If a restartpoint flushes no dirty buffers, then the redo location
> advances but the minimum recovery point does not; therefore, the
> minimum recovery point can fall behind the redo location.  That's
> correct behavior, because if the standby is subsequently restarted, it
> can correctly begin at the checkpoint record associated with the
> restartpoint and need not replay any further.
>
> 2. However, it causes a problem when a base backup with the "include
> WAL" option is taken from a standby because -- on standby servers only
> -- the end location for a backup as returned by do_pg_stop_backup() is
> the minimum recovery point.  If this precedes the start point for the
> backup -- which is the restart point -- perform_base_backup() will
> conclude that no WAL files are required for the backup and fail an
> internal sanity check.
>
> 3. Removing the sanity check wouldn't help, because it can never be
> right to end up with zero WAL files as a result of a base backup.  At
> a minimum, we need the WAL file which contains the checkpoint record
> from which the control file specifies that redo should start.
> Actually, I think that checkpoint record could be spread across
> multiple files: it might be big.  We need them all, but the current
> code won't ensure that we get them.
>
> The consensus solution on this thread seems to be that we should have
> pg_do_stop_backup() return the last-replayed XLOG location as the
> backup end point.  If the control file has been updated with a newer
> redo location, then the associated checkpoint record has surely been
> replayed, so we'll definitely have enough WAL to replay that
> checkpoint record (and we don't need to replay anything else, because
> we're supposing that this is the case where the minimum recovery point
> precedes the redo location).  Although this will work, it might
> include more WAL in the backup than is strictly necessary.  If the
> additional WAL replayed beyond the minimum recovery point does NOT
> include a checkpoint, we don't need any of it; if it does, we need
> only the portion up to and including the last checkpoint record, and
> not anything beyond that.
>

You are right that it will include additional WAL than strictly
necessary, but that can happen today as well because minrecoverypoint
could be updated after you have established restart point for
do_pg_start_backup().  Do you want to say that even without patch in
some cases we are including additional WAL in backup than what is
strictly necessary, so it is better to improve the situation?

> I can think of two solutions that would be "tighter":
>
> 1. When performing a restartpoint, update the minimum recovery point
> to just beyond the checkpoint record.  I think this can't hurt anyone
> who is actually restarting recovery on the same machine, because
> that's exactly the point where they are going to start recovery.  A
> minimum recovery point which precedes the point at which they will
> start recovery is no better than one which is equal to the point at
> which they will start recovery.
>

I think this will work but will cause an unnecessary control file
update for the cases when buffer flush will anyway do it.  It can work
if we try to do only when required (minrecoverypoint is lesser than
lastcheckpoint) after flush of buffers.

> 2. In perform_base_backup(), if the endptr returned by
> do_pg_stop_backup() precedes the end of the checkpoint record returned
> by do_pg_start_backup(), use the latter value instead.  Unfortunately,
> that's not so easy: we can't just say if (endptr < starptr) startptr =
> endptr; because startptr is the *start* of the checkpoint record, not
> the end.  I suspect somebody could figure out a solution to this
> problem, though.
>

With this approach, don't we need something similar for API's
pg_stop_backup()/pg_stop_backup_v2()?


> If we decide we don't want to aim for one of these tighter solutions
> and just adopt the previously-discussed patch, then at the very least
> it needs better comments.
>

+1.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] FSM corruption leading to errors

2016-10-23 Thread Michael Paquier
On Sat, Oct 22, 2016 at 7:31 AM, Michael Paquier
 wrote:
> On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby  wrote:
>> On 10/20/16 10:15 PM, Michael Paquier wrote:
>>>
>>> 2) If anything is found, stop the server and delete the files manually.
>>> 3) Re-start the server.
>>> OK, that's troublesome and costly for large relations, but we know
>>> that's the safest way to go for any versions, and there is no need to
>>> complicate the code with any one-time repairing extensions.
>>
>> Wouldn't you need to run around to all your replicas and do that as well?
>
> Yeah...

Release notes refer to this page for methods to fix corrupted instances:
https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
I just typed something based on Pavan's upper method, feel free to
jump in and improve it as necessary.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails

2016-10-23 Thread Kyotaro HORIGUCHI
Thank you for looking and retelling this.

At Fri, 21 Oct 2016 13:02:21 -0400, Robert Haas  wrote 
in 
> On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier
>  wrote:
> >> So, if I understand correctly, then we can mark the version posted by
> >> you upthread [1] which includes a test along with Kyotaro's fix can be
> >> marked as Ready for committer.  If so, then please change the status
> >> of patch accordingly.
> >
> > Patch moved to next CF 2016-11, still with status "ready for committer".
> >
> > IMO, this thread has reached as conclusion to use this patch to
> > address the problem reported:
> > cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com
> 
> I spent some time reading this thread today and trying to understand
> exactly what's going on here.  Here's my attempt to restate the
> problem:
> 
> 1. If a restartpoint flushes no dirty buffers, then the redo location
> advances but the minimum recovery point does not; therefore, the
> minimum recovery point can fall behind the redo location.  That's
> correct behavior, because if the standby is subsequently restarted, it
> can correctly begin at the checkpoint record associated with the
> restartpoint and need not replay any further.

Yes.

> 2. However, it causes a problem when a base backup with the "include
> WAL" option is taken from a standby because -- on standby servers only
> -- the end location for a backup as returned by do_pg_stop_backup() is
> the minimum recovery point.  If this precedes the start point for the
> backup -- which is the restart point -- perform_base_backup() will
> conclude that no WAL files are required for the backup and fail an
> internal sanity check.

Yes. The option implies that the copied $PGDATA is a standalone
backup, that is, usable to start a standby instantly so at least
one WAL segment is needed, as mentioned in 3 below.

> 3. Removing the sanity check wouldn't help, because it can never be
> right to end up with zero WAL files as a result of a base backup.  At
> a minimum, we need the WAL file which contains the checkpoint record
> from which the control file specifies that redo should start.
> Actually, I think that checkpoint record could be spread across
> multiple files: it might be big.  We need them all, but the current
> code won't ensure that we get them.

Yes. The "checkpoit records" means that records during a checkpoint.

> The consensus solution on this thread seems to be that we should have
> pg_do_stop_backup() return the last-replayed XLOG location as the
> backup end point.  If the control file has been updated with a newer
> redo location, then the associated checkpoint record has surely been
> replayed, so we'll definitely have enough WAL to replay that
> checkpoint record (and we don't need to replay anything else, because
> we're supposing that this is the case where the minimum recovery point
> precedes the redo location).  Although this will work, it might
> include more WAL in the backup than is strictly necessary.  If the
> additional WAL replayed beyond the minimum recovery point does NOT
> include a checkpoint, we don't need any of it; if it does, we need
> only the portion up to and including the last checkpoint record, and
> not anything beyond that.

StartupXLOG mandates the existence of the start record of the
latest checkpoint. pg_start_backup() returns the start point
(redo location) of the restartpoint that it requested and
minRecoveryPoint is always after the replpayed checkpoint so it
works if always was going well.

But if the checkpoint contains no update, the restart point
exceeds minRecoveryPoint. Then no WAL copied.

> I can think of two solutions that would be "tighter":
> 
> 1. When performing a restartpoint, update the minimum recovery point
> to just beyond the checkpoint record.  I think this can't hurt anyone
> who is actually restarting recovery on the same machine, because
> that's exactly the point where they are going to start recovery.  A
> minimum recovery point which precedes the point at which they will
> start recovery is no better than one which is equal to the point at
> which they will start recovery.

This is a candidate that I thought of. But I avoided to change
the behavior of minRecoveryPoint that (seems to me that) it
describes only buffer state. So checkpoint with no update doesn't
advances minRecoveryPoint as my understanding.

> 2. In perform_base_backup(), if the endptr returned by
> do_pg_stop_backup() precedes the end of the checkpoint record returned
> by do_pg_start_backup(), use the latter value instead.  Unfortunately,
> that's not so easy: we can't just say if (endptr < starptr) startptr =
> endptr; because startptr is the *start* of the checkpoint record, not
> the end.  I suspect somebody could figure out a solution to this
> problem, though.

Yes, and the reason I rejected it was that it is not logical,

[HACKERS] Add radiustimeout parameter for RADIUS HBA

2016-10-23 Thread Samuel D. Leslie
Hello everyone,

I’d like to submit the attached patch for feedback from the PostgreSQL 
community and potential future inclusion in the codebase. The patch adds a new 
parameter to the RADIUS authentication method named “radiustimeout”, allowing 
the database administrator to configure the timeout in seconds to wait for 
responses from a configured RADIUS server. Until now, this has been hardcoded 
to three seconds by the RADIUS_TIMEOUT define in auth.c. While this is usually 
sufficient for typical RADIUS server configurations, there are some more 
unusual configurations where a higher timeout is required. Examples include:
 - Authenticating against a RADIUS server over a high latency link
 - Authenticating against a RADIUS server that is performing additional 
out-of-band authentication

The latter case is applicable to a server I admin and spurred the development 
of this patch. We implemented multi-factor authentication for user access to a 
sensitive database via a RADIUS server implementation which performs the 
standard username & password verification, and if it succeeds, subsequently 
performs a second factor of authentication via a configured mobile app. The 
RADIUS response confirming successful authentication is only returned after 
both authentication factors have completed. In our deployment, a timeout of 60 
seconds seems to work well, but certainly three seconds is not at all workable.

Thanks in advance for any and all feedback.


Kind regards,
-SDL


radiustimeout.patch
Description: radiustimeout.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Minor code improvement to postgresGetForeignJoinPaths

2016-10-23 Thread Tatsuro Yamada
Hi,

The last argument of create_foreignscan_path called by 
postgresGetForeignJoinPaths is
set to NULL, but it would be suitable to set it to NIL because the argument 
type is List.

Please find attached a patch.

Tatsuro Yamada
NTT Open Source Software Center
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..a5de611 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -4331,7 +4331,7 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	   NIL,		/* no pathkeys */
 	   NULL,	/* no required_outer */
 	   epq_path,
-	   NULL);	/* no fdw_private */
+	   NIL);	/* no fdw_private */
 
 	/* Add generated path into joinrel by add_path(). */
 	add_path(joinrel, (Path *) joinpath);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM Address Sanitizer (ASAN) and valgrind support

2016-10-23 Thread Noah Misch
On Thu, Oct 20, 2016 at 06:14:28PM +0100, Greg Stark wrote:
> On Oct 20, 2016 5:27 PM, "Noah Misch"  wrote:
> > On Wed, Oct 19, 2016 at 11:08:39AM +0100, Greg Stark wrote:
> > > The MEMPOOL_FREE doesn't take any size argument and mcxt.c doesn't
> > > have convenient access to a size argument. It could call the
> > > GetChunkSpace method but that will include the allocation overhead and
> >
> > That is indeed a problem for making VALGRIND_MEMPOOL_FREE() an alias of
> > VALGRIND_MAKE_MEM_NOACCESS() under ASAN as I suggested.  Calling
> > GetMemoryChunkSpace() in the macro would cause memdebug.h to embed an
> > assumption of mcxt.c, which is messy.  Including the allocation overhead is
> > fine, though.
> 
> I think the way out is to simply have aset.c make the memory
> undefined/noaccess even if it's redundant under valgrind. It's a bit
> unfortunate that the macros would have different semantics under the two.
> If it's too confusing or we're worried about the performance overhead we
> could make a MAKE_MEM_{UNDEFINED,NOACCESS}_IF_NO_MEMPOOL() but I don't
> think it's worth it myself.

I don't expect much performance overhead.  When I last benchmarked Valgrind
Memcheck of "make installcheck", a !USE_VALGRIND build (no client requests at
all) saved about 5% of runtime.  A single new client request should be cheap
enough.  (Marking otherwise-redundant calls may still be good, though.)

> There are a couple build oddities both with gcc and clang before I can
> commit anything though. And I can't test valgrind to be sure the redundant
> calls aren't causing a problem.

When you submit your patch to a CommitFest, mention that you're blocked on
having a reviewer who can test Valgrind.  Many reviewers can help.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-23 Thread Peter Geoghegan
On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane  wrote:
> What's bothering me is (a) it's less than 24 hours to release wrap and
> (b) this patch changes SSI-relevant behavior and hasn't been approved
> by Kevin.  I'm not familiar enough with that logic to commit a change
> in it on my own authority, especially with so little time for problems
> to be uncovered.

I should point out that I knew that the next set of point releases had
been moved forward much later than you did. I had to work on this fix
during the week, which was pretty far from ideal for me for my own
reasons.

> I'm okay with adding an explicit buffer lock/unlock pair, and in fact
> plan to go have a look at that in a bit.  I'm not okay with doing a
> refactoring that might change the behavior in more ways than that
> under these circumstances.

Fair enough. As long as we do that much, I'm comfortable.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-23 Thread Tom Lane
Peter Geoghegan  writes:
> On Sun, Oct 23, 2016 at 1:42 PM, Tom Lane  wrote:
>> "Rarely" is not "never".  A bigger problem though is that heap_fetch
>> does more than just lock the buffer: there are also PredicateLockTuple
>> and CheckForSerializableConflictOut calls in there.  It's possible that
>> those are no-ops in this usage (because after all we already fetched
>> the tuple once), or maybe they're even desirable because they would help
>> resolve Kevin's concerns.  But it hasn't been analyzed and so I'm not
>> prepared to risk a behavioral change in this already under-tested area
>> the day before a release wrap.

> I'm surprised at how you've assessed the risk here.

What's bothering me is (a) it's less than 24 hours to release wrap and
(b) this patch changes SSI-relevant behavior and hasn't been approved
by Kevin.  I'm not familiar enough with that logic to commit a change
in it on my own authority, especially with so little time for problems
to be uncovered.

I'm okay with adding an explicit buffer lock/unlock pair, and in fact
plan to go have a look at that in a bit.  I'm not okay with doing a
refactoring that might change the behavior in more ways than that
under these circumstances.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-23 Thread Peter Geoghegan
On Sun, Oct 23, 2016 at 1:42 PM, Tom Lane  wrote:
> "Rarely" is not "never".  A bigger problem though is that heap_fetch
> does more than just lock the buffer: there are also PredicateLockTuple
> and CheckForSerializableConflictOut calls in there.  It's possible that
> those are no-ops in this usage (because after all we already fetched
> the tuple once), or maybe they're even desirable because they would help
> resolve Kevin's concerns.  But it hasn't been analyzed and so I'm not
> prepared to risk a behavioral change in this already under-tested area
> the day before a release wrap.

The heap_fetch() contract doesn't ask its callers to consider any of
that. Besides, those actions (PredicateLockTuple +
CheckForSerializableConflictOut) are very probably redundant no-ops as
you say. (They won't help with Kevin's additional concerns, BTW,
because Kevin is concerned about excessive serialization failures with
SSI.)

> AFAICT, it's very hard to get to an actual failure from the lack of
> buffer lock here.  You would need to have a situation where the tuple
> was not hintable when originally fetched but has become so by the time
> ON CONFLICT is rechecking it.  That's possible, eg if we're using
> async commit and the previous transaction's commit record has gotten
> flushed to disk in between, but it's not likely.  Even then, no actual
> problem would ensue (in non-assert builds) from setting a hint bit without
> buffer lock, except in very hard-to-hit race conditions.  So the buffer
> lock issue doesn't seem urgent enough to me to justify making a fix under
> time pressure.
>
> The business about not throwing a serialization failure due to actions
> of our own xact does seem worth fixing now, but if I understand correctly,
> we can deal with that by adding a test for xmin-is-our-own-xact into
> the existing code structure.  I propose doing that much (and adding
> Munro's regression test case) and calling it good for today.

I'm surprised at how you've assessed the risk here. I think that the
risk of not proceeding with fixing the buffer lock issue is greater
than the risk of breaking something else with the proposed fix. Even
if the former risk isn't such a big one.

If there are a non-trivial number of users that are particularly
attached to the precise behavior of higher isolation levels in master
(assuming it would be altered by the proposed fix), then it's
surprising that it took so many months for someone to complain about
the ON CONFLICT DO NOTHING behavior being blatantly broken.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-23 Thread Tom Lane
Peter Geoghegan  writes:
> On Sat, Oct 22, 2016 at 9:38 AM, Tom Lane  wrote:
>> 1. Why are you replacing ExecOnConflictUpdate's ExecCheckHeapTupleVisible
>> call with ExecCheckTIDVisible?  That appears to demand a re-fetch of the
>> tuple ExecOnConflictUpdate already has its hands on.  Wouldn't it be
>> better to just put a buffer lock/unlock around the existing code?

> I thought that that was less than idiomatic within nodeModifyTable.c
> -- executor modules rarely directly acquire buffer locks.

"Rarely" is not "never".  A bigger problem though is that heap_fetch
does more than just lock the buffer: there are also PredicateLockTuple
and CheckForSerializableConflictOut calls in there.  It's possible that
those are no-ops in this usage (because after all we already fetched
the tuple once), or maybe they're even desirable because they would help
resolve Kevin's concerns.  But it hasn't been analyzed and so I'm not
prepared to risk a behavioral change in this already under-tested area
the day before a release wrap.

AFAICT, it's very hard to get to an actual failure from the lack of
buffer lock here.  You would need to have a situation where the tuple
was not hintable when originally fetched but has become so by the time
ON CONFLICT is rechecking it.  That's possible, eg if we're using
async commit and the previous transaction's commit record has gotten
flushed to disk in between, but it's not likely.  Even then, no actual
problem would ensue (in non-assert builds) from setting a hint bit without
buffer lock, except in very hard-to-hit race conditions.  So the buffer
lock issue doesn't seem urgent enough to me to justify making a fix under
time pressure.

The business about not throwing a serialization failure due to actions
of our own xact does seem worth fixing now, but if I understand correctly,
we can deal with that by adding a test for xmin-is-our-own-xact into
the existing code structure.  I propose doing that much (and adding
Munro's regression test case) and calling it good for today.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
>  wrote:
> > Robert Haas wrote:
> >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
> >> >> This seems like a might subtle thing to backpatch. If we really want to
> >> >> go there, ISTM that the relevant code should stew in an unreleased
> >> >> branch for a while, before being backpatched.
> >> >
> >> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> >> > awhile.  If it survives six months or so then we could discuss it again.
> >>
> >> I agree with Tom.
> >
> > Okay, several months have passed with this in the development branch and
> > now seems a good time to backpatch this all the way back to 9.4.
> >
> > Any objections?
> 
> Although the code has now been in the tree for six months, it's only
> been in a released branch for three weeks, which is something about
> which to think.

The objection above was "stew in an unreleased branch", which it has.

> I guess what's really bothering me is that I don't think this is a bug
> fix.  It seems like a performance optimization.  And I think that we
> generally have a policy that we don't back-patch performance
> optimizations.  Of course, there is some fuzziness because when the
> performance gets sufficiently bad, we sometimes decide that it amounts
> to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7.  Maybe this
> case qualifies for similar treatment, but I'm not sure.

I have seen a number of situations in which the standby strangely lags
behind seemingly randomly sometimes for very long periods of time,
without any apparent cause, without any possible remedy, and it makes
the standby unusable.  If the user happens to monitor the lag they may
notice it and some may decide not to run certain queries.  In other
cases the I/O load is so bad that queries that otherwise run without a
hitch are stuck for long.

In my opinion, this is a serious problem.

> Why are you talking about back-patching this to 9.4 rather than all
> supported branches?

As far as I understand this is dependent on catalog scans being MVCC,
so it cannot be backpatched any further than that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlog error on the master

2016-10-23 Thread Venkata B Nagothi
On Sunday, 23 October 2016, Michael Paquier 
wrote:

> On Sun, Oct 23, 2016 at 5:05 PM, Venkata B Nagothi 
> wrote:
> > I just did did a "git pull" to test one of my patches and i get the
> > following error :
> >
> > 2016-10-23 18:51:47.679 AEDT [31930] FATAL:  could not open archive
> status
> > directory "pg_xlog/archive_status": No such file or directory
> > 2016-10-23 18:51:47.679 AEDT [31841] LOG:  archiver process (PID 31930)
> > exited with exit code 1
> >
> > is it expected ?
>
> No.
>
> > I notice that pg_xlog's name has been changed to pg_wal. I am not sure
> about
> > this.
>
> WAL archiving works correctly here, and tests in src/test/recovery/
> work. Are you sure that you cleaned up up your source tree before
> recompiling?


Oops. My bad. All works fine now. Sorry for the confusion.

Thanks,
Venkata B N


[HACKERS] Assertion failures due to testing visibility without buffer lock

2016-10-23 Thread Tom Lane
Looking at the issue raised in
https://www.postgresql.org/message-id/flat/57EE93C8.8080504%40postgrespro.ru
prompted me to wonder whether there were any other places in which we
were calling tqual.c routines without holding buffer content lock.
I tried adding

Assert(BufferIsLocal(buffer) ||
   
LWLockHeldByMe(BufferDescriptorGetContentLock(GetBufferDescriptor(buffer - 
1;

to all the tqual.c tuple-visibility-test routines, and sure enough a bunch
of regression tests fell over, as a result of this bit in RI_FKey_check:

/*
 * We should not even consider checking the row if it is no longer valid,
 * since it was either deleted (so the deferred check should be skipped)
 * or updated (in which case only the latest version of the row should be
 * checked).  Test its liveness according to SnapshotSelf.
 *
 * NOTE: The normal coding rule is that one must acquire the buffer
 * content lock to call HeapTupleSatisfiesVisibility.  We can skip that
 * here because we know that AfterTriggerExecute just fetched the tuple
 * successfully, so there cannot be a VACUUM compaction in progress on the
 * page (either heap_fetch would have waited for the VACUUM, or the
 * VACUUM's LockBufferForCleanup would be waiting for us to drop pin). And
 * since this is a row inserted by our open transaction, no one else can
 * be entitled to change its xmin/xmax.
 */
Assert(new_row_buf != InvalidBuffer);
if (!HeapTupleSatisfiesVisibility(new_row, SnapshotSelf, new_row_buf))
return PointerGetDatum(NULL);

Now, you'll notice that that bit of argumentation fails to cover the
question of hint bits.  It might still be okay, if HeapTupleSatisfiesSelf
would never attempt to set a hint bit on a row inserted by our own open
transaction ... but a quick perusal of that function shows that it will
attempt to do so, if xmax points to an aborted subtransaction.  That
led me to try this test case:

create table pp(f1 int primary key);
insert into pp values(1);
create table cc(f1 int references pp deferrable initially deferred);
begin;
insert into cc values(1);
savepoint x;
delete from cc;
rollback to x;
commit;

and sure enough, that dies with an assertion failure in
MarkBufferDirtyHint, or predecessor code, in every release back to 8.2
or thereabouts.

We might be able to band-aid around that by not attempting to set the
hint bit in the "deleting subtransaction must have aborted" path in
HeapTupleSatisfiesSelf --- but it's easy to think of cases where not
doing so would be a net loss performance-wise.  And TBH that code in
RI_FKey_check seems to me to be cantilevered way too far over the
canyon's edge anyway, especially when one compares the cost of one buffer
lock cycle to everything else the function is going to do.  I think
we should just make it take the buffer lock and be done.

BTW, I was also able to reproduce the problem complained of in the
aforementioned thread, in an existing test in postgres_fdw:

#2  0x00808cdd in ExceptionalCondition (
conditionName=, errorType=, 
fileName=, lineNumber=)
at assert.c:54
#3  0x0083e3a4 in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0x2198b70, buffer=335) at tqual.c:981
#4  0x0060635c in ExecCheckHeapTupleVisible (
estate=, tuple=, 
buffer=) at nodeModifyTable.c:197
#5  0x0060845f in ExecCheckTIDVisible (node=0x21bc6e0)
at nodeModifyTable.c:222
#6  ExecInsert (node=0x21bc6e0) at nodeModifyTable.c:413
#7  ExecModifyTable (node=0x21bc6e0) at nodeModifyTable.c:1496
#8  0x005ebf98 in ExecProcNode (node=0x21bc6e0) at execProcnode.c:396
#9  0x005e99ba in ExecutePlan (queryDesc=0x21b3908, 
direction=, count=0) at execMain.c:1567
#10 standard_ExecutorRun (queryDesc=0x21b3908, 
direction=, count=0) at execMain.c:338

(gdb) p debug_query_string 
$1 = 0x21b36f8 "INSERT INTO \"S 1\".\"T 1\"(\"C 1\", c2, c3, c4, c5, c6, c7, 
c8) VALUES ($1, $2, $3, $4, $5, $6, $7, $8) ON CONFLICT DO NOTHING"

So we do have coverage of this case, sort of.  I don't think I want
to propose putting in these Asserts as a permanent test, because
it seems pretty ugly to have such code outside bufmgr.c (it required
adding #include "storage/buf_internals.h" to tqual.c).  On the other
hand, if we'd had them we would never have shipped either of these bugs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-10-23 Thread Peter Geoghegan
On Tue, Oct 4, 2016 at 12:12 PM, Thomas Munro
 wrote:
> Here's a rebased patch.  ConditionVariableSleep now takes
> wait_event_info.  Anyone using this in patches for core probably needs
> to add enumerators to the WaitEventXXX enums in pgstat.h to describe
> their wait points.

I think that there are at least 2 patches from EDB people that are
already dependent on this one. I myself could easily adopt parallel
CREATE INDEX to use it, too. Why the continued delay in committing it?
I think it's fairly clear that this mechanism is widely useful.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-23 Thread Petr Jelinek
On 23/10/16 16:26, Tomas Vondra wrote:
> On 10/22/2016 08:30 PM, Tomas Vondra wrote:
>> On 10/20/2016 04:43 PM, Robert Haas wrote:
>>>
>>> ...
>>>
>>> The sb_alloc allocator I proposed a couple of years ago would work
>>> well for this case, I think.
>>>
>>
>> Maybe, but it does not follow the Memory Context design at all, if I
>> understand it correctly. I was willing to give it a spin anyway and see
>> how it compares to the two other allocators, but this is a significant
>> paradigm shift and certainly much larger step than what I proposed.
>>
>> I'm not even sure it's possible to implement a MemoryContext based on
>> the same ideas as sb_alloc(), because one of the important points of
>> sb_alloc design seems to be throwing away the chunk header. While that
>> may be possible, it would certainly affect the whole tree (not just the
>> reorderbuffer bit), and it'd require way more work.
>>
>> Moreover, the two allocators I proposed significantly benefit from the
>> "same lifespan" assumption. I don't think sb_alloc can do that.
>>
> 
> I've given the sb_alloc patch another try - essentially hacking it into
> reorderbuffer, ignoring the issues mentioned yesterday. And yes, it's
> faster than the allocators discussed in this thread. Based on a few very
> quick tests on my laptop, the difference is usually ~5-10%.
> 
> That might seem like a significant improvement, but it's negligible
> compared to the "master -> slab/gen" improvement, which improves
> performance by orders of magnitude (at least for the tested cases).
> 
> Moreover, the slab/gen allocators proposed here seem like a better fit
> for reorderbuffer, e.g. because they release memory. I haven't looked at
> sb_alloc too closely, but I think it behaves more like AllocSet in this
> regard (i.e. keeping the memory indefinitely).
> 

For reorderbuffer, from what I've seen in practice, I'd prefer proper
freeing to 5% performance gain as I seen walsenders taking GBs of memory
dues to reoderbuffer allocations that are never properly freed.

About your actual patch. I do like both the Slab and the Gen allocators
and think that we should proceed with them for the moment. You
definitely need to rename the Gen one (don't ask me to what though) as
it sounds like "generic" and do some finishing touches but I think it's
the way to go. I don't see any point in GenSlab anymore.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-23 Thread Tomas Vondra

On 10/22/2016 08:30 PM, Tomas Vondra wrote:

On 10/20/2016 04:43 PM, Robert Haas wrote:

>>

...

The sb_alloc allocator I proposed a couple of years ago would work
well for this case, I think.



Maybe, but it does not follow the Memory Context design at all, if I
understand it correctly. I was willing to give it a spin anyway and see
how it compares to the two other allocators, but this is a significant
paradigm shift and certainly much larger step than what I proposed.

I'm not even sure it's possible to implement a MemoryContext based on
the same ideas as sb_alloc(), because one of the important points of
sb_alloc design seems to be throwing away the chunk header. While that
may be possible, it would certainly affect the whole tree (not just the
reorderbuffer bit), and it'd require way more work.

Moreover, the two allocators I proposed significantly benefit from the
"same lifespan" assumption. I don't think sb_alloc can do that.



I've given the sb_alloc patch another try - essentially hacking it into 
reorderbuffer, ignoring the issues mentioned yesterday. And yes, it's 
faster than the allocators discussed in this thread. Based on a few very 
quick tests on my laptop, the difference is usually ~5-10%.


That might seem like a significant improvement, but it's negligible 
compared to the "master -> slab/gen" improvement, which improves 
performance by orders of magnitude (at least for the tested cases).


Moreover, the slab/gen allocators proposed here seem like a better fit 
for reorderbuffer, e.g. because they release memory. I haven't looked at 
sb_alloc too closely, but I think it behaves more like AllocSet in this 
regard (i.e. keeping the memory indefinitely).


FWIW I'm not making any conclusions about sb_alloc benefits outside 
reorderbuffer.c - it might easily be worth pursuing, no doubt about 
that. The amount of remaining work however seems quite high, though.


Attached is the modified sb_alloc patch that I used - it's mostly v1 
with removed uses in nbtree etc. FWIW the patch does not implement 
sb_destroy_private_allocator (it's only defined in the header), which 
seems like a bug.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/contrib/test_freepage/Makefile b/contrib/test_freepage/Makefile
new file mode 100644
index 000..b482fe9
--- /dev/null
+++ b/contrib/test_freepage/Makefile
@@ -0,0 +1,17 @@
+# contrib/test_freepage/Makefile
+
+MODULES = test_freepage
+
+EXTENSION = test_freepage
+DATA = test_freepage--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_freepage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_freepage/test_freepage--1.0.sql b/contrib/test_freepage/test_freepage--1.0.sql
new file mode 100644
index 000..5d3191e
--- /dev/null
+++ b/contrib/test_freepage/test_freepage--1.0.sql
@@ -0,0 +1,15 @@
+/* contrib/test_freepage/test_freepage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_freepage" to load this file. \quit
+
+CREATE FUNCTION init(size pg_catalog.int8) RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+CREATE FUNCTION get(pages pg_catalog.int8) RETURNS pg_catalog.int8
+	AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+CREATE FUNCTION inquire_largest() RETURNS pg_catalog.int8
+	AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+CREATE FUNCTION put(first_page pg_catalog.int8, npages pg_catalog.int8)
+	RETURNS pg_catalog.void AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
+CREATE FUNCTION dump() RETURNS pg_catalog.text
+AS 'MODULE_PATHNAME' LANGUAGE C STRICT;
diff --git a/contrib/test_freepage/test_freepage.c b/contrib/test_freepage/test_freepage.c
new file mode 100644
index 000..074cf56
--- /dev/null
+++ b/contrib/test_freepage/test_freepage.c
@@ -0,0 +1,113 @@
+/*--
+ *
+ * test_freepage.c
+ *		Test harness code for free page manager.
+ *
+ * Copyright (C) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/test_freepage/test_freepage.c
+ *
+ * -
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
+#include "utils/freepage.h"
+
+PG_MODULE_MAGIC;
+PG_FUNCTION_INFO_V1(init);
+PG_FUNCTION_INFO_V1(get);
+PG_FUNCTION_INFO_V1(inquire_largest);
+PG_FUNCTION_INFO_V1(put);
+PG_FUNCTION_INFO_V1(dump);
+
+Datum		init(PG_FUNCTION_ARGS);
+Datum		get(PG_FUNCTION_ARGS);
+Datum		inquire_largest(PG_FUNCTION_ARGS);
+Datum		put(PG_FUNCTION_ARGS);
+Datum		dump(PG_FUNCTION_ARGS);
+
+char *space;
+FreePageManager *fpm;
+
+Datum
+init(PG_FUNCTION_ARGS)
+{
+	int64 size = PG_GETARG_INT64(0);
+	Size	

Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Michael Paquier
On Sun, Oct 23, 2016 at 10:30 PM, Magnus Hagander  wrote:
> On Mon, Oct 17, 2016 at 7:37 AM, Michael Paquier 
> wrote:
>> > But independent of this patch, actually putting that test in for non-tar
>> > mode would probably not be a bad idea -- if that breaks, it's likely
>> > both
>> > break, after all.
>>
>> Agreed (you were able to break only tar upthread with your patch). One
>> way to do that elegantly would be to:
>> 1) extend slurp_dir to return only files that have a matching pattern.
>> That's not difficult to do:
>> --- a/src/test/perl/TestLib.pm
>> +++ b/src/test/perl/TestLib.pm
>> @@ -184,10 +184,14 @@ sub generate_ascii_string
>>
>>  sub slurp_dir
>>  {
>> -   my ($dir) = @_;
>> +   my ($dir, $match_pattern) = @_;
>> opendir(my $dh, $dir)
>>   or die "could not opendir \"$dir\": $!";
>> my @direntries = readdir $dh;
>> +   if (defined($match_pattern))
>> +   {
>> +   @direntries = grep($match_pattern, @direntries);
>> +   }
>> closedir $dh;
>> return @direntries;
>>  }
>> Sorting them at the same time may be a good idea..
>> 2) Add an option to pg_xlogdump to be able to output its output to a
>> file. That would be awkward to rely on grabbing the output data from a
>> pipe... On Windows particularly. Thinking about it, would that
>> actually be useful to others? That's not a complicated patch.
>
> I think both of those would be worthwhile. Just for the testability in
> itself, but such a flag to pg_xlogdump would probably be useful in other
> cases as well, beyond just the testing.

Looking quickly at the code, it does not seem that complicated... I
may just send patches tomorrow for all those things and be done with
it, all that on its new dedicated thread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Magnus Hagander
On Mon, Oct 17, 2016 at 7:37 AM, Michael Paquier 
wrote:

> On Sat, Oct 15, 2016 at 8:51 AM, Magnus Hagander 
> wrote:
> > Fixed.
>
> Ok, I had a extra look on the patch:
> +   The transactionn log files are written to a separate file
> +called pg_xlog.tar.
> +   
> s/transactionn/transaction/, and the  markup should be on its own
> line.
>
> +   if (dir_data->sync)
> +   {
> +   if (fsync_fname(tmppath, false, progname) != 0)
> +   {
> +   close(fd);
> +   return NULL;
> +   }
> +   if (fsync_parent_path(tmppath, progname) != 0)
> +   {
> +   close(fd);
> +   return NULL;
> +   }
> +   }
> Nit: squashing both things together would simplify the code.
>
> +   else if (method == CLOSE_UNLINK
> +   )
> Your finger slipped here.
>
> Except that it looks in pretty good to me, so I am switching that as
> ready for committer.
>

I incorporated those changes before pushing.



> > But independent of this patch, actually putting that test in for non-tar
> > mode would probably not be a bad idea -- if that breaks, it's likely both
> > break, after all.
>
> Agreed (you were able to break only tar upthread with your patch). One
> way to do that elegantly would be to:
> 1) extend slurp_dir to return only files that have a matching pattern.
> That's not difficult to do:
> --- a/src/test/perl/TestLib.pm
> +++ b/src/test/perl/TestLib.pm
> @@ -184,10 +184,14 @@ sub generate_ascii_string
>
>  sub slurp_dir
>  {
> -   my ($dir) = @_;
> +   my ($dir, $match_pattern) = @_;
> opendir(my $dh, $dir)
>   or die "could not opendir \"$dir\": $!";
> my @direntries = readdir $dh;
> +   if (defined($match_pattern))
> +   {
> +   @direntries = grep($match_pattern, @direntries);
> +   }
> closedir $dh;
> return @direntries;
>  }
> Sorting them at the same time may be a good idea..
> 2) Add an option to pg_xlogdump to be able to output its output to a
> file. That would be awkward to rely on grabbing the output data from a
> pipe... On Windows particularly. Thinking about it, would that
> actually be useful to others? That's not a complicated patch.
>

I think both of those would be worthwhile. Just for the testability in
itself, but such a flag to pg_xlogdump would probably be useful in other
cases as well, beyond just the testing.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-23 Thread Magnus Hagander
On Fri, Oct 21, 2016 at 2:02 PM, Michael Paquier 
wrote:

> On Mon, Oct 17, 2016 at 2:37 PM, Michael Paquier
>  wrote:
> > Except that it looks in pretty good to me, so I am switching that as
> > ready for committer.
>
> +   /*
> +* Create pg_xlog/archive_status (and thus pg_xlog) so we can
> write to
> +* basedir/pg_xlog as the directory entry in the tar file may
> arrive
> +* later.
> +*/
> +   snprintf(statusdir, sizeof(statusdir), "%s/pg_xlog/archive_status",
> +basedir);
>
> This part conflicts with f82ec32, where you need make pg_basebackup
> aware of the backend version.. I promise that's the last conflict, at
> least I don't have more patches planned in the area.
>

It also broke the tests and invalidated some documentation. But it was easy
enough to fix.

I've now applied this, so next time you get to do the merging :P Joking
aside, please review and let me know if you can spot something I messed up
in the final merge.

Thanks for your repeated reviews!

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-23 Thread Michael Paquier
On Sun, Oct 23, 2016 at 5:18 PM, Vik Fearing  wrote:
> On 10/22/2016 06:00 PM, David Steele wrote:
>> On 10/22/16 6:58 PM, Bruce Momjian wrote:
>>> On Sat, Oct 22, 2016 at 07:33:56AM +0900, Michael Paquier wrote:
 On Sat, Oct 22, 2016 at 4:29 AM, Alvaro Herrera

> Also +1 to renaming pg_subtrans to pg_subxact.

 Nice suggestion, good naming for consistency with the rest.
>>>
>>> Agreed.
>>
>> +1
>
> +1 from me, too.

OK, glad to see that we are reaching a conclusion to this thread.
Attached are two patches. 0001 renames pg_clog to pg_xact. I found a
comment in xact.c that was not yet updated. And 0002 renames
pg_subtrans to pg_subxact.
-- 
Michael
From 1711e73fe59b37c0626c1382c6f044b28c01f3d9 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sun, 23 Oct 2016 20:52:22 +0900
Subject: [PATCH 1/2] Rename pg_clog to pg_xact

pg_upgrade is updated to handle the transfer correctly to post-10
clusters.
---
 doc/src/sgml/backup.sgml   |  4 ++--
 doc/src/sgml/catalogs.sgml |  4 ++--
 doc/src/sgml/config.sgml   |  2 +-
 doc/src/sgml/func.sgml |  2 +-
 doc/src/sgml/maintenance.sgml  |  8 
 doc/src/sgml/ref/pg_resetxlog.sgml |  4 ++--
 doc/src/sgml/ref/pg_rewind.sgml|  2 +-
 doc/src/sgml/storage.sgml  | 10 +-
 doc/src/sgml/wal.sgml  |  2 +-
 src/backend/access/heap/heapam.c   |  4 ++--
 src/backend/access/transam/README  | 10 +-
 src/backend/access/transam/clog.c  |  2 +-
 src/backend/access/transam/commit_ts.c |  2 +-
 src/backend/access/transam/multixact.c |  2 +-
 src/backend/access/transam/subtrans.c  |  5 ++---
 src/backend/access/transam/transam.c   |  2 +-
 src/backend/access/transam/twophase.c  |  4 ++--
 src/backend/access/transam/xact.c  | 18 +-
 src/backend/access/transam/xlog.c  |  2 +-
 src/backend/commands/vacuum.c  | 10 +-
 src/backend/postmaster/autovacuum.c|  2 +-
 src/backend/storage/buffer/README  |  2 +-
 src/backend/storage/ipc/procarray.c|  4 ++--
 src/backend/utils/time/tqual.c |  6 +++---
 src/bin/initdb/initdb.c|  2 +-
 src/bin/pg_upgrade/exec.c  |  6 ++
 src/bin/pg_upgrade/pg_upgrade.c| 30 ++
 src/include/access/slru.h  |  4 ++--
 28 files changed, 83 insertions(+), 72 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..d22af0d 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -382,10 +382,10 @@ tar -cf backup.tar /usr/local/pgsql/data
   directories. This will not work because the
   information contained in these files is not usable without
   the commit log files,
-  pg_clog/*, which contain the commit status of
+  pg_xact/*, which contain the commit status of
   all transactions. A table file is only usable with this
   information. Of course it is also impossible to restore only a
-  table and the associated pg_clog data
+  table and the associated pg_xact data
   because that would render all other tables in the database
   cluster useless.  So file system backups only work for complete
   backup and restoration of an entire database cluster.
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 29738b0..839b52a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1847,7 +1847,7 @@
All transaction IDs before this one have been replaced with a permanent
(frozen) transaction ID in this table.  This is used to track
whether the table needs to be vacuumed in order to prevent transaction
-   ID wraparound or to allow pg_clog to be shrunk.  Zero
+   ID wraparound or to allow pg_xact to be shrunk.  Zero
(InvalidTransactionId) if the relation is not a table.
   
  
@@ -2514,7 +2514,7 @@
All transaction IDs before this one have been replaced with a permanent
(frozen) transaction ID in this database.  This is used to
track whether the database needs to be vacuumed in order to prevent
-   transaction ID wraparound or to allow pg_clog to be shrunk.
+   transaction ID wraparound or to allow pg_xact to be shrunk.
It is the minimum of the per-table
pg_class.relfrozenxid values.
   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..8a5d202 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5834,7 +5834,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 

 Vacuum also allows removal of old files from the
-pg_clog subdirectory, which is why the default
+pg_xact subdirectory, which is why the default
 is a relatively low 200 million transactions.
 This parameter can only be set at server start, but the 

Re: [HACKERS] pg_xlog error on the master

2016-10-23 Thread Michael Paquier
On Sun, Oct 23, 2016 at 5:05 PM, Venkata B Nagothi  wrote:
> I just did did a "git pull" to test one of my patches and i get the
> following error :
>
> 2016-10-23 18:51:47.679 AEDT [31930] FATAL:  could not open archive status
> directory "pg_xlog/archive_status": No such file or directory
> 2016-10-23 18:51:47.679 AEDT [31841] LOG:  archiver process (PID 31930)
> exited with exit code 1
>
> is it expected ?

No.

> I notice that pg_xlog's name has been changed to pg_wal. I am not sure about
> this.

WAL archiving works correctly here, and tests in src/test/recovery/
work. Are you sure that you cleaned up up your source tree before
recompiling?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-23 Thread Vik Fearing
On 10/22/2016 06:00 PM, David Steele wrote:
> On 10/22/16 6:58 PM, Bruce Momjian wrote:
>> On Sat, Oct 22, 2016 at 07:33:56AM +0900, Michael Paquier wrote:
>>> On Sat, Oct 22, 2016 at 4:29 AM, Alvaro Herrera
>>>
 Also +1 to renaming pg_subtrans to pg_subxact.
>>>
>>> Nice suggestion, good naming for consistency with the rest.
>>
>> Agreed.
> 
> +1

+1 from me, too.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_xlog error on the master

2016-10-23 Thread Venkata B Nagothi
I just did did a "git pull" to test one of my patches and i get the
following error :

2016-10-23 18:51:47.679 AEDT [31930] FATAL:  could not open archive status
directory "pg_xlog/archive_status": No such file or directory
2016-10-23 18:51:47.679 AEDT [31841] LOG:  archiver process (PID 31930)
exited with exit code 1

is it expected ?

I notice that pg_xlog's name has been changed to pg_wal. I am not sure
about this.

Regards,

Venkata B N

Database Consultant / Architect