date:20180708

Re: Non-reserved replication slots and slot advancing

2018-07-08 Thread Michael Paquier

On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote:
> Looking the attached patch, I noticed that both "WAL" and "wal"
> are used in similar ERROR messages. Grepping the source tree
> showed me that it is always in upper case letters except in the
> case it is a part of other words like variable/column/function
> names or "walsender". This is the same with the word "lsn".

Thanks for the lookup.

I see.  Indeed, let's fix at the same time the error message close by.
xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking
with that as per the attached.  I'll go commit that if there are no
objections.  If you see any others which you would like to fix, please
feel free to send a patch.
--
Michael
diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out
index 21e9d56f73..3a1a7e8514 100644
--- a/contrib/test_decoding/expected/slot.out
+++ b/contrib/test_decoding/expected/slot.out
@@ -131,3 +131,20 @@ SELECT pg_drop_replication_slot('regression_slot1');
 ERROR:  replication slot "regression_slot1" does not exist
 SELECT pg_drop_replication_slot('regression_slot2');
 ERROR:  replication slot "regression_slot2" does not exist
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+slot_name 
+--
+ regression_slot3
+(1 row)
+
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+ERROR:  invalid target WAL location (LSN)
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+ERROR:  cannot move slot not reserving WAL
+SELECT pg_drop_replication_slot('regression_slot3');
+ pg_drop_replication_slot 
+--
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql
index 706340c1d8..24cdf7155d 100644
--- a/contrib/test_decoding/sql/slot.sql
+++ b/contrib/test_decoding/sql/slot.sql
@@ -68,3 +68,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_
 -- both should error as they should be dropped on error
 SELECT pg_drop_replication_slot('regression_slot1');
 SELECT pg_drop_replication_slot('regression_slot2');
+
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+SELECT pg_drop_replication_slot('regression_slot3');
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..4851bc2e24 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9867,7 +9867,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   The address (LSN) of oldest WAL which still
   might be required by the consumer of this slot and thus won't be
-  automatically removed during checkpoints.
+  automatically removed during checkpoints.  NULL
+  if the LSN of this slot has never been reserved.
   
  
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..b55dadfe2f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -465,7 +465,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 
 	if (XLogRecPtrIsInvalid(moveto))
 		ereport(ERROR,
-(errmsg("invalid target wal lsn")));
+(errmsg("invalid target WAL location (LSN)")));
 
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
@@ -483,6 +483,15 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
 
+	/* A slot whose restart_lsn has never been reserved cannot be advanced */
+	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+	{
+		ReplicationSlotRelease();
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot move slot not reserving WAL")));
+	}
+
 	/*
 	 * Check if the slot is not moving backwards.  Physical slots rely simply
 	 * on restart_lsn as a minimum point, while logical slots have confirmed


signature.asc
Description: PGP signature

Re: Non-reserved replication slots and slot advancing

2018-07-08 Thread Kyotaro HORIGUCHI

Hello.

At Mon, 9 Jul 2018 14:18:51 +0900, Michael Paquier  wrote 
in <20180709051851.ga30...@paquier.xyz>
> On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote:
> > I'm not so in favor of the word "reserve" in first place since it
> > doesn't seem intuitive for me, but "active" is already used for
> > the state where the connection with the peer is made. (The word
> > "reserve" may be misused since in the source code "reserve" is
> > used as "to reserve WAL", but used as "reserve a slot" in
> > documentation.)
> 
> That's the term used for now three releases, counting v11 in the pack,
> so I would not change that now.  The concept of non-initialized slots is
> fuzzy as well as it could be attached to some other meta-data.
> 
> So, chewing on all that, I would suggest the following error message as
> the attached patch and just move on:
> +ERROR:  cannot move slot not reserving WAL

I'm fine with this. Thank you.

Looking the attched patch, I noticed that both "WAL" and "wal"
are used in similar ERROR messages. Grepping the source tree
showed me that it is always in upper case letters except in the
case it is a part of other words like variable/column/function
names or "walsender". This is the same with the word "lsn".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-08 Thread Haribabu Kommi

On Mon, Jul 9, 2018 at 3:39 PM Haribabu Kommi 
wrote:

>
> On Mon, Jul 9, 2018 at 12:24 PM Michael Paquier 
> wrote:
>
>> On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote:
>> > Ugh, it's true :-(
>> >
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8
>> >
>> > Dave, Simon, any comments?
>>
>> The offending line:
>> contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql:
>> GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats;
>>
>> This will need a new version bump down to REL_10_STABLE...
>>
>
> Hearing no objections, attached patch removes all permissions from PUBLIC
> as before this change went in. Or do we need to add command for revoke only
> from pg_read_all_stats?
>

Revoke all doesn't work, so patch updated with revoke from
pg_read_all_stats.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Revoke-pg_stat_statements_reset-permissions_v2.patch
Description: Binary data

Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-08 Thread Kyotaro HORIGUCHI

Hello. Sawada-san.

Thank you for the comments.

At Thu, 5 Jul 2018 15:43:56 +0900, Masahiko Sawada  
wrote in 
> On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello.
> >
> > At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20180626.162659.223208514.horiguchi.kyot...@lab.ntt.co.jp>
> >> The previous patche files doesn't have version number so I let
> >> the attached latest version be v2.
> >>
> >>
> >> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
> >>   The body of the limiting feature
> >>
> >> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
> >>   Shows the status of WAL rataining in pg_replication_slot view
> >>
> >> v2-0003-TAP-test-for-the-slot-limit-feature.patch
> >>   TAP test for this feature
> >>
> >> v2-0004-Documentation-for-slot-limit-feature.patch
> >>   Documentation, as the name.
> >
> > Travis (test_decoding test) showed that GetOldestXLogFileSegNo
> > added by 0002 forgets to close temporarily opened pg_wal
> > directory. This is the fixed version v3.
> >
> 
> Thank you for updating the patch! I looked at v3 patches. Here is
> review comments.
> 
> ---
> +   {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
> +   gettext_noop("Sets the maximum size of extra
> WALs kept by replication slots."),
> +NULL,
> +GUC_UNIT_MB
> +   },
> +   &max_slot_wal_keep_size_mb,
> +   0, 0, INT_MAX,
> +   NULL, NULL, NULL
> +   },
> 
> Maybe MAX_KILOBYTES would be better instead of INT_MAX like wal_max_size.

The MAX_KILOBYTES is maximum value of size in kB, which fits long
or Size/size_t variables after convreted into bytes. Applying the
limit there means that we assume that the _mb variable can be
converted not into bytes but kB. So applying it to
max/min_wal_size seems somewhat wrong but doesn't harm since they
are not acutually converted into bytes.

max_slot_wal_keep_size is not converted into bytes so capping
with INT_MAX is no problem. However it doesn't need to be larger
than MAX_KILOBYTES, I follow that in order to make it same with
max/min_wal_size.

> ---
> Once the following WARNING emitted this message is emitted whenever we
> execute CHECKPOINT even if we don't change anything. Is that expected
> behavior? I think it would be better to emit this message only when we
> remove WAL segements that are required by slots.
> 
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  The mostly affected slot has lost 153 segments.

I didn't consider the situation the number of lost segments
doesn't change. Changed to mute the message when the number of
lost segments is not changed.

> ---
> +   Assert(wal_keep_segments >= 0);
> +   Assert(max_slot_wal_keep_size_mb >= 0);
> 
> These assertions are not meaningful because these parameters are
> ensured >= 0 by those definition.

Yeah, that looks a bit being paranoid. Removed.

> ---
> +/* slots aren't useful, consider only wal_keep_segments */
> +if (slotpos == InvalidXLogRecPtr)
> +{
> 
> Perhaps XLogRecPtrIsInvalid(slotpos) is better.

Agreed. It is changed to "slotpos != InvalidXLogRecPtr" after
changing the function by the comments below. I think that the
double negation !XLogRecPtrInvalid() is not fine.

> ---
> +if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + 
> wal_keep_segments)
> +slotpos = InvalidXLogRecPtr;
> +
> +/* slots aren't useful, consider only wal_keep_segments */
> +if (slotpos == InvalidXLogRecPtr)
> +{
> 
> This logic is hard to read to me. The slotpos can be any of: valid,
> valid but then become invalid in halfway or invalid from beginning of
> this function. Can we convert this logic to following?
> 
> if (XLogRecPtrIsInvalid(slotpos) ||
> currSeg <= slotSeg + wal_keep_segments)

Right. But it is removed.

> ---
> +keepSegs = wal_keep_segments +
> +ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
> 
> Why do we need to keep (wal_keep_segment + max_slot_wal_keep_size) WAL
> segments? I think what this feature does is, if wal_keep_segments is
> not useful (that is, slotSeg < (currSeg - wal_keep_segment) then we
> normally choose slotSeg as lower boundary but max_slot_wal_keep_size
> restrict the lower boundary so that it doesn't get lower than the
> threshold. So I thought what this function should do is to calculate
> min(currSeg - wal_keep_segment, max(currSeg - max_slot_wal_keep_size,
> slotSeg)), I might be missing something though.

You're right that wal_keep_segments should not been added, but
should give lower limit of segments to keep as the current
KeepLogSeg() does. Fixed that.

Since the amount is specified in mega bytes, silently rounding
down to segment bounds may not be proper in general and this
feature used to use the fragments to show users something. But
there's no loner a place where the fragments are per

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-08 Thread Haribabu Kommi

On Mon, Jul 9, 2018 at 12:24 PM Michael Paquier  wrote:

> On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote:
> > Ugh, it's true :-(
> >
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8
> >
> > Dave, Simon, any comments?
>
> The offending line:
> contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql:
> GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats;
>
> This will need a new version bump down to REL_10_STABLE...
>

Hearing no objections, attached patch removes all permissions from PUBLIC
as before this change went in. Or do we need to add command for revoke only
from pg_read_all_stats?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Revoke-pg_stat_statements_reset-permissions.patch
Description: Binary data

Re: Postgres 11 release notes

2018-07-08 Thread Andres Freund

On 2018-07-09 14:18:14 +0900, Tatsuro Yamada wrote:
> Hi Bruce,
> 
> > I expect a torrent of feedback.;-)
> 
> Could you add this fix to the release note because this change affects
> an extension developer using the hook function.
> It would be better to know the change by reading the release note, not 
> compilation error.
> 
> 
>
> Add QueryEnvironment to ExplainOneQuery_hook's parameter list
> (Tatsuro Yamada, Thomas Munro)
>
> 
> Discussion: 
> https://postgr.es/m/890e8dd9-c1c7-a422-6892-874f5eaee...@lab.ntt.co.jp
> 
> I guess "E.1.3.11. Source Code" or "E.1.3.12. Additional Modules" sections are
> suitable for putting the message in the release note.

We adapt plenty of functions signatures without listing them
individually in the release notes. I don't think randomly selecting one
relatively uncommonly used hook is a good way to attack that.

Greetings,

Andres Freund

Re: Non-reserved replication slots and slot advancing

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote:
> I'm not so in favor of the word "reserve" in first place since it
> doesn't seem intuitive for me, but "active" is already used for
> the state where the connection with the peer is made. (The word
> "reserve" may be misused since in the source code "reserve" is
> used as "to reserve WAL", but used as "reserve a slot" in
> documentation.)

That's the term used for now three releases, counting v11 in the pack,
so I would not change that now.  The concept of non-initialized slots is
fuzzy as well as it could be attached to some other meta-data.

So, chewing on all that, I would suggest the following error message as
the attached patch and just move on:
+ERROR:  cannot move slot not reserving WAL
--
Michael
diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out
index 21e9d56f73..cc3766e10e 100644
--- a/contrib/test_decoding/expected/slot.out
+++ b/contrib/test_decoding/expected/slot.out
@@ -131,3 +131,20 @@ SELECT pg_drop_replication_slot('regression_slot1');
 ERROR:  replication slot "regression_slot1" does not exist
 SELECT pg_drop_replication_slot('regression_slot2');
 ERROR:  replication slot "regression_slot2" does not exist
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+slot_name 
+--
+ regression_slot3
+(1 row)
+
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+ERROR:  invalid target wal lsn
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+ERROR:  cannot move slot not reserving WAL
+SELECT pg_drop_replication_slot('regression_slot3');
+ pg_drop_replication_slot 
+--
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql
index 706340c1d8..24cdf7155d 100644
--- a/contrib/test_decoding/sql/slot.sql
+++ b/contrib/test_decoding/sql/slot.sql
@@ -68,3 +68,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_
 -- both should error as they should be dropped on error
 SELECT pg_drop_replication_slot('regression_slot1');
 SELECT pg_drop_replication_slot('regression_slot2');
+
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+SELECT pg_drop_replication_slot('regression_slot3');
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..4851bc2e24 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9867,7 +9867,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   The address (LSN) of oldest WAL which still
   might be required by the consumer of this slot and thus won't be
-  automatically removed during checkpoints.
+  automatically removed during checkpoints.  NULL
+  if the LSN of this slot has never been reserved.
   
  
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..2f07cc37f5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -483,6 +483,15 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
 
+	/* A slot whose restart_lsn has never been reserved cannot be advanced */
+	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+	{
+		ReplicationSlotRelease();
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot move slot not reserving WAL")));
+	}
+
 	/*
 	 * Check if the slot is not moving backwards.  Physical slots rely simply
 	 * on restart_lsn as a minimum point, while logical slots have confirmed


signature.asc
Description: PGP signature

Re: Postgres 11 release notes

2018-07-08 Thread Tatsuro Yamada


Hi Bruce,

> I expect a torrent of feedback.;-)

Could you add this fix to the release note because this change affects
an extension developer using the hook function.
It would be better to know the change by reading the release note, not 
compilation error.


   
Add QueryEnvironment to ExplainOneQuery_hook's parameter list
(Tatsuro Yamada, Thomas Munro)
   

Discussion: 
https://postgr.es/m/890e8dd9-c1c7-a422-6892-874f5eaee...@lab.ntt.co.jp

I guess "E.1.3.11. Source Code" or "E.1.3.12. Additional Modules" sections are
suitable for putting the message in the release note.

Thanks,
Tatsuro Yamada
NTT Open Source Software Center

Incorrect error handling for two-phase state files resulting in data loss

2018-07-08 Thread Michael Paquier

Hi all,

This is a follow-up of the following thread where I have touched the
topic of corrupted 2PC files being completely ignored by recovery:
https://www.postgresql.org/message-id/20180709012955.GD1467%40paquier.xyz
I have posted a patch on this thread, but after more reviews I have
noticed that it does not close all the holes.

What happens is that when a two-phase state file cannot be read by
recovery when the file is on disk, which is what ReadTwoPhaseFile does,
then its data is simply skipped.  It is perfectly possible that an
on-disk 2PC is missing at crash recovery if it has been already
processed, so if trying to open the file and seeing an ENOENT, then the
file can be properly skipped.  However, if you look at this API, it
skips *all* errors instead of the ones that matter.  For example, if a
file needs to be processed and is cannot be opened because of permission
issues, then it is simply skipped.  If its size, its CRC or its magic
number is incorrect, the same thing happens.  Skipping unconditionally
files this way can result in data loss.

I think that we really need to harden things, by making
ReadTwoPhaseFile() fail hard is it finds something unexpected, which is
in this case anything except trying to open a file which fails on
ENOENT, and that this stuff should be back-patched.

Thoughts?
--
Michael
From c53e3f6b604ceffcf75484975331e1b141bb6139 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 9 Jul 2018 13:59:46 +0900
Subject: [PATCH] Fail hard when facing corrupted two-phase state files at
 recovery

When a corrupted file is found by WAL replay, be it for crash recovery
or archive recovery, then the file is simply skipped and a WARNING is
logged to the user.  Facing an on-disk WAL file which is corrupted is
as likely to happen as its pair recorded in dedicated WAL records, but
WAL records are already able to fail hard if there is a CRC mismatch at
record level.

The situation got better since 978b2f6 (9.6) which has added two-phase
state information directly in WAL instead of using on-disk files, so the
risk is limited to two-phase transactions which live across at least one
checkpoint.

Reported-by: Michael Paquier
Author: Michael Paquier
Discussion: https://postgr.es/m/XXX
---
 src/backend/access/transam/twophase.c | 131 +++---
 1 file changed, 53 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e8d4e37fe3..3a048f9a4b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1206,10 +1206,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
  * Read and validate the state file for xid.
  *
  * If it looks OK (has a valid magic number and CRC), return the palloc'd
- * contents of the file.  Otherwise return NULL.
+ * contents of the file.  If missing_ok is true, do not issue an ERROR and
+ * return NULL.
  */
 static char *
-ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
+ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@@ -1225,12 +1226,12 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 	{
-		if (give_warnings)
-			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not open two-phase state file \"%s\": %m",
-			path)));
-		return NULL;
+		if (missing_ok && errno == ENOENT)
+			return NULL;
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not open two-phase state file \"%s\": %m",
+		path)));
 	}
 
 	/*
@@ -1240,36 +1241,26 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	 * even on a valid file.
 	 */
 	if (fstat(fd, &stat))
-	{
-		int			save_errno = errno;
-
-		CloseTransientFile(fd);
-		if (give_warnings)
-		{
-			errno = save_errno;
-			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not stat two-phase state file \"%s\": %m",
-			path)));
-		}
-		return NULL;
-	}
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not stat two-phase state file \"%s\": %m",
+		path)));
 
 	if (stat.st_size < (MAXALIGN(sizeof(TwoPhaseFileHeader)) +
 		MAXALIGN(sizeof(TwoPhaseRecordOnDisk)) +
 		sizeof(pg_crc32c)) ||
 		stat.st_size > MaxAllocSize)
-	{
-		CloseTransientFile(fd);
-		return NULL;
-	}
+		ereport(ERROR,
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("incorrect size of two-phase state file \"%s\": %zu bytes",
+		path, stat.st_size)));
 
 	crc_offset = stat.st_size - sizeof(pg_crc32c);
 	if (crc_offset != MAXALIGN(crc_offset))
-	{
-		CloseTransientFile(fd);
-		return NULL;
-	}
+		ereport(ERROR,
+(errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("incorrect alignment of CRC offset for two-phase state file \"%s\"",
+		path)));
 
 	/*
 	 * OK, slurp in the file.
@@ -1278,32 +1269,26 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 
 	pgstat_report_wait_start(W

RE: Locking B-tree leafs immediately in exclusive mode

2018-07-08 Thread Imai, Yoshikazu

On Mon, June 11, 2018 at 4:31 PM, Alexander Korotkov wrote:
> On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs  wrote:
> > On 5 June 2018 at 14:45, Alexander Korotkov 
> > wrote:
> > > Currently _bt_search() always locks leaf buffer in shared mode (aka
> > > BT_READ), while caller can relock it later.  However, I don't see
> > > what prevents
> > > _bt_search()
> > > from locking leaf immediately in exclusive mode (aka BT_WRITE) when
> > > required.
> > > When traversing downlink from non-leaf page of level 1 (lowest level
> > > of non-leaf pages just prior to leaf pages), we know that next page
> > > is going to be leaf.  In this case, we can immediately lock next page
> > > in exclusive mode if required.
> > > For sure, it might happen that we didn't manage to exclusively lock
> > > leaf in this way when _bt_getroot() points us to leaf page.  But
> > > this case could be handled in _bt_search() by relock.  Please, find
> > > implementation of this approach in the attached patch.
> >
> > It's a good idea. How does it perform with many duplicate entries?
> 
> Our B-tree is currently maintaining duplicates unordered.  So, during
> insertion we can traverse rightlinks in order to find page, which would
> fit new index tuple.
> However, in this case we're traversing pages in exclusive lock mode, and
> that happens already after re-lock.  _bt_search() doesn't do anything
> with that.
> So, I assume there shouldn't be any degradation in the case of many
> duplicate entries.
> 
> But this case definitely worth testing, and I'm planning to do it.
> 


Hi, I'm reviewing this.

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as 
you did.

> # postgresql.conf
> max_connections = 300
> shared_buffers = 32GB
> fsync = off
> synchronous_commit = off
> 
> 
> 
> #  DDL:
> CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
> CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);
> 
> 
> 
> # script_ordered.sql
> INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');
> 
> 
> 
> # script_unordered.sql
> \set i random(1, 100)
> INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');
> 
> 
> 
> # commands
> pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
> pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres
> 
> 
> 
> # results
> ordered, master: 157473 TPS
> ordered, patched 231374 TPS
> unordered, master: 232372 TPS
> unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched:265842 TPS
unordered, master:   313011 TPS
unordered, patched:  309636 TPS

I confirmed that this patch improves ordered insertion.


In addition to your tests, I did the following tests with many duplicate entries

#  DDL
CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null);

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres

# results
duplicated, master:  315450 TPS
duplicated, patched: 311584 TPS

It seems there are almostly no performance degression in case of many 
duplicated entries.


I'm planning to do code review and send it in the next mail.

Yoshikazu Imai

Re: no partition pruning when partitioning using array type

2018-07-08 Thread Amit Langote

Thanks for taking a look.

On 2018/07/07 9:19, Alvaro Herrera wrote:
> On 2018-May-08, Amit Langote wrote:
> 
>> I would like to revisit this as a bug fix for get_partition_operator() to
>> be applied to both PG 10 and HEAD.  In the former case, it fixes the bug
>> that constraint exclusion code will fail to prune correctly when partition
>> key is of array, enum, range, or record type due to the structural
>> mismatch between the OpExpr that partitioning code generates and one that
>> the parser generates for WHERE clauses involving partition key columns.
> 
> Interesting patchset.  Didn't read your previous v2, v3 versions; I only
> checked your latest, v1 (???).

Sorry, I think I messed up version numbering there.

> I'm wondering about the choice of OIDs in the new test.  I wonder if
> it's possible to get ANYNONARRAY (or others) by way of having a
> polymorphic function that passes its polymorphic argument in a qual.  I
> suppose it won't do anything in v10, or will it?  Worth checking :-)> Why not 
> use IsPolymorphicType?

Hmm, so IsPolymorphicType() test covers all of these pseudo-types except
RECORDOID.  I rewrote the patch to use IsPolymorphicType.

I'm not able to think of a case where the partition constraint expression
would have to contain ANYNONARRAY though.

> Also, I think it'd be good to have tests
> for all these cases (even in v10), just to make sure we don't break it
> going forward.

So, I had proposed this patch in last December, because partition pruning
using constraint exclusion was broken for these types and still is in PG
10.  I have added the tests back in the patch for PG 10 to test that
partition pruning (using constraint exclusion) works for these cases.  For
PG 11 and HEAD, we took care of that in e5dcbb88a15 (Rework code to
determine partition pruning procedure), so there does not appear to be any
need to add tests for pruning there.

> At least the array case is clearly broken today ...
> A test for the RECORDOID case would be particularly welcome, since it's
> somehow different from the other cases.  (I didn't understand why did
> you remove the test in the latest version.)

I have added the tests in the patch for PG 10.

I have marked both patches as v5.  One patch is for PG 10 and the other
applies unchanged to both HEAD and PG 11 branches.

Thanks,
Amit
From 1ec0c064adc34c12ae5615f0f2bca27a9c61c30f Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 8 May 2018 14:31:37 +0900
Subject: [PATCH v4] Fix how get_partition_operator looks up the operator

Instead of looking up using the underlying partition key's type as the
operator's lefttype and righttype, use the partition operator class
declared input type, which works reliably even in the cases where
pseudo-types are involved.  Also, make its decision whether a
RelableType is needed depend on a different criteria than what's
currently there.  That is, only add a RelabelType if the partition
key's type is different from the selected operator's input types.
---
 src/backend/catalog/partition.c| 47 ++
 src/test/regress/expected/create_table.out |  2 +-
 src/test/regress/expected/inherit.out  | 98 ++
 src/test/regress/sql/inherit.sql   | 43 +
 4 files changed, 161 insertions(+), 29 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 17704f36b9..1f50b29a3f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1176,40 +1176,31 @@ get_partition_operator(PartitionKey key, int col, 
StrategyNumber strategy,
Oid operoid;
 
/*
-* First check if there exists an operator of the given strategy, with
-* this column's type as both its lefttype and righttype, in the
-* partitioning operator family specified for the column.
+* Get the operator in the partitioning operator family using the 
operator
+* class declared input type as both its lefttype and righttype.
 */
operoid = get_opfamily_member(key->partopfamily[col],
- 
key->parttypid[col],
- 
key->parttypid[col],
+ 
key->partopcintype[col],
+ 
key->partopcintype[col],
  strategy);
+   if (!OidIsValid(operoid))
+   elog(ERROR, "missing operator %d(%u,%u) in partition opfamily 
%u",
+strategy, key->partopcintype[col], 
key->partopcintype[col],
+key->partopfamily[col]);
 
/*
-* If one doesn't exist, we must resort to using an operator in the same
-* operator family but with the operator class declared input type.  It 
is
-* OK to do so, bec

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote:
> Ugh, it's true :-(
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8
> 
> Dave, Simon, any comments?

The offending line:
contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql:
GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats;

This will need a new version bump down to REL_10_STABLE...
--
Michael


signature.asc
Description: PGP signature

Re: Typo in Japanese translation of psql.

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 06:14:24PM +0900, Yugo Nagata wrote:
> However, I think you have to submit the whole po file to Patch Tracker[1]
> instead of a patch according to the wiki [2].
> 
> [1] https://redmine.postgresql.org/projects/pgtranslation
> [2] https://wiki.postgresql.org/wiki/NLS

My understanding is that all this class of bugs is fixed into the
translation git tree first, and then patched in block back to Postgres.
Peter Eisentraut does that usually.
--
Michael


signature.asc
Description: PGP signature

Re: no partition pruning when partitioning using array type

2018-07-08 Thread Amit Langote

On 2018/07/07 0:13, Andrew Dunstan wrote:
> 
> 
> On 05/08/2018 02:18 AM, Amit Langote wrote:
>> On 2018/03/01 17:16, Amit Langote wrote:
>>> Added this to CF (actually moved to the September one after first having
>>> added it to the CF that is just getting started).
>>>
>>> It seems to me that we don't need to go with my originally proposed
>>> approach to teach predtest.c to strip RelabelType nodes.  Apparently, it's
>>> only partition.c that's adding the RelableType node around partition key
>>> nodes in such cases.  That is, in the case of having an array, record,
>>> enum, and range type as the partition key.  No other part of the system
>>> ends up adding one as far as I can see.  Parser's make_op(), for example,
>>> that is used to generate an OpExpr for a qual involving the partition key,
>>> won't put RelabelType around the partition key node if the operator in
>>> question has "pseudo"-types listed as declared types of its left and right
>>> arguments.
>>>
>>> So I revised the patch to drop all the predtest.c changes and instead
>>> modify get_partition_operator() to avoid generating RelabelType in such
>>> cases.  Please find it attached.
>
>> I would like to revisit this as a bug fix for get_partition_operator() to
>> be applied to both PG 10 and HEAD.  In the former case, it fixes the bug
>> that constraint exclusion code will fail to prune correctly when partition
>> key is of array, enum, range, or record type due to the structural
>> mismatch between the OpExpr that partitioning code generates and one that
>> the parser generates for WHERE clauses involving partition key columns.
>>
>> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a
>> different piece of code anyway, the patch only serves to improve the
>> deparse output emitted by ruleutils.c for partition constraint expressions
>> where pseudo-type partition key is involved.  The change can be seen in
>> the updated test output for create_table test.
>>
>> Attached are patches for PG 10 and HEAD, which implement a slightly
>> different approach to fixing this.  Now, instead of passing the partition
>> key's type as lefttype and righttype to look up the operator, the operator
>> class declared type is passed.  Also, we now decide whether to create a
>> RelabelType node on top based on whether the partition key's type is
>> different from the selected operator's input type, with exception for
>> pseudo-type types.
>>
>> Thanks,
>> Amit
>>
>> [1]
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e5dcbb88a15d
> 
> Since this email we have branched off REL_11_STABLE. Do we want to
> consider this as a bug fix for Release 11? If so, should we add it to the
> open items list?

The code being fixed with the latest patch is not new in PG 11, it's
rather PG 10 code.  That code affects how pruning works in PG 10 (used to
affect PG 11 until we rewrote the partition pruning code).  I think it's a
good idea to apply this patch to all branches, as the code is not specific
to partition pruning and has its benefits even for HEAD and PG 11.

For PG 11 and HEAD, the benefit of this patch can be thought of as just
cosmetic, because it only affects the ruleutils.c's deparse output for
partition constraint expressions when showing it in the psql output for
example.

In PG 10, it directly affects the planner behavior whereby having a
RelabelType redundantly in a partition constraint expression limits the
planner's ability to do partition pruning with it.

Thanks,
Amit

Re: Let's remove DSM_IMPL_NONE.

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 11:08:02PM +0200, Peter Eisentraut wrote:
> On 26.06.18 09:10, Kyotaro HORIGUCHI wrote:
>> --- a/src/include/storage/dsm_impl.h
>> +++ b/src/include/storage/dsm_impl.h
>> @@ -14,11 +14,10 @@
>>  #define DSM_IMPL_H
>>  
>>  /* Dynamic shared memory implementations. */
>> -#define DSM_IMPL_NONE   0
>> -#define DSM_IMPL_POSIX  1
>> -#define DSM_IMPL_SYSV   2
>> -#define DSM_IMPL_WINDOWS3
>> -#define DSM_IMPL_MMAP   4
>> +#define DSM_IMPL_POSIX  0
>> +#define DSM_IMPL_SYSV   1
>> +#define DSM_IMPL_WINDOWS2
>> +#define DSM_IMPL_MMAP   3
> 
> I would avoid renumbering here.  It was kind of sensible to have NONE =
> 0, so I'd keep the non-NONE ones as non-zero.

+1.
--
Michael


signature.asc
Description: PGP signature

[GSoC] working status

2018-07-08 Thread Charles Cui

Hi mentors and hackers,

   The second review is coming. Here is my working status so far. 1.
Complete the thrift compact protocol implementation using bytea interface.
2. Thrift type (binary protocol) is almost done, the only remaining part is
struct encoding and decoding. With the thrift type, you can express your
thrift struct using json, but stored using thrift bytes.
3. Set up travis CI. 4. better documents.
Here is the repo with all recent changes
(https://github.com/charles-cui/pg_thrift)
Let me know if you have any questions.

Re: Copy function for logical replication slots

2018-07-08 Thread Michael Paquier

On Mon, Jul 09, 2018 at 10:06:00AM +0900, Masahiko Sawada wrote:
> I think that this patch might be splitted but I will be able to send
> an updated patch in the next week. As you suggestion this patch needs
> more careful thoughts. I'll move this patch to the next commit fest if
> I will not be able to sent it. Is that okay?

Fine by me.  Thanks for the update.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 09:09:27PM +0900, Michael Paquier wrote:
> Sure.  I think I can finish the set on Monday JST then.

So, I have been able to back-patch things down to 9.5, but further down
I am not really convinced that it is worth meddling with that,
particularly in light of 7cbee7c which has reworked the way partial
segments on older timelines are handled at the end of promotion.  The
portability issues I have found is related to the timeline number
exitArchiveRecovery uses which comes for the WAL reader hence this gets
set to the timeline from the last page read by the startup process.
This can actually cause quite a bit of trouble at the end of recovery
as we would get a failure when trying to copy the last segment from the
old timeline to the new timeline.  Well, it could be possible to fix
things properly by roughly back-porting 7cbee7c down to REL9_4_STABLE
but that's not really worth the risk, and moving exitArchiveRecovery()
and PrescanPreparedTransactions() around is a straight-forward move with
9.5~ thanks to this commit.

I have also done nothing yet for the detection of corrupted 2PC files
which get ignored at recovery.  While looking again at the patch I sent
upthread, the thing was actually missing some more error handling in
ReadTwoPhaseFile().  In particular, with the proposed patch we would
still not report an error if a 2PC file cannot be opened because of for
example EPERM.  In FinishPreparedTransaction, one would example get a
simply a *crash* with no hints about what happened.  I have a patch for
that which still needs some polishing, and I will start a new thread on
the matter.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Secondary index access optimizations

2018-07-08 Thread David Rowley

On 22 March 2018 at 22:38, Konstantin Knizhnik
 wrote:
> Attached please find rebased version of the patch.

Hi,

I started looking over this patch and have a few comments:

I don't think this range type idea is a great one. I don't think it's
going to ever perform very well.  I also see you're not checking the
collation of the type anywhere.  As of now, no range types have
collation support, but you can't really go and code this with that
assumption. I also don't really like the sequence scan over pg_range.

Probably a better way to do this would be to add a new bt proc, like
what was done in 0a459cec for 2 new functions. Something like
BTISNEXTVAL_PROC and BTISPREVVAL_PROC. You'd then need to define
functions for all the types based on integers, making functions which
take 2 parameters of the type, and an additional collation param. The
functions should return bool.  int4_isnextval(2, 3, InvalidOid) would
return true.  You'd need to return false on wraparound.

I also think that the patch is worth doing without the additional
predicate_implied_by() smarts. In fact, I think strongly that the two
should be considered as two independent patches. Partial indexes
suffer from the same issue you're trying to address here and that
would be resolved by any patch which makes changes to
predicate_implied_by().

Probably the best place to put the code to skip the redundant quals is
inside set_append_rel_size(). There's already code there that skips
quals that are seen as const TRUE. This applies for UNION ALL targets
with expressions that can be folded to constants once the qual is
passed through adjust_appendrel_attrs()

For example:
explain select * from (select 1 as a from pg_class union all select 2
from pg_class) t where a = 1;

I've attached a patch to show what I had in mind. I had to change how
partition_qual is populated, which I was surprised to see only gets
populated for sub-partitioned tables (the top-level parent won't have
a qual since it's not partitioned)  I didn't update the partition
pruning code that assumes this is only populated for sub-partitioned
table. That will need to be done. The patch comes complete with all
the failing regression tests where the redundant quals have been
removed by the new code.

If you want to make this work for CHECK constraints too, then I think
the code for that can be added to the same location as the code I
added in the attached patch. You'd just need to fetch the check
constraint quals and just add some extra code to check if the qual is
redundant.

Some new performance benchmarks would then be useful in order to find
out how much overhead there is. We might learn that we don't want to
enable it by default if it's too expensive.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


skip_redundant_partition_quals_poc.patch
Description: Binary data

Re: Failure assertion in GROUPS mode of window function in current HEAD

2018-07-08 Thread Masahiko Sawada

On Wed, Jul 4, 2018 at 11:24 PM, Masahiko Sawada  wrote:
> Hi,
>
> I got an assertion failure when I use GROUPS mode and specifies offset
> without ORDER BY clause. The reproduction steps and the backtrace I
> got are following.
>
> =# create table test as select 1 as c;
> =# select sum(c) over (partition by c groups between 1 preceding and
> current row) from test;
> TRAP: FailedAssertion("!(ptr >= 0 && ptr < state->readptrcount)",
> File: "tuplestore.c", Line: 478)
>
> #0  0x003b3ce32495 in raise () from /lib64/libc.so.6
> #1  0x003b3ce33c75 in abort () from /lib64/libc.so.6
> #2  0x00a2ef5a in ExceptionalCondition (conditionName=0xc99f38
> "!(ptr >= 0 && ptr < state->readptrcount)", errorType=0xc99f22
> "FailedAssertion",
> fileName=0xc99ec0 "tuplestore.c", lineNumber=478) at assert.c:54
> #3  0x00a7fa7d in tuplestore_select_read_pointer
> (state=0x139e4a0, ptr=-1) at tuplestore.c:478
> #4  0x007216cd in update_frameheadpos (winstate=0x137fc30) at
> nodeWindowAgg.c:1655
> #5  0x0071fb7f in eval_windowaggregates (winstate=0x137fc30)
> at nodeWindowAgg.c:735
> #6  0x00722ac8 in ExecWindowAgg (pstate=0x137fc30) at
> nodeWindowAgg.c:2196
> #7  0x006e5776 in ExecProcNodeFirst (node=0x137fc30) at
> execProcnode.c:445
> #8  0x006da945 in ExecProcNode (node=0x137fc30) at
> ../../../src/include/executor/executor.h:237
> #9  0x006dd2fc in ExecutePlan (estate=0x137fa18,
> planstate=0x137fc30, use_parallel_mode=false, operation=CMD_SELECT,
> sendTuples=true,
> numberTuples=0, direction=ForwardScanDirection, dest=0x138b098,
> execute_once=true) at execMain.c:1726
> #10 0x006daf2c in standard_ExecutorRun (queryDesc=0x13785b8,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:363
> #11 0x006dad48 in ExecutorRun (queryDesc=0x13785b8,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:306
> #12 0x008c7626 in PortalRunSelect (portal=0x12f3bd8,
> forward=true, count=0, dest=0x138b098) at pquery.c:932
> #13 0x008c72b4 in PortalRun (portal=0x12f3bd8,
> count=9223372036854775807, isTopLevel=true, run_once=true,
> dest=0x138b098, altdest=0x138b098,
> completionTag=0x7fffaece38b0 "") at pquery.c:773
> #14 0x008c1288 in exec_simple_query (
> query_string=0x128e938 "select sum(c) over (partition by c groups
> between 1 preceding and current row) from test;") at postgres.c:1122
> #15 0x008c555e in PostgresMain (argc=1, argv=0x12b8258,
> dbname=0x12b80d8 "postgres", username=0x12b80b0 "masahiko") at
> postgres.c:4153
> #16 0x00822c3c in BackendRun (port=0x12b0020) at postmaster.c:4361
> #17 0x008223aa in BackendStartup (port=0x12b0020) at postmaster.c:4033
> #18 0x0081e84b in ServerLoop () at postmaster.c:1706
> #19 0x0081e17d in PostmasterMain (argc=5, argv=0x1289330) at
> postmaster.c:1379
> #20 0x007452d0 in main (argc=5, argv=0x1289330) at main.c:228
>
> The cause of this assertion failure is that we attempted to select a
> read pointer (framehead_ptr) that is not allocated. We allocate read
> pointers for both frame head and tail when begin a new partition in
> begin_partition(). The current code doesn't allocate them as follows
> if ORDER BY clause is omitted, but this behavior doesn't match to both
> update_framheadpos() and update_framtailpos() which always use each
> read pointer in GROUPS offset mode.
>
> if ((winstate->frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS)) &&
>node->ordNumCols != 0)
> {
> if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING))
> winstate->framehead_ptr =
> tuplestore_alloc_read_pointer(winstate->buffer, 0);
> if (!(winstate->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING))
> winstate->frametail_ptr =
> tuplestore_alloc_read_pointer(winstate->buffer, 0);
> }
>
> Since this issue relates to only GROUPS mode it happen in PostgreSQL
> 11 or later. Attached patch fixes this issue and add regression tests
> for testing GROUPS mode without ORDER BY clause.

Since this problem still happens with current HEAD I've added this
item to Open Item.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Copy function for logical replication slots

2018-07-08 Thread Masahiko Sawada

On Fri, Jul 6, 2018 at 9:21 AM, Michael Paquier  wrote:
> On Thu, Jul 05, 2018 at 05:24:48PM +0900, Masahiko Sawada wrote:
>> Yes, you're right. To guarantee that restart LSN of copied slot is
>> available, it seems to me that it's better to copy new slot while
>> holding the origin slot as you mentioned before. Since the replication
>> slot creation code assumes that a process creating a new slot doesn't
>> have any slots we should save origin slot temporary and create new
>> one, and then restore it.
>
> This will require some refactoring first I think as most of the slot
> routines assume that the process owning it is the one doing the calls,
> so this has a string smell of a patch set being splitted.
>
>> It might be a bit tricky but would work fine.
>
> Sawada-san, will you be able to rewrite this patch soon or should it be
> moved to the next commit fest?  I would suggest to do the latter as this
> is no small work, and this needs careful thoughts.

I think that this patch might be splitted but I will be able to send
an updated patch in the next week. As you suggestion this patch needs
more careful thoughts. I'll move this patch to the next commit fest if
I will not be able to sent it. Is that okay?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: WAL prefetch

2018-07-08 Thread Sean Chittenden

> Without prefetching, it's ~70GB of WAL. With prefetching, it's only about
> 30GB. Considering the 1-hour test generates about 90GB of WAL, this means the
> replay speed grew from 20GB/h to almost 60GB/h. That's rather measurable
> improvement ;-)

Thank you everyone for this reasonably in-depth thread on prefaulting.

Because this was a sprawling thread and I haven't been keeping up with this
discussion until now, let me snag a bunch of points and address them here in one
shot.  I've attempted to answer a bunch of questions that appear to have come up
during this thread, as well as provide some clarity where there were unanswered
questions.  Apologies in advance for the length.

There are a few points that I want to highlight regarding prefaulting, and I
also want to call out when prefaulting is and isn't useful.  But first, let me
introduce three terms that will help characterize this problem:

1. Hot read-modify-write - a PG page that is modified while the page is still
   contained within shared_buffers.
2. Warm read-modify-write ("RMW") - a PG page that's in the filesystem cache but
   not present in shared_buffers.
3. Cold RMW - a PG page is not in either PG's shared_buffers or the OS'es
   filesystem cache.

Prefaulting is only useful in addressing the third situation, the cold
read-modify-write.  For fast disks, or systems that have their entire dataset
held in RAM, or whose disk systems can perform a RMW fast enough for the
velocity of incoming writes, there is no benefit of prefaulting (this is why
there is a high and low-watermark in pg_prefaulter).  In these situations
prefaulting would potentially be extra constant overhead, especially for DBs
where their workload is ~100% Hot/Warm RMW.  Primaries are almost always under
the Hot RMW workload (cold restarts being the exception).

The warm RMW scenario could be solved by prefaulting into shared_buffers, but I
doubt there would be a significant performance benefit because the expense of
PostgreSQL faulting from shared_buffers to the OS cache is relatively small
compared to a disk read.  I do think there is something to be gained in the Warm
RMW case, but compared to Cold RMW, this optimization is noise and best left for
a future iteration.

The real importance of prefaulting becomes apparent in the following two
situations:

1. Priming the OS's filesystem cache, notably after an OS restart.  This is of
   value to all PostgreSQL scenarios, regardless of whether or not it's a
   primary or follower.  Reducing database startup/recovery times is very
   helpful, especially when recovering from an outage or after having performed
   planned maintenance.  Little in PostgreSQL administration is more infuriating
   than watching PostgreSQL recover and seeing the CPU 100% idle and the disk IO
   system nearly completely idle (especially during an outage or when recovering
   from an outage).
2. When the following two environmental factors are true:
   a. the volume of writes to discrete pages is high
   b. the interval between subsequent writes to a single page is long enough
  that a page is evicted from both shared_buffers and the filesystem cache

   Write-heavy workloads tend to see this problem, especially if you're
   attempting to provide consistency in your application and do not read from
   the followers (thereby priming their OS/shared_buffer cache).  If the
   workload is continuous, the follower may never be able overcome the write
   volume and the database never catches up.

The pg_prefaulter was borne out of the last workload, namely a write-heavy, 24/7
constant load with a large dataset.

What pg_prefaulter does is read in the blocks referenced from the WAL stream
(i.e. PG heap pages) and then load the referenced pages into the OS filesystem
cache (via threaded calls to pread(2)).  The WAL apply process has a cache-hit
because the filesystem cache has been primed with the heap page before the apply
process attempted to perform its read-modify-write of the heap.

It is important to highlight that this is a problem because there is only one
synchronous pread(2) call in flight at a time from the apply/recover/startup
process, which effectively acts as the speed limit for PostgreSQL.  The physics
of many workloads are such that followers are unable to keep up and are thus
destined to always fall behind (we've all seen this at some point, likely via
apply lag from a VACUUM or pg_repack).  The primary can schedule concurrent IO
from multiple client all making independent SELECTS.  Contrast that to a replica
who has zero knowledge of the IOs that the primary recently dispatched, and all
IO looks like random read and likely a cache miss.  In effect, the pg_prefaulter
raises the speed limit of the WAL apply/recovery process by priming the
filesystem cache by snooping in on the WAL stream.

PostgreSQL's WAL apply and recovery process is only capable of scheduling a
single synchronous pread(2) syscall.  As a result, even if you have an RAID10

Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2018-07-08 Thread Jeff Davis

On Tue, 2018-03-06 at 20:09 +0900, Etsuro Fujita wrote:
> Agreed.  I added a comment to that function.  I think that that
> comment 
> in combination with changes to the FDW docs in the patch would help
> FDW 
> authors understand why that is needed.  Please find attached an
> updated 
> version of the patch.
> 
> Thanks for the comments!

Committed.

I made some small modifications and added a test for the case where the
foreign table is a partition of a local table, which follows a
different code path after commit 3d956d95.

Thank you!

Regards,
Jeff Davis

"Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-07-08 Thread Peter Geoghegan

On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan  wrote:
> I'm starting this new thread to discuss the benefits of B-Tree suffix
> truncation, and to demonstrate how effective it can be at increasing
> fan-out and space utilization with a real-world example. I haven't
> explained why suffix truncation matters very well before now. Now that
> I have a patch that works, I can just show the benefit directly. (By
> the way, there are similar benefits to covering INCLUDE indexes, where
> suffix truncation occurs all the time, and not just
> opportunistically.)

> Explanation
> ---
>
> Pivot tuples describe how the key space will be split up, which will
> *hopefully* be balanced for future insertions. So, apart from the
> obvious issue about the size of pivot tuples, there is a more
> important issue: why unnecessarily commit to certain exact boundaries
> early on, before it's clear how well that will balance values that get
> inserted later?

Actually, this particular example does not really show why general
suffix truncation is important. My example did manage to make an index
15% smaller in a practical, workable way, but traditional suffix
truncation deserves far less credit for that than I initially claimed.
My explanation was about 99% wrong, but my example is still valid. The
true explanation for why my patch (the pending v3 of my
unique-key-heap-TID patch) avoided so much bloat is very interesting,
because it is related to a broader problem. I'll show a simple example
where the bloat that my patch can avoid is far greater still,
involving a simple single-attribute secondary index.

Before I get to that example, I'll give the real explanation. The real
reason for the marked decrease in the level of bloating is that my
patch makes insertions into secondary indexes (non-unique indexes)
jump immediately to the leaf page that the tuple unambiguously belongs
on -- since it now has a unique key, there must be one exact page that
the value is supposed to go on. We avoid trying to find a place among
pages full of logical duplicates, and so we avoid the risk of "getting
tired" [1] and giving up on finding free space that is actually
available to us. "Getting tired" tends to produce really inefficient
space utilization among leaf pages full of duplicates, at least past a
certain tipping point.

The whole "getting tired" thing is the root of the problem here, which
is why the pending v3 of my patch will remove that code completely
(_bt_findinsertloc() is streamlined). The "tipping point" nature of
getting tired is of particular concern to me, since it sounds like
something that could have been a factor in Uber's much-publicized blog
post.   :-(

I attach the test case I mentioned, which I show the output from
below, with my commentary in parenthesis (note that there are "\echo"
psql statements whose output you'll also see):

$ psql -f testcase.sql
autovcuum should probably be disabled:
 autovacuum

 off
(1 row)

psql:testcase.sql:3: NOTICE:  table "sec_idx_bloat" does not exist, skipping
DROP TABLE
CREATE TABLE

(Created empty table named "sec_idx_bloat", with a single int4 attribute.)

CREATE INDEX

(Created empty index named "bloated" on that attribute.)

Initial insert of 1e7 sequential values:
INSERT 0 1000

"bloated" size on master: 214 MB
"bloated" size with patch: 214 MB

Initial insert of the value 42 1e7 times:
INSERT 0 1000

"bloated" size on master: 486 MB
"bloated" size with patch: 430 MB

Repeated insertion of the value 42 1e7 times:
INSERT 0 1000

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

Delete + VACUUM away most of the dups:
DELETE 18001072
VACUUM

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

(That is, VACUUM hasn't made either case have a smaller index yet,
though it probably gave more back many more whole index pages to the
FSM in the case of the patch.)

Third insertion of the value 42 1e7 times:
INSERT 0 1000

(This is where it gets really interesting, because things truly fall
apart for master, whereas the patch case manages to reuse existing
free space in all cases.)

"bloated" size on master: 1029 MB
"bloated" size with patch: 645 MB

Fourth insertion of the value 42 1e7 times:
INSERT 0 1000

"bloated" size on master: 1301 MB
"bloated" size with patch: 688 MB

(Patch can still almost reuse all the space left behind by VACUUM,
though since it's a bit bigger than the original high watermark size
of 645 MB it's not perfectly efficient. Master flounders even further
behind, though.)

To summarize: recognizing that bloat in indexes is correlated with
bloat in tables allows us to recycle space in both structures
cooperatively, which can be very important.

While this example focusses on bloat, there are a lot of other things
to be unhappy about around buffer lock contention, and around the
number of pages that will be dirtied. Random choices about which page
to dirty leads to increased random I/O when checkpointi

Re: cost_sort() improvements

2018-07-08 Thread Tomas Vondra

Hi,

I'll do more experiments/tests next week, but let me share at least some
initial thoughts ...

On 06/28/2018 06:47 PM, Teodor Sigaev wrote:
> Hi!
> 
> Current estimation of sort cost has following issues:
>  - it doesn't differ one and many columns sort
>  - it doesn't pay attention to comparison function cost and column width
>  - it doesn't try to count number of calls of comparison function on per
> column
>    basis
> 
> At least [1] and [2] hit into to that issues and have an
> objections/questions about correctness of cost sort estimation.
> Suggested patch tries to improve current estimation and solve that issues.
> 

For those not following the two patches ("GROUP BY optimization" [1] and
"Incremental sort" [2]): This wasn't a major issue so far, because we
are not reordering the grouping keys to make the sort cheaper (which is
what [1] does), nor do we ignore some of the sort keys (which is what
[2] does, pretty much). Both patches need to estimate number of
comparisons on different columns and/or size of groups (i.e. ndistinct).

> Basic ideas:
>  - let me introduce some designations
>     i   - index of column, started with 1
>     N   - number of tuples to sort
>     Gi  - number of groups, calculated by i number columns. Obviously,
>   G0 == 1. Number of groups is caculated by estimate_num_groups().
>     NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1)
>     Fi  - cost of comparison function call of i-th column

OK, so Fi is pretty much whatever CREATE FUNCTION ... COST says, right?

>     Wi  - average width in bytes of 1-ith column.
>     Ci  - total cost of one comparison call of column calculated as
>   Fi * fine_function(Wi) where fine_function(Wi) looks like:
>   if (Wi <= sizeof(Datum))
>   return 1.0; //any type passed in Datum
>   else
>   return 1 + log16(Wi/sizeof(Datum));
>   log16 just an empirical choice
>  - So, cost for first column is 2.0 * C0 * log2(N)
>    First column is always compared, multiplier 2.0 is defined per
> regression
>    test. Seems, it estimates a cost for transferring tuple to CPU cache,
>    starting cost for unpacking tuple, cost of call qsort compare wrapper
>    function, etc. Removing this multiplier causes too optimistic
> prediction of
>    cost.

Hmm, makes sense. But doesn't that mean it's mostly a fixed per-tuple
cost, not directly related to the comparison? For example, why should it
be multiplied by C0? That is, if I create a very expensive comparator
(say, with cost 100), why should it increase the cost for transferring
the tuple to CPU cache, unpacking it, etc.?

I'd say those costs are rather independent of the function cost, and
remain rather fixed, no matter what the function cost is.

Perhaps you haven't noticed that, because the default funcCost is 1?

>  - cost for i-th columns:
>    Ci * log2(NGi) => Ci * log2(N/G(i-1))

OK. So essentially for each column we take log2(tuples_per_group), as
total number of comparisons in quick-sort is O(N * log2(N)).

>    So, here I believe, that i-th comparison function will be called only
> inside
>    group which number is defined by (i-1) leading columns. Note, following
>    discussion [1] I add multiplier 1.5 here to be closer to worst case when
>    groups are significantly differ. Right now there is no way to
> determine how
>    differ they could be. Note, this formula describes cost of first
> column too
>    except difference with multiplier.

The number of new magic constants introduced by this patch is somewhat
annoying. 2.0, 1.5, 0.125, ... :-(

>  - Final cost is cpu_operator_cost * N * sum(per column costs described
> above).
>    Note, for single column with width <= sizeof(datum) and F1 = 1 this
> formula
>    gives exactly the same result as current one.
>  - for Top-N sort empiric is close to old one: use 2.0 multiplier as
> constant
>    under log2, and use log2(Min(NGi, output_tuples)) for second and
> following
>    columns.
> 

I think compute_cpu_sort_cost is somewhat confused whether
per_tuple_cost is directly a cost, or a coefficient that will be
multiplied with cpu_operator_cost to get the actual cost.

At the beginning it does this:

per_tuple_cost = comparison_cost;

so it inherits the value passed to cost_sort(), which is supposed to be
cost. But then it does the work, which includes things like this:

per_tuple_cost += 2.0 * funcCost * LOG2(tuples);

where funcCost is pretty much pg_proc.procost. AFAIK that's meant to be
a value in units of cpu_operator_cost. And at the end it does this

per_tuple_cost *= cpu_operator_cost;

I.e. it gets multiplied with another cost. That doesn't seem right.

For most cost_sort calls this may not matter as the comparison_cost is
set to 0.0, but plan_cluster_use_sort() sets this explicitly to:

comparisonCost
= 2.0 * (indexExprCost.startup + indexExprCost.per_tuple);

which is likely to cause issues.

Also, why do we need this?

Re: cost_sort() improvements

2018-07-08 Thread Tomas Vondra




On 06/29/2018 04:36 PM, Teodor Sigaev wrote:
> 
> 
> Peter Geoghegan wrote:
>> On Thu, Jun 28, 2018 at 9:47 AM, Teodor Sigaev  wrote:
>>> Current estimation of sort cost has following issues:
>>>   - it doesn't differ one and many columns sort
>>>   - it doesn't pay attention to comparison function cost and column
>>> width
>>>   - it doesn't try to count number of calls of comparison function on
>>> per
>>> column
>>>     basis
>>
>> I've been suspicious of the arbitrary way in which I/O for
>> external sorts is costed by cost_sort() for a long time. I'm not
>> 100% sure about how we should think about this question, but I am
>> sure that it needs to be improved in *some* way.
>>
> Agree, but I think it should be separated patch to attack this
> issue. And I don't have an idea how to do it, at least right now.
> Nevertheless, I hope, issue of estimation of disk-based sort isn't a
> blocker of CPU cost estimation improvements.
> 

Yes, I agree this should be addressed separately. Peter is definitely
right that should look at costing internal vs. external sorts (after
all, it's generally foolish to argue with *him* about sorting stuff).

But the costing changes discussed here are due to my nagging from the
GROUP BY patch (and also the "incremental sort" patch). The internal vs.
external costing seems like an independent issue.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-07-08 Thread Noah Misch

On Fri, Jul 06, 2018 at 04:32:56PM +0100, Simon Riggs wrote:
> On 6 July 2018 at 03:30, Thomas Munro  wrote:
> > On Thu, Jul 5, 2018 at 8:27 AM, Noah Misch  wrote:
> >>> However, 49bff5300d527 also introduced a similar bug where 
> >>> subtransaction
> >>> commit would fail to release an AccessExclusiveLock, leaving the lock 
> >>> to
> >>> be removed sometimes early and sometimes late. This commit fixes
> >>> that bug also. Backpatch to PG10 needed.
> >>
> >> Subtransaction commit is too early to release an arbitrary
> >> AccessExclusiveLock.  The primary releases every AccessExclusiveLock at
> >> top-level transaction commit, top-level transaction abort, or 
> >> subtransaction
> >> abort.  CommitSubTransaction() doesn't do that; it transfers locks to the
> >> parent sub(xact).  Standby nodes can't safely remove an arbitrary lock 
> >> earlier
> >> than the primary would.
> >
> > But we don't release locks acquired by committing subxacts until the
> > top level xact commits.  Perhaps that's what the git commit message
> > meant by "early", as opposed to "late" meaning when
> > StandbyReleaseOldLocks() next runs because an XLOG_RUNNING_XACTS
> > record is replayed?
> 
> Locks held by subtransactions were not released at the correct timing
> of top-level commit; they are now.

I read the above-quoted commit message as saying that the commit expands the
set of locks released when replaying subtransaction commit.  The only lock
that should ever be released at subxact commit is the subxact XID lock.  If
that continues to be the only lock released at subxact commit, good.

Re: XLogSegNoOffsetToRecPtr fixup

2018-07-08 Thread Andres Freund

Hi,

On 2018-07-08 14:23:45 -0400, Alvaro Herrera wrote:
> Pursuant to closing comment in 
> https://postgr.es/m/20180306214239.ospkf6ie7aa5gm4j@alvherre.pgsql
> here's a quick patch to change the order of arguments in
> XLogSegNoOffsetToRecPtr.  Commit fc49e24fa69a ("Make WAL segment size
> configurable at initdb time.") put the walsegsz as last argument,
> *after* its output argument, which is downright weird.
> 
> I propose to apply this to pg11 and master, to avoid an unnecessary API
> change in pg12.

WFM.

Thanks,

Andres

XLogSegNoOffsetToRecPtr fixup

2018-07-08 Thread Alvaro Herrera

Pursuant to closing comment in 
https://postgr.es/m/20180306214239.ospkf6ie7aa5gm4j@alvherre.pgsql
here's a quick patch to change the order of arguments in
XLogSegNoOffsetToRecPtr.  Commit fc49e24fa69a ("Make WAL segment size
configurable at initdb time.") put the walsegsz as last argument,
*after* its output argument, which is downright weird.

I propose to apply this to pg11 and master, to avoid an unnecessary API
change in pg12.

-- 
Álvaro Herrera http://www.flickr.com/photos/alvherre/
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index d6b5b05425..f251989f0b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1971,7 +1971,7 @@ XLogBytePosToRecPtr(uint64 bytepos)
seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + 
SizeOfXLogShortPHD;
}
 
-   XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result, wal_segment_size);
+   XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, wal_segment_size, result);
 
return result;
 }
@@ -2017,7 +2017,7 @@ XLogBytePosToEndRecPtr(uint64 bytepos)
seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + 
SizeOfXLogShortPHD;
}
 
-   XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result, wal_segment_size);
+   XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, wal_segment_size, result);
 
return result;
 }
diff --git a/src/backend/access/transam/xlogreader.c 
b/src/backend/access/transam/xlogreader.c
index dd96cef8f0..4c633c6c49 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -747,7 +747,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, 
XLogRecPtr recptr,
XLByteToSeg(recptr, segno, state->wal_segment_size);
offset = XLogSegmentOffset(recptr, state->wal_segment_size);
 
-   XLogSegNoOffsetToRecPtr(segno, offset, recaddr, 
state->wal_segment_size);
+   XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, 
recaddr);
 
if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
{
diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index d3ec137051..9b55b94227 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2782,7 +2782,7 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot 
*slot, TransactionId xid
 {
XLogRecPtr  recptr;
 
-   XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size);
+   XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr);
 
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap",
 NameStr(MyReplicationSlot->data.name),
diff --git a/src/bin/pg_basebackup/pg_receivewal.c 
b/src/bin/pg_basebackup/pg_receivewal.c
index 071b32d19d..ed9d7f6378 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -352,7 +352,7 @@ FindStreamingStart(uint32 *tli)
if (!high_ispartial)
high_segno++;
 
-   XLogSegNoOffsetToRecPtr(high_segno, 0, high_ptr, WalSegSz);
+   XLogSegNoOffsetToRecPtr(high_segno, 0, WalSegSz, high_ptr);
 
*tli = high_tli;
return high_ptr;
diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
b/src/bin/pg_resetwal/pg_resetwal.c
index 8cff535692..d63a3a27f6 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -933,8 +933,8 @@ RewriteControlFile(void)
 * Adjust fields as needed to force an empty XLOG starting at
 * newXlogSegNo.
 */
-   XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD,
-   
ControlFile.checkPointCopy.redo, WalSegSz);
+   XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD, WalSegSz,
+   
ControlFile.checkPointCopy.redo);
ControlFile.checkPointCopy.time = (pg_time_t) time(NULL);
 
ControlFile.state = DB_SHUTDOWNED;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index b4c1f827a6..1689279767 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -248,7 +248,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr 
targetPagePtr,
XLogSegNo   targetSegNo;
 
XLByteToSeg(targetPagePtr, targetSegNo, WalSegSz);
-   XLogSegNoOffsetToRecPtr(targetSegNo + 1, 0, targetSegEnd, WalSegSz);
+   XLogSegNoOffsetToRecPtr(targetSegNo + 1, 0, WalSegSz, targetSegEnd);
targetPageOff = XLogSegmentOffset(targetPagePtr, WalSegSz);
 
/*
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 5c4f38e597..d41b831b18 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1039,7 +1039,7 @@ main(int argc, char **argv)

Re: How can we submit code patches that implement our (pending) patents?

2018-07-08 Thread Andres Freund

Hi,

On 2018-07-08 11:46:51 -0400, Alvaro Herrera wrote:
> On 2018-Jul-07, David Fetter wrote:
> 
> > If they have no plans to exercise any proprietary rights, our usual
> > process where people submit things and agree to have us label them
> > with the PGDG copyright and publish them under TPL would be the
> > simplest way to accomplish it.
> 
> Eh, but if the submitting company has patents, would it not be dishonest
> to publish as PGDG copyright & license with no attached patent grant?
> Some other company deriving a proprietary fork from Postgres could later
> be sued by the submitting company, because there is no legal standing
> for them to use the patented code.

Yep. And even if the original submitter has good intent, it's not
unlikely for companies to go bancrupt and their assets sold off.

> TBH I don't understand how can we dual-license the code in a manner that
> protects those proprietary forks.  Can you (Andres) explain what is the
> idea?

https://www.apache.org/licenses/LICENSE-2.0 grants roughly the same
rights as our license. But 3) additionally grants a patent license. That
license is *not* restricted to the code in unmodified form.  By
requiring future contributions to be both under the pg license and
apache 2, downstream users, including proprietary forks, can safely use
code contributed in the future, without risk of patent mines.

There are arguments made that TPL (and BSD, MIT etc) already includes an
implicit patent grant, but while a longstanding theory, it's to my
knowledge not legally been tested.

IANAL etc.

Greetings,

Andres Freund

Re: How can we submit code patches that implement our (pending) patents?

2018-07-08 Thread Alvaro Herrera

On 2018-Jul-07, David Fetter wrote:

> If they have no plans to exercise any proprietary rights, our usual
> process where people submit things and agree to have us label them
> with the PGDG copyright and publish them under TPL would be the
> simplest way to accomplish it.

Eh, but if the submitting company has patents, would it not be dishonest
to publish as PGDG copyright & license with no attached patent grant?
Some other company deriving a proprietary fork from Postgres could later
be sued by the submitting company, because there is no legal standing
for them to use the patented code.

TBH I don't understand how can we dual-license the code in a manner that
protects those proprietary forks.  Can you (Andres) explain what is the
idea?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pglife and devel branch content

2018-07-08 Thread Bruce Momjian

Previously, pglife (http://pglife.momjian.us/) showed all commits for
the "devel" branch back back to the start of the git tree.  It now shows
only commits since the last major branch.  For example, "12 devel" now
shows only commits since we branched the git tree for PG 11.

This should make it easier to see what we have done for PG 12. 
Unfortunately, it doesn't remove backbranch commits like git_changelog
because it uses a preexisting gitweb page.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Invisible Indexes

2018-07-08 Thread Jeff Janes

On Mon, Jun 18, 2018 at 5:57 PM, Tom Lane  wrote:

>
> I'm not sure about the "enforce constraint only" argument --- that
> sounds like a made-up use-case to me.  It's pretty hard to imagine
> a case where a unique index applies to a query and yet you don't want
> to use it.
>

I've not seen it with unique constraints, but have with EXCLUDE
constraints.  GiST index costing is not very robust, and the planner can
easily decide that a read query should use the EXCLUDE-supporting GiST
index in cases where it is not optimal.

Cheers,

Jeff

Re: [PATCH] Use access() to check file existence in GetNewRelFileNode().

2018-07-08 Thread Michael Paquier

On Fri, Jul 06, 2018 at 10:52:14PM +0200, Peter Eisentraut wrote:
> This patch looks sensible to me.  We also use access() elsewhere in the
> backend to test for file existence.

Yes, the patch made sense when I looked at it, and it still does, so
committed.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-08 Thread Fabien COELHO




Hello Alvaro,


For context: in the backend, elog() is only used for internal messages
(i.e. "can't-happen" conditions), and ereport() is used for user-facing
messages.  There are many things ereport() has that elog() doesn't, such
as additional message fields (HINT, DETAIL, etc) that I think could have
some use in pgbench as well.  If you use elog() then you can't have that.
[...]


Ok. Then forget elog, but I'm pretty against having a kind of ereport 
which looks greatly overkill to me, because:


 (1) the syntax is pretty heavy, and does not look like a function.

 (2) the implementation allocates a string buffer for the message
 this is greatly overkill for pgbench which only needs to print
 to stderr once.

This makes sense server-side because the generated message may be output 
several times (eg stderr, file logging, to the client), and the 
implementation has to work with cpp implementations which do not handle 
varags (and maybe other reasons).


So I would be in favor of having just a simpler error function. 
Incidentally, one already exists "pgbench_error" and could be improved, 
extended, replaced. There is also "syntax_error".



One thing that just came to mind is that pgbench uses some src/fe_utils
stuff.  I hope having ereport() doesn't cause a conflict with that ...


Currently ereport does not exists client-side. I do not think that this 
patch is the right moment to decide to do that. Also, there are some 
"elog" in libpq, but they are out with a "#ifndef FRONTEND".



BTW I think abort() is not the right thing, as it'll cause core dumps if
enabled.  Why not just exit(1)?


Yes, I agree and already reported that.

Conclusion:

My current opinion is that I'm pretty against bringing "ereport" to the 
front-end on this specific pgbench patch. I agree with you that "elog" 
would be misleading there as well, for the arguments you developed above.


I'd suggest to have just one clean and simple pgbench internal function to 
handle errors and possibly exit, debug... Something like


  void pgb_error(FATAL, "error %d raised", 12);

Implemented as

  void pgb_error(int/enum XXX level, const char * format, ...)
  {
 test level and maybe return immediately (eg debug);
 print to stderr;
 exit/abort/return depending;
  }

Then if some advanced error handling is introduced for front-end programs, 
possibly through some macros, then it would be time to improve upon that.


--
Fabien.

Re: Desirability of client-side expressions in psql?

2018-07-08 Thread Fabien COELHO




Hello Stephen,

[...] So, I tend to agree w/ Andrew that while this is a good topic to 
have on -hackers, it shouldn't be a CF entry.  I wouldn't draw any 
conclusions from Andrew closing it out as "not appropriate for CF".


Sure. As I had no committer feedback on the discussion for 3 months, I 
tried this as an ineffective way to get some. It did not work up to now.



As I have not received feedback from committers about the desirability of
the feature, I interpret that as "the feature is not desirable", and I will
not develop it, because of the probability that this would be time down the
drain.


Personally, I'm definitely in favor of having a lot more flexibility and
capability in psql, that's an area which I think we don't focus on
nearly enough.  Having to fight with bash or another language to make
calls to psql to get things done is downright annoying.

So, +1 from me on the overall idea.


Good, that is one committer opinion, an infinite improvement over the 
previous status:-)



The challenge here will almost certainly be in the details.


Yep. I'm fine with "your code is not good and creates problems so it is 
rejected". I'm trying to avoid "your code was a loss from the start, 
whatever you did, because we do not want such a feature".


I do like the proposal you have of building out a common set of 
capabilities which are shared between psql and pgbench.


Good.

The other big challenge here is figuring out how to distinguish between 
SQL which should be sent to the server and something which needs to be 
client-side processed.


The current great idea is to use backslash commands to define existing 
variables:


  psql> \let i  1 + 2 * 3
  psql> SELECT :i ;
  psql> \if :i >= 5
  psql>   ...
  psql> \endif

I've never liked the ':var' approach and it really sucks when you want 
to combine that variable with something else, but it's what we've got 
and therefore has history behind it.


Indeed. I do not think that changing this would make much sense.

If you can find a way to improve on that without breaking existing code, 
that'd be fantastic.  If not, definitely try to minimize the impact.


I was just planning to set existing :-variables with expressions, I have 
no great other idea. Mixing languages is always a pain.


Thanks for your feedback.

--
Fabien.

Re: Non-reserved replication slots and slot advancing

Re: Non-reserved replication slots and slot advancing

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

Re: [HACKERS] Restricting maximum keep segments by repslots

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

Re: Postgres 11 release notes

Re: Non-reserved replication slots and slot advancing

Re: Postgres 11 release notes

Incorrect error handling for two-phase state files resulting in data loss

RE: Locking B-tree leafs immediately in exclusive mode

Re: no partition pruning when partitioning using array type

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

Re: Typo in Japanese translation of psql.

Re: no partition pruning when partitioning using array type

Re: Let's remove DSM_IMPL_NONE.

[GSoC] working status

Re: Copy function for logical replication slots

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Re: [HACKERS] Secondary index access optimizations

Re: Failure assertion in GROUPS mode of window function in current HEAD

Re: Copy function for logical replication slots

Re: WAL prefetch

Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

"Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

Re: cost_sort() improvements

Re: cost_sort() improvements

Re: hot_standby_feedback vs excludeVacuum and snapshots

Re: XLogSegNoOffsetToRecPtr fixup

XLogSegNoOffsetToRecPtr fixup

Re: How can we submit code patches that implement our (pending) patents?

Re: How can we submit code patches that implement our (pending) patents?

pglife and devel branch content

Re: Invisible Indexes

Re: [PATCH] Use access() to check file existence in GetNewRelFileNode().

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Re: Desirability of client-side expressions in psql?

36 matches

Site Navigation

Mail list logo

Footer information