Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Mon, Jul 9, 2018 at 3:39 PM Haribabu Kommi wrote: > > On Mon, Jul 9, 2018 at 12:24 PM Michael Paquier > wrote: > >> On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote: >> > Ugh, it's true :-( >> > >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8 >> > >> > Dave, Simon, any comments? >> >> The offending line: >> contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql: >> GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats; >> >> This will need a new version bump down to REL_10_STABLE... >> > > Hearing no objections, attached patch removes all permissions from PUBLIC > as before this change went in. Or do we need to add command for revoke only > from pg_read_all_stats? > Revoke all doesn't work, so patch updated with revoke from pg_read_all_stats. Regards, Haribabu Kommi Fujitsu Australia 0001-Revoke-pg_stat_statements_reset-permissions_v2.patch Description: Binary data
Re: [HACKERS] Restricting maximum keep segments by repslots
Hello. Sawada-san. Thank you for the comments. At Thu, 5 Jul 2018 15:43:56 +0900, Masahiko Sawada wrote in > On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI > wrote: > > Hello. > > > > At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > > wrote in > > <20180626.162659.223208514.horiguchi.kyot...@lab.ntt.co.jp> > >> The previous patche files doesn't have version number so I let > >> the attached latest version be v2. > >> > >> > >> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch > >> The body of the limiting feature > >> > >> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch > >> Shows the status of WAL rataining in pg_replication_slot view > >> > >> v2-0003-TAP-test-for-the-slot-limit-feature.patch > >> TAP test for this feature > >> > >> v2-0004-Documentation-for-slot-limit-feature.patch > >> Documentation, as the name. > > > > Travis (test_decoding test) showed that GetOldestXLogFileSegNo > > added by 0002 forgets to close temporarily opened pg_wal > > directory. This is the fixed version v3. > > > > Thank you for updating the patch! I looked at v3 patches. Here is > review comments. > > --- > + {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING, > + gettext_noop("Sets the maximum size of extra > WALs kept by replication slots."), > +NULL, > +GUC_UNIT_MB > + }, > + _slot_wal_keep_size_mb, > + 0, 0, INT_MAX, > + NULL, NULL, NULL > + }, > > Maybe MAX_KILOBYTES would be better instead of INT_MAX like wal_max_size. The MAX_KILOBYTES is maximum value of size in kB, which fits long or Size/size_t variables after convreted into bytes. Applying the limit there means that we assume that the _mb variable can be converted not into bytes but kB. So applying it to max/min_wal_size seems somewhat wrong but doesn't harm since they are not acutually converted into bytes. max_slot_wal_keep_size is not converted into bytes so capping with INT_MAX is no problem. However it doesn't need to be larger than MAX_KILOBYTES, I follow that in order to make it same with max/min_wal_size. > --- > Once the following WARNING emitted this message is emitted whenever we > execute CHECKPOINT even if we don't change anything. Is that expected > behavior? I think it would be better to emit this message only when we > remove WAL segements that are required by slots. > > WARNING: some replication slots have lost required WAL segments > DETAIL: The mostly affected slot has lost 153 segments. I didn't consider the situation the number of lost segments doesn't change. Changed to mute the message when the number of lost segments is not changed. > --- > + Assert(wal_keep_segments >= 0); > + Assert(max_slot_wal_keep_size_mb >= 0); > > These assertions are not meaningful because these parameters are > ensured >= 0 by those definition. Yeah, that looks a bit being paranoid. Removed. > --- > +/* slots aren't useful, consider only wal_keep_segments */ > +if (slotpos == InvalidXLogRecPtr) > +{ > > Perhaps XLogRecPtrIsInvalid(slotpos) is better. Agreed. It is changed to "slotpos != InvalidXLogRecPtr" after changing the function by the comments below. I think that the double negation !XLogRecPtrInvalid() is not fine. > --- > +if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + > wal_keep_segments) > +slotpos = InvalidXLogRecPtr; > + > +/* slots aren't useful, consider only wal_keep_segments */ > +if (slotpos == InvalidXLogRecPtr) > +{ > > This logic is hard to read to me. The slotpos can be any of: valid, > valid but then become invalid in halfway or invalid from beginning of > this function. Can we convert this logic to following? > > if (XLogRecPtrIsInvalid(slotpos) || > currSeg <= slotSeg + wal_keep_segments) Right. But it is removed. > --- > +keepSegs = wal_keep_segments + > +ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size); > > Why do we need to keep (wal_keep_segment + max_slot_wal_keep_size) WAL > segments? I think what this feature does is, if wal_keep_segments is > not useful (that is, slotSeg < (currSeg - wal_keep_segment) then we > normally choose slotSeg as lower boundary but max_slot_wal_keep_size > restrict the lower boundary so that it doesn't get lower than the > threshold. So I thought what this function should do is to calculate > min(currSeg - wal_keep_segment, max(currSeg - max_slot_wal_keep_size, > slotSeg)), I might be missing something though. You're right that wal_keep_segments should not been added, but should give lower limit of segments to keep as the current KeepLogSeg() does. Fixed that. Since the amount is specified in mega bytes, silently rounding down to segment bounds may not be proper in general and this feature used to use the fragments to show users something. But there's no loner a place where the fragments are
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Mon, Jul 9, 2018 at 12:24 PM Michael Paquier wrote: > On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote: > > Ugh, it's true :-( > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8 > > > > Dave, Simon, any comments? > > The offending line: > contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql: > GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats; > > This will need a new version bump down to REL_10_STABLE... > Hearing no objections, attached patch removes all permissions from PUBLIC as before this change went in. Or do we need to add command for revoke only from pg_read_all_stats? Regards, Haribabu Kommi Fujitsu Australia 0001-Revoke-pg_stat_statements_reset-permissions.patch Description: Binary data
Re: Postgres 11 release notes
On 2018-07-09 14:18:14 +0900, Tatsuro Yamada wrote: > Hi Bruce, > > > I expect a torrent of feedback.;-) > > Could you add this fix to the release note because this change affects > an extension developer using the hook function. > It would be better to know the change by reading the release note, not > compilation error. > > > > Add QueryEnvironment to ExplainOneQuery_hook's parameter list > (Tatsuro Yamada, Thomas Munro) > > > Discussion: > https://postgr.es/m/890e8dd9-c1c7-a422-6892-874f5eaee...@lab.ntt.co.jp > > I guess "E.1.3.11. Source Code" or "E.1.3.12. Additional Modules" sections are > suitable for putting the message in the release note. We adapt plenty of functions signatures without listing them individually in the release notes. I don't think randomly selecting one relatively uncommonly used hook is a good way to attack that. Greetings, Andres Freund
Re: Non-reserved replication slots and slot advancing
On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote: > I'm not so in favor of the word "reserve" in first place since it > doesn't seem intuitive for me, but "active" is already used for > the state where the connection with the peer is made. (The word > "reserve" may be misused since in the source code "reserve" is > used as "to reserve WAL", but used as "reserve a slot" in > documentation.) That's the term used for now three releases, counting v11 in the pack, so I would not change that now. The concept of non-initialized slots is fuzzy as well as it could be attached to some other meta-data. So, chewing on all that, I would suggest the following error message as the attached patch and just move on: +ERROR: cannot move slot not reserving WAL -- Michael diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 21e9d56f73..cc3766e10e 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -131,3 +131,20 @@ SELECT pg_drop_replication_slot('regression_slot1'); ERROR: replication slot "regression_slot1" does not exist SELECT pg_drop_replication_slot('regression_slot2'); ERROR: replication slot "regression_slot2" does not exist +-- slot advance with physical slot, error with non-reserved slot +SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3'); +slot_name +-- + regression_slot3 +(1 row) + +SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN +ERROR: invalid target wal lsn +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error +ERROR: cannot move slot not reserving WAL +SELECT pg_drop_replication_slot('regression_slot3'); + pg_drop_replication_slot +-- + +(1 row) + diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql index 706340c1d8..24cdf7155d 100644 --- a/contrib/test_decoding/sql/slot.sql +++ b/contrib/test_decoding/sql/slot.sql @@ -68,3 +68,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_ -- both should error as they should be dropped on error SELECT pg_drop_replication_slot('regression_slot1'); SELECT pg_drop_replication_slot('regression_slot2'); + +-- slot advance with physical slot, error with non-reserved slot +SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3'); +SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error +SELECT pg_drop_replication_slot('regression_slot3'); diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 3ed9021c2f..4851bc2e24 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9867,7 +9867,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx The address (LSN) of oldest WAL which still might be required by the consumer of this slot and thus won't be - automatically removed during checkpoints. + automatically removed during checkpoints. NULL + if the LSN of this slot has never been reserved. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2806e1076c..2f07cc37f5 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -483,6 +483,15 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) /* Acquire the slot so we "own" it */ ReplicationSlotAcquire(NameStr(*slotname), true); + /* A slot whose restart_lsn has never been reserved cannot be advanced */ + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) + { + ReplicationSlotRelease(); + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot move slot not reserving WAL"))); + } + /* * Check if the slot is not moving backwards. Physical slots rely simply * on restart_lsn as a minimum point, while logical slots have confirmed signature.asc Description: PGP signature
Re: Postgres 11 release notes
Hi Bruce, > I expect a torrent of feedback.;-) Could you add this fix to the release note because this change affects an extension developer using the hook function. It would be better to know the change by reading the release note, not compilation error. Add QueryEnvironment to ExplainOneQuery_hook's parameter list (Tatsuro Yamada, Thomas Munro) Discussion: https://postgr.es/m/890e8dd9-c1c7-a422-6892-874f5eaee...@lab.ntt.co.jp I guess "E.1.3.11. Source Code" or "E.1.3.12. Additional Modules" sections are suitable for putting the message in the release note. Thanks, Tatsuro Yamada NTT Open Source Software Center
Incorrect error handling for two-phase state files resulting in data loss
Hi all, This is a follow-up of the following thread where I have touched the topic of corrupted 2PC files being completely ignored by recovery: https://www.postgresql.org/message-id/20180709012955.GD1467%40paquier.xyz I have posted a patch on this thread, but after more reviews I have noticed that it does not close all the holes. What happens is that when a two-phase state file cannot be read by recovery when the file is on disk, which is what ReadTwoPhaseFile does, then its data is simply skipped. It is perfectly possible that an on-disk 2PC is missing at crash recovery if it has been already processed, so if trying to open the file and seeing an ENOENT, then the file can be properly skipped. However, if you look at this API, it skips *all* errors instead of the ones that matter. For example, if a file needs to be processed and is cannot be opened because of permission issues, then it is simply skipped. If its size, its CRC or its magic number is incorrect, the same thing happens. Skipping unconditionally files this way can result in data loss. I think that we really need to harden things, by making ReadTwoPhaseFile() fail hard is it finds something unexpected, which is in this case anything except trying to open a file which fails on ENOENT, and that this stuff should be back-patched. Thoughts? -- Michael From c53e3f6b604ceffcf75484975331e1b141bb6139 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 9 Jul 2018 13:59:46 +0900 Subject: [PATCH] Fail hard when facing corrupted two-phase state files at recovery When a corrupted file is found by WAL replay, be it for crash recovery or archive recovery, then the file is simply skipped and a WARNING is logged to the user. Facing an on-disk WAL file which is corrupted is as likely to happen as its pair recorded in dedicated WAL records, but WAL records are already able to fail hard if there is a CRC mismatch at record level. The situation got better since 978b2f6 (9.6) which has added two-phase state information directly in WAL instead of using on-disk files, so the risk is limited to two-phase transactions which live across at least one checkpoint. Reported-by: Michael Paquier Author: Michael Paquier Discussion: https://postgr.es/m/XXX --- src/backend/access/transam/twophase.c | 131 +++--- 1 file changed, 53 insertions(+), 78 deletions(-) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index e8d4e37fe3..3a048f9a4b 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1206,10 +1206,11 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info, * Read and validate the state file for xid. * * If it looks OK (has a valid magic number and CRC), return the palloc'd - * contents of the file. Otherwise return NULL. + * contents of the file. If missing_ok is true, do not issue an ERROR and + * return NULL. */ static char * -ReadTwoPhaseFile(TransactionId xid, bool give_warnings) +ReadTwoPhaseFile(TransactionId xid, bool missing_ok) { char path[MAXPGPATH]; char *buf; @@ -1225,12 +1226,12 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings) fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) { - if (give_warnings) - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not open two-phase state file \"%s\": %m", - path))); - return NULL; + if (missing_ok && errno == ENOENT) + return NULL; + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not open two-phase state file \"%s\": %m", + path))); } /* @@ -1240,36 +1241,26 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings) * even on a valid file. */ if (fstat(fd, )) - { - int save_errno = errno; - - CloseTransientFile(fd); - if (give_warnings) - { - errno = save_errno; - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not stat two-phase state file \"%s\": %m", - path))); - } - return NULL; - } + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not stat two-phase state file \"%s\": %m", + path))); if (stat.st_size < (MAXALIGN(sizeof(TwoPhaseFileHeader)) + MAXALIGN(sizeof(TwoPhaseRecordOnDisk)) + sizeof(pg_crc32c)) || stat.st_size > MaxAllocSize) - { - CloseTransientFile(fd); - return NULL; - } + ereport(ERROR, +(errcode(ERRCODE_DATA_CORRUPTED), + errmsg("incorrect size of two-phase state file \"%s\": %zu bytes", + path, stat.st_size))); crc_offset = stat.st_size - sizeof(pg_crc32c); if (crc_offset != MAXALIGN(crc_offset)) - { - CloseTransientFile(fd); - return NULL; - } + ereport(ERROR, +(errcode(ERRCODE_DATA_CORRUPTED), + errmsg("incorrect alignment of CRC offset for two-phase state file \"%s\"", + path))); /* * OK, slurp in the file. @@ -1278,32 +1269,26 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
RE: Locking B-tree leafs immediately in exclusive mode
On Mon, June 11, 2018 at 4:31 PM, Alexander Korotkov wrote: > On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs wrote: > > On 5 June 2018 at 14:45, Alexander Korotkov > > wrote: > > > Currently _bt_search() always locks leaf buffer in shared mode (aka > > > BT_READ), while caller can relock it later. However, I don't see > > > what prevents > > > _bt_search() > > > from locking leaf immediately in exclusive mode (aka BT_WRITE) when > > > required. > > > When traversing downlink from non-leaf page of level 1 (lowest level > > > of non-leaf pages just prior to leaf pages), we know that next page > > > is going to be leaf. In this case, we can immediately lock next page > > > in exclusive mode if required. > > > For sure, it might happen that we didn't manage to exclusively lock > > > leaf in this way when _bt_getroot() points us to leaf page. But > > > this case could be handled in _bt_search() by relock. Please, find > > > implementation of this approach in the attached patch. > > > > It's a good idea. How does it perform with many duplicate entries? > > Our B-tree is currently maintaining duplicates unordered. So, during > insertion we can traverse rightlinks in order to find page, which would > fit new index tuple. > However, in this case we're traversing pages in exclusive lock mode, and > that happens already after re-lock. _bt_search() doesn't do anything > with that. > So, I assume there shouldn't be any degradation in the case of many > duplicate entries. > > But this case definitely worth testing, and I'm planning to do it. > Hi, I'm reviewing this. Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did. > # postgresql.conf > max_connections = 300 > shared_buffers = 32GB > fsync = off > synchronous_commit = off > > > > # DDL: > CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null); > CREATE UNLOGGED TABLE unordered (i integer not null, value text not null); > > > > # script_ordered.sql > INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz'); > > > > # script_unordered.sql > \set i random(1, 100) > INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz'); > > > > # commands > pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres > pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres > > > > # results > ordered, master: 157473 TPS > ordered, patched 231374 TPS > unordered, master: 232372 TPS > unordered, patched: 232535 TPS # my results ordered, master: 186045 TPS ordered, patched:265842 TPS unordered, master: 313011 TPS unordered, patched: 309636 TPS I confirmed that this patch improves ordered insertion. In addition to your tests, I did the following tests with many duplicate entries # DDL CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null); # script_duplicated.sql INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz'); # commands pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres # results duplicated, master: 315450 TPS duplicated, patched: 311584 TPS It seems there are almostly no performance degression in case of many duplicated entries. I'm planning to do code review and send it in the next mail. Yoshikazu Imai
Re: no partition pruning when partitioning using array type
Thanks for taking a look. On 2018/07/07 9:19, Alvaro Herrera wrote: > On 2018-May-08, Amit Langote wrote: > >> I would like to revisit this as a bug fix for get_partition_operator() to >> be applied to both PG 10 and HEAD. In the former case, it fixes the bug >> that constraint exclusion code will fail to prune correctly when partition >> key is of array, enum, range, or record type due to the structural >> mismatch between the OpExpr that partitioning code generates and one that >> the parser generates for WHERE clauses involving partition key columns. > > Interesting patchset. Didn't read your previous v2, v3 versions; I only > checked your latest, v1 (???). Sorry, I think I messed up version numbering there. > I'm wondering about the choice of OIDs in the new test. I wonder if > it's possible to get ANYNONARRAY (or others) by way of having a > polymorphic function that passes its polymorphic argument in a qual. I > suppose it won't do anything in v10, or will it? Worth checking :-)> Why not > use IsPolymorphicType? Hmm, so IsPolymorphicType() test covers all of these pseudo-types except RECORDOID. I rewrote the patch to use IsPolymorphicType. I'm not able to think of a case where the partition constraint expression would have to contain ANYNONARRAY though. > Also, I think it'd be good to have tests > for all these cases (even in v10), just to make sure we don't break it > going forward. So, I had proposed this patch in last December, because partition pruning using constraint exclusion was broken for these types and still is in PG 10. I have added the tests back in the patch for PG 10 to test that partition pruning (using constraint exclusion) works for these cases. For PG 11 and HEAD, we took care of that in e5dcbb88a15 (Rework code to determine partition pruning procedure), so there does not appear to be any need to add tests for pruning there. > At least the array case is clearly broken today ... > A test for the RECORDOID case would be particularly welcome, since it's > somehow different from the other cases. (I didn't understand why did > you remove the test in the latest version.) I have added the tests in the patch for PG 10. I have marked both patches as v5. One patch is for PG 10 and the other applies unchanged to both HEAD and PG 11 branches. Thanks, Amit From 1ec0c064adc34c12ae5615f0f2bca27a9c61c30f Mon Sep 17 00:00:00 2001 From: amit Date: Tue, 8 May 2018 14:31:37 +0900 Subject: [PATCH v4] Fix how get_partition_operator looks up the operator Instead of looking up using the underlying partition key's type as the operator's lefttype and righttype, use the partition operator class declared input type, which works reliably even in the cases where pseudo-types are involved. Also, make its decision whether a RelableType is needed depend on a different criteria than what's currently there. That is, only add a RelabelType if the partition key's type is different from the selected operator's input types. --- src/backend/catalog/partition.c| 47 ++ src/test/regress/expected/create_table.out | 2 +- src/test/regress/expected/inherit.out | 98 ++ src/test/regress/sql/inherit.sql | 43 + 4 files changed, 161 insertions(+), 29 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 17704f36b9..1f50b29a3f 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1176,40 +1176,31 @@ get_partition_operator(PartitionKey key, int col, StrategyNumber strategy, Oid operoid; /* -* First check if there exists an operator of the given strategy, with -* this column's type as both its lefttype and righttype, in the -* partitioning operator family specified for the column. +* Get the operator in the partitioning operator family using the operator +* class declared input type as both its lefttype and righttype. */ operoid = get_opfamily_member(key->partopfamily[col], - key->parttypid[col], - key->parttypid[col], + key->partopcintype[col], + key->partopcintype[col], strategy); + if (!OidIsValid(operoid)) + elog(ERROR, "missing operator %d(%u,%u) in partition opfamily %u", +strategy, key->partopcintype[col], key->partopcintype[col], +key->partopfamily[col]); /* -* If one doesn't exist, we must resort to using an operator in the same -* operator family but with the operator class declared input type. It is -* OK to do so,
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Fri, Jul 06, 2018 at 05:10:18PM -0400, Alvaro Herrera wrote: > Ugh, it's true :-( > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=25fff40798fc4ac11a241bfd9ab0c45c085e2212#patch8 > > Dave, Simon, any comments? The offending line: contrib/pg_stat_statements/pg_stat_statements--1.4--1.5.sql: GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO pg_read_all_stats; This will need a new version bump down to REL_10_STABLE... -- Michael signature.asc Description: PGP signature
Re: Typo in Japanese translation of psql.
On Fri, Jul 06, 2018 at 06:14:24PM +0900, Yugo Nagata wrote: > However, I think you have to submit the whole po file to Patch Tracker[1] > instead of a patch according to the wiki [2]. > > [1] https://redmine.postgresql.org/projects/pgtranslation > [2] https://wiki.postgresql.org/wiki/NLS My understanding is that all this class of bugs is fixed into the translation git tree first, and then patched in block back to Postgres. Peter Eisentraut does that usually. -- Michael signature.asc Description: PGP signature
Re: no partition pruning when partitioning using array type
On 2018/07/07 0:13, Andrew Dunstan wrote: > > > On 05/08/2018 02:18 AM, Amit Langote wrote: >> On 2018/03/01 17:16, Amit Langote wrote: >>> Added this to CF (actually moved to the September one after first having >>> added it to the CF that is just getting started). >>> >>> It seems to me that we don't need to go with my originally proposed >>> approach to teach predtest.c to strip RelabelType nodes. Apparently, it's >>> only partition.c that's adding the RelableType node around partition key >>> nodes in such cases. That is, in the case of having an array, record, >>> enum, and range type as the partition key. No other part of the system >>> ends up adding one as far as I can see. Parser's make_op(), for example, >>> that is used to generate an OpExpr for a qual involving the partition key, >>> won't put RelabelType around the partition key node if the operator in >>> question has "pseudo"-types listed as declared types of its left and right >>> arguments. >>> >>> So I revised the patch to drop all the predtest.c changes and instead >>> modify get_partition_operator() to avoid generating RelabelType in such >>> cases. Please find it attached. > >> I would like to revisit this as a bug fix for get_partition_operator() to >> be applied to both PG 10 and HEAD. In the former case, it fixes the bug >> that constraint exclusion code will fail to prune correctly when partition >> key is of array, enum, range, or record type due to the structural >> mismatch between the OpExpr that partitioning code generates and one that >> the parser generates for WHERE clauses involving partition key columns. >> >> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a >> different piece of code anyway, the patch only serves to improve the >> deparse output emitted by ruleutils.c for partition constraint expressions >> where pseudo-type partition key is involved. The change can be seen in >> the updated test output for create_table test. >> >> Attached are patches for PG 10 and HEAD, which implement a slightly >> different approach to fixing this. Now, instead of passing the partition >> key's type as lefttype and righttype to look up the operator, the operator >> class declared type is passed. Also, we now decide whether to create a >> RelabelType node on top based on whether the partition key's type is >> different from the selected operator's input type, with exception for >> pseudo-type types. >> >> Thanks, >> Amit >> >> [1] >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e5dcbb88a15d > > Since this email we have branched off REL_11_STABLE. Do we want to > consider this as a bug fix for Release 11? If so, should we add it to the > open items list? The code being fixed with the latest patch is not new in PG 11, it's rather PG 10 code. That code affects how pruning works in PG 10 (used to affect PG 11 until we rewrote the partition pruning code). I think it's a good idea to apply this patch to all branches, as the code is not specific to partition pruning and has its benefits even for HEAD and PG 11. For PG 11 and HEAD, the benefit of this patch can be thought of as just cosmetic, because it only affects the ruleutils.c's deparse output for partition constraint expressions when showing it in the psql output for example. In PG 10, it directly affects the planner behavior whereby having a RelabelType redundantly in a partition constraint expression limits the planner's ability to do partition pruning with it. Thanks, Amit
Re: Let's remove DSM_IMPL_NONE.
On Fri, Jul 06, 2018 at 11:08:02PM +0200, Peter Eisentraut wrote: > On 26.06.18 09:10, Kyotaro HORIGUCHI wrote: >> --- a/src/include/storage/dsm_impl.h >> +++ b/src/include/storage/dsm_impl.h >> @@ -14,11 +14,10 @@ >> #define DSM_IMPL_H >> >> /* Dynamic shared memory implementations. */ >> -#define DSM_IMPL_NONE 0 >> -#define DSM_IMPL_POSIX 1 >> -#define DSM_IMPL_SYSV 2 >> -#define DSM_IMPL_WINDOWS3 >> -#define DSM_IMPL_MMAP 4 >> +#define DSM_IMPL_POSIX 0 >> +#define DSM_IMPL_SYSV 1 >> +#define DSM_IMPL_WINDOWS2 >> +#define DSM_IMPL_MMAP 3 > > I would avoid renumbering here. It was kind of sensible to have NONE = > 0, so I'd keep the non-NONE ones as non-zero. +1. -- Michael signature.asc Description: PGP signature
[GSoC] working status
Hi mentors and hackers, The second review is coming. Here is my working status so far. 1. Complete the thrift compact protocol implementation using bytea interface. 2. Thrift type (binary protocol) is almost done, the only remaining part is struct encoding and decoding. With the thrift type, you can express your thrift struct using json, but stored using thrift bytes. 3. Set up travis CI. 4. better documents. Here is the repo with all recent changes (https://github.com/charles-cui/pg_thrift) Let me know if you have any questions.
Re: Copy function for logical replication slots
On Mon, Jul 09, 2018 at 10:06:00AM +0900, Masahiko Sawada wrote: > I think that this patch might be splitted but I will be able to send > an updated patch in the next week. As you suggestion this patch needs > more careful thoughts. I'll move this patch to the next commit fest if > I will not be able to sent it. Is that okay? Fine by me. Thanks for the update. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Crash on promotion when recovery.conf is renamed
On Fri, Jul 06, 2018 at 09:09:27PM +0900, Michael Paquier wrote: > Sure. I think I can finish the set on Monday JST then. So, I have been able to back-patch things down to 9.5, but further down I am not really convinced that it is worth meddling with that, particularly in light of 7cbee7c which has reworked the way partial segments on older timelines are handled at the end of promotion. The portability issues I have found is related to the timeline number exitArchiveRecovery uses which comes for the WAL reader hence this gets set to the timeline from the last page read by the startup process. This can actually cause quite a bit of trouble at the end of recovery as we would get a failure when trying to copy the last segment from the old timeline to the new timeline. Well, it could be possible to fix things properly by roughly back-porting 7cbee7c down to REL9_4_STABLE but that's not really worth the risk, and moving exitArchiveRecovery() and PrescanPreparedTransactions() around is a straight-forward move with 9.5~ thanks to this commit. I have also done nothing yet for the detection of corrupted 2PC files which get ignored at recovery. While looking again at the patch I sent upthread, the thing was actually missing some more error handling in ReadTwoPhaseFile(). In particular, with the proposed patch we would still not report an error if a 2PC file cannot be opened because of for example EPERM. In FinishPreparedTransaction, one would example get a simply a *crash* with no hints about what happened. I have a patch for that which still needs some polishing, and I will start a new thread on the matter. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Secondary index access optimizations
On 22 March 2018 at 22:38, Konstantin Knizhnik wrote: > Attached please find rebased version of the patch. Hi, I started looking over this patch and have a few comments: I don't think this range type idea is a great one. I don't think it's going to ever perform very well. I also see you're not checking the collation of the type anywhere. As of now, no range types have collation support, but you can't really go and code this with that assumption. I also don't really like the sequence scan over pg_range. Probably a better way to do this would be to add a new bt proc, like what was done in 0a459cec for 2 new functions. Something like BTISNEXTVAL_PROC and BTISPREVVAL_PROC. You'd then need to define functions for all the types based on integers, making functions which take 2 parameters of the type, and an additional collation param. The functions should return bool. int4_isnextval(2, 3, InvalidOid) would return true. You'd need to return false on wraparound. I also think that the patch is worth doing without the additional predicate_implied_by() smarts. In fact, I think strongly that the two should be considered as two independent patches. Partial indexes suffer from the same issue you're trying to address here and that would be resolved by any patch which makes changes to predicate_implied_by(). Probably the best place to put the code to skip the redundant quals is inside set_append_rel_size(). There's already code there that skips quals that are seen as const TRUE. This applies for UNION ALL targets with expressions that can be folded to constants once the qual is passed through adjust_appendrel_attrs() For example: explain select * from (select 1 as a from pg_class union all select 2 from pg_class) t where a = 1; I've attached a patch to show what I had in mind. I had to change how partition_qual is populated, which I was surprised to see only gets populated for sub-partitioned tables (the top-level parent won't have a qual since it's not partitioned) I didn't update the partition pruning code that assumes this is only populated for sub-partitioned table. That will need to be done. The patch comes complete with all the failing regression tests where the redundant quals have been removed by the new code. If you want to make this work for CHECK constraints too, then I think the code for that can be added to the same location as the code I added in the attached patch. You'd just need to fetch the check constraint quals and just add some extra code to check if the qual is redundant. Some new performance benchmarks would then be useful in order to find out how much overhead there is. We might learn that we don't want to enable it by default if it's too expensive. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services skip_redundant_partition_quals_poc.patch Description: Binary data
Re: Failure assertion in GROUPS mode of window function in current HEAD
On Wed, Jul 4, 2018 at 11:24 PM, Masahiko Sawada wrote: > Hi, > > I got an assertion failure when I use GROUPS mode and specifies offset > without ORDER BY clause. The reproduction steps and the backtrace I > got are following. > > =# create table test as select 1 as c; > =# select sum(c) over (partition by c groups between 1 preceding and > current row) from test; > TRAP: FailedAssertion("!(ptr >= 0 && ptr < state->readptrcount)", > File: "tuplestore.c", Line: 478) > > #0 0x003b3ce32495 in raise () from /lib64/libc.so.6 > #1 0x003b3ce33c75 in abort () from /lib64/libc.so.6 > #2 0x00a2ef5a in ExceptionalCondition (conditionName=0xc99f38 > "!(ptr >= 0 && ptr < state->readptrcount)", errorType=0xc99f22 > "FailedAssertion", > fileName=0xc99ec0 "tuplestore.c", lineNumber=478) at assert.c:54 > #3 0x00a7fa7d in tuplestore_select_read_pointer > (state=0x139e4a0, ptr=-1) at tuplestore.c:478 > #4 0x007216cd in update_frameheadpos (winstate=0x137fc30) at > nodeWindowAgg.c:1655 > #5 0x0071fb7f in eval_windowaggregates (winstate=0x137fc30) > at nodeWindowAgg.c:735 > #6 0x00722ac8 in ExecWindowAgg (pstate=0x137fc30) at > nodeWindowAgg.c:2196 > #7 0x006e5776 in ExecProcNodeFirst (node=0x137fc30) at > execProcnode.c:445 > #8 0x006da945 in ExecProcNode (node=0x137fc30) at > ../../../src/include/executor/executor.h:237 > #9 0x006dd2fc in ExecutePlan (estate=0x137fa18, > planstate=0x137fc30, use_parallel_mode=false, operation=CMD_SELECT, > sendTuples=true, > numberTuples=0, direction=ForwardScanDirection, dest=0x138b098, > execute_once=true) at execMain.c:1726 > #10 0x006daf2c in standard_ExecutorRun (queryDesc=0x13785b8, > direction=ForwardScanDirection, count=0, execute_once=true) at > execMain.c:363 > #11 0x006dad48 in ExecutorRun (queryDesc=0x13785b8, > direction=ForwardScanDirection, count=0, execute_once=true) at > execMain.c:306 > #12 0x008c7626 in PortalRunSelect (portal=0x12f3bd8, > forward=true, count=0, dest=0x138b098) at pquery.c:932 > #13 0x008c72b4 in PortalRun (portal=0x12f3bd8, > count=9223372036854775807, isTopLevel=true, run_once=true, > dest=0x138b098, altdest=0x138b098, > completionTag=0x7fffaece38b0 "") at pquery.c:773 > #14 0x008c1288 in exec_simple_query ( > query_string=0x128e938 "select sum(c) over (partition by c groups > between 1 preceding and current row) from test;") at postgres.c:1122 > #15 0x008c555e in PostgresMain (argc=1, argv=0x12b8258, > dbname=0x12b80d8 "postgres", username=0x12b80b0 "masahiko") at > postgres.c:4153 > #16 0x00822c3c in BackendRun (port=0x12b0020) at postmaster.c:4361 > #17 0x008223aa in BackendStartup (port=0x12b0020) at postmaster.c:4033 > #18 0x0081e84b in ServerLoop () at postmaster.c:1706 > #19 0x0081e17d in PostmasterMain (argc=5, argv=0x1289330) at > postmaster.c:1379 > #20 0x007452d0 in main (argc=5, argv=0x1289330) at main.c:228 > > The cause of this assertion failure is that we attempted to select a > read pointer (framehead_ptr) that is not allocated. We allocate read > pointers for both frame head and tail when begin a new partition in > begin_partition(). The current code doesn't allocate them as follows > if ORDER BY clause is omitted, but this behavior doesn't match to both > update_framheadpos() and update_framtailpos() which always use each > read pointer in GROUPS offset mode. > > if ((winstate->frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS)) && >node->ordNumCols != 0) > { > if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING)) > winstate->framehead_ptr = > tuplestore_alloc_read_pointer(winstate->buffer, 0); > if (!(winstate->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING)) > winstate->frametail_ptr = > tuplestore_alloc_read_pointer(winstate->buffer, 0); > } > > Since this issue relates to only GROUPS mode it happen in PostgreSQL > 11 or later. Attached patch fixes this issue and add regression tests > for testing GROUPS mode without ORDER BY clause. Since this problem still happens with current HEAD I've added this item to Open Item. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: WAL prefetch
> Without prefetching, it's ~70GB of WAL. With prefetching, it's only about > 30GB. Considering the 1-hour test generates about 90GB of WAL, this means the > replay speed grew from 20GB/h to almost 60GB/h. That's rather measurable > improvement ;-) Thank you everyone for this reasonably in-depth thread on prefaulting. Because this was a sprawling thread and I haven't been keeping up with this discussion until now, let me snag a bunch of points and address them here in one shot. I've attempted to answer a bunch of questions that appear to have come up during this thread, as well as provide some clarity where there were unanswered questions. Apologies in advance for the length. There are a few points that I want to highlight regarding prefaulting, and I also want to call out when prefaulting is and isn't useful. But first, let me introduce three terms that will help characterize this problem: 1. Hot read-modify-write - a PG page that is modified while the page is still contained within shared_buffers. 2. Warm read-modify-write ("RMW") - a PG page that's in the filesystem cache but not present in shared_buffers. 3. Cold RMW - a PG page is not in either PG's shared_buffers or the OS'es filesystem cache. Prefaulting is only useful in addressing the third situation, the cold read-modify-write. For fast disks, or systems that have their entire dataset held in RAM, or whose disk systems can perform a RMW fast enough for the velocity of incoming writes, there is no benefit of prefaulting (this is why there is a high and low-watermark in pg_prefaulter). In these situations prefaulting would potentially be extra constant overhead, especially for DBs where their workload is ~100% Hot/Warm RMW. Primaries are almost always under the Hot RMW workload (cold restarts being the exception). The warm RMW scenario could be solved by prefaulting into shared_buffers, but I doubt there would be a significant performance benefit because the expense of PostgreSQL faulting from shared_buffers to the OS cache is relatively small compared to a disk read. I do think there is something to be gained in the Warm RMW case, but compared to Cold RMW, this optimization is noise and best left for a future iteration. The real importance of prefaulting becomes apparent in the following two situations: 1. Priming the OS's filesystem cache, notably after an OS restart. This is of value to all PostgreSQL scenarios, regardless of whether or not it's a primary or follower. Reducing database startup/recovery times is very helpful, especially when recovering from an outage or after having performed planned maintenance. Little in PostgreSQL administration is more infuriating than watching PostgreSQL recover and seeing the CPU 100% idle and the disk IO system nearly completely idle (especially during an outage or when recovering from an outage). 2. When the following two environmental factors are true: a. the volume of writes to discrete pages is high b. the interval between subsequent writes to a single page is long enough that a page is evicted from both shared_buffers and the filesystem cache Write-heavy workloads tend to see this problem, especially if you're attempting to provide consistency in your application and do not read from the followers (thereby priming their OS/shared_buffer cache). If the workload is continuous, the follower may never be able overcome the write volume and the database never catches up. The pg_prefaulter was borne out of the last workload, namely a write-heavy, 24/7 constant load with a large dataset. What pg_prefaulter does is read in the blocks referenced from the WAL stream (i.e. PG heap pages) and then load the referenced pages into the OS filesystem cache (via threaded calls to pread(2)). The WAL apply process has a cache-hit because the filesystem cache has been primed with the heap page before the apply process attempted to perform its read-modify-write of the heap. It is important to highlight that this is a problem because there is only one synchronous pread(2) call in flight at a time from the apply/recover/startup process, which effectively acts as the speed limit for PostgreSQL. The physics of many workloads are such that followers are unable to keep up and are thus destined to always fall behind (we've all seen this at some point, likely via apply lag from a VACUUM or pg_repack). The primary can schedule concurrent IO from multiple client all making independent SELECTS. Contrast that to a replica who has zero knowledge of the IOs that the primary recently dispatched, and all IO looks like random read and likely a cache miss. In effect, the pg_prefaulter raises the speed limit of the WAL apply/recovery process by priming the filesystem cache by snooping in on the WAL stream. PostgreSQL's WAL apply and recovery process is only capable of scheduling a single synchronous pread(2) syscall. As a result, even if you have an RAID10
Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw
On Tue, 2018-03-06 at 20:09 +0900, Etsuro Fujita wrote: > Agreed. I added a comment to that function. I think that that > comment > in combination with changes to the FDW docs in the patch would help > FDW > authors understand why that is needed. Please find attached an > updated > version of the patch. > > Thanks for the comments! Committed. I made some small modifications and added a test for the case where the foreign table is a partition of a local table, which follows a different code path after commit 3d956d95. Thank you! Regards, Jeff Davis
"Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan wrote: > I'm starting this new thread to discuss the benefits of B-Tree suffix > truncation, and to demonstrate how effective it can be at increasing > fan-out and space utilization with a real-world example. I haven't > explained why suffix truncation matters very well before now. Now that > I have a patch that works, I can just show the benefit directly. (By > the way, there are similar benefits to covering INCLUDE indexes, where > suffix truncation occurs all the time, and not just > opportunistically.) > Explanation > --- > > Pivot tuples describe how the key space will be split up, which will > *hopefully* be balanced for future insertions. So, apart from the > obvious issue about the size of pivot tuples, there is a more > important issue: why unnecessarily commit to certain exact boundaries > early on, before it's clear how well that will balance values that get > inserted later? Actually, this particular example does not really show why general suffix truncation is important. My example did manage to make an index 15% smaller in a practical, workable way, but traditional suffix truncation deserves far less credit for that than I initially claimed. My explanation was about 99% wrong, but my example is still valid. The true explanation for why my patch (the pending v3 of my unique-key-heap-TID patch) avoided so much bloat is very interesting, because it is related to a broader problem. I'll show a simple example where the bloat that my patch can avoid is far greater still, involving a simple single-attribute secondary index. Before I get to that example, I'll give the real explanation. The real reason for the marked decrease in the level of bloating is that my patch makes insertions into secondary indexes (non-unique indexes) jump immediately to the leaf page that the tuple unambiguously belongs on -- since it now has a unique key, there must be one exact page that the value is supposed to go on. We avoid trying to find a place among pages full of logical duplicates, and so we avoid the risk of "getting tired" [1] and giving up on finding free space that is actually available to us. "Getting tired" tends to produce really inefficient space utilization among leaf pages full of duplicates, at least past a certain tipping point. The whole "getting tired" thing is the root of the problem here, which is why the pending v3 of my patch will remove that code completely (_bt_findinsertloc() is streamlined). The "tipping point" nature of getting tired is of particular concern to me, since it sounds like something that could have been a factor in Uber's much-publicized blog post. :-( I attach the test case I mentioned, which I show the output from below, with my commentary in parenthesis (note that there are "\echo" psql statements whose output you'll also see): $ psql -f testcase.sql autovcuum should probably be disabled: autovacuum off (1 row) psql:testcase.sql:3: NOTICE: table "sec_idx_bloat" does not exist, skipping DROP TABLE CREATE TABLE (Created empty table named "sec_idx_bloat", with a single int4 attribute.) CREATE INDEX (Created empty index named "bloated" on that attribute.) Initial insert of 1e7 sequential values: INSERT 0 1000 "bloated" size on master: 214 MB "bloated" size with patch: 214 MB Initial insert of the value 42 1e7 times: INSERT 0 1000 "bloated" size on master: 486 MB "bloated" size with patch: 430 MB Repeated insertion of the value 42 1e7 times: INSERT 0 1000 "bloated" size on master: 757 MB "bloated" size with patch: 645 MB Delete + VACUUM away most of the dups: DELETE 18001072 VACUUM "bloated" size on master: 757 MB "bloated" size with patch: 645 MB (That is, VACUUM hasn't made either case have a smaller index yet, though it probably gave more back many more whole index pages to the FSM in the case of the patch.) Third insertion of the value 42 1e7 times: INSERT 0 1000 (This is where it gets really interesting, because things truly fall apart for master, whereas the patch case manages to reuse existing free space in all cases.) "bloated" size on master: 1029 MB "bloated" size with patch: 645 MB Fourth insertion of the value 42 1e7 times: INSERT 0 1000 "bloated" size on master: 1301 MB "bloated" size with patch: 688 MB (Patch can still almost reuse all the space left behind by VACUUM, though since it's a bit bigger than the original high watermark size of 645 MB it's not perfectly efficient. Master flounders even further behind, though.) To summarize: recognizing that bloat in indexes is correlated with bloat in tables allows us to recycle space in both structures cooperatively, which can be very important. While this example focusses on bloat, there are a lot of other things to be unhappy about around buffer lock contention, and around the number of pages that will be dirtied. Random choices about which page to dirty leads to increased random I/O when
Re: cost_sort() improvements
Hi, I'll do more experiments/tests next week, but let me share at least some initial thoughts ... On 06/28/2018 06:47 PM, Teodor Sigaev wrote: > Hi! > > Current estimation of sort cost has following issues: > - it doesn't differ one and many columns sort > - it doesn't pay attention to comparison function cost and column width > - it doesn't try to count number of calls of comparison function on per > column > basis > > At least [1] and [2] hit into to that issues and have an > objections/questions about correctness of cost sort estimation. > Suggested patch tries to improve current estimation and solve that issues. > For those not following the two patches ("GROUP BY optimization" [1] and "Incremental sort" [2]): This wasn't a major issue so far, because we are not reordering the grouping keys to make the sort cheaper (which is what [1] does), nor do we ignore some of the sort keys (which is what [2] does, pretty much). Both patches need to estimate number of comparisons on different columns and/or size of groups (i.e. ndistinct). > Basic ideas: > - let me introduce some designations > i - index of column, started with 1 > N - number of tuples to sort > Gi - number of groups, calculated by i number columns. Obviously, > G0 == 1. Number of groups is caculated by estimate_num_groups(). > NGi - number of tuples in one group. NG0 == N and NGi = N/G(i-1) > Fi - cost of comparison function call of i-th column OK, so Fi is pretty much whatever CREATE FUNCTION ... COST says, right? > Wi - average width in bytes of 1-ith column. > Ci - total cost of one comparison call of column calculated as > Fi * fine_function(Wi) where fine_function(Wi) looks like: > if (Wi <= sizeof(Datum)) > return 1.0; //any type passed in Datum > else > return 1 + log16(Wi/sizeof(Datum)); > log16 just an empirical choice > - So, cost for first column is 2.0 * C0 * log2(N) > First column is always compared, multiplier 2.0 is defined per > regression > test. Seems, it estimates a cost for transferring tuple to CPU cache, > starting cost for unpacking tuple, cost of call qsort compare wrapper > function, etc. Removing this multiplier causes too optimistic > prediction of > cost. Hmm, makes sense. But doesn't that mean it's mostly a fixed per-tuple cost, not directly related to the comparison? For example, why should it be multiplied by C0? That is, if I create a very expensive comparator (say, with cost 100), why should it increase the cost for transferring the tuple to CPU cache, unpacking it, etc.? I'd say those costs are rather independent of the function cost, and remain rather fixed, no matter what the function cost is. Perhaps you haven't noticed that, because the default funcCost is 1? > - cost for i-th columns: > Ci * log2(NGi) => Ci * log2(N/G(i-1)) OK. So essentially for each column we take log2(tuples_per_group), as total number of comparisons in quick-sort is O(N * log2(N)). > So, here I believe, that i-th comparison function will be called only > inside > group which number is defined by (i-1) leading columns. Note, following > discussion [1] I add multiplier 1.5 here to be closer to worst case when > groups are significantly differ. Right now there is no way to > determine how > differ they could be. Note, this formula describes cost of first > column too > except difference with multiplier. The number of new magic constants introduced by this patch is somewhat annoying. 2.0, 1.5, 0.125, ... :-( > - Final cost is cpu_operator_cost * N * sum(per column costs described > above). > Note, for single column with width <= sizeof(datum) and F1 = 1 this > formula > gives exactly the same result as current one. > - for Top-N sort empiric is close to old one: use 2.0 multiplier as > constant > under log2, and use log2(Min(NGi, output_tuples)) for second and > following > columns. > I think compute_cpu_sort_cost is somewhat confused whether per_tuple_cost is directly a cost, or a coefficient that will be multiplied with cpu_operator_cost to get the actual cost. At the beginning it does this: per_tuple_cost = comparison_cost; so it inherits the value passed to cost_sort(), which is supposed to be cost. But then it does the work, which includes things like this: per_tuple_cost += 2.0 * funcCost * LOG2(tuples); where funcCost is pretty much pg_proc.procost. AFAIK that's meant to be a value in units of cpu_operator_cost. And at the end it does this per_tuple_cost *= cpu_operator_cost; I.e. it gets multiplied with another cost. That doesn't seem right. For most cost_sort calls this may not matter as the comparison_cost is set to 0.0, but plan_cluster_use_sort() sets this explicitly to: comparisonCost = 2.0 * (indexExprCost.startup + indexExprCost.per_tuple); which is likely to cause issues. Also, why do we need this?
Re: cost_sort() improvements
On 06/29/2018 04:36 PM, Teodor Sigaev wrote: > > > Peter Geoghegan wrote: >> On Thu, Jun 28, 2018 at 9:47 AM, Teodor Sigaev wrote: >>> Current estimation of sort cost has following issues: >>> - it doesn't differ one and many columns sort >>> - it doesn't pay attention to comparison function cost and column >>> width >>> - it doesn't try to count number of calls of comparison function on >>> per >>> column >>> basis >> >> I've been suspicious of the arbitrary way in which I/O for >> external sorts is costed by cost_sort() for a long time. I'm not >> 100% sure about how we should think about this question, but I am >> sure that it needs to be improved in *some* way. >> > Agree, but I think it should be separated patch to attack this > issue. And I don't have an idea how to do it, at least right now. > Nevertheless, I hope, issue of estimation of disk-based sort isn't a > blocker of CPU cost estimation improvements. > Yes, I agree this should be addressed separately. Peter is definitely right that should look at costing internal vs. external sorts (after all, it's generally foolish to argue with *him* about sorting stuff). But the costing changes discussed here are due to my nagging from the GROUP BY patch (and also the "incremental sort" patch). The internal vs. external costing seems like an independent issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hot_standby_feedback vs excludeVacuum and snapshots
On Fri, Jul 06, 2018 at 04:32:56PM +0100, Simon Riggs wrote: > On 6 July 2018 at 03:30, Thomas Munro wrote: > > On Thu, Jul 5, 2018 at 8:27 AM, Noah Misch wrote: > >>> However, 49bff5300d527 also introduced a similar bug where > >>> subtransaction > >>> commit would fail to release an AccessExclusiveLock, leaving the lock > >>> to > >>> be removed sometimes early and sometimes late. This commit fixes > >>> that bug also. Backpatch to PG10 needed. > >> > >> Subtransaction commit is too early to release an arbitrary > >> AccessExclusiveLock. The primary releases every AccessExclusiveLock at > >> top-level transaction commit, top-level transaction abort, or > >> subtransaction > >> abort. CommitSubTransaction() doesn't do that; it transfers locks to the > >> parent sub(xact). Standby nodes can't safely remove an arbitrary lock > >> earlier > >> than the primary would. > > > > But we don't release locks acquired by committing subxacts until the > > top level xact commits. Perhaps that's what the git commit message > > meant by "early", as opposed to "late" meaning when > > StandbyReleaseOldLocks() next runs because an XLOG_RUNNING_XACTS > > record is replayed? > > Locks held by subtransactions were not released at the correct timing > of top-level commit; they are now. I read the above-quoted commit message as saying that the commit expands the set of locks released when replaying subtransaction commit. The only lock that should ever be released at subxact commit is the subxact XID lock. If that continues to be the only lock released at subxact commit, good.
Re: XLogSegNoOffsetToRecPtr fixup
Hi, On 2018-07-08 14:23:45 -0400, Alvaro Herrera wrote: > Pursuant to closing comment in > https://postgr.es/m/20180306214239.ospkf6ie7aa5gm4j@alvherre.pgsql > here's a quick patch to change the order of arguments in > XLogSegNoOffsetToRecPtr. Commit fc49e24fa69a ("Make WAL segment size > configurable at initdb time.") put the walsegsz as last argument, > *after* its output argument, which is downright weird. > > I propose to apply this to pg11 and master, to avoid an unnecessary API > change in pg12. WFM. Thanks, Andres
XLogSegNoOffsetToRecPtr fixup
Pursuant to closing comment in https://postgr.es/m/20180306214239.ospkf6ie7aa5gm4j@alvherre.pgsql here's a quick patch to change the order of arguments in XLogSegNoOffsetToRecPtr. Commit fc49e24fa69a ("Make WAL segment size configurable at initdb time.") put the walsegsz as last argument, *after* its output argument, which is downright weird. I propose to apply this to pg11 and master, to avoid an unnecessary API change in pg12. -- Álvaro Herrera http://www.flickr.com/photos/alvherre/ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d6b5b05425..f251989f0b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1971,7 +1971,7 @@ XLogBytePosToRecPtr(uint64 bytepos) seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD; } - XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result, wal_segment_size); + XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, wal_segment_size, result); return result; } @@ -2017,7 +2017,7 @@ XLogBytePosToEndRecPtr(uint64 bytepos) seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD; } - XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result, wal_segment_size); + XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, wal_segment_size, result); return result; } diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index dd96cef8f0..4c633c6c49 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -747,7 +747,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLByteToSeg(recptr, segno, state->wal_segment_size); offset = XLogSegmentOffset(recptr, state->wal_segment_size); - XLogSegNoOffsetToRecPtr(segno, offset, recaddr, state->wal_segment_size); + XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr); if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index d3ec137051..9b55b94227 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -2782,7 +2782,7 @@ ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid { XLogRecPtr recptr; - XLogSegNoOffsetToRecPtr(segno, 0, recptr, wal_segment_size); + XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr); snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.snap", NameStr(MyReplicationSlot->data.name), diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c index 071b32d19d..ed9d7f6378 100644 --- a/src/bin/pg_basebackup/pg_receivewal.c +++ b/src/bin/pg_basebackup/pg_receivewal.c @@ -352,7 +352,7 @@ FindStreamingStart(uint32 *tli) if (!high_ispartial) high_segno++; - XLogSegNoOffsetToRecPtr(high_segno, 0, high_ptr, WalSegSz); + XLogSegNoOffsetToRecPtr(high_segno, 0, WalSegSz, high_ptr); *tli = high_tli; return high_ptr; diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c index 8cff535692..d63a3a27f6 100644 --- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -933,8 +933,8 @@ RewriteControlFile(void) * Adjust fields as needed to force an empty XLOG starting at * newXlogSegNo. */ - XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD, - ControlFile.checkPointCopy.redo, WalSegSz); + XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD, WalSegSz, + ControlFile.checkPointCopy.redo); ControlFile.checkPointCopy.time = (pg_time_t) time(NULL); ControlFile.state = DB_SHUTDOWNED; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index b4c1f827a6..1689279767 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -248,7 +248,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, XLogSegNo targetSegNo; XLByteToSeg(targetPagePtr, targetSegNo, WalSegSz); - XLogSegNoOffsetToRecPtr(targetSegNo + 1, 0, targetSegEnd, WalSegSz); + XLogSegNoOffsetToRecPtr(targetSegNo + 1, 0, WalSegSz, targetSegEnd); targetPageOff = XLogSegmentOffset(targetPagePtr, WalSegSz); /* diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 5c4f38e597..d41b831b18 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1039,7 +1039,7 @@ main(int argc, char **argv)
Re: How can we submit code patches that implement our (pending) patents?
Hi, On 2018-07-08 11:46:51 -0400, Alvaro Herrera wrote: > On 2018-Jul-07, David Fetter wrote: > > > If they have no plans to exercise any proprietary rights, our usual > > process where people submit things and agree to have us label them > > with the PGDG copyright and publish them under TPL would be the > > simplest way to accomplish it. > > Eh, but if the submitting company has patents, would it not be dishonest > to publish as PGDG copyright & license with no attached patent grant? > Some other company deriving a proprietary fork from Postgres could later > be sued by the submitting company, because there is no legal standing > for them to use the patented code. Yep. And even if the original submitter has good intent, it's not unlikely for companies to go bancrupt and their assets sold off. > TBH I don't understand how can we dual-license the code in a manner that > protects those proprietary forks. Can you (Andres) explain what is the > idea? https://www.apache.org/licenses/LICENSE-2.0 grants roughly the same rights as our license. But 3) additionally grants a patent license. That license is *not* restricted to the code in unmodified form. By requiring future contributions to be both under the pg license and apache 2, downstream users, including proprietary forks, can safely use code contributed in the future, without risk of patent mines. There are arguments made that TPL (and BSD, MIT etc) already includes an implicit patent grant, but while a longstanding theory, it's to my knowledge not legally been tested. IANAL etc. Greetings, Andres Freund
Re: How can we submit code patches that implement our (pending) patents?
On 2018-Jul-07, David Fetter wrote: > If they have no plans to exercise any proprietary rights, our usual > process where people submit things and agree to have us label them > with the PGDG copyright and publish them under TPL would be the > simplest way to accomplish it. Eh, but if the submitting company has patents, would it not be dishonest to publish as PGDG copyright & license with no attached patent grant? Some other company deriving a proprietary fork from Postgres could later be sued by the submitting company, because there is no legal standing for them to use the patented code. TBH I don't understand how can we dual-license the code in a manner that protects those proprietary forks. Can you (Andres) explain what is the idea? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pglife and devel branch content
Previously, pglife (http://pglife.momjian.us/) showed all commits for the "devel" branch back back to the start of the git tree. It now shows only commits since the last major branch. For example, "12 devel" now shows only commits since we branched the git tree for PG 11. This should make it easier to see what we have done for PG 12. Unfortunately, it doesn't remove backbranch commits like git_changelog because it uses a preexisting gitweb page. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Invisible Indexes
On Mon, Jun 18, 2018 at 5:57 PM, Tom Lane wrote: > > I'm not sure about the "enforce constraint only" argument --- that > sounds like a made-up use-case to me. It's pretty hard to imagine > a case where a unique index applies to a query and yet you don't want > to use it. > I've not seen it with unique constraints, but have with EXCLUDE constraints. GiST index costing is not very robust, and the planner can easily decide that a read query should use the EXCLUDE-supporting GiST index in cases where it is not optimal. Cheers, Jeff
Re: [PATCH] Use access() to check file existence in GetNewRelFileNode().
On Fri, Jul 06, 2018 at 10:52:14PM +0200, Peter Eisentraut wrote: > This patch looks sensible to me. We also use access() elsewhere in the > backend to test for file existence. Yes, the patch made sense when I looked at it, and it still does, so committed. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Hello Alvaro, For context: in the backend, elog() is only used for internal messages (i.e. "can't-happen" conditions), and ereport() is used for user-facing messages. There are many things ereport() has that elog() doesn't, such as additional message fields (HINT, DETAIL, etc) that I think could have some use in pgbench as well. If you use elog() then you can't have that. [...] Ok. Then forget elog, but I'm pretty against having a kind of ereport which looks greatly overkill to me, because: (1) the syntax is pretty heavy, and does not look like a function. (2) the implementation allocates a string buffer for the message this is greatly overkill for pgbench which only needs to print to stderr once. This makes sense server-side because the generated message may be output several times (eg stderr, file logging, to the client), and the implementation has to work with cpp implementations which do not handle varags (and maybe other reasons). So I would be in favor of having just a simpler error function. Incidentally, one already exists "pgbench_error" and could be improved, extended, replaced. There is also "syntax_error". One thing that just came to mind is that pgbench uses some src/fe_utils stuff. I hope having ereport() doesn't cause a conflict with that ... Currently ereport does not exists client-side. I do not think that this patch is the right moment to decide to do that. Also, there are some "elog" in libpq, but they are out with a "#ifndef FRONTEND". BTW I think abort() is not the right thing, as it'll cause core dumps if enabled. Why not just exit(1)? Yes, I agree and already reported that. Conclusion: My current opinion is that I'm pretty against bringing "ereport" to the front-end on this specific pgbench patch. I agree with you that "elog" would be misleading there as well, for the arguments you developed above. I'd suggest to have just one clean and simple pgbench internal function to handle errors and possibly exit, debug... Something like void pgb_error(FATAL, "error %d raised", 12); Implemented as void pgb_error(int/enum XXX level, const char * format, ...) { test level and maybe return immediately (eg debug); print to stderr; exit/abort/return depending; } Then if some advanced error handling is introduced for front-end programs, possibly through some macros, then it would be time to improve upon that. -- Fabien.
Re: Desirability of client-side expressions in psql?
Hello Stephen, [...] So, I tend to agree w/ Andrew that while this is a good topic to have on -hackers, it shouldn't be a CF entry. I wouldn't draw any conclusions from Andrew closing it out as "not appropriate for CF". Sure. As I had no committer feedback on the discussion for 3 months, I tried this as an ineffective way to get some. It did not work up to now. As I have not received feedback from committers about the desirability of the feature, I interpret that as "the feature is not desirable", and I will not develop it, because of the probability that this would be time down the drain. Personally, I'm definitely in favor of having a lot more flexibility and capability in psql, that's an area which I think we don't focus on nearly enough. Having to fight with bash or another language to make calls to psql to get things done is downright annoying. So, +1 from me on the overall idea. Good, that is one committer opinion, an infinite improvement over the previous status:-) The challenge here will almost certainly be in the details. Yep. I'm fine with "your code is not good and creates problems so it is rejected". I'm trying to avoid "your code was a loss from the start, whatever you did, because we do not want such a feature". I do like the proposal you have of building out a common set of capabilities which are shared between psql and pgbench. Good. The other big challenge here is figuring out how to distinguish between SQL which should be sent to the server and something which needs to be client-side processed. The current great idea is to use backslash commands to define existing variables: psql> \let i 1 + 2 * 3 psql> SELECT :i ; psql> \if :i >= 5 psql> ... psql> \endif I've never liked the ':var' approach and it really sucks when you want to combine that variable with something else, but it's what we've got and therefore has history behind it. Indeed. I do not think that changing this would make much sense. If you can find a way to improve on that without breaking existing code, that'd be fantastic. If not, definitely try to minimize the impact. I was just planning to set existing :-variables with expressions, I have no great other idea. Mixing languages is always a pain. Thanks for your feedback. -- Fabien.