date:20180619

On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:
> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
> It seems to me that we still want to have the slot forwarding finish in
> this case even if this is interrupted.  Petr, isn't that the intention
> here?

I have been chewing a bit more on the proposed patch, finishing with the
attached to close the loop.  Thoughts?
--
Michael
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61588d626f..6068aefd02 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -340,6 +340,9 @@ CreateInitDecodingContext(char *plugin,
  * output_plugin_options
  *		contains options passed to the output plugin.
  *
+ * fast_forward
+ *		bypasses the generation of logical changes.
+ *
  * read_page, prepare_write, do_write, update_progress
  *		callbacks that have to be filled to perform the use-case dependent,
  *		actual work.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..7c03bf69d3 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -346,7 +346,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
  * The LSN position to move to is checked by doing a per-record scan and
  * logical decoding which makes sure that confirmed_lsn is updated to a
  * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * changes.  As decoding is done with fast_forward mode, no changes are
+ * actually generated.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -389,8 +390,8 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			startlsn = InvalidXLogRecPtr;
 
 			/*
-			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-			 * store the description into our tuplestore.
+			 * Note that changes are not generated in fast_forward mode, but
+			 * that the slot's data still needs to be updated.
 			 */
 			if (record != NULL)
 LogicalDecodingProcessRecord(ctx, ctx->reader);


signature.asc
Description: PGP signature

Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-19 Thread Masahiko Sawada

On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
 wrote:
> On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada  
> wrote:
>> On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov
>>  wrote:
>> > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada  
>> > wrote:
>> >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
>> >> > So, I'm proposing to raise maximum valus of
>> >> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
>> >> >
>> >>
>> >> I agree to expand the maximum value. But if users don't want index
>> >> cleanup it would be helpful if we have an option (e.g. setting to -1)
>> >> to disable index cleanup while documenting a risk of disabling index
>> >> cleanup. It seems to me that setting very high values means the same
>> >> purpose.
>> >
>> > Yes, providing an option to completely disable b-tree index cleanup
>> > would be good.  But the problem is that we already use -1 value for
>> > "use the default" in reloption.  So, if even we will make -1 guc
>> > option to mean "never cleanup", then we still wouldn't be able to make
>> > reloption to work this way.  Probably, we should use another "magical
>> > value" in reloption for "use the default" semantics.
>>
>> Right. We can add a new reloption specifying whether we use default
>> value of vacuum_index_cleanup_scale_factor or not but it might be
>> overkill.
>
> That would be overkill, and also that would be different from how
> other reloptions behaves.

Agreed.

>
>> >> Also, your patch lacks documentation update.
>> >
>> > Good catch, thank you.
>> >
>> >> BTW, I realized that postgresql.conf.sample doesn't have
>> >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.
>> >
>> > It seems that you post a wrong attachment, because the patch you sent
>> > is exactly same as mine.
>> >
>>
>> Sorry, attached correct one.
>
> Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> subsection in the "resource usage" section.  I think it's not
> appropriate for this option to be in "resource usage".  Ideally it
> should be grouped with other vacuum options, but we don't have single
> section for that.  "Autovacuum" section is also not appropriate,
> because this guc works not only for autovacuum.  So, most semantically
> close options, which affects vacuum in general, are
> vacuum_freeze_min_age, vacuum_freeze_table_age,
> vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> which are located in "client connection defaults" section.  So, I
> decided to move this GUC into this section.  I also change the section
> in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.

Agreed. So should we move it to 19.11 Client Connection Defaults in
doc as well? And I think it's better if this option places next to
other vacuum options for finding easier. Attached patch changes them
so. Please review it.

> I think we ideally should have a "vacuum" section, which would have
> two subections: "client defaults" and "autovacuum".  But that should
> be a separate patch, which needs to be discussed separately.

+1

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


skip_cleanup_index_3.patch
Description: Binary data

Re: Partitioning with temp tables is broken

On Wed, Jun 20, 2018 at 01:32:58PM +0900, Amit Langote wrote:
> Just a minor nit in the last sentence:
> 
> "have to be from" -> "must be from / must belong to"

I think that both have the same meaning, but I am no native speaker so I
may be missing a nuance of some kind.
--
Michael


signature.asc
Description: PGP signature

Re: Partitioning with temp tables is broken

2018-06-19 Thread Amit Langote

On 2018/06/20 10:54, Michael Paquier wrote:
> On Tue, Jun 19, 2018 at 04:26:56PM +0530, Ashutosh Bapat wrote:
>> On Tue, Jun 19, 2018 at 4:22 PM, Michael Paquier  wrote:
>>> I was under the impression that this was implied in the precious
>>> phrasing but you guys visibly don't match with my impression.  So I
>>> would suggest this paragraph at the end:
>>> "Mixing temporary and permanent relations in the same partition tree is
>>> not allowed.  Hence, if the partitioned table is permanent, so must be
>>> its partitions at all levels and likewise if the partitioned table is
>>
>> You don't need to mention "at all levels", its implied recursively.
> 
> Okay, done on master and REL_10_STABLE.  I have adapted the tests and
> the code on v10 where default partitions do not apply.  I have also
> removed the test case for partition pruning in REL_10_STABLE as those
> have been mainly added by 8d4e70a6, master of course keeps it.

Thank you, especially for putting in the extra work for back-patching.  I
shouldn't have used default partition syntax in tests, sorry.

> I have included Ashutosh's last suggestions and finished with the
> following phrasing:

I liked both of Ashutosh's suggestions, which I see you incorporated into
the commit.

> "Mixing temporary and permanent relations in the same partition tree is
> not allowed.  Hence, if the partitioned table is permanent, so must be
> its partitions and likewise if the partitioned table is temporary.  When
> using temporary relations, all members of the partition tree have to be
> from the same session."

Just a minor nit in the last sentence:

"have to be from" -> "must be from / must belong to"

> I am not sure why this set of emails does not actually appear on
> UI interface for archives of pgsql-hackers...  All hackers are receiving
> that, right?

I evidently got your email just fine. :)

Thanks,
Amit

Re: Add necessary package list to ldap TAP's README

Michael Paquier  writes:
> On Tue, Jun 19, 2018 at 06:59:36PM -0400, Tom Lane wrote:
>> No, but it needs some adjustments.  I'll go fix what I checked.

> c992dca2 did not change anything about ldap packages.  Are the ones
> listed for RHEL and FreeBSD correct?

No, that patch was more-or-less-unrelated cleanup.  The question is still
open about what we want to list as required packages here.  It seems
somewhat reasonable to assume that people have already managed to build
with the option of interest (eg --with-ldap) before they come to these
test suites ... or am I wrong about that?

regards, tom lane

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila  wrote:
> Both values will be present in the index, but the old value will be
> delete-marked.  It is correct that we can't remove the value (index
> tuple) from the index until it is truly dead (not visible to anyone),
> but during a delete or index-update operation, we need to traverse the
> index to mark the entries as delete-marked.  See, at this stage, I
> don't want to go in too much detail discussion of how delete-marking
> will happen in zheap and also I am not sure this thread is the right
> place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.


-- 
Peter Geoghegan

Re: Add necessary package list to ldap TAP's README

On Wed, Jun 20, 2018 at 12:59:04PM +0900, Michael Paquier wrote:
> On Tue, Jun 19, 2018 at 06:59:36PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> >> On Tue, Jun 19, 2018 at 06:45:54PM -0400, Tom Lane wrote:
> >>> I was a bit disturbed that you'd push information that was "just guessed",
> >>> so I went and tried to run the ldap tests on a couple different platforms.
> > 
> >> I assumed they were guessed at and at least the URLs checked out. 
> >> Should we remove the patch?
> 
> Yes, I was a bit surprised to see this patch pushed without checks about
> what I guessed.  I am not a user of those other platforms.

It was in README that only developer read, so I thought even a guess
would better than nothing.  Maybe I was wrong.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-19 Thread Andrey V. Lepikhov





On 19.06.2018 04:05, Peter Geoghegan wrote:

On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan  wrote:

On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov
 wrote:

Patch '0001-retail-indextuple-deletion' introduce new function
amtargetdelete() in access method interface. Patch
'0002-quick-vacuum-strategy' implements this function for an alternative
strategy of lazy index vacuum, called 'Quick Vacuum'.


My compiler shows the following warnings:


Some real feedback:

What we probably want to end up with here is new lazyvacuum.c code
that does processing for one heap page (and associated indexes) that
is really just a "partial" lazy vacuum. Though it won't do things like
advance relfrozenxid, it will do pruning for the heap page, index
tuple killing, and finally heap tuple killing. It will do all of these
things reliably, just like traditional lazy vacuum. This will be what
your background worker eventually uses.


It is final goal of the patch.


I doubt that you can use routines like index_beginscan() within
bttargetdelete() at all. I think that you need something closer to
_bt_doinsert() or _bt_pagedel(), that manages its own scan (your code
should probably live in nbtpage.c). It does not make sense to teach
external, generic routines like index_beginscan() about heap TID being
an implicit final index attribute, which is one reason for this (I'm
assuming that this patch relies on my own patch).  Another reason is
that you need to hold an exclusive buffer lock at the point that you
identify the tuple to be killed, until the point that you actually
kill it. You don't do that now.

IOW, the approach you've taken in bttargetdelete() isn't quite correct
because you imagine that it's okay to occasionally "lose" the index
tuple that you originally found, and just move on. That needs to be
100% reliable, or else we'll end up with index tuples that point to
the wrong heap tuples in rare cases with concurrent insertions. As I
said, we want a "partial" lazy vacuum here, which must mean that it's
reliable. Note that _bt_pagedel() actually calls _bt_search() when it
deletes a page. Your patch will not be the first patch that makes
nbtree vacuuming do an index scan. You should be managing your own
insertion scan key, much like _bt_pagedel() does. If you use my patch,
_bt_search() can be taught to look for a specific heap TID.

Agree with this notes. Corrections will made in the next version of the 
patch.



Finally, doing things this way would let you delete multiple
duplicates in one shot, as I described in an earlier e-mail. Only a
single descent of the tree is needed to delete quite a few index
tuples, provided that they all happen to be logical duplicates. Again,
your background worker will take advantage of this.

It is very interesting idea. According to this, I plan to change 
bttargetdelete() interface as follows:

IndexTargetDeleteStats*
amtargetdelete(IndexTargetDeleteInfo *info,
   IndexTargetDeleteStats *stats,
   Datum *values, bool *isnull);
where structure IndexTargetDeleteInfo contains a TID list of dead heap 
tuples. All index entries, corresponding to this list, may be deleted 
(or only some of it) by one call of amtargetdelete() function with 
single descent of the tree.



This code does not follow the Postgres style:


-   else
+   }
+   else {
+   if (rootoffnum != latestdead)
+   heap_prune_record_unused(prstate, latestdead);
 heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
+   }
 }


Please be more careful about that. I find it very distracting.


Done

--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company

Re: Add necessary package list to ldap TAP's README

On Wed, Jun 20, 2018 at 11:06:52AM +1200, Thomas Munro wrote:
> I think he used Brew paths?
> 
> https://www.postgresql.org/message-id/0a7646b1-f6f6-1364-f0e3-814aed302db4%402ndquadrant.com

Yes, stuff in /usr/local/ involves homebrew.
--
Michael


signature.asc
Description: PGP signature

Re: partition -> partitioned

2018-06-19 Thread Alvaro Herrera

On 2018-Jun-20, Michael Paquier wrote:

> On Tue, Jun 19, 2018 at 06:02:22PM +0900, Amit Langote wrote:
> > Noticed that the relevant code changed, so I rebased the patch.  Also,
> > made a minor update to a nearby comment.
> 
> That looks right to me as we speak about non-leaf partitions here.
> Alvaro, as 499be013 is yours, would you fix this inconsistency or should
> I?  I could understand why things are confusing on HEAD, "partitioned"
> and "partition" have opposite meanings.

Hmm, will look.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add necessary package list to ldap TAP's README

On Tue, Jun 19, 2018 at 06:59:36PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
>> On Tue, Jun 19, 2018 at 06:45:54PM -0400, Tom Lane wrote:
>>> I was a bit disturbed that you'd push information that was "just guessed",
>>> so I went and tried to run the ldap tests on a couple different platforms.
> 
>> I assumed they were guessed at and at least the URLs checked out. 
>> Should we remove the patch?

Yes, I was a bit surprised to see this patch pushed without checks about
what I guessed.  I am not a user of those other platforms.

> No, but it needs some adjustments.  I'll go fix what I checked.

c992dca2 did not change anything about ldap packages.  Are the ones
listed for RHEL and FreeBSD correct?

> Do you have an opinion on what to say about build-time vs test-time
> requirements?

It could be an idea to mention that for those test suites Postgres needs
to be built with the appropriate --enable-XXX switches or they are just
skipped silently.  Something in each README files across the lines is an
idea:
"This test suite requires PostgreSQL to be built with the configure
argument --enable-XXX or they are entirely skipped".
--
Michael


signature.asc
Description: PGP signature

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-19 Thread Amit Kapila

On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan  wrote:
> On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila  wrote:
>>> I imagine that retail index tuple deletion (the whole point of this
>>> project) would be run by a VACUUM-like process that kills tuples that
>>> are dead to everyone. Even with something like zheap, you cannot just
>>> delete index tuples until you establish that they're truly dead. I
>>> guess that the delete marking stuff that Robert mentioned marks tuples
>>> as dead when the deleting transaction commits.
>>>
>>
>> No, I don't think that is the case because we want to perform in-place
>> updates for indexed-column-updates.  If we won't delete-mark the index
>> tuple before performing in-place update, then we will have two tuples
>> in the index which point to the same heap-TID.
>
> How can an old MVCC snapshot that needs to find the heap tuple using
> some now-obsolete key values get to the heap tuple via an index scan
> if there are no index tuples that stick around until "recently dead"
> heap tuples become "fully dead"? How can you avoid keeping around both
> old and new index tuples at the same time?
>

Both values will be present in the index, but the old value will be
delete-marked.  It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked.  See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: partition -> partitioned

On Tue, Jun 19, 2018 at 06:02:22PM +0900, Amit Langote wrote:
> Noticed that the relevant code changed, so I rebased the patch.  Also,
> made a minor update to a nearby comment.

That looks right to me as we speak about non-leaf partitions here.
Alvaro, as 499be013 is yours, would you fix this inconsistency or should
I?  I could understand why things are confusing on HEAD, "partitioned"
and "partition" have opposite meanings.
--
Michael


signature.asc
Description: PGP signature

Re: Fast default stuff versus pg_upgrade

On 2018-06-19 22:41:26 -0400, Andrew Dunstan wrote:
> This unfortunately crashes and burns if we use DirectFunctionCall3 to call
> array_in, because it uses fn_extra. There is the CallerFInfoFunctionCall
> stuff, but it only has 1 and 2 arg variants, and array_in takes 3. In
> retrospect we should probably have added a 3 arg form - quite a few input
> functions take 3 args. Anything else is likely to be rather uglier.
> 
> Attaching the failing patch. I'll attack this again in the morning.

Why don't you just use OidFunctionCall3? Or simply an explicit
fmgr_info(), InitFunctionCallInfoData(), FunctionCallInvoke() combo?

Greetings,

Andres Freund

Re: Fast default stuff versus pg_upgrade




On 06/19/2018 01:19 PM, Tom Lane wrote:

Andres Freund  writes:

On 2018-06-19 12:37:52 -0400, Tom Lane wrote:

The problem here is that that function does not exist in 11beta1.
Since adding the "incoming" function is certainly going to require
initdb, we have to be able to dump from the server as it now stands,
or we'll be cutting existing beta testers adrift.

It'd probably not be too hard to write a plpgsql replacement for it,
should it come to that. Obviously it'd be nicer to not require users to
create that, but ...

After some thought, I think it's not that hard to get the support function
to accept the anyarray string form.  I was worried about issues like
whether float8 values would restore exactly, but really that's no worse
than a dump/reload today.  Basically, the support function would just need
to extract the target attribute's type and typmod from the pg_attribute
row, then call array_in().



This unfortunately crashes and burns if we use DirectFunctionCall3 to 
call array_in, because it uses fn_extra. There is the 
CallerFInfoFunctionCall stuff, but it only has 1 and 2 arg variants, and 
array_in takes 3. In retrospect we should probably have added a 3 arg 
form - quite a few input functions take 3 args. Anything else is likely 
to be rather uglier.


Attaching the failing patch. I'll attack this again in the morning.

cheers

andrew



--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..d9474db 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,14 +11,18 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
-
+#include "utils/rel.h"
+#include "utils/syscache.h"
 
 #define CHECK_IS_BINARY_UPGRADE	\
 do {			\
@@ -192,3 +196,59 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+	Oid		table_id = PG_GETARG_OID(0);
+	text*attname = PG_GETARG_TEXT_P(1);
+	text*value = PG_GETARG_TEXT_P(2);
+	DatumvaluesAtt[Natts_pg_attribute];
+	bool nullsAtt[Natts_pg_attribute];
+	bool replacesAtt[Natts_pg_attribute];
+	Datummissingval;
+	Form_pg_attribute attStruct;
+	Relationattrrel;
+	HeapTuple   atttup;
+
+	CHECK_IS_BINARY_UPGRADE;
+
+	/* Lock the attribute row and get the data */
+	attrrel = heap_open(AttributeRelationId, RowExclusiveLock);
+	atttup = SearchSysCacheAttName(table_id,text_to_cstring(attname));
+	if (!HeapTupleIsValid(atttup))
+		elog(ERROR, "cache lookup failed for attribute %s of relation %u",
+			 text_to_cstring(attname), table_id);
+	attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+
+	/* get an array value from the value string */
+	missingval = DirectFunctionCall3(array_in,
+	 CStringGetDatum(text_to_cstring(value)),
+	 ObjectIdGetDatum(attStruct->atttypid),
+	 Int32GetDatum(attStruct->atttypmod));
+
+	/* update the tuple - set atthasmissing and attmissingval */
+	MemSet(valuesAtt, 0, sizeof(valuesAtt));
+	MemSet(nullsAtt, false, sizeof(nullsAtt));
+	MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+	valuesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+	replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+	nullsAtt[Anum_pg_attribute_attmissingval - 1] = false;
+
+	atttup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+			   valuesAtt, nullsAtt, replacesAtt);
+	CatalogTupleUpdate(attrrel, >t_self, atttup);
+
+
+	/* clean up */
+	pfree(DatumGetPointer(missingval));
+
+	heap_close(attrrel, RowExclusiveLock);
+	heap_freetuple(atttup);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ea2f022..22be3c2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8095,6 +8095,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_typstorage;
 	int			i_attnotnull;
 	int			i_atthasdef;
+	int			i_atthasmissing;
 	int			i_attidentity;
 	int			i_attisdropped;
 	int			i_attlen;
@@ -8103,6 +8104,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attoptions;
 	int			i_attcollation;
 	int			i_attfdwoptions;
+	int			i_attmissingval;
 	PGresult   *res;
 	int			ntups;
 	bool		hasdefaults;
@@ -8132,7 +8134,33 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 
 		resetPQExpBuffer(q);
 
-		if (fout->remoteVersion >= 10)
+		if

Re: Partitioning with temp tables is broken

On Tue, Jun 19, 2018 at 04:26:56PM +0530, Ashutosh Bapat wrote:
> On Tue, Jun 19, 2018 at 4:22 PM, Michael Paquier  wrote:
>> I was under the impression that this was implied in the precious
>> phrasing but you guys visibly don't match with my impression.  So I
>> would suggest this paragraph at the end:
>> "Mixing temporary and permanent relations in the same partition tree is
>> not allowed.  Hence, if the partitioned table is permanent, so must be
>> its partitions at all levels and likewise if the partitioned table is
> 
> You don't need to mention "at all levels", its implied recursively.

Okay, done on master and REL_10_STABLE.  I have adapted the tests and
the code on v10 where default partitions do not apply.  I have also
removed the test case for partition pruning in REL_10_STABLE as those
have been mainly added by 8d4e70a6, master of course keeps it.

I have included Ashutosh's last suggestions and finished with the
following phrasing:
"Mixing temporary and permanent relations in the same partition tree is
not allowed.  Hence, if the partitioned table is permanent, so must be
its partitions and likewise if the partitioned table is temporary.  When
using temporary relations, all members of the partition tree have to be
from the same session."

I am not sure why this set of emails does not actually appear on
UI interface for archives of pgsql-hackers...  All hackers are receiving
that, right?
--
Michael

signature.asc
Description: PGP signature

Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-19 Thread David Rowley

On 20 June 2018 at 02:28, Robert Haas  wrote:
> On Sun, Jun 17, 2018 at 10:59 PM, David Rowley
>> Robert, do you have any objections to the proposed patch?
>
> I don't have time to study this right now, but I think the main
> possible objection is around performance.  If not flattening the
> Append is the best way to make queries run fast, then we should do it
> that way.  If making pruning capable of coping with mixed hierarchies
> is going to be faster, then we should do that.  If I were to speculate
> in the absence of data, my guess would be that failing to flatten the
> hierarchy is going to lead to a significant per-tuple cost, while the
> cost of making run-time pruning smarter is likely to be incurred once
> per rescan (i.e. a lot less).  But that might be wrong, and it might
> be impractical to get this working perfectly in v11 given the time we
> have.  But I would suggest that you performance test a query that ends
> up feeding lots of tuples through two Append nodes rather than one and
> see how much it hurts.

I've performed two tests. One to see what the overhead of the
additional append is, and one to see what the saving from pruning away
unneeded partitions is. I tried to make the 2nd test use a realistic
number of partitions. Partition pruning will become more useful with
higher numbers of partitions.

Test 1: Test overhead of pulling tuples through an additional append

create table p (a int) partition by list (a);
create table p1 partition of p for values in(1);
insert into p select 1 from generate_series(1,100);
vacuum p1;
set max_parallel_workers_per_gather=0;

select count(*) from (select * from p union all select * from p) p;

Unpatched:
tps = 8.530355 (excluding connections establishing)

Patched:
tps = 7.853939 (excluding connections establishing)

Patched version takes 108.61% of the unpatched time.

Test 2: Tests time saved from run-time partition pruning and not
scanning the index on 23 of the partitions.

create table rp (d date) partition by range (d);
select 'CREATE TABLE rp' || x::text || ' PARTITION OF rp FOR VALUES
FROM (''' || '2017-01-01'::date + (x::text || ' month')::interval ||
''') TO (''' || '2017-01-01'::date + ((x+1)::text || '
month')::interval || ''');'
from generate_Series(0,23) x;
\gexec
insert into rp select d::date from
generate_series('2017-01-01','2018-12-31', interval '10 sec') d;
create index on rp (d);

select count(*) from (select * from rp union all select * from rp) rp
where d = current_date;

Unpatched: (set enable_partition_pruning = 0; to make it work)
tps = 260.969953 (excluding connections establishing)

Patched:
tps = 301.319038 (excluding connections establishing)

Patched version takes 86.61% of the unpatched time.

So, I don't think that really concludes much.  I'd say the overhead
shown in test 1 is going to be a bit more predictable as it will
depend on how many tuples are being pulled through the additional
Append, but the savings shown in test 2 will vary.  Having run-time
pruning not magically fail to work when the partitioned table is part
of a UNION ALL certainly seems less surprising.

If I drop the index from the "d" column in test 2, the performance gap
increases significantly and is roughly proportional to the number of
partitions.

Unpatched:
tps = 0.523691 (excluding connections establishing)

Patched:
tps = 13.453964 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Add necessary package list to ldap TAP's README

2018-06-19 Thread Thomas Munro

On Wed, Jun 20, 2018 at 10:45 AM, Tom Lane  wrote:
> FreeBSD 12: you need openldap-client (NOT openldap24-client) to build
> with --with-ldap at all.  Then also openldap-server to run this test.

Well, it's "pkg install openldap-client" but it's
"net/openldap24-client" if you're building ports from source.  That is
confusing.

> On a somewhat related subject ... I notice that the paths for darwin in
> the test script seem to presume that the LDAP support has been installed
> from MacPorts or some such, but this isn't documented anywhere (least of
> all in this README file).  I tried to get it to work with the
> Apple-provided LDAP server and libraries, but no go --- Apple's apparently
> done something weird to their copy of slapd, such that it doesn't work
> when launched manually like this.  So the expectation of non-default LDAP
> support is probably reasonable, but I'm disturbed by the lack of docs.

I think he used Brew paths?

https://www.postgresql.org/message-id/0a7646b1-f6f6-1364-f0e3-814aed302db4%402ndquadrant.com

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Add necessary package list to ldap TAP's README

Bruce Momjian  writes:
> On Tue, Jun 19, 2018 at 06:45:54PM -0400, Tom Lane wrote:
>> I was a bit disturbed that you'd push information that was "just guessed",
>> so I went and tried to run the ldap tests on a couple different platforms.

> I assumed they were guessed at and at least the URLs checked out. 
> Should we remove the patch?

No, but it needs some adjustments.  I'll go fix what I checked.

Do you have an opinion on what to say about build-time vs test-time
requirements?

regards, tom lane

Re: Add necessary package list to ldap TAP's README

On Tue, Jun 19, 2018 at 06:45:54PM -0400, Tom Lane wrote:
> I was a bit disturbed that you'd push information that was "just guessed",
> so I went and tried to run the ldap tests on a couple different platforms.

I assumed they were guessed at and at least the URLs checked out. 
Should we remove the patch?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Add necessary package list to ldap TAP's README

Bruce Momjian  writes:
> On Mon, May 21, 2018 at 10:34:25AM +0900, Michael Paquier wrote:
>> The kerberos test suite mentions the package list to use on a set of
>> distributions, which is very useful.  However we don't do the same for
>> ldap.  Something like the attached would be adapted then to help setting
>> up a test environment?
>> 
>> I have at hand only a Debian installation, and I have just guessed the
>> RHEL and FreeBSD dependencies.

> Patch applied to head, thanks.

I was a bit disturbed that you'd push information that was "just guessed",
so I went and tried to run the ldap tests on a couple different platforms.

Fedora 28: you need openldap and openldap-devel to build with --with-ldap
at all.  Then in addition openldap-clients and openldap-servers to run
this test.  The path selections in the script work.

FreeBSD 12: you need openldap-client (NOT openldap24-client) to build
with --with-ldap at all.  Then also openldap-server to run this test.

I'm not sure whether it's worth making a clear distinction between
build-time and test-time dependencies, but if we're going to list
build-time dependencies here at all, we probably ought to be complete
about it.  (It's been well established that not everybody who builds
PG knows about the need to install -devel/-dev subpackages.)

On a somewhat related subject ... I notice that the paths for darwin in
the test script seem to presume that the LDAP support has been installed
from MacPorts or some such, but this isn't documented anywhere (least of
all in this README file).  I tried to get it to work with the
Apple-provided LDAP server and libraries, but no go --- Apple's apparently
done something weird to their copy of slapd, such that it doesn't work
when launched manually like this.  So the expectation of non-default LDAP
support is probably reasonable, but I'm disturbed by the lack of docs.

regards, tom lane

pglife and post-beta commits

FYI, you can now see all the commits since the most recent beta in
pglife (http://pglife.momjian.us/).  It is the plus sign link after "11
beta".  The link currently is:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=shortlog;h=refs/heads/master;hp=586e4e6df5b85ddd28c9e881d237bd7380ffeb8e

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Time to put context diffs in the grave





On 06/19/2018 05:45 PM, Peter Geoghegan wrote:

On Tue, Jun 19, 2018 at 2:41 PM, Tom Lane  wrote:

Not requiring them is considerably different from actually trying
to stamp them out.  To my mind, where we're trying to go here is
not having one set of people impose their diff format preferences
on other people.

Am I the only one that doesn't use context diffs and also doesn't hate
them with a passion?



No.



I don't get it.



But I do get it. I won't even start on the things that bug me :-)

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Time to put context diffs in the grave

On Tue, Jun 19, 2018 at 2:41 PM, Tom Lane  wrote:
> Not requiring them is considerably different from actually trying
> to stamp them out.  To my mind, where we're trying to go here is
> not having one set of people impose their diff format preferences
> on other people.

Am I the only one that doesn't use context diffs and also doesn't hate
them with a passion? I don't get it.

> There's already a (perhaps underdocumented) way to make regression
> diffs look the way you want in local runs.

Actually, it is documented.

-- 
Peter Geoghegan

Re: Time to put context diffs in the grave

On 2018-06-19 17:41:00 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 06/19/2018 05:21 PM, Andres Freund wrote:
> >> Given that we've "officially" stopped relying on context diffs,
> >> I don't see why we'd not also retire them in pg_regress.
> 
> > Well I suspect at least one person would be made unhappy ;-)
> 
> Not requiring them is considerably different from actually trying
> to stamp them out.

Sure - but I'm not trying to get rid of PG_REGRESS_DIFF_OPTS, just
talking about changing the defaults.


> There's already a (perhaps underdocumented) way to make regression
> diffs look the way you want in local runs.  Andrew's idea of getting
> the buildfarm server to display diffs either way would go a long
> way towards having that same flexibility for buildfarm results, though
> it does seem like rather a lot of work for a small point :-(

Yea, if this were all already available, I'd be happy with it. But...

Greetings,

Andres Freund

Re: Time to put context diffs in the grave

Andrew Dunstan  writes:
> On 06/19/2018 05:21 PM, Andres Freund wrote:
>> Given that we've "officially" stopped relying on context diffs,
>> I don't see why we'd not also retire them in pg_regress.

> Well I suspect at least one person would be made unhappy ;-)

Not requiring them is considerably different from actually trying
to stamp them out.  To my mind, where we're trying to go here is
not having one set of people impose their diff format preferences
on other people.

There's already a (perhaps underdocumented) way to make regression
diffs look the way you want in local runs.  Andrew's idea of getting
the buildfarm server to display diffs either way would go a long
way towards having that same flexibility for buildfarm results, though
it does seem like rather a lot of work for a small point :-(

regards, tom lane

Re: PostgreSQL and Homomorphic Encryption

On Wed, May 23, 2018 at 09:05:15AM +0800, Craig Ringer wrote:
> Presumably it'd have to support some non-equality ops like < and > for b-tree
> indexing, so you can compare two encrypted texts without decryption.
> 
> If the user can supply cleartext to be compared against, this exposes
> search-based plaintext attacks where you can discover the plaintext over time
> with iterative searches over modified plaintext.
> 
> My understanding of homomorphic encryption is that it's generally more useful
> for data-modifying operations. For example, you might want to add a value to a
> balance without being able to read the balance and learn the current value. I
> haven't heard of it being used for searches before.

I have a slide about indexing encrypted data;  not sure if it is
relevant:

https://momjian.us/main/writings/crypto_hw_use.pdf#page=86

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Time to put context diffs in the grave





On 06/19/2018 05:21 PM, Andres Freund wrote:


Given that we've "officially" stopped relying on context diffs,
I don't see why we'd not also retire them in pg_regress.




Well I suspect at least one person would be made unhappy ;-)

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Time to put context diffs in the grave

On 2018-06-19 17:18:32 -0400, Andrew Dunstan wrote:
> On 06/19/2018 04:54 PM, Andres Freund wrote:
> > Could we please also change pg_regress' diff invocations? I find it
> > really painful to see differences in buildfarm output due to the
> > -C3. Locally I've long exported PG_REGRESS_DIFF_OPTS, but that doesn't
> > help on the BF.

> It should do if you put it on the build_env stanza of your config file. I
> just looked at your animal skink and didn't see any trace of it. What
> PG_REGRESS_DIFF_OPTS do you use?
> 
> But of course this won;t help you with other peoples' animals :-)

Oh, I was talking about the wider buildfarm, rather than just mine.


> One idea I have been playing with is breaking up the reports so that instead
> of a large text blob we have a series of files. Allowing different formats
> on the web interface for diff files wouldn't be terribly difficult once we
> had that - we'd just need to invoke filterdiff or some such. Of course, we
> couldn't supply more context that was there in the first place ;-)

I mean that would be good - I've way too often spent considerable time
looking for errors - but I don't see why that would be related to this
change. Given that we've "officially" stopped relying on context diffs,
I don't see why we'd not also retire them in pg_regress.

Greetings,

Andres Freund

Re: Time to put context diffs in the grave





On 06/19/2018 04:54 PM, Andres Freund wrote:

Hi,

On 2018-05-21 21:51:11 -0400, Andrew Dunstan wrote:

We haven't insisted on context diffs in years now, and one of my
interlocutors has just turned handsprings trying to follow the advice at
 to produce his first
patch.


Unless someone objects really violently I'm going to rip all that stuff out
and let sanity prevail.

Could we please also change pg_regress' diff invocations? I find it
really painful to see differences in buildfarm output due to the
-C3. Locally I've long exported PG_REGRESS_DIFF_OPTS, but that doesn't
help on the BF.





It should do if you put it on the build_env stanza of your config file. 
I just looked at your animal skink and didn't see any trace of it. What 
PG_REGRESS_DIFF_OPTS do you use?


But of course this won;t help you with other peoples' animals :-)

One idea I have been playing with is breaking up the reports so that 
instead of a large text blob we have a series of files. Allowing 
different formats on the web interface for diff files wouldn't be 
terribly difficult once we had that - we'd just need to invoke 
filterdiff or some such. Of course, we couldn't supply more context that 
was there in the first place ;-)


It would be a substantial change, though, so I'm still working on it 
conceptually.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2018-06-19 Thread Robbie Harwood

Konstantin Knizhnik  writes:

> On 18.06.2018 23:34, Robbie Harwood wrote:
>
>> I also don't like that you've injected into the *startup* path -
>> before authentication takes place.  Fundamentally, authentication (if
>> it happens) consists of exchanging some combination of short strings
>> (e.g., usernames) and binary blobs (e.g., keys).  None of this will
>> compress well, so I don't see it as worth performing this negotiation
>> there - it can wait.  It's also another message in every startup.
>> I'd leave it to connection parameters, personally, but up to you.
>
>  From my point of view compression of libpq traffic is similar with SSL 
> and should be toggled at the same stage.

But that's not what you're doing.  This isn't where TLS gets toggled.

TLS negotiation happens as the very first packet: after completing the
TCP handshake, the client will send a TLS negotiation request.  If it
doesn't happen there, it doesn't happen at all.

(You *could* configure it where TLS is toggled.  This is, I think, not a
good idea.  TLS encryption is a probe: the server can reject it, at
which point the client tears everything down and connects without TLS.
So if you do the same with compression, that's another point of tearing
down an starting over.  The scaling on it isn't good either: if we add
another encryption method into the mix, you've doubled the number of
teardowns.)

> Definitely authentication parameter are not so large to be efficiently 
> compressed, by compression (may be in future password protected) can 
> somehow protect this data.
> In any case I do not think that compression of authentication data may 
> have any influence on negotiation speed.
> So I am not 100% sure that toggling compression just after receiving 
> startup package is the only right solution.
> But I am not also convinced in that there is some better place where 
> compressor should be configured.
> Do you have some concrete suggestions for it? In InitPostgres just after 
> PerformAuthentication ?

Hmm, let me try to explain this differently.

pq_configure() (as you've called it) shouldn't send a packet.  At its
callsite, we *already know* whether we want to use compression - that's
what the port->use_compression option says.  So there's no point in
having a negotiation there - it's already happened.

The other thing you do in pq_configure() is call zpq_create(), which
does a bunch of initialization for you.  I am pretty sure that all of
this can be deferred until the first time you want to send a compressed
message - i.e., when compress()/decompress() is called for the first
time from *secure_read() or *secure_write().

> Also please notice that compression is useful not only for client-server 
> communication, but also for replication channels.
> Right now it is definitely used in both cases, but if we move 
> pq_configure somewhere else, we should check that this code is invoked 
> in both for normal backends and walsender.

"We" meaning you, at the moment, since I don't think any of the rest of
us have set up tests with this code :)

If there's common code to be shared around, that's great.  But it's not
imperative; in a lot of ways, the network stacks are very different from
each other, as I'm sure you've seen.  Let's not have the desire for code
reuse get in the way of good, maintainable design.

>> Using terminology from https://facebook.github.io/zstd/zstd_manual.html :
>>
>> Right now you use streaming (ZSTD_{compress,decompress}Stream()) as the
>> basis for your API.  I think this is probably a mismatch for what we're
>> doing here - we're doing one-shot compression/decompression of packets,
>> not sending video or something.
>>
>> I think our use case is better served by the non-streaming interface, or
>> what they call the "Simple API" (ZSTD_{decompress,compress}()).
>> Documentation suggests it may be worth it to keep an explicit context
>> around and use that interface instead (i.e.,
>> ZSTD_{compressCCTx,decompressDCtx}()), but that's something you'll have
>> to figure out.
>>
>> You may find making this change helpful for addressing the next issue.
>
> Sorry, but here I completely disagree with you.
> What we are doing is really streaming compression, not compression of 
> individual messages/packages.
> Yes, it is not a video, but actually COPY data has the same nature as 
> video data.
> The main benefit of streaming compression is that we can use the same 
> dictionary for compressing all messages (and adjust this dictionary 
> based on new data).
> We do not need to write dictionary and separate header for each record. 
> Otherwize compression of libpq messages will be completely useless: 
> typical size of message is too short to be efficiently compressed. The 
> main drawback of streaming compression is that you can not decompress 
> some particular message without decompression of all previous messages.
> This is why streaming compression can not be used to compress database 
> pages  (as it is done in

Re: Time to put context diffs in the grave

Hi,

On 2018-05-21 21:51:11 -0400, Andrew Dunstan wrote:
> 
> We haven't insisted on context diffs in years now, and one of my
> interlocutors has just turned handsprings trying to follow the advice at
>  to produce his first
> patch.
> 
> 
> Unless someone objects really violently I'm going to rip all that stuff out
> and let sanity prevail.

Could we please also change pg_regress' diff invocations? I find it
really painful to see differences in buildfarm output due to the
-C3. Locally I've long exported PG_REGRESS_DIFF_OPTS, but that doesn't
help on the BF.

Greetings,

Andres Freund

Re: Time to put context diffs in the grave

On Wed, May 23, 2018 at 05:01:58PM -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > OK, that's done. Now I think we can get rid of git-external-diff.
> 
> I for one rely on that.  I won't tell anyone else what kind of diff
> they have to read, but if you try to tell me what kind of diff I have
> to read, I'm going to complain.
> 
> > While we're about it, does anyone use make_diff any more? It seems
> > rather ancient and crufty.
> 
> Not me, but judging from the README, possibly Bruce still uses it.

I have local copies, so they can be removed.  I added them years ago in
case they helped anyone else.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

2018-06-19 Thread Alvaro Herrera

Hello hackers,

Any objections fixing this on Pg11?  My opinion is that it's better to
fix it now rather than waiting for pg12.  It's already broken in pg10
(xmltable) and older (rest of the xml stuff).  As Markus advised, I'd
not backpatch fixes for fear of breaking stuff, but I'd rather release
pg11 with the correct behavior.

I intend to get it done this week if there are no serious objections.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Hi,

On 2018-06-19 17:05:48 -0300, Matheus de Oliveira wrote:
> > You should first make sure it's actually this problem - which tables are
> > holding back the xmin horizon?
> 
> 
> How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the
> following message is logged every minute (which I belive is due to
> `autovacuum_naptime`, which is using the default of 1 minute):

Yes, that sounds like the issue. Basically just wanted the table names:

> ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
> CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"

which indeed are shared relations.

Greetings,

Andres Freund

Re: Invisible Indexes

Hi,

On 2018-06-19 14:05:24 -0400, Robert Haas wrote:
> Yeah, I agree that a GUC seems more powerful and easier to roll out.
> A downside is that there could be cached plans still using that old
> index.  If we did DDL on the index we could be sure they all got
> invalidated, but otherwise how do we know?

Hm - it doesn't seem too hard to force an invalidation after SIGHUP and
certain config changes. Seems like that would be a good idea for other
existing GUCs anyway?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-19 Thread Matheus de Oliveira

On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
> > Hello friends.
> >
> > On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund 
> wrote:
> >
> > >
> > > On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> > > > I plan to go over the change again tomorrow, and then push. Unless
> > > > somebody has comments before then, obviously.
> > >
> > > Done.
> > >
> > >
> > Sorry to bother about this, but do you have any plan to do the minor
> > release before planned due to this bug?
>
> Unclear at this point.  There's been discussion about it, without coming
> to a conclusion.
>
>
Ok. Thank you for the information.

I really hope you decide to release it soon, I'm very afraid about the
users that have hit the bug but haven't noticed the issue.


>
> > I'm pondering what is the best option to avoid a forced shutdown of this
> > server:
> > - should I just wait for a release (if it is soon, I would be fine)?
> > - build PG from the git version by myself?
> > - or is there a safer workaround to the problem? (not clear to me if
> > deleting the `global/pg_internal.init` file is really the way to go, and
> > the details, is it safe? Should I stop the server, delete, start?)
>
> You should first make sure it's actually this problem - which tables are
> holding back the xmin horizon?


How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the
following message is logged every minute (which I belive is due to
`autovacuum_naptime`, which is using the default of 1 minute):

ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"
ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
CONTEXT:  automatic vacuum of table
"template0.pg_catalog.pg_auth_members"

Do you need controldata or more info to validate it?


>   After that, yes, deleting the
> global/pg_internal.init file is the way to go, and I can't think of a
> case where it's problematic, even without stopping the server.
>
>
I'll try that and get back to you if it worked or not. Thank you for the
confirmation.

And thank you for all clarifications.

Best regards,
-- 
Matheus de Oliveira

Re: Add necessary package list to ldap TAP's README

On Mon, May 21, 2018 at 10:34:25AM +0900, Michael Paquier wrote:
> Hi all,
> 
> The kerberos test suite mentions the package list to use on a set of
> distributions, which is very useful.  However we don't do the same for
> ldap.  Something like the attached would be adapted then to help setting
> up a test environment?
> 
> I have at hand only a Debian installation, and I have just guessed the
> RHEL and FreeBSD dependencies.

Patch applied to head, thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Invisible Indexes

On Tue, Jun 19, 2018 at 12:22 PM, Euler Taveira  wrote:
> If we want to test the effect of disabling an index, we could set GUC
> only on the current session. DDL will make the index invisible
> immediately. Things can go worse after that. I prefer the former. It
> is more conservative but could confuse users if the effect is not
> immediate (few words could explain cached plans x invisible indexes).

If we're going to go that way, then we better not call them invisible
indexes. Invisible indexes are generally understood to be indexes that
are "invisible" to everyone -- not just the current session.

-- 
Peter Geoghegan

Re: Invisible Indexes

2018-06-19 Thread Euler Taveira

2018-06-19 15:05 GMT-03:00 Robert Haas :
> Yeah, I agree that a GUC seems more powerful and easier to roll out.
> A downside is that there could be cached plans still using that old
> index.  If we did DDL on the index we could be sure they all got
> invalidated, but otherwise how do we know?
>
If we want to test the effect of disabling an index, we could set GUC
only on the current session. DDL will make the index invisible
immediately. Things can go worse after that. I prefer the former. It
is more conservative but could confuse users if the effect is not
immediate (few words could explain cached plans x invisible indexes).

> BTW, like you, I seem to remember somebody writing an extension that
> did added a GUC that did exactly this, and demoing it at a conference.
> Maybe Oleg or Teodor?
>
https://github.com/postgrespro/plantuner


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Invisible Indexes

On Mon, Jun 18, 2018 at 6:12 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane  wrote:
>>> I think the actually desirable way to handle this sort of thing is through
>>> an "index advisor" sort of plugin, which can hide a given index from the
>>> planner without any globally visible side-effects.
>
>> The globally visible side-effects are the point, though. Some users
>> desire cheap insurance against dropping what turns out to be the wrong
>> index.
>
> Perhaps there are use-cases where you want globally visible effects,
> but the primary use-case Andrew cited (i.e. EXPLAIN experimentation)
> would not want that.
>
> Anyway, if we do it with a GUC, the user can control the scope of
> the effects.

Yeah, I agree that a GUC seems more powerful and easier to roll out.
A downside is that there could be cached plans still using that old
index.  If we did DDL on the index we could be sure they all got
invalidated, but otherwise how do we know?

BTW, like you, I seem to remember somebody writing an extension that
did added a GUC that did exactly this, and demoing it at a conference.
Maybe Oleg or Teodor?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fast default stuff versus pg_upgrade

"David G. Johnston"  writes:
> On Tue, Jun 19, 2018 at 9:37 AM, Tom Lane  wrote:
>> The problem here is that that function does not exist in 11beta1.
>> Since adding the "incoming" function is certainly going to require
>> initdb, we have to be able to dump from the server as it now stands,
>> or we'll be cutting existing beta testers adrift.

> I was under the impression that we don't promise to support a "v10 -> beta
> -> rc -> final" upgrade path; instead, once final is released people would
> be expected to upgrade "v10 -> v11".

Well, we don't *promise* beta testers that their beta databases will be
usable into production, but ever since pg_upgrade became available we've
tried to make it possible to pg_upgrade to the next beta or production
release.  I do not offhand recall any previous case where we failed to do
so.

regards, tom lane

Re: MERGE SQL statement for PG12

On Tue, Jun 19, 2018 at 4:06 AM, Pavan Deolasee
 wrote:
> I would like to resubmit the MERGE patch for PG12. The past discussions
> about the patch can be found here [1] [2].

FWIW, I'm really glad that you're going to work on this for v12.

-- 
Peter Geoghegan

Re: Excessive CPU usage in StandbyReleaseLocks()

Andres Freund  writes:
> On 2018-06-19 13:05:48 -0400, Robert Haas wrote:
>> I don't expect you to agree with my vote, but I stand by it.

> Yea, I didn't expect (but hoped!) to change your mind... ;)

FWIW, I agree with Robert --- a PANIC here will certainly create
problems, and it's much less clear what problems it might prevent.

However, the trouble with LOG is that we would be unlikely to
notice such a message in testing, so that a test case triggering
an issue of this type would not get recognized as such.

Hence, I have a modest proposal: use elog(LOG) followed by Assert(false),
so that debug builds report such problems loudly but production builds
do not.  We've done that before, see e.g. load_relcache_init_file around
relcache.c:5718.

regards, tom lane

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila  wrote:
>> I imagine that retail index tuple deletion (the whole point of this
>> project) would be run by a VACUUM-like process that kills tuples that
>> are dead to everyone. Even with something like zheap, you cannot just
>> delete index tuples until you establish that they're truly dead. I
>> guess that the delete marking stuff that Robert mentioned marks tuples
>> as dead when the deleting transaction commits.
>>
>
> No, I don't think that is the case because we want to perform in-place
> updates for indexed-column-updates.  If we won't delete-mark the index
> tuple before performing in-place update, then we will have two tuples
> in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

-- 
Peter Geoghegan

Re: ToDo: show size of partitioned table

2018-06-19 Thread Jeevan Ladhe

On Fri, Jun 1, 2018 at 8:51 PM, Pavel Stehule 
wrote:

>
>
> 2018-06-01 17:15 GMT+02:00 Ashutosh Bapat  >:
>
>> On Fri, Jun 1, 2018 at 12:56 AM, Pavel Stehule 
>> wrote:
>> > Hi
>> >
>> > postgres=# SELECT count(*) from data;
>> > ┌─┐
>> > │  count  │
>> > ╞═╡
>> > │ 100 │
>> > └─┘
>> > (1 row)
>> >
>> > \dt+ can display actual size of partitioned table data - now zero is
>> > displayed
>> >
>> > postgres=# \dt+ data
>> > List of relations
>> > ┌┬──┬───┬───┬─┬─┐
>> > │ Schema │ Name │ Type  │ Owner │  Size   │ Description │
>> > ╞╪══╪═══╪═══╪═╪═╡
>> > │ public │ data │ table │ pavel │ 0 bytes │ │
>> > └┴──┴───┴───┴─┴─┘
>> > (1 row)
>>
>> I think we should at least display "Type" as "partitioned table" for a
>> partitioned table, so that it's easy to understand why the size is 0;
>> partitioned tables do not hold any data by themselves.
>>
>
> should be.
>
>
Yes, or maybe we can add that info in "Description".


> Some is missing still - there is not any total size across all partitions.
>
> maybe new command like
>
> \dtP+ .. show partitioned tables and their size
>

+1


Regards,
Jeevan Ladhe

Re: Fast default stuff versus pg_upgrade

2018-06-19 Thread David G. Johnston

On Tue, Jun 19, 2018 at 9:37 AM, Tom Lane  wrote:

> The problem here is that that function does not exist in 11beta1.
> Since adding the "incoming" function is certainly going to require
> initdb, we have to be able to dump from the server as it now stands,
> or we'll be cutting existing beta testers adrift.
>

I was under the impression that we don't promise to support a "v10 -> beta
-> rc -> final" upgrade path; instead, once final is released people would
be expected to upgrade "v10 -> v11".  Under that condition requiring users
to do "v10 -> beta2" instead of "beta1 -> beta2", while annoying, is well
within the realm of possibility and expectation.

David J.

Re: Fast default stuff versus pg_upgrade





On 06/19/2018 01:19 PM, Tom Lane wrote:

Andres Freund  writes:

On 2018-06-19 12:37:52 -0400, Tom Lane wrote:

The problem here is that that function does not exist in 11beta1.
Since adding the "incoming" function is certainly going to require
initdb, we have to be able to dump from the server as it now stands,
or we'll be cutting existing beta testers adrift.

It'd probably not be too hard to write a plpgsql replacement for it,
should it come to that. Obviously it'd be nicer to not require users to
create that, but ...

After some thought, I think it's not that hard to get the support function
to accept the anyarray string form.  I was worried about issues like
whether float8 values would restore exactly, but really that's no worse
than a dump/reload today.  Basically, the support function would just need
to extract the target attribute's type and typmod from the pg_attribute
row, then call array_in().

I wonder though whether there are any interesting corner cases along
this line:

1. Create a column with a fast default.

2. Sometime later, alter the column so that the fast default value
is no longer a legal value.  If the fast default isn't in active use
in the table, the ALTER would go through; but if it does not remove
the attmissingval entry, then ...

3. Subsequently, pg_upgrade fails when the support function tries to
pass the attmissingval entry through the type input function.

The kind of case where this might fail is reducing the allowed
max len (typmod) for a varchar column.  I think ALTER TABLE is
smart enough to not rewrite the table for that, so that there
wouldn't be anything causing the fast default to get removed.





My experimentation showed this causing a rewrite. I think it only skips 
the rewrite if you make the allowed length greater, not smaller.


cheers

andrew

Re: [WIP] [B-Tree] Retail IndexTuple deletion

On Tue, Jun 19, 2018 at 2:33 AM, Masahiko Sawada  wrote:
> I think that we do the partial lazy vacuum using visibility map even
> now. That does heap pruning, index tuple killing but doesn't advance
> relfrozenxid.

Right, that's what I was thinking. Opportunistic HOT pruning isn't
like vacuuming because it doesn't touch indexes. This patch adds an
alternative strategy for conventional lazy vacuum that is also able to
run a page at a time if needed. Perhaps page-at-a-time operation could
later be used for doing something that is opportunistic in the same
way that pruning is opportunistic, but it's too early to worry about
that.

> Since this patch adds an ability to delete small amount
> of index tuples quickly, what I'd like to do with this patch is to
> invoke autovacuum more frequently, and do the target index deletion or
> the index bulk-deletion depending on amount of garbage and index size
> etc. That is, it might be better if lazy vacuum scans heap in ordinary
> way and then plans and decides a method of index deletion based on
> costs similar to what query planning does.

That seems to be what Andrey wants to do, though right now the
prototype patch actually just always uses its alternative strategy
while doing any kind of lazy vacuuming (some simple costing code is
commented out right now). It shouldn't be too hard to add some costing
to it. Once we do that, and once we polish the patch some more, we can
do performance testing. Maybe that alone will be enough to make the
patch worth committing; "opportunistic microvacuuming" can come later,
if at all.

-- 
Peter Geoghegan

Re: Fast default stuff versus pg_upgrade

On Tue, Jun 19, 2018 at 10:12 AM, Robert Haas  wrote:
> I have to admit that I think this feature is scary. I'm not sure that
> it was adequately reviewed and tested, and I'm worried this may not be
> the only problem it causes. But this particular problem, as Andres
> says, doesn't seem like anything we can't fix with acceptable risk.

I agree with both points.

-- 
Peter Geoghegan

Re: Excessive CPU usage in StandbyReleaseLocks()

On 2018-06-19 13:05:48 -0400, Robert Haas wrote:
> On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund  wrote:
> > On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
> >> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund  wrote:
> >> > This should be a PANIC imo.
> >>
> >> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
> >> would much rather have replay continue (with possible problems) than
> >> have it stopped cold in its tracks with absolutely nothing that I as
> >> the administrator can do to fix it.  We should be building this
> >> product for end users.
> >
> > Except that that just means the end-user will have an undebuggable
> > problem at their hand. Which will affect them as well.
> 
> I agree, but my guess is that a PANIC will affect them more.

Sure, in the short term. I think our disagreement is based on the a
different viewpoint: I think users are served better if a problem is
surfaced as close to the origin, because that makes quick diagnosis and
fix possible - even if that means sometimes erroring out in cases where
we could have just limped on without causing noticed problems.  You
think that trying to limp along as long as possible is good, because
that removes immediate pain from the user (and then prevents the user
from redirecting the pain to you).

> I don't expect you to agree with my vote, but I stand by it.

Yea, I didn't expect (but hoped!) to change your mind... ;)

Greetings,

Andres Freund

Re: Fast default stuff versus pg_upgrade

Andres Freund  writes:
> On 2018-06-19 12:37:52 -0400, Tom Lane wrote:
>> The problem here is that that function does not exist in 11beta1.
>> Since adding the "incoming" function is certainly going to require
>> initdb, we have to be able to dump from the server as it now stands,
>> or we'll be cutting existing beta testers adrift.

> It'd probably not be too hard to write a plpgsql replacement for it,
> should it come to that. Obviously it'd be nicer to not require users to
> create that, but ...

After some thought, I think it's not that hard to get the support function
to accept the anyarray string form.  I was worried about issues like
whether float8 values would restore exactly, but really that's no worse
than a dump/reload today.  Basically, the support function would just need
to extract the target attribute's type and typmod from the pg_attribute
row, then call array_in().

I wonder though whether there are any interesting corner cases along
this line:

1. Create a column with a fast default.

2. Sometime later, alter the column so that the fast default value
is no longer a legal value.  If the fast default isn't in active use
in the table, the ALTER would go through; but if it does not remove
the attmissingval entry, then ...

3. Subsequently, pg_upgrade fails when the support function tries to
pass the attmissingval entry through the type input function.

The kind of case where this might fail is reducing the allowed
max len (typmod) for a varchar column.  I think ALTER TABLE is
smart enough to not rewrite the table for that, so that there
wouldn't be anything causing the fast default to get removed.

regards, tom lane

Re: Fast default stuff versus pg_upgrade

On Tue, Jun 19, 2018 at 12:37 PM, Tom Lane  wrote:
> The problem here is that that function does not exist in 11beta1.
> Since adding the "incoming" function is certainly going to require
> initdb, we have to be able to dump from the server as it now stands,
> or we'll be cutting existing beta testers adrift.

That would still be less disruptive than ripping the feature out,
which would be cutting those same users adrift, too, unless I'm
missing something.

I have to admit that I think this feature is scary. I'm not sure that
it was adequately reviewed and tested, and I'm worried this may not be
the only problem it causes. But this particular problem, as Andres
says, doesn't seem like anything we can't fix with acceptable risk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WAL prefetch


On 06/19/2018 06:34 PM, Konstantin Knizhnik wrote:



On 19.06.2018 18:50, Andres Freund wrote:

On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more 
efforts

and make patch more envasive...
It even somehow simplify it, because there is no to maintain own 
cache of

prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...

These arguments seem bogus to me. Otherwise the startup process is going
to do that work.


There is just one process replaying WAL. Certainly it has some impact on 
hot standby query execution.
But if there will be several prefetch workers (128???) then this impact 
will be dramatically increased.




The goal of prefetching is better saturation of the storage. Which means 
less bandwidth remaining for other processes (that have to compete for 
the same storage). I don't think "startup process is going to do that 
work" is entirely true - it'd do that work, but likely over longer 
period of time.


But I don't think this is an issue - I'd expect having some GUC defining 
how many records to prefetch (just like effective_io_concurrency).


Concerning WAL perfetch I still have a serious doubt if it is needed 
at all:
if checkpoint interval is less than size of free memory at the 
system, then

redo process should not read much.

I'm confused. Didn't you propose this?  FWIW, there's a significant
number of installations where people have observed this problem in
practice.


Well, originally it was proposed by Sean - the author of pg-prefaulter. 
I just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic 
improvement in performance (in case of synchronous replication) or 
reducing replication lag for asynchronous replication neither at my 
desktop (SSD, 16Gb RAM, local replication within same computer, pgbench 
scale 1000), neither at pair of two powerful servers connected by

InfiniBand and 3Tb NVME (pgbench with scale 10).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.



pgbench is a perfectly sufficient workload to demonstrate the issue, all 
you need to do is use sufficiently large scale factor (say 2*RAM) and 
large number of clients to generate writes on the primary (to actually 
saturate the storage). Then the redo on replica won't be able to keep 
up, because the redo only fetches one page at a time.


So it make me think when such prefetch may be needed... And it caused 
new questions:

I wonder how frequently checkpoint interval is much larger than OS cache?


Pretty often. Furthermore, replicas may also run queries (often large 
ones), pushing pages related to redo from RAM.


If we enforce full pages writes (let's say each after each 1Gb), how it 
affect wal size and performance?




It would improve redo performance, of course, exactly because the page 
would not need to be loaded from disk. But the amount of WAL can 
increase tremendously, causing issues for network bandwidth 
(particularly between different data centers).


Looks like it is difficult to answer the second question without 
implementing some prototype.

May be I will try to do it.


Perhaps you should prepare some examples of workloads demonstrating the 
issue, before trying implementing a solution.



And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?)

Yes, there are.  Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.


One more consideration: data is written to the disk as blocks in any 
case. If you updated just few bytes on a page, then still the whole page 
has to be written in database file.
So avoiding full page writes allows to reduce WAL size and amount of 
data written to the WAL, but not amount of data written to the database 
itself.
It means that if we completely eliminate FPW and transactions are 
updating random pages, then disk traffic is reduced less than two times...




I don't follow. What do you mean by "less than two times"? Surely the 
difference can be anything between 0 and infinity, depending on how 
often you write a single page.


The other problem with just doing FPI all the time is backups. To do 
physical backups / WAL archival, you need to store all the WAL segments. 
If the amount of WAL increases 10x you're going to be unhappy.






then quite small patch (as it seems to me now) forcing full page write
when distance between page LSN and current WAL insertion point exceeds
some threshold should eliminate random reads also in this case.

I'm pretty sure that that'll hurt a significant number of installations,
that

Re: Index Skip Scan

2018-06-19 Thread Jesper Pedersen


Hi Dmitry,

On 06/19/2018 06:01 AM, Dmitry Dolgov wrote:

On 18 June 2018 at 19:31, Alexander Korotkov  wrote:
Is skip scan only possible for index-only scan?  I guess, that no.  We
could also make plain index scan to behave like a skip scan.  And it
should be useful for accelerating DISTINCT ON clause.  Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan.  So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller.  But I don't insist on that, and I would like to hear
other opinions too.


In one of patches I'm working on I had similar situation, when I wanted to
split one node into two similar nodes (before I just extended it) and logically
it made perfect sense. But it turned out to be quite useless and the advantage
I've got wasn't worth it - and just to mention, those nodes had more differences
than in this patch. So I agree that probably it would be better to keep using
IndexOnlyScan.



I looked at this today, and creating a new node (T_IndexOnlySkipScan) 
would make the patch more complex.


The question is if the patch should create such a node such that future 
patches didn't have to deal with refactoring to a new node to cover 
additional functionality.


Thanks for your feedback !

Best regards,
 Jesper

Re: Excessive CPU usage in StandbyReleaseLocks()

On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund  wrote:
> On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
>> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund  wrote:
>> > This should be a PANIC imo.
>>
>> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
>> would much rather have replay continue (with possible problems) than
>> have it stopped cold in its tracks with absolutely nothing that I as
>> the administrator can do to fix it.  We should be building this
>> product for end users.
>
> Except that that just means the end-user will have an undebuggable
> problem at their hand. Which will affect them as well.

I agree, but my guess is that a PANIC will affect them more.

> And they could just restart with hot_standby=off, and restart again. Or
> even just restart without the GUC change, because that will rebuild the
> locking state from a later state / start becoming ready for query at a
> later stage.

True, but that can still be a sufficient operational problem.

I don't expect you to agree with my vote, but I stand by it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Excessive CPU usage in StandbyReleaseLocks()

Hi,

On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund  wrote:
> > This should be a PANIC imo.
> 
> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
> would much rather have replay continue (with possible problems) than
> have it stopped cold in its tracks with absolutely nothing that I as
> the administrator can do to fix it.  We should be building this
> product for end users.

Except that that just means the end-user will have an undebuggable
problem at their hand. Which will affect them as well.

And they could just restart with hot_standby=off, and restart again. Or
even just restart without the GUC change, because that will rebuild the
locking state from a later state / start becoming ready for query at a
later stage.

Greetings,

Andres Freund

Re: Index Skip Scan

2018-06-19 Thread Jesper Pedersen


Hi Alexander,

On 06/18/2018 01:31 PM, Alexander Korotkov wrote:

 wrote:

Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.


Is skip scan only possible for index-only scan?  I guess, that no.  We
could also make plain index scan to behave like a skip scan.  And it
should be useful for accelerating DISTINCT ON clause.  Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan.  So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller.  But I don't insist on that, and I would like to hear
other opinions too.



Yes, there are likely other use-cases for Index Skip Scan apart from the 
simplest form. Which sort of suggests that having dedicated nodes would 
be better in the long run.


My goal is to cover the simplest form, which can be handled by extending 
 the T_IndexOnlyScan node, or by having common functions that both use. 
We can always improve the functionality with future patches.



Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.


Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient.  So, it seems to not worth
implementing, at least in the initial version. > And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together?  What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.



The feedback from Robert Haas seems to suggest that it was a requirement 
for the patch to be considered.



I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.


Please, register it on commitfest.  If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.



Based on the feedback from Andrew and Michael I won't register this 
thread until the September CF.


Thanks for your feedback !

Best regards,
 Jesper

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Hi,

On 2018-06-19 08:25:33 -0500, Jeremy Finzel wrote:
> I have a question related to this - and specifically, preventing the error
> until we have a patch :).

The patch has been committed to postgres, although no release has been
made including it yet.

> We are encountering this error every few weeks on one very high
> transaction db, and have to restart to fix it.

> If I read you correctly, the cache may never be invalidated for these
> catalogs even if I manually VACUUM them?  I was thinking if I routinely run
> VACUUM FREEZE on these tables in every database I might avoid the
> issue.

Correct, that won't help. In fact, it'll quite possibly make it worse.

> But given the cause of the issue, would that just make no difference and I
> will still hit the error eventually?

Yes. You might be able to get away with just removing the
pg_internal.init files before and after an explicit VACUUM on shared
system tables.

Greetings,

Andres Freund

Re: Fast default stuff versus pg_upgrade

Hi,

On 2018-06-19 12:37:52 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > And if the default value bugs us,
> > we can easily add a support function that dumps the value without the
> > anyarray adornment?
> 
> The problem here is that that function does not exist in 11beta1.
> Since adding the "incoming" function is certainly going to require
> initdb, we have to be able to dump from the server as it now stands,
> or we'll be cutting existing beta testers adrift.

It'd probably not be too hard to write a plpgsql replacement for it,
should it come to that. Obviously it'd be nicer to not require users to
create that, but ...

Greetings,

Andres Freund

Re: WAL prefetch

Hi,

On 2018-06-19 18:41:24 +0200, Tomas Vondra wrote:
> I'm confused. I thought you wanted to prefetch directly to shared buffers,
> so that it also works with direct I/O in the future. But now you suggest to
> use posix_fadvise() to work around the synchronous buffer read limitation. I
> don't follow ...

Well, I have multiple goals. For one I think using prefetching without
any sort of backpressure and mechanism to see which have completed will
result in hard to monitor and random performance. For another I'm
concerned with wasting a significant amount of memory for the OS cache
of all the read data that's guaranteed to never be needed (as we'll
*always* write to the relevant page shortly down the road).  For those
reasons alone I think prefetching just into the OS cache is a bad idea,
and should be rejected.

I also would want something that's more compatible with DIO. But people
pushed back on that, so...  As long as we build something that looks
like a request queue (which my proposal does), it's also something that
can later with some reduced effort be ported onto asynchronous io.

Greetings,

Andres Freund

Re: WAL prefetch

On 2018-06-19 19:34:22 +0300, Konstantin Knizhnik wrote:
> On 19.06.2018 18:50, Andres Freund wrote:
> > On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
> > > I do not think that prefetching in shared buffers requires much more 
> > > efforts
> > > and make patch more envasive...
> > > It even somehow simplify it, because there is no to maintain own cache of
> > > prefetched pages...
> > > But it will definitely have much more impact on Postgres performance:
> > > contention for buffer locks, throwing away pages accessed by read-only
> > > queries,...
> > These arguments seem bogus to me. Otherwise the startup process is going
> > to do that work.
> 
> There is just one process replaying WAL. Certainly it has some impact on hot
> standby query execution.
> But if there will be several prefetch workers (128???) then this impact will
> be dramatically increased.

Hence me suggesting how you can do that with one process (re locking). I
still entirely fail to see how "throwing away pages accessed by
read-only queries" is meaningful here - the startup process is going to
read the data anyway, and we *do not* want to use a ringbuffer as that'd
make the situation dramatically worse.

> Well, originally it was proposed by Sean - the author of pg-prefaulter. I
> just ported it from GO to C using standard PostgreSQL WAL iterator.
> Then I performed some measurements and didn't find some dramatic improvement
> in performance (in case of synchronous replication) or reducing replication
> lag for asynchronous replication neither at my desktop (SSD, 16Gb RAM, local
> replication within same computer, pgbench scale 1000), neither at pair of
> two powerful servers connected by
> InfiniBand and 3Tb NVME (pgbench with scale 10).
> Also I noticed that read rate at replica is almost zero.

> What does it mean:
> 1. I am doing something wrong.
> 2. posix_prefetch is not so efficient.
> 3. pgbench is not right workload to demonstrate effect of prefetch.
> 4. Hardware which I am using is not typical.

I think it's probably largely a mix of 3 and 4. pgbench with random
distribution probably indeed is a bad testcase, because either
everything is in cache or just about every write ends up as a full page
write because of the scale.  You might want to try a) turn of full page
writes b) use a less random distribution.

> So it make me think when such prefetch may be needed... And it caused new
> questions:
> I wonder how frequently checkpoint interval is much larger than OS
> cache?

Extremely common.

> If we enforce full pages writes (let's say each after each 1Gb), how it
> affect wal size and performance?

Extremely badly.  If you look at stats of production servers (using
pg_waldump) you can see that large percentage of the total WAL volume is
FPWs, that FPWs are a storage / bandwidth / write issue, and that higher
FPW rates after a checkpoint correlate strongly negatively with performance.

Greetings,

Andres Freund

Re: WAL prefetch





On 06/19/2018 05:50 PM, Andres Freund wrote:

On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:

I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...



But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...


These arguments seem bogus to me. Otherwise the startup process is going
to do that work.



Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.


I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED.  I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.

That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.



I'm confused. I thought you wanted to prefetch directly to shared 
buffers, so that it also works with direct I/O in the future. But now 
you suggest to use posix_fadvise() to work around the synchronous buffer 
read limitation. I don't follow ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fast default stuff versus pg_upgrade

On 2018-06-19 12:17:59 -0400, Andrew Dunstan wrote:
> 
> 
> On 06/19/2018 12:05 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-06-19 11:51:16 -0400, Andrew Dunstan wrote:
> > > My initial thought was that as a fallback we should disable pg_upgrade on
> > > databases containing such values, and document the limitation in the docs
> > > and the release notes. The workaround would be to force a table rewrite
> > > which would clear them if necessary.
> > I personally would say that that's not acceptable. People will start
> > using fast defaults - and you can't even do anything against it! - and
> > suddenly pg_upgrade won't work. But they will only notice that years
> > later, after collecting terrabytes of data in such tables.
> 
> 
> Umm, barring the case that Tom mentioned by then it would just work.

Huh?


> It's not the case that if they put in fast default values today they
> will never be able to upgrade.

How? I mean upgrading and loosing your default values certainly ain't
ok?  And we can't expect users to rewrite their tables, that's why we
added fast default support and why pg_upgrade is used.


> I'd at least like to see what a solution might look like before ruling it
> out. I suspect I can come up with something in a day or so. The work
> wouldn't be wasted.

I think it'd be unacceptable to release v11 without support, but I also
think it's quite possible to just add the necessary logic for v11 if we
put some effort into it. ISTM we've resolved worse issues during beta
than this.

Greetings,

Andres Freund

Re: WAL prefetch

2018-06-19 Thread Konstantin Knizhnik





On 19.06.2018 18:50, Andres Freund wrote:

On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:

I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...

These arguments seem bogus to me. Otherwise the startup process is going
to do that work.


There is just one process replaying WAL. Certainly it has some impact on 
hot standby query execution.
But if there will be several prefetch workers (128???) then this impact 
will be dramatically increased.






Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.

I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED.  I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.

That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.



Concerning WAL perfetch I still have a serious doubt if it is needed at all:
if checkpoint interval is less than size of free memory at the system, then
redo process should not read much.

I'm confused. Didn't you propose this?  FWIW, there's a significant
number of installations where people have observed this problem in
practice.


Well, originally it was proposed by Sean - the author of pg-prefaulter. 
I just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic 
improvement in performance (in case of synchronous replication) or 
reducing replication lag for asynchronous replication neither at my 
desktop (SSD, 16Gb RAM, local replication within same computer, pgbench 
scale 1000), neither at pair of two powerful servers connected by

InfiniBand and 3Tb NVME (pgbench with scale 10).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.

So it make me think when such prefetch may be needed... And it caused 
new questions:

I wonder how frequently checkpoint interval is much larger than OS cache?
If we enforce full pages writes (let's say each after each 1Gb), how it 
affect wal size and performance?


Looks like it is difficult to answer the second question without 
implementing some prototype.

May be I will try to do it.

And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?)

Yes, there are.  Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.


One more consideration: data is written to the disk as blocks in any 
case. If you updated just few bytes on a page, then still the whole page 
has to be written in database file.
So avoiding full page writes allows to reduce WAL size and amount of 
data written to the WAL, but not amount of data written to the database 
itself.
It means that if we completely eliminate FPW and transactions are 
updating random pages, then disk traffic is reduced less than two times...






then quite small patch (as it seems to me now) forcing full page write
when distance between page LSN and current WAL insertion point exceeds
some threshold should eliminate random reads also in this case.

I'm pretty sure that that'll hurt a significant number of installations,
that set the timeout high, just so they can avoid FPWs.

May be, but I am not so sure. This is why I will try to investigate it more.



Greetings,

Andres Freund


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fast default stuff versus pg_upgrade

On 2018-06-19 12:17:56 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > I have no idea how large an actual fix might be. I'll at least start 
> > working on it immediately. I agree it's very late in the day.
> 
> On reflection, it seems like there are two moving parts needed:
> 
> * Add a binary-upgrade support function to the backend, which would take,
> say, table oid, column name, and some representation of the default value;
> 
> * Teach pg_dump when operating in binary-upgrade mode to emit a call to
> such a function for each column that has atthasmissing true.
> 
> The hard part here is how exactly are we going to represent the default
> value.  AFAICS, the only thing that pg_dump could readily lay its hands
> on is the "anyarray" textual representation of attmissingval, which maybe
> is okay but it means more work for the support function.

Isn't that just a few lines of code? And if the default value bugs us,
we can easily add a support function that dumps the value without the
anyarray adornment?


> Too bad we did not just store the value in bytea format.

That still seems the right thing to me, not being able in areasonable
way to inspect the default values in the catalog seems worse.  We could
have added a new non-array pseudo-type as well, but that's a fair bit of
work...

Greetings,

Andres Freund

Re: Fast default stuff versus pg_upgrade





On 06/19/2018 12:05 PM, Andres Freund wrote:

Hi,

On 2018-06-19 11:51:16 -0400, Andrew Dunstan wrote:

My initial thought was that as a fallback we should disable pg_upgrade on
databases containing such values, and document the limitation in the docs
and the release notes. The workaround would be to force a table rewrite
which would clear them if necessary.

I personally would say that that's not acceptable. People will start
using fast defaults - and you can't even do anything against it! - and
suddenly pg_upgrade won't work. But they will only notice that years
later, after collecting terrabytes of data in such tables.



Umm, barring the case that Tom mentioned by then it would just work. 
It's not the case that if they put in fast default values today they 
will never be able to upgrade.





If we can't fix it properly, then imo we should revert / neuter the
feature.



Have we ever recommended use of pg_upgrade for some manual catalog fix after
release? I don't recall doing so. Certainly it hasn't been common.

No, but why does it matter? Are you arguing we can delay pg_dump support
for fast defaults to v12?




Right now I'm more or less thinking out loud, not arguing anything.

I'd at least like to see what a solution might look like before ruling 
it out. I suspect I can come up with something in a day or so. The work 
wouldn't be wasted.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fast default stuff versus pg_upgrade

Andrew Dunstan  writes:
> I have no idea how large an actual fix might be. I'll at least start 
> working on it immediately. I agree it's very late in the day.

On reflection, it seems like there are two moving parts needed:

* Add a binary-upgrade support function to the backend, which would take,
say, table oid, column name, and some representation of the default value;

* Teach pg_dump when operating in binary-upgrade mode to emit a call to
such a function for each column that has atthasmissing true.

The hard part here is how exactly are we going to represent the default
value.  AFAICS, the only thing that pg_dump could readily lay its hands
on is the "anyarray" textual representation of attmissingval, which maybe
is okay but it means more work for the support function.  Too bad we did
not just store the value in bytea format.

regards, tom lane

Re: Fast default stuff versus pg_upgrade

Andres Freund  writes:
> On 2018-06-19 11:51:16 -0400, Andrew Dunstan wrote:
>> Have we ever recommended use of pg_upgrade for some manual catalog fix after
>> release? I don't recall doing so. Certainly it hasn't been common.

> No, but why does it matter?

We absolutely have, as recently as last month:

* Fix incorrect volatility markings on a few built-in functions
  (Thomas Munro, Tom Lane)

... can be fixed by manually updating these functions' pg_proc
entries, for example ALTER FUNCTION pg_catalog.query_to_xml(text,
boolean, boolean, text) VOLATILE. (Note that that will need to be
done in each database of the installation.) Another option is to
pg_upgrade the database to a version containing the corrected
initial data.

regards, tom lane

Re: Fast default stuff versus pg_upgrade

Hi,

On 2018-06-19 11:51:16 -0400, Andrew Dunstan wrote:
> My initial thought was that as a fallback we should disable pg_upgrade on
> databases containing such values, and document the limitation in the docs
> and the release notes. The workaround would be to force a table rewrite
> which would clear them if necessary.

I personally would say that that's not acceptable. People will start
using fast defaults - and you can't even do anything against it! - and
suddenly pg_upgrade won't work. But they will only notice that years
later, after collecting terrabytes of data in such tables.

If we can't fix it properly, then imo we should revert / neuter the
feature.

> Have we ever recommended use of pg_upgrade for some manual catalog fix after
> release? I don't recall doing so. Certainly it hasn't been common.

No, but why does it matter? Are you arguing we can delay pg_dump support
for fast defaults to v12?

Greetings,

Andres Freund

Re: Concurrency bug in UPDATE of partition-key

2018-06-19 Thread Amit Khandekar

On 19 June 2018 at 13:06, Dilip Kumar  wrote:
> On Mon, Jun 18, 2018 at 6:19 PM, Amit Khandekar  
> wrote:
>> On 18 June 2018 at 17:56, Amit Kapila  wrote:
>>> On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar  wrote:
 Should we also create a test case where we can verify that some
 unnecessary or duplicate triggers are not executed?

>>>
>>> I am not sure how much value we will add by having such a test.  In
>>> general, it is good to have tests that cover various aspects of
>>> functionality, but OTOH, we have to be careful to not overdo it.
>>
>> Actually I am thinking, it's not a big deal adding a RAISE statement
>> in trigger function in the existing testcases. It will clearly show how
>> many times the trigger has executed. So I will go ahead and do that.
>
> Ok,  That makes sense to me.

Could not add RAISE statement, because the isolation test does not
seem to print those statements in the output. Instead, I have dumped
some rows in a new table through the trigger function, and did a
select from that table. Attached is v3 patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_concurrency_bug_v3.patch
Description: Binary data

Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-19 Thread Alexander Korotkov

On Tue, Jun 19, 2018 at 12:25 PM Masahiko Sawada  wrote:
> On Tue, Jun 19, 2018 at 5:43 PM, Alexander Korotkov
>  wrote:
> > On Tue, Jun 19, 2018 at 11:34 AM Masahiko Sawada  
> > wrote:
> >> On Mon, Jun 18, 2018 at 1:56 PM, Alexander Korotkov
> >> > So, I'm proposing to raise maximum valus of
> >> > vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?
> >> >
> >>
> >> I agree to expand the maximum value. But if users don't want index
> >> cleanup it would be helpful if we have an option (e.g. setting to -1)
> >> to disable index cleanup while documenting a risk of disabling index
> >> cleanup. It seems to me that setting very high values means the same
> >> purpose.
> >
> > Yes, providing an option to completely disable b-tree index cleanup
> > would be good.  But the problem is that we already use -1 value for
> > "use the default" in reloption.  So, if even we will make -1 guc
> > option to mean "never cleanup", then we still wouldn't be able to make
> > reloption to work this way.  Probably, we should use another "magical
> > value" in reloption for "use the default" semantics.
>
> Right. We can add a new reloption specifying whether we use default
> value of vacuum_index_cleanup_scale_factor or not but it might be
> overkill.

That would be overkill, and also that would be different from how
other reloptions behaves.

> >> Also, your patch lacks documentation update.
> >
> > Good catch, thank you.
> >
> >> BTW, I realized that postgresql.conf.sample doesn't have
> >> vacuum_cleanup_index_scale_factor option. Attached patch fixes it.
> >
> > It seems that you post a wrong attachment, because the patch you sent
> > is exactly same as mine.
> >
>
> Sorry, attached correct one.

Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
subsection in the "resource usage" section.  I think it's not
appropriate for this option to be in "resource usage".  Ideally it
should be grouped with other vacuum options, but we don't have single
section for that.  "Autovacuum" section is also not appropriate,
because this guc works not only for autovacuum.  So, most semantically
close options, which affects vacuum in general, are
vacuum_freeze_min_age, vacuum_freeze_table_age,
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
which are located in "client connection defaults" section.  So, I
decided to move this GUC into this section.  I also change the section
in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.

I think we ideally should have a "vacuum" section, which would have
two subections: "client defaults" and "autovacuum".  But that should
be a separate patch, which needs to be discussed separately.
--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7905..859ef931e7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3248,7 +3248,7 @@ static struct config_real ConfigureNamesReal[] =
 	},
 
 	{
-		{"vacuum_cleanup_index_scale_factor", PGC_USERSET, AUTOVACUUM,
+		{"vacuum_cleanup_index_scale_factor", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Number of tuple inserts prior to index cleanup as a fraction of reltuples."),
 			NULL
 		},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f43086f6d0..5269297d74 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -580,6 +580,8 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
+	# before index cleanup, 0 always performs index cleanup
 
 # - Locale and Formatting -

Re: missing toast table for pg_policy

On 2018-02-18 11:18:49 -0500, Tom Lane wrote:
> Joe Conway  writes:
> > Is there really a compelling reason to not just create toast tables for
> > all system catalogs as in the attached?
> 
> What happens when you VACUUM FULL pg_class?  (The associated toast table
> would have to be nonempty for the test to prove much.)
> 
> I'm fairly suspicious of toasting anything that the toast mechanism itself
> depends on, actually, and that would include at least pg_attribute and
> pg_index as well as pg_class.  Maybe we could get away with it because
> there would never be any actual recursion only potential recursion ...
> but it seems scary.
> 
> On the whole, I'm dubious that the risk/reward ratio is attractive here.

I think we should just review the code to make sure bootstrapping works
for pg_class and fix if necessary. We've discussed this repeatedly, and
this concern has been the blocker every time - at this point I'm fairly
sure we could have easily fixed the issues.

At least the simpler case, where the bootstrapped contents themselves
aren't toasted, shouldn't be hard to support. And that restriction seems
fairly easy to support, by dint of the population mechanism for
bootstrapped catalogs not supporting toasting.

What I'm not sure will work correctly without fixes is the relation
swapping logic for VACUUM FULL / CLUSTER. There's some pg_class specific
logic in there, that might need to be adapted for pg_class's toast
table.

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Hi,

On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
> Hello friends.
> 
> On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund  wrote:
> 
> >
> > On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> > > I plan to go over the change again tomorrow, and then push. Unless
> > > somebody has comments before then, obviously.
> >
> > Done.
> >
> >
> Sorry to bother about this, but do you have any plan to do the minor
> release before planned due to this bug?

Unclear at this point.  There's been discussion about it, without coming
to a conclusion.


> I'm pondering what is the best option to avoid a forced shutdown of this
> server:
> - should I just wait for a release (if it is soon, I would be fine)?
> - build PG from the git version by myself?
> - or is there a safer workaround to the problem? (not clear to me if
> deleting the `global/pg_internal.init` file is really the way to go, and
> the details, is it safe? Should I stop the server, delete, start?)

You should first make sure it's actually this problem - which tables are
holding back the xmin horizon?  After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.

Greetings,

Andres Freund

Re: Fast default stuff versus pg_upgrade





On 06/19/2018 10:55 AM, Tom Lane wrote:

AFAICS, the fast-default patch neglected to consider what happens if
a database containing columns with active attmissingval entries is
pg_upgraded.  I do not see any code in either pg_dump or pg_upgrade that
attempts to deal with that situation, which means the effect will be
that the "missing" values will silently revert to nulls: they're still
null in the table storage, and the restored pg_attribute entries won't
have anything saying it should be different.

The pg_upgrade regression test fails to exercise such a case.  There is
only one table in the ending state of the regression database that has
any atthasmissing columns, and it's empty :-(.  If I add a table in
which there actually are active attmissingval entries, say according
to the attached patch, I get a failure in the pg_upgrade test.

This is certainly a stop-ship issue, and in fact it's bad enough
that I think we may need to pull the feature for v11.  Designing
binary-upgrade support for this seems like a rather large task
to be starting post-beta1.  Nor do I think it's okay to wait for
v12 to make it work; what if we have to force an initdb later in
beta, or recommend use of pg_upgrade for some manual catalog fix
after release?



Ouch!

I guess I have to say mea culpa.

My initial thought was that as a fallback we should disable pg_upgrade 
on databases containing such values, and document the limitation in the 
docs and the release notes. The workaround would be to force a table 
rewrite which would clear them if necessary.


Have we ever recommended use of pg_upgrade for some manual catalog fix 
after release? I don't recall doing so. Certainly it hasn't been common.


I have no idea how large an actual fix might be. I'll at least start 
working on it immediately. I agree it's very late in the day.


cheers

andrew

Re: WAL prefetch

On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
> I do not think that prefetching in shared buffers requires much more efforts
> and make patch more envasive...
> It even somehow simplify it, because there is no to maintain own cache of
> prefetched pages...

> But it will definitely have much more impact on Postgres performance:
> contention for buffer locks, throwing away pages accessed by read-only
> queries,...

These arguments seem bogus to me. Otherwise the startup process is going
to do that work.

> Also there are two points which makes prefetching into shared buffers more
> complex:
> 1. Need to spawn multiple workers to make prefetch in parallel and somehow
> distribute work between them.

I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED.  I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.

That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.

> Concerning WAL perfetch I still have a serious doubt if it is needed at all:
> if checkpoint interval is less than size of free memory at the system, then
> redo process should not read much.

I'm confused. Didn't you propose this?  FWIW, there's a significant
number of installations where people have observed this problem in
practice.

> And if checkpoint interval is much larger than OS cache (are there cases
> when it is really needed?)

Yes, there are.  Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.

> then quite small patch (as it seems to me now) forcing full page write
> when distance between page LSN and current WAL insertion point exceeds
> some threshold should eliminate random reads also in this case.

I'm pretty sure that that'll hurt a significant number of installations,
that set the timeout high, just so they can avoid FPWs.

Greetings,

Andres Freund

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-19 Thread Nico Williams

On Tue, Jun 19, 2018 at 08:46:06AM +0100, Dent John wrote:
> I’m pretty impressed anything in this space can be written entirely in
> PlPGQSL!

https://github.com/twosigma/postgresql-contrib

PG is quite powerful!

I have even implemented a COMMIT TRIGGER in pure PlPgSQL.

You'll notice I make extensive use of record/table types.

> If you did integrate your implementation, it would be easy for my
> Extension to read from a table other than the one which it gets the MV
> definition from... Although having said that, if you went down the
> route you suggest, would not you make that “regular table” into a
> first class scheme object very much like Corey’s CONTINUOUS
> MATERIALIZED VIEW object concept?

I know nothing about the CONTINUOUS MATERIALIZED VIEW concept.  What
that would imply to me seems... difficult to achieve.  There will be
view queries that are difficult or impossible to automatically write
triggers for that update an MV synchronously.

> It is interesting that you can put triggers onto the table though, as
> that leads well in to use cases where it is desirable to “stack” MVs
> upon each other. (I’m not immediately sure whether such a use case is
> still needed in face of an always-up-to-date MV feature such as is
> described, but I’ve seen it elsewhere.)

I have done exactly this sort of MV chaining.

In my use case I had an MV of a nesting group transitive closure and
then another of a join between that and user group memberships to get a
complete user group transitive closure.

The transitive closure of nesting groups being computed via a RECURSIVE
CTE...  In principle one can understand such a query and automatically
write DMLs to update the MV on the fly (I've done this _manually_), but
the moment you do any bulk updates out of sequence you can't, and then
you have to refresh the view, so you see, I don't quite believe we can
have a true continuously materialized view :(

For me the key requirement is the ability to generate incremental
updates to an external system, but also the whole thing has to be fast.

> You say you’d like to base it off a VIEW’s AST (rather than, I
> presume, you must parse the reconstructed VIEW source text as SQL?),

PG already stores the AST.  There's no need to write a new parser when
PG already has one.  At the end of the day you need to analyze an AST
for the MV's source query in order to automatically write the triggers
to keep it updated (or mark it as needing a refresh).

> and I do agree — that’s probably the right direction... it does seem
> to me there is scope to leverage the “bottom half” of the ASSERTION
> stuff from Dave Fetter that Corey linked to — i.e., the part of it
> that manages triggers. Still leaves the AST crawling deciding what to
> actually do once a change is caught.

I'll search for this.

> Really good to hear about progress in this area.

Eh, I've not actually implemented any automatic generation of triggers
to update MVs.  I've written enough such triggers manually to believe
that *some* of them could be written by software.  If you look at my
sketch for how to do it, you'll notice that many of the sorts of queries
that one would choose to materialize... are not really amenable to this
treatment -- that's precisely because those make for the sorts of slow
queries that make you reach for materialization in the first place :(

But even so, automatically-generated triggers that mark an MV as needing
a refresh are always possible, and that is a huge improvement anyways,
especially if concurrent view refreshes can be made to go faster (by
having PKs on the MVs).  The idea is to have some sort of adaptive
automatic background, concurrent MV refresh running on a frequency based
in part of the amount of time it takes to refresh the VIEW.

BTW, MERGE would be a significant optimization for concurrent MV
refresh.  Think of MERGE as a statement that can scan a source, FULL
OUTER JOIN it to a target table, and for each row do an INSERT, UPDATE,
or DELETE -- this is 3x faster than the three INSERT/UPDATE/DELETE
statements you need to do the same work without a MERGE!

Nico
--

Re: Postgres 11 release notes

2018-06-19 Thread Alexander Korotkov

On Tue, Jun 19, 2018 at 6:18 PM Daniel Gustafsson  wrote:
> > On 19 Jun 2018, at 12:40, Alexander Korotkov  
> > wrote:
> >
> > On Tue, Jun 19, 2018 at 12:15 PM Alexander Korotkov
> >  wrote:
> >> On Sat, Jun 16, 2018 at 3:57 PM Darafei "Komяpa" Praliaskouski
> >>  wrote:
> 
> > I'm not sure it is usefull in release notes since it is more about API, 
> > and not
> > user-facing change. Just in case.
> > GiST opclasses now can omit compress and decompress functions. If 
> > compress
> > function is omited, IndexOnlyScan is enabled for opclass without any 
> > extra
> > change.
> > https://github.com/postgres/postgres/commit/
> > d3a4f89d8a3e500bd7c0b7a8a8a5ce1b47859128
> 
>  Uh, we do have this for SP-GiST:
> 
> Allow SP-GiST indexes to optionally use compression (Teodor 
>  Sigaev,
> Heikki Linnakangas, Alexander Korotkov, Nikita Glukhov)
> 
>  I am unclear how far downt the API stack I should go in documenting
>  changes like this.
> >>>
> >>>
> >>> It is also a bit misleading - the idea in that change is that now index 
> >>> representation can be a lossy version of actual data type (a box instead 
> >>> of polygon as a referende, so a changelog entry can tell "Allow SP-GiST 
> >>> index creation for polygon datatype.").  There is no "decompression" for 
> >>> such thing. "compression" sounds like gzip for me in user-facing context.
> >>
> >> +1 that current wording looks confusing.  But I think we need to
> >> highlight that we have general SP-GiST improvement, not just support
> >> for particular datatype.  So, I propose following wording: "Allow
> >> SP-GiST to use lossy representation of leaf keys, and add SP-GiST
> >> support for polygon type using that".
> >
> > Oh, I missed that we have separate release notes entry for polygons
> > indexing.  Then second part of sentence isn't needed, it should be
> > just "Allow SP-GiST to use lossy representation of leaf keys".  If no
> > objections, I'm going to commit that altogether with fixes by
> > Liudmila.
>
> Speaking of typos, I think I spotted two more: s/features is/feature is/.
> Fixed in the attached diff.

Committed, thanks!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Postgres 11 release notes

2018-06-19 Thread Daniel Gustafsson

> On 19 Jun 2018, at 12:40, Alexander Korotkov  
> wrote:
> 
> On Tue, Jun 19, 2018 at 12:15 PM Alexander Korotkov
>  wrote:
>> On Sat, Jun 16, 2018 at 3:57 PM Darafei "Komяpa" Praliaskouski
>>  wrote:
 
> I'm not sure it is usefull in release notes since it is more about API, 
> and not
> user-facing change. Just in case.
> GiST opclasses now can omit compress and decompress functions. If compress
> function is omited, IndexOnlyScan is enabled for opclass without any extra
> change.
> https://github.com/postgres/postgres/commit/
> d3a4f89d8a3e500bd7c0b7a8a8a5ce1b47859128
 
 Uh, we do have this for SP-GiST:
 
Allow SP-GiST indexes to optionally use compression (Teodor Sigaev,
Heikki Linnakangas, Alexander Korotkov, Nikita Glukhov)
 
 I am unclear how far downt the API stack I should go in documenting
 changes like this.
>>> 
>>> 
>>> It is also a bit misleading - the idea in that change is that now index 
>>> representation can be a lossy version of actual data type (a box instead of 
>>> polygon as a referende, so a changelog entry can tell "Allow SP-GiST index 
>>> creation for polygon datatype.").  There is no "decompression" for such 
>>> thing. "compression" sounds like gzip for me in user-facing context.
>> 
>> +1 that current wording looks confusing.  But I think we need to
>> highlight that we have general SP-GiST improvement, not just support
>> for particular datatype.  So, I propose following wording: "Allow
>> SP-GiST to use lossy representation of leaf keys, and add SP-GiST
>> support for polygon type using that".
> 
> Oh, I missed that we have separate release notes entry for polygons
> indexing.  Then second part of sentence isn't needed, it should be
> just "Allow SP-GiST to use lossy representation of leaf keys".  If no
> objections, I'm going to commit that altogether with fixes by
> Liudmila.

Speaking of typos, I think I spotted two more: s/features is/feature is/.
Fixed in the attached diff.

cheers ./daniel



relnotes11typo.diff
Description: Binary data

Re: WAL prefetch

On 06/19/2018 04:50 PM, Konstantin Knizhnik wrote:

On 19.06.2018 16:57, Ants Aasma wrote:
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra
mailto:tomas.von...@2ndquadrant.com>>
wrote:

Right. My point is that while spawning bgworkers probably helps, I
don't
expect it to be enough to fill the I/O queues on modern storage
systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more
than 16
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency
values:

[1]

https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may
have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'

For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1] Multiply that appropriately for RAID
arrays.So

How it is related with results for S3500 where this is almost now
performance improvement for effective_io_concurrency >8?
Starting 128 or more workers for performing prefetch is definitely not
acceptable...

I'm not sure what you mean by "almost now performance improvement", but
I guess you meant "almost no performance improvement" instead?

If that's the case, it's not quite true - increasing the queue depth
above 8 further improved the throughput by about ~10-20% (both by
duration and peak throughput measured by iotop).

But more importantly, this is just a single device - you typically have
multiple of them in a larger arrays, to get better capacity, performance
and/or reliability. So if you have 16 such drives, and you want to send
at least 8 requests to each, suddenly you need at least 128 requests.

And as pointed out before, S3500 is about 5-years old device (it was
introduced in Q2/2013). On newer devices the difference is usually way
more significant / the required queue depth is much higher.

Obviously, this is a somewhat simplified view, ignoring various details
(e.g. that there may be multiple concurrent queries, each sending I/O
requests - what matters is the combined number of requests, of course).
But I don't think this makes a huge difference.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

On Mon, Apr 23, 2018 at 12:34 PM, Konstantin Knizhnik
 wrote:
> May be. But in any cases, there are several direction where compression can
> be used:
> - custom compression algorithms
> - libpq compression
> - page level compression
> ...
>
> and  them should be somehow finally "married" with each other.

I agree that we should try to avoid multiplying the number of
compression-related APIs.  Ideally there should be one API for
registering a compression algorithms, and then there can be different
methods of selecting that compression algorithm depending on the
purpose for which it will be used.  For instance, you could select a
column compression format using some variant of ALTER TABLE ... ALTER
COLUMN, but you would obviously use some other method to select the
WAL compression format.  However, it's a little unclear to me how we
would actually make the idea of a single API work.  For column
compression, we need everything to be accessible through the catalogs.
For something like WAL compression, we need it to be completely
independent of the catalogs.  Those things are opposites, so a single
API can't have both properties.  Maybe there can be some pieces
shared, but as much as I'd like it to be otherwise, it doesn't seem
possible to share it completely.

I also agree with Ildus and Alexander that we cannot and should not
try to solve every problem in one patch.  Rather, we should just think
ahead, so that we make as much of what goes into this patch reusable
in the future as we can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Fast default stuff versus pg_upgrade

AFAICS, the fast-default patch neglected to consider what happens if
a database containing columns with active attmissingval entries is
pg_upgraded.  I do not see any code in either pg_dump or pg_upgrade that
attempts to deal with that situation, which means the effect will be
that the "missing" values will silently revert to nulls: they're still
null in the table storage, and the restored pg_attribute entries won't
have anything saying it should be different.

The pg_upgrade regression test fails to exercise such a case.  There is
only one table in the ending state of the regression database that has
any atthasmissing columns, and it's empty :-(.  If I add a table in
which there actually are active attmissingval entries, say according
to the attached patch, I get a failure in the pg_upgrade test.

This is certainly a stop-ship issue, and in fact it's bad enough
that I think we may need to pull the feature for v11.  Designing
binary-upgrade support for this seems like a rather large task
to be starting post-beta1.  Nor do I think it's okay to wait for
v12 to make it work; what if we have to force an initdb later in
beta, or recommend use of pg_upgrade for some manual catalog fix
after release?

regards, tom lane

diff --git a/src/test/regress/expected/fast_default.out b/src/test/regress/expected/fast_default.out
index ef8d04f..f3d783c 100644
*** a/src/test/regress/expected/fast_default.out
--- b/src/test/regress/expected/fast_default.out
*** DROP TABLE has_volatile;
*** 548,550 
--- 548,561 
  DROP EVENT TRIGGER has_volatile_rewrite;
  DROP FUNCTION log_rewrite;
  DROP SCHEMA fast_default;
+ -- Leave a table with an active fast default in place, for pg_upgrade testing
+ set search_path = public;
+ create table has_fast_default(f1 int);
+ insert into has_fast_default values(1);
+ alter table has_fast_default add column f2 int default 42;
+ table has_fast_default;
+  f1 | f2 
+ +
+   1 | 42
+ (1 row)
+ 
diff --git a/src/test/regress/sql/fast_default.sql b/src/test/regress/sql/fast_default.sql
index 0e66033..7b9cc47 100644
*** a/src/test/regress/sql/fast_default.sql
--- b/src/test/regress/sql/fast_default.sql
*** DROP TABLE has_volatile;
*** 369,371 
--- 369,378 
  DROP EVENT TRIGGER has_volatile_rewrite;
  DROP FUNCTION log_rewrite;
  DROP SCHEMA fast_default;
+ 
+ -- Leave a table with an active fast default in place, for pg_upgrade testing
+ set search_path = public;
+ create table has_fast_default(f1 int);
+ insert into has_fast_default values(1);
+ alter table has_fast_default add column f2 int default 42;
+ table has_fast_default;

Re: WAL prefetch

2018-06-19 Thread Konstantin Knizhnik




On 19.06.2018 16:57, Ants Aasma wrote:
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra 
mailto:tomas.von...@2ndquadrant.com>> 
wrote:


Right. My point is that while spawning bgworkers probably helps, I
don't
expect it to be enough to fill the I/O queues on modern storage
systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more
than 16
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency
values:

[1]

https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may
have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'


For reference, a typical datacenter SSD needs a queue depth of 128 to 
saturate a single device. [1] Multiply that appropriately for RAID 
arrays.So


How it is related with results for S3500  where this is almost now 
performance improvement for effective_io_concurrency >8?
Starting 128 or more workers for performing prefetch is definitely not 
acceptable...




Regards,
Ants Aasma

[1] 
https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Excessive CPU usage in StandbyReleaseLocks()

On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund  wrote:
> This should be a PANIC imo.

-1.  As a developer, I would prefer PANIC.  But as an end-user, I
would much rather have replay continue (with possible problems) than
have it stopped cold in its tracks with absolutely nothing that I as
the administrator can do to fix it.  We should be building this
product for end users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

On Mon, Jun 11, 2018 at 10:04 AM, Tom Lane  wrote:
> Given that this has been broken since forever, and there've been
> no complaints before now, I do not think the case for back-patching
> is strong enough to justify the problems it would cause.  Just
> put it in v11 and be done.

I'm not sure I understand what problem would be created by
back-patching.  It is true that anyone writing code that must work
with any version of PostgreSQL wouldn't able to count on this being
there, but the chances that anyone is writing such software using ECPG
is remote.  In other words, nobody's going to add a version number
test to their ECPG code.  They're just going to apply the update and
then use the new function.

I don't think this is a very important issue so I'm not prepared to
fight about it, but this looks pretty low-risk to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

On Sun, Jun 17, 2018 at 10:59 PM, David Rowley
 wrote:
> Thanks for looking.
>
> Robert, do you have any objections to the proposed patch?

I don't have time to study this right now, but I think the main
possible objection is around performance.  If not flattening the
Append is the best way to make queries run fast, then we should do it
that way.  If making pruning capable of coping with mixed hierarchies
is going to be faster, then we should do that.  If I were to speculate
in the absence of data, my guess would be that failing to flatten the
hierarchy is going to lead to a significant per-tuple cost, while the
cost of making run-time pruning smarter is likely to be incurred once
per rescan (i.e. a lot less).  But that might be wrong, and it might
be impractical to get this working perfectly in v11 given the time we
have.  But I would suggest that you performance test a query that ends
up feeding lots of tuples through two Append nodes rather than one and
see how much it hurts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

On Mon, Jun 18, 2018 at 11:36 PM, Amit Kapila  wrote:
> Fixed in the attached patch.  I will wait for a day or two to see if
> Tom or Robert wants to say something and then commit.

The patch LGTM but the commit message could perhaps use a little
word-smithing, e.g. "Commit ab72716778 allowed Parallel Append paths
to be generated for a relation that is not parallel safe.  Prevent
that from happening."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WAL prefetch

2018-06-19 Thread Ants Aasma

On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra 
wrote:

> Right. My point is that while spawning bgworkers probably helps, I don't
> expect it to be enough to fill the I/O queues on modern storage systems.
> Even if you start say 16 prefetch bgworkers, that's not going to be
> enough for large arrays or SSDs. Those typically need way more than 16
> requests in the queue.
>
> Consider for example [1] from 2014 where Merlin reported how S3500
> (Intel SATA SSD) behaves with different effective_io_concurrency values:
>
> [1]
>
> https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com
>
> Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
> multiple such devices in a single RAID array, and that this device is
> from 2014 (and newer flash devices likely need even deeper queues).'
>

For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1] Multiply that appropriately for RAID arrays.

Regards,
Ants Aasma

[1]
https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-19 Thread Jeremy Finzel

On Tue, Jun 19, 2018 at 8:26 AM Matheus de Oliveira <
matioli.math...@gmail.com> wrote:

> Hello friends.
>
> On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund  wrote:
>
>>
>> On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
>> > I plan to go over the change again tomorrow, and then push. Unless
>> > somebody has comments before then, obviously.
>>
>> Done.
>>
>>
> Sorry to bother about this, but do you have any plan to do the minor
> release before planned due to this bug?
>
> There seem to have too many users affected by this. And worst is that many
> users may not have even noticed they have the problem, which can cause
> `age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the
> system goes down.
>
> In my case, I have a server that its `age(datfrozenxid)` is already at 1.9
> billions, and I expect it to reach 2.1 billions in about 14 days.
> Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why
> I found the issue in one of my servers.
>
> I'm pondering what is the best option to avoid a forced shutdown of this
> server:
> - should I just wait for a release (if it is soon, I would be fine)?
> - build PG from the git version by myself?
> - or is there a safer workaround to the problem? (not clear to me if
> deleting the `global/pg_internal.init` file is really the way to go, and
> the details, is it safe? Should I stop the server, delete, start?)
>
> Best regards,
> --
> Matheus de Oliveira
>
>
> Restarting the database has fixed the error on these pg_catalog tables,
allowing us to vacuum them and avoid wraparound.

We first noticed a restart fixed the issue because SAN snapshots did not
have the error. The only difference really being shared memory and nothing
disk-level.

Jeremy

Re: missing toast table for pg_policy

2018-06-19 Thread John Naylor

On 2/20/18, Michael Paquier  wrote:
> Regression tests of pg_upgrade are failing as follows:
> New cluster database "postgres" is not empty
> Failure, exiting
> + rm -rf /tmp/pg_upgrade_check-Xn0ZLe

I looked into this briefly. The error comes from
check_new_cluster_is_empty() in src/bin/pg_upgrade/check.c, which
contains the comment

/* pg_largeobject and its index should be skipped */

If you don't create a toast table for pg_largeobject and
pg_largeobject_metadata, the pg_upgrade regression test passes.
Revised addendum attached. Adding two more exceptions to my alternate
implementation starts to make it ugly, so I haven't updated it for
now.

-John Naylor

> Michael
>
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index ee54487..f259890 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -46,9 +46,9 @@ extern void BootstrapToastTable(char *relName,
  */
 
 /* normal catalogs */
-DECLARE_TOAST(pg_aggregate, 4139, 4140);
+DECLARE_TOAST(pg_aggregate, 4159, 4160);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
-DECLARE_TOAST(pg_collation, 4141, 4142);
+DECLARE_TOAST(pg_collation, 4161, 4162);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
 DECLARE_TOAST(pg_default_acl, 4143, 4144);
 DECLARE_TOAST(pg_description, 2834, 2835);
@@ -59,8 +59,6 @@ DECLARE_TOAST(pg_foreign_server, 4151, 4152);
 DECLARE_TOAST(pg_foreign_table, 4153, 4154);
 DECLARE_TOAST(pg_init_privs, 4155, 4156);
 DECLARE_TOAST(pg_language, 4157, 4158);
-DECLARE_TOAST(pg_largeobject, 4159, 4160);
-DECLARE_TOAST(pg_largeobject_metadata, 4161, 4162);
 DECLARE_TOAST(pg_namespace, 4163, 4164);
 DECLARE_TOAST(pg_partitioned_table, 4165, 4166);
 DECLARE_TOAST(pg_policy, 4167, 4168);
diff --git a/src/test/regress/expected/misc_sanity.out 
b/src/test/regress/expected/misc_sanity.out
index 0be74d2..a6dacd4 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -88,15 +88,18 @@ WHERE c.oid < 16384 AND
   relkind = 'r' AND
   attstorage != 'p'
 ORDER BY 1,2;
-   relname|attname|   atttypid   
---+---+--
- pg_attribute | attacl| aclitem[]
- pg_attribute | attfdwoptions | text[]
- pg_attribute | attoptions| text[]
- pg_class | relacl| aclitem[]
- pg_class | reloptions| text[]
- pg_class | relpartbound  | pg_node_tree
- pg_index | indexprs  | pg_node_tree
- pg_index | indpred   | pg_node_tree
-(8 rows)
+ relname |attname|   atttypid   
+-+---+--
+ pg_attribute| attacl| aclitem[]
+ pg_attribute| attfdwoptions | text[]
+ pg_attribute| attmissingval | anyarray
+ pg_attribute| attoptions| text[]
+ pg_class| relacl| aclitem[]
+ pg_class| reloptions| text[]
+ pg_class| relpartbound  | pg_node_tree
+ pg_index| indexprs  | pg_node_tree
+ pg_index| indpred   | pg_node_tree
+ pg_largeobject  | data  | bytea
+ pg_largeobject_metadata | lomacl| aclitem[]
+(11 rows)

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-19 Thread Jeevan Chalke

On Tue, Jun 19, 2018 at 2:13 PM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Mon, Jun 18, 2018 at 9:27 PM, Andres Freund  wrote:
>
>> On 2018-06-18 17:10:12 +0530, Jeevan Chalke wrote:
>> > On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi <
>> > rajkumar.raghuwan...@enterprisedb.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > Below test case crashed, when set enable_partitionwise_aggregate to
>> true.
>> > >
>> >
>> > I will have a look over this.
>>
>
> In the reported testcase, parallel_workers is set to 0 for all partition
> (child) relations. Which means partial parallel paths are not possible for
> child rels. However, the parent can have partial parallel paths. Thus, when
> we have a full partitionwise aggregate possible (as the group by clause
> matches with the partition key), we end-up in a situation where we do
> create a partially_grouped_rel for the parent but there won't be any
> partially_grouped_live_children.
>
> The current code was calling add_paths_to_append_rel() without making sure
> of any live children present or not (sorry, it was my fault). This causes
> an Assertion failure in add_paths_to_append_rel() as this function assumes
> that it will have non-NIL live_childrels list.
>
> I have fixed partitionwise aggregate code which is calling
> add_paths_to_append_rel() by checking the live children list correctly. And
> for robustness, I have also added an Assert() in add_paths_to_append_rel().
>
> I have verified the callers of add_paths_to_append_rel() and except one,
> all are calling it by making sure that they have a non-NIL live children
> list. The one which is calling add_paths_to_append_rel() directly is
> set_append_rel_pathlist(). And I think, at that place, we will never have
> an empty live children list, I may be wrong though. And if that's the case
> then newly added Assert() in add_paths_to_append_rel() will fail anyway to
> catch any programming error in that code path.
>
> Attached patch fixing the crash and also added a simple test-coverage for
> that.
>
> Let me know if I missed any.
>

Rajkumar offlist reported another issue related to data-loss. If few of the
partitions has parallel_workers = 0, not all, then PWA plan ended up with a
plan having children which has parallel_workers != 0. So the partitions
with parallel_workers = 0; were not scanned.

Fixed this in attached version of the patch.


>
> Thanks
>
>
>> >
>> > Thanks for reporting.
>>
>> I've added an v11 open-items entry.
>>
>> - Andres
>>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 9eea7f4a83b3c295847774da9c1164fc8d3a5587 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Tue, 19 Jun 2018 10:44:33 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fewer children in the partially_grouped_live_children list. Don't
create append rel in that case.

Jeevan Chalke
---
 src/backend/optimizer/path/allpaths.c |  3 +
 src/backend/optimizer/plan/planner.c  | 25 +++-
 src/test/regress/expected/partition_aggregate.out | 73 +++
 src/test/regress/sql/partition_aggregate.sql  | 20 +++
 4 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1462988..b9dffef 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have few live child rels. */
+	Assert(live_childrels != NIL);
+
 	/*
 	 * AppendPath generated for partitioned tables must record the RT indexes
 	 * of partitioned tables that are direct or indirect children of this
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..ab27ad0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7011,12 +7011,14 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
+	int			dummy_children_cnt;
 	PathTarget

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-19 Thread Matheus de Oliveira

Hello friends.

On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund  wrote:

>
> On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> > I plan to go over the change again tomorrow, and then push. Unless
> > somebody has comments before then, obviously.
>
> Done.
>
>
Sorry to bother about this, but do you have any plan to do the minor
release before planned due to this bug?

There seem to have too many users affected by this. And worst is that many
users may not have even noticed they have the problem, which can cause
`age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the
system goes down.

In my case, I have a server that its `age(datfrozenxid)` is already at 1.9
billions, and I expect it to reach 2.1 billions in about 14 days.
Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why
I found the issue in one of my servers.

I'm pondering what is the best option to avoid a forced shutdown of this
server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if
deleting the `global/pg_internal.init` file is really the way to go, and
the details, is it safe? Should I stop the server, delete, start?)

Best regards,
-- 
Matheus de Oliveira

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-19 Thread Jeremy Finzel

On Fri, May 25, 2018 at 3:37 PM, Andres Freund  wrote:

> Hi,
>
> Moving discussion to -hackers.  Tom, I think you worked most with this
> code, your input would be appreciated.
>
> Original discussion is around:
> http://archives.postgresql.org/message-id/20180524211311.
> tnswfnjwnii54htx%40alvherre.pgsql
>
> On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> > On 2018-May-24, Andres Freund wrote:
> > > Then there's also:
> > > http://archives.postgresql.org/message-id/1527193504642.
> 36340%40amazon.com
> >
> > ah, so deleting the relcache file makes the problem to go away?  That's
> > definitely pretty strange.  I see no reason for the value in relcache to
> > become out of step with the catalogued value in the same database ... I
> > don't think we transmit in any way values of one database to another.
>
> I can reproduce the issue. As far as I can tell we just don't ever
> actually update nailed relcache entries in the normal course, leaving
> the "physical address" aside.  VACUUM will, via
> vac_update_relstats() -> heap_inplace_update() ->
> CacheInvalidateHeapTuple(),
> send out an invalidation. But invalidation, in my case another session,
> will essentially ignore most of that due to:
>
> static void
> RelationClearRelation(Relation relation, bool rebuild)
> ...
> /*
>  * Never, never ever blow away a nailed-in system relation,
> because we'd
>  * be unable to recover.  However, we must redo
> RelationInitPhysicalAddr
>  * in case it is a mapped relation whose mapping changed.
>  *
>  * If it's a nailed-but-not-mapped index, then we need to re-read
> the
>  * pg_class row to see if its relfilenode changed. We do that
> immediately
>  * if we're inside a valid transaction and the relation is open
> (not
>  * counting the nailed refcount).  Otherwise just mark the entry as
>  * possibly invalid, and it'll be fixed when next opened.
>  */
> if (relation->rd_isnailed)
> {
> RelationInitPhysicalAddr(relation);
>
> if (relation->rd_rel->relkind == RELKIND_INDEX ||
> relation->rd_rel->relkind ==
> RELKIND_PARTITIONED_INDEX)
> {
> relation->rd_isvalid = false;   /* needs to be
> revalidated */
> if (relation->rd_refcnt > 1 &&
> IsTransactionState())
> RelationReloadIndexInfo(relation);
> }
> return;
> }
>
> Which basically means that once running we'll never update the relcache
> data for nailed entries.  That's unproblematic for most relcache fields,
> but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.
>
> This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
> vacuums despite being required. And it'll lead, triggering this thread,
> to wrong errors being raised during vacuum because relfrozenxid just is
> some random value from the past.  I suspect this might also be
> co-responsible for a bunch of planning issues for queries involving the
> catalog, because the planner will use wrong relcache data until the next
> time the init file is thrown away?
>
> This looks like a very longstanding bug to me.  I'm not yet quite sure
> what the best way to deal with this is.  I suspect we might get away
> with just looking up a new version of the pg_class tuple and copying
> rd_rel over?
>
> Greetings,
>
> Andres Freund
>

I have a question related to this - and specifically, preventing the error
until we have a patch :).  We are encountering this error every few weeks
on one very high transaction db, and have to restart to fix it.

If I read you correctly, the cache may never be invalidated for these
catalogs even if I manually VACUUM them?  I was thinking if I routinely run
VACUUM FREEZE on these tables in every database I might avoid the issue.
But given the cause of the issue, would that just make no difference and I
will still hit the error eventually?

Thanks,
Jeremy

Re: WAL prefetch

On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote:

On 19.06.2018 14:03, Tomas Vondra wrote:

On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:

...

>>>
Also there are two points which makes prefetching into shared buffers
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of
false prefetch is less critical.

I think the main challenge here is that all buffer reads are currently
synchronous (correct me if I'm wrong), while the posix_fadvise()
allows a to prefetch the buffers asynchronously.

Yes, this is why we have to spawn several concurrent background workers
to perfrom prefetch.

Right. My point is that while spawning bgworkers probably helps, I don't
expect it to be enough to fill the I/O queues on modern storage systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more than 16
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency values:

[1]
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).

ISTM a small number of bgworkers is not going to be sufficient. It might
be enough for WAL prefetching (where we may easily run into the
redo-is-single-threaded bottleneck), but it's hardly a solution for
bitmap heap scans, for example. We'll need to invent something else for
that.

OTOH my guess is that whatever solution we'll end up implementing for
bitmap heap scans, it will be applicable for WAL prefetching too. Which
is why I'm suggesting simply using posix_fadvise is not going to make
the direct I/O patch significantly more complicated.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-06-19 Thread Etsuro Fujita


(2018/06/15 20:56), Etsuro Fujita wrote:

Actually, I've
created a patch implementing that proposal.



But I think that patch needs more work, so I'm
planning to post it next week.


Here is a patch for that.

* As I said upthread, the patch makes code much more simple; I removed 
all the changes to setrefs.c added by the partitionwise-join patch.  I 
also simplified the logic for building a tlist for a child-join rel. 
The original PWJ computes attr_needed data even for child rels, and 
build the tlist for a child-join by passing to build_joinrel_tlist that 
data for input child rels for the child-join.  But I think that's 
redundant, and it's more straightforward to apply adjust_appendrel_attrs 
to the parent-join's tlist to get the child-join's tlist.  So, I changed 
that way, which made unnecessary all the changes to build_joinrel_tlist 
and placeholder.c added by the PWJ patch, so I removed those as well.


* The patch contains all of the regression tests in the original patch 
proposed by Ashutosh.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 8215,8222  ALTER TABLE fprt2_p1 SET (autovacuum_enabled = 'false');
  ALTER TABLE fprt2_p2 SET (autovacuum_enabled = 'false');
  INSERT INTO fprt2_p1 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(0, 249, 3) i;
  INSERT INTO fprt2_p2 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(250, 499, 3) i;
! CREATE FOREIGN TABLE ftprt2_p1 PARTITION OF fprt2 FOR VALUES FROM (0) TO (250)
  	SERVER loopback OPTIONS (table_name 'fprt2_p1', use_remote_estimate 'true');
  CREATE FOREIGN TABLE ftprt2_p2 PARTITION OF fprt2 FOR VALUES FROM (250) TO (500)
  	SERVER loopback OPTIONS (table_name 'fprt2_p2', use_remote_estimate 'true');
  ANALYZE fprt2;
--- 8215,8223 
  ALTER TABLE fprt2_p2 SET (autovacuum_enabled = 'false');
  INSERT INTO fprt2_p1 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(0, 249, 3) i;
  INSERT INTO fprt2_p2 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(250, 499, 3) i;
! CREATE FOREIGN TABLE ftprt2_p1 (b int, c varchar, a int)
  	SERVER loopback OPTIONS (table_name 'fprt2_p1', use_remote_estimate 'true');
+ ALTER TABLE fprt2 ATTACH PARTITION ftprt2_p1 FOR VALUES FROM (0) TO (250);
  CREATE FOREIGN TABLE ftprt2_p2 PARTITION OF fprt2 FOR VALUES FROM (250) TO (500)
  	SERVER loopback OPTIONS (table_name 'fprt2_p2', use_remote_estimate 'true');
  ANALYZE fprt2;
***
*** 8269,8294  SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  
  -- with whole-row reference
  EXPLAIN (COSTS OFF)
! SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
!QUERY PLAN
! -
   Sort
 Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
 ->  Append
   ->  Foreign Scan
!Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
   ->  Foreign Scan
!Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
  (7 rows)
  
! SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
!t1   |   t2   
  +
   (0,0,) | (0,0,)
   (150,150,0003) | (150,150,0003)
   (250,250,0005) | (250,250,0005)
   (400,400,0008) | (400,400,0008)
! (4 rows)
  
  -- join with lateral reference
  EXPLAIN (COSTS OFF)
--- 8270,8305 
  
  -- with whole-row reference
  EXPLAIN (COSTS OFF)
! SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
!QUERY PLAN   
! 
   Sort
 Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
 ->  Append
   ->  Foreign Scan
!Relations: (public.ftprt1_p1 t1) FULL JOIN (public.ftprt2_p1 t2)
   ->  Foreign Scan
!Relations: (public.ftprt1_p2 t1) FULL JOIN (public.ftprt2_p2 t2)
  (7 rows)
  
! SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2 t2 WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
!wr   |   wr   
  +
   (0,0,) | (0,0,)
+  (50,50,0001)   | 
+  (100,100,0002) | 
   (150,150,0003) | (150,150,0003)
+  (200,200,0004) | 
   (250,250,0005) | (250,250,0005)
+  (300,300,0006) | 
+  (350,350,0007) | 
   (400,400,0008) | (400,400,0008)
!  (450,450,0009) | 
! | (75,75,0001)
! | (225,225,0004)
! | (325,325,0006)
!

Re: WAL prefetch

2018-06-19 Thread Konstantin Knizhnik





On 19.06.2018 14:03, Tomas Vondra wrote:



On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:



On 18.06.2018 23:47, Andres Freund wrote:

On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund  
wrote:
The posix_fadvise approach is not perfect, no doubt about that. 
But it
works pretty well for bitmap heap scans, and it's about 13249x 
better

(rough estimate) than the current solution (no prefetching).
Sure, but investing in an architecture we know might not live long 
also

has it's cost. Especially if it's not that complicated to do better.

My guesses are:

- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.

Why?  The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization 
between

them isn't that much more work.


I do not think that prefetching in shared buffers requires much more 
efforts and make patch more envasive...
It even somehow simplify it, because there is no to maintain own 
cache of prefetched pages...
But it will definitely have much more impact on Postgres performance: 
contention for buffer locks, throwing away pages accessed by 
read-only queries,...


Also there are two points which makes prefetching into shared buffers 
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and 
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent 
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of 
false prefetch is less critical.




I think the main challenge here is that all buffer reads are currently 
synchronous (correct me if I'm wrong), while the posix_fadvise() 
allows a to prefetch the buffers asynchronously.


Yes, this is why we have to spawn several concurrent background workers 
to perfrom prefetch.


I don't think simply spawning a couple of bgworkers to prefetch 
buffers is going to be equal to async prefetch, unless we support some 
sort of async I/O. Maybe something has changed recently, but every 
time I looked for good portable async I/O API/library I got burned.


Now, maybe a couple of bgworkers prefetching buffers synchronously 
would be good enough for WAL refetching - after all, we only need to 
prefetch data fast enough for the recovery not to wait. But I doubt 
it's going to be good enough for bitmap heap scans, for example.


We need a prefetch that allows filling the I/O queues with hundreds of 
requests, and I don't think sync prefetch from a handful of bgworkers 
can achieve that.


regards



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: MERGE SQL statement for PG12