Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-17 Thread Thomas Munro
On Tue, Jun 12, 2018 at 2:04 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> One question I have is whether it is against project policy to
>> backport a new file and a new user-facing function.
>
> Given that this has been broken since forever, and there've been
> no complaints before now, I do not think the case for back-patching
> is strong enough to justify the problems it would cause.  Just
> put it in v11 and be done.

Done.

> Also, this bit in the proposed documentation seems quite inappropriate:
>
> (This is a change from earlier releases of
> PostgreSQL ...
>
> We don't normally put in such comments at all, and if we do, we
> specify which version we're talking about.  Two years from now
> this'd be totally confusing.  I'd just make it read
>
> (This is important only on Windows, where ...

Done.

On Tue, Jun 12, 2018 at 1:09 PM, Tsunakawa, Takayuki
 wrote:
>> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> Why is it OK that we do "free(outp_sqlda)" having got that pointer from
>> a statement like "exec sql fetch 1 from mycur1 into descriptor outp_sqlda"?
>> Isn't that memory allocated by libecpg.dll?
>
> My colleague is now preparing a patch for that, which adds a function 
> ECPGFreeSQLDA() in libecpg.so.  That thread is here:
>
> https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA963A42097@G01JPEXMBKW03

Thanks.  I will follow up on that thread.

> I want some remedy for older releases.  Our customer worked around this 
> problem by getting a libpq connection in their ECPG application and calling 
> PQfreemem().  That's an ugly kludge, and I don't want other users to follow 
> it.
>
> I don't see a problem with back-patching as-is, because existing users who 
> just call free() or don't call free() won't be affected.  I think that most 
> serious applications somehow state their supported minor releases like "this 
> application supports (or is certified against) PostgreSQL 10.5 or later", 
> just like other applications support "RHEL 6.2 or later" or "Windows XP Sp2 
> or later."

If there is a consensus that we should do that then I'll back-patch,
but so far no one else has spoken up in support.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Partitioning with temp tables is broken

2018-06-17 Thread Michael Paquier
On Mon, Jun 18, 2018 at 01:27:51PM +0900, Amit Langote wrote:
> On 2018/06/17 22:11, Michael Paquier wrote:
> Which checks do you think are missing other than those added by the
> proposed patch?

I was just wondering how this reacted if trying to attach a foreign
table to a partition tree which is made of temporary tables, and things
get checked in MergeAttributes, and you have added a check for that:
+-- do not allow a foreign table to become a partition of temporary
+-- partitioned table
+create temp table temp_parted (a int) partition by list (a);

Those tests should be upper-case I think to keep consistency with the
surrounding code.

>> I am quickly looking at forbid-temp-parts-1.patch from previous message
>> https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp
>> and this shines per its lack of tests.  It would be easy enough to test
>> that temp and permanent relations are not mixed within the same session
>> for multiple levels of partitioning.  Amit, could you add some?  There
>> may be tweaks needed for foreign tables or such, but I have not looked
>> close enough at the problem yet..
> 
> OK, I have added some tests.  Thanks for looking!

Okay, I have looked at this patch and tested it manually and I can
confirm the following restrictions:
- Partition trees are supported for a set of temporary relations within
the same session.
- If trying to work with temporary relations from different sessions,
then failure.
- If trying to attach a temporary partition to a permanent parent, then
failure.
- If trying to attach a permanent relation to a temporary parent, then
failure.

+   /* If the parent is permanent, so must be all of its partitions. */
+   if (is_partition &&
+   relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
+   relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot create a temporary relation as partition of 
permanent relation \"%s\"",
+RelationGetRelationName(relation;
I had some second thoughts about this one actually because inheritance
trees allow a temporary child for a permanent parent:
=# create table aa_parent (a int);
CREATE TABLE
=# create temp table aa_temp_child (a int) inherits (aa_parent);
NOTICE:  0: merging column "a" with inherited definition
LOCATION:  MergeAttributes, tablecmds.c:2306
CREATE TABLE
And they also disallow a permanent child to inherit from a temporary
parent:
=# create temp table aa_temp_parent (a int);
CREATE TABLE
=# create table aa_child (a int) inherits (aa_temp_parent);
ERROR:  42809: cannot inherit from temporary relation "aa_temp_parent"
I am not seeing any regression tests for that actually so that would be
a nice addition, with also a test for inheritance of only temporary
relations.  That's not something for the patch discussed on this thread
to tackle.

Still the partition bound checks make that kind of harder to bypass, and
the use-case is not obvious, so let's keep the restriction as
suggested...

-   /* Ditto for the partition */
+   /* Ditto for the partition. */
Some noise here.

Adding a test which makes sure that partition trees made of only
temporary relations work would be nice.

Documenting all those restrictions and behaviors would be nice, why not
adding a paragraph in ddl.sgml, under the section for declarative
partitioning?
--
Michael


signature.asc
Description: PGP signature


Re: Concurrency bug in UPDATE of partition-key

2018-06-17 Thread Dilip Kumar
On Mon, Jun 18, 2018 at 10:21 AM, Amit Khandekar  wrote:
> Attached is v2 version of the patch. It contains the above
> trigger-related issue fixed.
>
> The updated tuple is passed back using the existing newslot parameter
> of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
> new epqslot parameter, it means caller wants to skip the trigger
> execution, because the updated tuple needs to be again checked for
> constraints. I have added comments of this behaviour in the
> ExecBRDeleteTriggers() function header.
>

Thanks for the updated patch.  I have verified the BR trigger
behaviour, its working fine with the patch.

+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+ RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();

Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed?  For example,  in
the above trigger function, we can maintain a count in some table (e.g
how many time delete trigger got executed) and after test over we can
verify the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add function to release an allocated SQLDA

2018-06-17 Thread Thomas Munro
On Wed, Jun 13, 2018 at 4:29 PM, Kato, Sho  wrote:
> I add a function called ECPGfreeSQLDA() becasue there is no API for releasing 
> the SQLDA stored the result set.

Hello Kato-san,

Thank you for sending the patch!

+ Alternatively, use the standard C library's free() function as
in the example below.

+  If the result set contains more than one record, an SQLDA
corresponding to each records is saved as linked list.
+  There is a possibility to free allocated memory area doubly and
cause the application crash,
+  because ECPGfreeSQLDA() releases all SQLDAs associated with the
specified the SQLDA.

This is not clear to me.  ECPGfreeSQLDA() releases a whole chain, but
free() only releases a single SQLDA(), so they are obviously not
interchangeable.  When exactly should a user prefer one over the
other?  If there is a good reason to free just one OR the whole chain,
shouldn't we provide a way to do both of those things without the user
having to use libc free(), for the benefit of Windows users who can't
safely use free()?

- * the new partition's info into its partition descriptor.  If there is a
+ * the new partition's info into its partition descriptor.  If there is a

This seems to be a spurious hunk, but I cannot see what changed.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-17 Thread Ashutosh Bapat
On Sat, Jun 16, 2018 at 9:29 AM, Robert Haas  wrote:
>
> By that logic, we should not have suggested that anyone use table
> inheritance, because they would have to change their configuration
> when we implemented table partitioning.  Indeed, switching from table
> inheritance to table partitioning is much more intrusive than dropping
> triggers on individual partitions and adding one on the partitioned
> table.  The latter only requires DDL on existing objects, but the
> former requires creating entirely new objects and moving all of your
> data.

That's a wrong comparison. Inheritance based partitioning works even
after declarative partitioning feature is added. So, users can
continue using inheritance based partitioning if they don't want to
move to declarative partitioning. That's not true here. A user creates
a BEFORE ROW trigger on each partition that mimics partitioned table
level BEFORE ROW trigger. The proposed BEFORE ROW trigger on
partitioned table will cascade the trigger down to each partition. If
that fails to recognize that there is already an equivalent trigger, a
partition table will get two triggers doing the same thing silently
without any warning or notice. That's fine if the trigger changes the
salaries to $50K but not OK if each of those increases salary by 10%.
The database has limited ability to recognize what a trigger is doing.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-17 Thread Alexander Korotkov
Hi!

On Sat, Jun 16, 2018 at 11:23 PM Darafei "Komяpa" Praliaskouski
 wrote:
> It is cool to see this in Postgres 11. However:
>
>>
>> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
>> Default value is 0.1.  So, by default cleanup scan is triggered after 
>> increasing of
>> table size by 10%.
>
>
> vacuum_cleanup_index_scale_factor can be set to the maximum of 100.
> I imagine that on a large append-only table with IOPS storage system budget 
> it may happen that I would want to never perform a full scan on index. 
> Roughly, with parameter set to 100, if we vacuum the table first time with 1 
> tuple and 130 byte wide rows, we'll have a full scan at 130 bytes, 12 kbytes, 
> 1.2MB, 123MB, 12 GB, 1.2TB.
>
> If we happen to perform the first vacuum when there are 4 tuples in the 
> table, it becomes 52kb, 5MB, 495MB, 48GB - and both 12GB and 48GB will 
> exhaust any storage spike IOPS budget, slowing everything down rather 
> suddenly.
>
> Can the upper limit for this GUC be lifted, or have a value for "never"?

I have some further exploration of how statistics obtained by B-tree
index vacuum cleanup is used.

1) Collected pages and tuples numbers are not directly used, but used
for an estimation of tuples density per page, while current number of
page is estimated using smgr (see btcostestimate()).  So, unless
density of tuples significantly changes, having index statistics
stalled doesn't affect query plans.
2) Our optimization for skipping B-tree index vacuum cleanup works
only in case when use manually vacuums table in order to update
visibility map. Autovacuum is not triggered for append-only tables.
So, if user doesn't have special care about append-only tables,
they're not vacuumed until "autovacuum to prevent wraparound".  Thus,
index statistics could be very stalled.  And I don't think we have
many occurrences of issues with stalled index statistics.
3) We have very safe defaul of vacuum_cleanup_index_scale_factor equal
to 1.1.  But as Darafei claimed, 100 maximum value is probably too low
for advanced users, who really need benefits of this optimization.

So, I'm proposing to raise maximum valus of
vacuum_cleanup_index_scale_factor to DBL_MAX.  Any objections?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


vacuum_cleanup_index_scale_factor-max.patch
Description: Binary data


Re: Concurrency bug in UPDATE of partition-key

2018-06-17 Thread Amit Khandekar
On 11 June 2018 at 15:29, Amit Kapila  wrote:
> On Mon, Jun 11, 2018 at 3:02 PM, Amit Kapila  wrote:
>> On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar  
>> wrote:
>>> On 7 June 2018 at 11:44, Amit Kapila  wrote:
 On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar 
 wrote:

 I think this will allow before row delete triggers to be fired more than
 once.  Normally, if the EvalPlanQual testing generates a new tuple, we 
 don't
 fire the triggers again.
>>>
>>> If there are BR delete triggers, the tuple will be locked using
>>> GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
>>> run, since the tuple is already locked due to triggers having run.
>>>
>>> But that leads me to think : The same concurrency issue can occur in
>>> GetTupleForTrigger() also. Say, a concurrent session has already
>>> locked the tuple, and GetTupleForTrigger() would wait and then return
>>> the updated tuple in its last parameter newSlot. In that case, we need
>>> to pass this slot back through ExecBRDeleteTriggers(), and further
>>> through epqslot parameter of ExecDelete(). But yes, in this case, we
>>> should avoid calling this trigger function the second time.
>>>
>>> If you agree on the above, I will send an updated patch.
>>>
>>
>> Sounds reasonable to me.

Attached is v2 version of the patch. It contains the above
trigger-related issue fixed.

The updated tuple is passed back using the existing newslot parameter
of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
new epqslot parameter, it means caller wants to skip the trigger
execution, because the updated tuple needs to be again checked for
constraints. I have added comments of this behaviour in the
ExecBRDeleteTriggers() function header.

Because the trigger execution is skipped the first time ExecDelete()
is called, I didn't have to explicitly add any changes for preventing
trigger execution the second time.

> Try to add a test case which covers BR trigger code path where you are
> planning to update.

Added the test cases. Also added cases where the concurrent session
causes the EvalPlanQual() test to fail, so the partition key update
does not happen.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_concurrency_bug_v2.patch
Description: Binary data


Re: Fix some error handling for read() and errno

2018-06-17 Thread Michael Paquier
On Thu, Jun 14, 2018 at 09:50:33AM -0400, Robert Haas wrote:
> On Mon, Jun 11, 2018 at 6:11 PM, Alvaro Herrera
>  wrote:
>> I would go as far as suggesting to remove qualifiers that indicate what
>> the file is for (such as "relation mapping file"); relying on the path
>> as indicator of what's going wrong should be sufficient, since it's an
>> error that affects internals anyway, not anything that users can do much
>> about.  Keep variations to a minimum, to ease translator's work;
>> sometimes it's hard enough to come up with good translations for things
>> like "relation mapping file" in the first place, and they don't help the
>> end user.
> 
> +1.  I think those things are often hard to phrase even in English.
> It makes the messages long, awkward, and almost invariably the style
> differs from one message to the next depending on the author and how
> easy it is to describe the type of file involved.

Okay, so this makes two votes in favor of keeping the messages simple
without context, shaped like "could not read file %s", with Robert and
Alvaro, and two votes for keeping some context with Andres and I.
Anybody else has an opinion to offer?

Please note that I think that some messages should keep some context
anyway, for example the WAL-related information is useful.  An example
is this one where the offset is good to know about:
+   ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+   (errmsg("could not read from log segment %s, offset %u: read %d 
bytes, expected %d",
+   fname, readOff, r, XLOG_BLCKSZ)));
--
Michael


signature.asc
Description: PGP signature


[WIP] [B-Tree] Retail IndexTuple deletion

2018-06-17 Thread Andrey V. Lepikhov

Hi,
I have written a code for quick indextuple deletion from an relation by 
heap tuple TID. The code relate to "Retail IndexTuple deletion" 
enhancement of btree index on postgresql wiki [1].

Briefly, it includes three steps:
1. Key generation for index tuple searching.
2. Index relation search for tuple with the heap tuple TID.
3. Deletion of the tuple from the index relation.

Now, index relation cleaning performs by vacuum which scan all index 
relation for dead entries sequentially, tuple-by-tuple. This simplistic 
and safe method can be significantly surpasses in the cases, than a 
number of dead entries is not large by retail deletions which uses index 
scans for search dead entries. Also, it can be used by distributed 
systems for reduce cost of a global index vacuum.


Patch '0001-retail-indextuple-deletion' introduce new function 
amtargetdelete() in access method interface. Patch 
'0002-quick-vacuum-strategy' implements this function for an alternative 
strategy of lazy index vacuum, called 'Quick Vacuum'.


The code demands hold DEAD tuple storage until scan key will be created. 
In this version I add 'target_index_deletion_factor' option. If it more 
than 0, heap_page_prune() uses ItemIdMarkDead() instead of 
ItemIdSetDead() function for set DEAD flag and hold the tuple storage.
Next step is developing background worker which will collect pairs (tid, 
scankey) of DEAD tuples from heap_page_prune() function.


Here are test description and some execution time measurements results 
showing the benefit of this patches:


Test:
-
create table test_range(id serial primary key, value integer);
insert into test_range (value) select random()*1e7/10^N from 
generate_series(1, 1e7);

DELETE FROM test_range WHERE value=1;
VACUUM test_range;

Results:


| n | t1, s  | t2, s  | speedup |
|---|---|
| 0 | 0.3| 0.4476 | 1748.4  |
| 1 | 0.6| 0.5367 | 855.99  |
| 2 | 0.0004 | 0.9804 | 233.99  |
| 3 | 0.0048 | 1.6493 | 34.576  |
| 4 | 0.5600 | 2.4854 | 4.4382  |
| 5 | 3.3300 | 3.9555 | 1.2012  |
| 6 | 17.700 | 5.6000 | 0.3164  |
|---|---|
in the table, t1 - measured execution time of lazy_vacuum_index() 
function by Quick-Vacuum strategy; t2 - measured execution time of 
lazy_vacuum_index() function by Lazy-Vacuum strategy;


Note, guaranteed allowable time of index scans (used for quick deletion) 
will be achieved by storing equal-key index tuples in physical TID order 
[2] with patch [3].


[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Retail_IndexTuple_deletion
[2] 
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[3] 
https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -80,6 +80,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/hash/hash.c b/src/back

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Michael Paquier
On Sun, Jun 17, 2018 at 07:33:01PM -0700, Andres Freund wrote:
> On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
>> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
>> recollection is that multiple fixes were needed to get to that point,
>> so I doubt it's worth trying to fix in older branches.
> 
> There were multiple fixes, that's right - I'd personally however still
> be in favor of trying to backpatch those.

It took me some time, but I have been able to find the thread which
discusses all the fixes needed to get parallel builds to work on older
branches without having to port PostgresNode.pm, which is a way higher
level of efforts:
https://www.postgresql.org/message-id/CAB7nPqQTGveO3_zvnXQAs32cMosX8c76_ewe4L1cNL=1xzm...@mail.gmail.com
Could that be considered?

> Running a check world serially for the backbranches seriously slows
> down things for me.

Me too.  That was already frustrating one year ago, and it still is.
--
Michael


signature.asc
Description: PGP signature


Re: Partitioning with temp tables is broken

2018-06-17 Thread Amit Langote
Hi.

On 2018/06/17 22:11, Michael Paquier wrote:
> On Thu, Jun 14, 2018 at 10:38:14PM -0400, Tom Lane wrote:
>> David Rowley  writes:
>>> On 15 June 2018 at 02:42, Tom Lane  wrote:
 I think that if possible, we should still allow a partitioned table
 in which all the rels are temp tables of the current session.  What we
 have to disallow is (a) temp/permanent mixes and (b) temp tables from
 different sessions.
>>
>>> So, this used to work in v10. Is it fine to just pluck the feature out
>>> of the v11 release and assume nobody cares?
>>
>> IIUC, it worked in v10 only for small values of "work".
> 
> Yeah, if we could get to the set of points mentioned above that would a
> consistent user-facing definition.  ATExecAttachPartition() is actually
> heading toward that behavior but its set of checks is incomplete.

Which checks do you think are missing other than those added by the
proposed patch?

> I am quickly looking at forbid-temp-parts-1.patch from previous message
> https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp
> and this shines per its lack of tests.  It would be easy enough to test
> that temp and permanent relations are not mixed within the same session
> for multiple levels of partitioning.  Amit, could you add some?  There
> may be tweaks needed for foreign tables or such, but I have not looked
> close enough at the problem yet..

OK, I have added some tests.  Thanks for looking!

Regards,
Amit
>From 189f035ba83acced1622d9edb6619e03857967ea Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 15 Jun 2018 10:20:26 +0900
Subject: [PATCH v2] Disallow mixing partitions of differing relpersistence and
 visibility

---
 src/backend/commands/tablecmds.c   | 24 +---
 src/test/regress/expected/alter_table.out  | 12 
 src/test/regress/expected/create_table.out | 10 ++
 src/test/regress/expected/foreign_data.out | 10 ++
 src/test/regress/sql/alter_table.sql   | 12 
 src/test/regress/sql/create_table.sql  | 10 ++
 src/test/regress/sql/foreign_data.sql  | 11 +++
 7 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8b848f91a7..fb7cd561e2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1985,6 +1985,16 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("inherited relation \"%s\" is 
not a table or foreign table",
parent->relname)));
+
+   /* If the parent is permanent, so must be all of its 
partitions. */
+   if (is_partition &&
+   relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP 
&&
+   relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot create a temporary 
relation as partition of permanent relation \"%s\"",
+   
RelationGetRelationName(relation;
+
/* Permanent rels cannot inherit from temporary ones */
if (relpersistence != RELPERSISTENCE_TEMP &&
relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -14135,7 +14145,15 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
   RelationGetRelationName(rel),
   
RelationGetRelationName(attachrel;
 
-   /* Temp parent cannot have a partition that is itself not a temp */
+   /* If the parent is permanent, so must be all of its partitions. */
+   if (rel->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
+   attachrel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot attach a temporary relation as 
partition of permanent relation \"%s\"",
+   RelationGetRelationName(rel;
+
+   /* If the parent is temporary relation, so must be all of its 
partitions */
if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP &&
attachrel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
ereport(ERROR,
@@ -14143,14 +14161,14 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
 errmsg("cannot attach a permanent relation as 
partition of temporary relation \"%s\"",
RelationGetRelationName(rel;
 
-   /* If the parent is 

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-17 Thread Alexander Korotkov
On Sun, Jun 17, 2018 at 9:33 PM Peter Geoghegan  wrote:
> On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera
>  wrote:
> > I don't necessarily object to the proposed change, but I think you
> > should generally wait a bit longer for others to react.
>
> What wait period do you think is appropriate in this case?
>
> The doc section that I removed was a last minute addition to the
> covering index commit, commit 8224de4f, something that I was heavily
> involved in as a reviewer. I felt, rightly or wrongly, that I had
> discretion to commit within a relatively short period of time (a
> little over 24 hours) because of the specific circumstances: I knew
> that the doc section was not well considered in the first place, I
> thought that the question was clear cut, and I doubted that anyone
> else would follow up at all.

FWIW, I've no objections against removing this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-17 Thread David Rowley
On 18 June 2018 at 14:36, Amit Langote  wrote:
> On 2018/06/15 20:41, David Rowley wrote:
>> If the top level Append is the UNION ALL one, then there won't be any
>> partitioned_rels. If that's what you mean by no-op then, yeah. There
>> are no duplicate locks already obtained in the parent with the child
>> Append node.
>
> Yeah, that's what I meant to say.  I looked for whether the locks end up
> being taken twice, once in the UNION ALL parent's ExecInitAppend and then
> again in the individual child Appends' ExecInitAppend, but that they
> don't, so the patch is right.

Thanks for looking.

Robert, do you have any objections to the proposed patch?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-17 Thread Amit Langote
On 2018/06/15 20:41, David Rowley wrote:
> On 15 June 2018 at 20:37, Amit Langote  wrote:
>> select * from partitioned_table_a
>> union all
>> select * from partitioned_table_b
>>
>> The only thing that changes with the patch is that
>> ExecLockNonLeafAppendTables is called *twice* for the two nested Appends
>> corresponding to partitioned_table_a and partitioned_table_b, resp.,
>> instead of just once for the top level Append corresponding to the UNION
>> ALL parent.  In fact, when called for the top level Append,
>> ExecLockNonLeafAppendTables is now a no-op.
> 
> If the top level Append is the UNION ALL one, then there won't be any
> partitioned_rels. If that's what you mean by no-op then, yeah. There
> are no duplicate locks already obtained in the parent with the child
> Append node.

Yeah, that's what I meant to say.  I looked for whether the locks end up
being taken twice, once in the UNION ALL parent's ExecInitAppend and then
again in the individual child Appends' ExecInitAppend, but that they
don't, so the patch is right.

Thanks,
Amit




Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Andres Freund
Hi,

On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > Trying to run regression tests in parallel in ~9.5 leads to spurious
> > failures, which is annoying...  I had a patch fixing that but I cannot
> > put my finger on the thread where this has been discussed.
> 
> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
> recollection is that multiple fixes were needed to get to that point,
> so I doubt it's worth trying to fix in older branches.

There were multiple fixes, that's right - I'd personally however still
be in favor of trying to backpatch those. Running a check world serially
for the backbranches seriously slows down things for me.

Greetings,

Andres Freund



Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Tom Lane
Michael Paquier  writes:
> Trying to run regression tests in parallel in ~9.5 leads to spurious
> failures, which is annoying...  I had a patch fixing that but I cannot
> put my finger on the thread where this has been discussed.

Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
recollection is that multiple fixes were needed to get to that point,
so I doubt it's worth trying to fix in older branches.

regards, tom lane



Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-17 Thread Michael Paquier
On Sun, Jun 17, 2018 at 10:57:16AM -0400, Tom Lane wrote:
> If we're just leaving them undefined, isn't this purely cosmetic?
> At least, that was what I understood to be the reasoning for leaving
> such symbols out of pg_config.h.win32 in the past.
>
> I'm on board with making things more consistent in HEAD, but not sure
> I see any need for back-patching.

I think that it is also important to give to those flags some
visibility so as packagers can more easily find what to switch on/off
when they come across compilation failures with a given version of
OpenSSL, so that's also a matter of documenting things.  You are right
that as the number of people doing packaging on Windows for Postgres can
perhaps be counted with the fingers of both hands, that may not be worth
bothering.  I maintain such stuff by the way, so this counts as one
finger up.

As there is visibly a consensus for HEAD, for now I propose to just
process with the previous patch on only the master branch then.  This
will address the open item.  Any objections to that?
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Michael Paquier
On Wed, Jun 13, 2018 at 09:00:47AM +0900, Michael Paquier wrote:
> Note for everybody on this list: I will be out for a couple of days at
> the end of this week, and my intention is to finish wrapping this patch
> at the beginning of next week, with a back-patch down to 9.5 where
> palloc_extended has been introduced as this has been seen a couple of
> times on -bugs or -hackers.  Feel free of course to comment and look at
> the thread for more background on the matter.

Okay, done this way.

Trying to run regression tests in parallel in ~9.5 leads to spurious
failures, which is annoying...  I had a patch fixing that but I cannot
put my finger on the thread where this has been discussed.
--
Michael


signature.asc
Description: PGP signature


Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-17 Thread Peter Geoghegan
On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera
 wrote:
> I don't necessarily object to the proposed change, but I think you
> should generally wait a bit longer for others to react.

What wait period do you think is appropriate in this case?

The doc section that I removed was a last minute addition to the
covering index commit, commit 8224de4f, something that I was heavily
involved in as a reviewer. I felt, rightly or wrongly, that I had
discretion to commit within a relatively short period of time (a
little over 24 hours) because of the specific circumstances: I knew
that the doc section was not well considered in the first place, I
thought that the question was clear cut, and I doubted that anyone
else would follow up at all.

-- 
Peter Geoghegan



Re: row_to_json(), NULL values, and AS

2018-06-17 Thread Tom Lane
I wrote:
> So I'm now pretty well convinced that this is a good change and we
> should slip it into v11.

I wrote a test case and was about ready to commit this, when I started
wondering about the behavior for fdresult == FUNCDETAIL_MULTIPLE.
That is, suppose that the notation f(x) matches more than one candidate
function f, and there's also a column named f.  The old code would have
chosen the column interpretation, and so would the patch I posted
previously.  But I don't really see how we can square that behavior with
documenting that "we prefer the function interpretation for functional
notation".  I think you should get a "function is not unique" error in
such a case, not silently picking the column instead.  If it's the
column you want, write it like a column.

So the attached updated patch rearranges the error handling logic to
make that happen.

BTW, while trying to make a test case I was reminded that CREATE FUNCTION
tries to prevent this situation from arising:

regression=# create table t1 (f1 int);
CREATE TABLE
regression=# create function f1(t1) returns int as 'select 42' language sql;
ERROR:  "f1" is already an attribute of type t1

That's totally non-bulletproof, though, since you can still create the
situation by renaming, or by adding the conflicting column after you
create the function; not to mention Neil's original problem where the
conflicting function is declared to take record or some such.
Furthermore, this behavior itself creates a dump/reload hazard, since
once you've done one of those things pg_dump will produce output that
fails this check.

While I've not done anything about it here, I think we should just
remove that error check.  With this patch, functions and columns can
coexist reasonably peacefully, so we shouldn't need bogus hacks that
try to prevent it.

regards, tom lane

diff --git a/doc/src/sgml/rowtypes.sgml b/doc/src/sgml/rowtypes.sgml
index 3f24293..2f924b1 100644
*** a/doc/src/sgml/rowtypes.sgml
--- b/doc/src/sgml/rowtypes.sgml
*** SELECT c.somefunc FROM inventory_item c;
*** 441,449 
  Because of this behavior, it's unwise to give a function that takes a
  single composite-type argument the same name as any of the fields of
  that composite type.  If there is ambiguity, the field-name
! interpretation will be preferred, so that such a function could not be
! called without tricks.  One way to force the function interpretation is
! to schema-qualify the function name, that is, write
  schema.func(compositevalue).
 

--- 441,452 
  Because of this behavior, it's unwise to give a function that takes a
  single composite-type argument the same name as any of the fields of
  that composite type.  If there is ambiguity, the field-name
! interpretation will be chosen if field-name syntax is used, while the
! function will be chosen if function-call syntax is used.  However,
! PostgreSQL versions before 11 always chose the
! field-name interpretation, unless the syntax of the call required it to
! be a function call.  One way to force the function interpretation in
! older versions is to schema-qualify the function name, that is, write
  schema.func(compositevalue).
 

diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 21ddd5b..abe1dbc 100644
*** a/src/backend/parser/parse_func.c
--- b/src/backend/parser/parse_func.c
*** static Node *ParseComplexProjection(Pars
*** 49,63 
   *	For historical reasons, Postgres tries to treat the notations tab.col
   *	and col(tab) as equivalent: if a single-argument function call has an
   *	argument of complex type and the (unqualified) function name matches
!  *	any attribute of the type, we take it as a column projection.  Conversely
!  *	a function of a single complex-type argument can be written like a
!  *	column reference, allowing functions to act like computed columns.
   *
   *	Hence, both cases come through here.  If fn is null, we're dealing with
!  *	column syntax not function syntax, but in principle that should not
!  *	affect the lookup behavior, only which error messages we deliver.
!  *	The FuncCall struct is needed however to carry various decoration that
!  *	applies to aggregate and window functions.
   *
   *	Also, when fn is null, we return NULL on failure rather than
   *	reporting a no-such-function error.
--- 49,65 
   *	For historical reasons, Postgres tries to treat the notations tab.col
   *	and col(tab) as equivalent: if a single-argument function call has an
   *	argument of complex type and the (unqualified) function name matches
!  *	any attribute of the type, we can interpret it as a column projection.
!  *	Conversely a function of a single complex-type argument can be written
!  *	like a column reference, allowing functions to act like computed columns.
!  *
!  *	If both interpretations are possible, we prefer the one matching

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-17 Thread Pavel Stehule
2018-06-16 16:21 GMT+02:00 John Dent :

> Hi folks,
>
> I thought I’d share an update to my pet project, which dynamically
> rewrites queries to target materialized views when they are available and
> can satisfy a query (or part of it) with a lower cost plan.
>
> The extension is now a regular EXTENSION and no longer tied in to the FDW
> mechanism. As a result, it may now be more generally usable, and less
> complicated to integrate into an existing system. (The FDW approach was an
> easy way for me to get started, but it ultimately added complexity and was
> rather limiting.)
>
> Same caution as before applies:
>
> **NOTE: this is not "production ready" code — if it works for you, then
> great, but use it after thorough testing, and with appropriate caution.**
>
> Built, and has rudimentary testing against Postgres 10.1..10.3.
>
> Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
> README: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite/
> blob/master/README.md
>
> Hope it is useful or interesting for someone! Questions or comments are
> very welcome.
>

good idea.

Regards

Pavel



> denty.
>
> Begin original message:
>
> *From: *Dent John 
> *Subject: **Query Rewrite for Materialized Views (FDW Extension)*
> *Date: *5 April 2018 at 14:41:15 BST
> *To: *pgsql-hackers@lists.postgresql.org
>
> Hi,
>
> I wanted to share the project I've been working on which dynamically
> rewrites queries to target materialized views when views are available that
> can satisfy part of a query with lower cost plans.
>
> I embarked upon as an interesting side project. It took me a bit more time
> than I anticipated, but the result works for my use case. Because of that,
> I thought it worth sharing. However I would caution that my use case is not
> exactly of a commercial scale... so please heed the following obligatory
> warning:
>
> **NOTE: this is not "production ready" code — if it works for you, then
> great, but use it after thorough testing, and with appropriate caution.**
>
> There are some limitations to the rewrite opportunities it takes up, and
> it will almost certainly fail on complex materialized views composed of
> deeply nested queries.
>
> The extension does not have extensive (actually: any) documentation, but
> the few test cases should make obvious to the inclined reader how it works.
> This is deliberate at this early a stage: I don't want to encourage
> uninformed adoption because of the possibility of data loss or incorrect
> query rewrites.
>
> The extension is written against a Postgres 10.1 source tree.
>
> Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
>
> Questions or comments are very welcome.
>
> denty.
>
>
>


Re: Microoptimization of Bitmapset usage in postgres_fdw

2018-06-17 Thread Daniel Gustafsson
> On 17 Jun 2018, at 14:47, Michael Paquier  wrote:
> 
> On Thu, Jun 14, 2018 at 08:14:54PM +, Bossart, Nathan wrote:
>> I'll go ahead and mark this as Ready for Committer.
> 
> Another thing not mentioned on this thread is that bms_membership is
> faster than bms_num_members by design with many members, so this change
> makes sense to shave a couple of cycles.
> 
> /*
>  * bms_num_members - count members of set
> + *
> + * In situations where the exact count isn't required, bms_membership can be
> + * used to test if the set has 0, 1 or multiple members.
>   */
> bms_membership is a couple of lines down, I am not sure I see much point
> in duplicating what's already present.

It is indeed a bit of a duplication, but the only reason this patch came to be
was that the original author of the instances in postgres_fdw had missed said
comment on bms_membership a few lines down.

> -   if (bms_num_members(clauses_attnums) < 2)
> +   if (bms_membership(clauses_attnums) != BMS_MULTIPLE)
> For this one, the comment above directly mentions that at least two
> attnums need to be present, so it seems to me that the current coding is
> easier to understand and intentional...  So I would be incline to not
> change it.

I don’t have any strong feelings either way, and will leave that call to the
committer who picks this up.  I agree that the current coding is easy to
understand but I don’t see this being much harder.

> I think that this should not go into the tree until REL_11_STABLE gets
> created though, so this will have to wait a bit.

100% agree.

cheers ./daniel


Re: Slow planning time for simple query

2018-06-17 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom>  * During recovery we ignore killed tuples and don't bother to kill 
them
 Tom>  * either. We do this because the xmin on the primary node could 
easily be
 Tom>  * later than the xmin on the standby node, so that what the primary
 Tom>  * thinks is killed is supposed to be visible on standby. So for 
correct
 Tom>  * MVCC for queries during recovery we must ignore these hints and 
check
 Tom>  * all tuples. Do *not* set ignore_killed_tuples to true when running 
in a
 Tom>  * transaction that was started during recovery. xactStartedInRecovery
 Tom>  * should not be altered by index AMs.

 Tom> but it seems to me that this is not terribly carefully thought through.

 Tom> 1. If global xmin on the primary is later than on the standby,
 Tom> VACUUM could remove tuples that should be visible on the standby,
 Tom> and that would shortly propagate to the standby. Simply ignoring
 Tom> index kill bits on the standby won't prevent that.

Right, but we have conflict detection for vacuum's tuple removal
actions, which we don't have for the index hints.

 Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
 Tom> bits, those bits could propagate to the standby anyway, as a
 Tom> result of a subsequent WAL action on the index page that gets a
 Tom> full-page image added.

That's OK as long as we're ignoring those hints on the standby.

 Tom> I believe that in some replication modes, #1 isn't a problem
 Tom> because we have mechanisms to hold back the primary's global xmin.

That's the case if feedback is enabled, but not if it's disabled, which
is sometimes done intentionally to ensure that long-running queries on
the standby don't hold back the master's xmin horizon.

Conflict detection then comes into play to kill the aforesaid
long-running queries before vacuuming away anything they might see.

-- 
Andrew (irc:RhodiumToad)



Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-17 Thread Tom Lane
Andrew Dunstan  writes:
>> On Thu, Jun 14, 2018 at 10:49:52AM +0900, Michael Paquier wrote:
>> Okay, this is still an open item.  Are there any objections to the
>> previous patch applied on master and the addition of the following
>> undefined flags to pg_config.h.win32 for back-branches?  Here is the
>> list of flags which I think should be added but disabled for consistency
>> with the rest and compatibility with OpenSSL 1.1.0:
>> HAVE_ASN1_STRING_GET0_DATA
>> HAVE_BIO_GET_DATA
>> HAVE_BIO_METH_NEW
>> HAVE_OPENSSL_INIT_SSL

> Looks OK.

If we're just leaving them undefined, isn't this purely cosmetic?
At least, that was what I understood to be the reasoning for leaving
such symbols out of pg_config.h.win32 in the past.

I'm on board with making things more consistent in HEAD, but not sure
I see any need for back-patching.

regards, tom lane



Re: Slow planning time for simple query

2018-06-17 Thread Tom Lane
[ trimming cc's ]

Andrew Gierth  writes:
> See index_fetch_heap:
>   /*
>* If we scanned a whole HOT chain and found only dead tuples, tell 
> index
>* AM to kill its entry for that TID (this will take effect in the next
>* amgettuple call, in index_getnext_tid).  We do not do this when in
>* recovery because it may violate MVCC to do so.  See comments in
>* RelationGetIndexScan().
>*/
>   if (!scan->xactStartedInRecovery)
>   scan->kill_prior_tuple = all_dead;

So the referenced comment in RelationGetIndexScan says

 * During recovery we ignore killed tuples and don't bother to kill them
 * either. We do this because the xmin on the primary node could easily be
 * later than the xmin on the standby node, so that what the primary
 * thinks is killed is supposed to be visible on standby. So for correct
 * MVCC for queries during recovery we must ignore these hints and check
 * all tuples. Do *not* set ignore_killed_tuples to true when running in a
 * transaction that was started during recovery. xactStartedInRecovery
 * should not be altered by index AMs.

but it seems to me that this is not terribly carefully thought through.

1. If global xmin on the primary is later than on the standby, VACUUM could
remove tuples that should be visible on the standby, and that would shortly
propagate to the standby.  Simply ignoring index kill bits on the standby
won't prevent that.

2. Although _bt_killitems doesn't WAL-log its setting of kill bits, those
bits could propagate to the standby anyway, as a result of a subsequent
WAL action on the index page that gets a full-page image added.

I believe that in some replication modes, #1 isn't a problem because we
have mechanisms to hold back the primary's global xmin.  If that's true
always, then all of this logic is obsolete silliness that we could remove.
If it's not true always, we have bigger problems to fix.

Similarly, the possibility of #2 means the standby can't trust the kill
bits to reflect its own xmin anyway.  So if we aren't junking this entire
stack of logic, we could perhaps allow the standby to set kill bits per
its own understanding of global xmin, and then allow the kill bits to be
used in SnapshotNonVacuumable scans even though they're unsafe for regular
MVCC scans.

However, I'm still caffeine-short so maybe there's something I missed.

regards, tom lane



Re: SCRAM with channel binding downgrade attack

2018-06-17 Thread Michael Paquier
On Fri, Jun 15, 2018 at 05:23:27PM -0400, Robert Haas wrote:
> On Thu, Jun 14, 2018 at 7:43 AM, Magnus Hagander  wrote:
>> I still think that the fact that we are still discussing what is basically
>> the *basic concepts* of how this would be set up after we have released
>> beta1 is a clear sign that this should not go into 11.
> 
> +1.

Yes, that sounds right.
--
Michael


signature.asc
Description: PGP signature


Re: Partitioning with temp tables is broken

2018-06-17 Thread Michael Paquier
On Thu, Jun 14, 2018 at 10:38:14PM -0400, Tom Lane wrote:
> David Rowley  writes:
>> On 15 June 2018 at 02:42, Tom Lane  wrote:
>>> I think that if possible, we should still allow a partitioned table
>>> in which all the rels are temp tables of the current session.  What we
>>> have to disallow is (a) temp/permanent mixes and (b) temp tables from
>>> different sessions.
> 
>> So, this used to work in v10. Is it fine to just pluck the feature out
>> of the v11 release and assume nobody cares?
> 
> IIUC, it worked in v10 only for small values of "work".

Yeah, if we could get to the set of points mentioned above that would a
consistent user-facing definition.  ATExecAttachPartition() is actually
heading toward that behavior but its set of checks is incomplete.

I am quickly looking at forbid-temp-parts-1.patch from previous message
https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp
and this shines per its lack of tests.  It would be easy enough to test
that temp and permanent relations are not mixed within the same session
for multiple levels of partitioning.  Amit, could you add some?  There
may be tweaks needed for foreign tables or such, but I have not looked
close enough at the problem yet..
--
Michael


signature.asc
Description: PGP signature


Re: Microoptimization of Bitmapset usage in postgres_fdw

2018-06-17 Thread Michael Paquier
On Thu, Jun 14, 2018 at 08:14:54PM +, Bossart, Nathan wrote:
> I'll go ahead and mark this as Ready for Committer.

Another thing not mentioned on this thread is that bms_membership is
faster than bms_num_members by design with many members, so this change
makes sense to shave a couple of cycles.

 /*
  * bms_num_members - count members of set
+ *
+ * In situations where the exact count isn't required, bms_membership can be
+ * used to test if the set has 0, 1 or multiple members.
   */
bms_membership is a couple of lines down, I am not sure I see much point
in duplicating what's already present.

-   if (bms_num_members(clauses_attnums) < 2)
+   if (bms_membership(clauses_attnums) != BMS_MULTIPLE)
For this one, the comment above directly mentions that at least two
attnums need to be present, so it seems to me that the current coding is
easier to understand and intentional...  So I would be incline to not
change it.

I think that this should not go into the tree until REL_11_STABLE gets
created though, so this will have to wait a bit.
--
Michael


signature.asc
Description: PGP signature


Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-17 Thread Andrew Dunstan




On 06/17/2018 08:15 AM, Michael Paquier wrote:

On Thu, Jun 14, 2018 at 10:49:52AM +0900, Michael Paquier wrote:

Thoughts?

Okay, this is still an open item.  Are there any objections to the
previous patch applied on master and the addition of the following
undefined flags to pg_config.h.win32 for back-branches?  Here is the
list of flags which I think should be added but disabled for consistency
with the rest and compatibility with OpenSSL 1.1.0:
HAVE_ASN1_STRING_GET0_DATA
HAVE_BIO_GET_DATA
HAVE_BIO_METH_NEW
HAVE_OPENSSL_INIT_SSL




Looks OK.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-17 Thread Michael Paquier
On Thu, Jun 14, 2018 at 10:49:52AM +0900, Michael Paquier wrote:
> Thoughts?

Okay, this is still an open item.  Are there any objections to the
previous patch applied on master and the addition of the following
undefined flags to pg_config.h.win32 for back-branches?  Here is the
list of flags which I think should be added but disabled for consistency
with the rest and compatibility with OpenSSL 1.1.0:
HAVE_ASN1_STRING_GET0_DATA
HAVE_BIO_GET_DATA
HAVE_BIO_METH_NEW
HAVE_OPENSSL_INIT_SSL
--
Michael


signature.asc
Description: PGP signature


Re: Slow planning time for simple query

2018-06-17 Thread Andrew Gierth
> "Amit" == Amit Kapila  writes:

 >> Presumably the problem is that the standby isn't authorized to change
 >> the btree index's "entry is dead" bits,

 Amit> I don't see anything like that in the code. We use _bt_killitems
 Amit> to mark the items as dead and neither that function or any of its
 Amit> caller has any such assumption.

See index_fetch_heap:

/*
 * If we scanned a whole HOT chain and found only dead tuples, tell 
index
 * AM to kill its entry for that TID (this will take effect in the next
 * amgettuple call, in index_getnext_tid).  We do not do this when in
 * recovery because it may violate MVCC to do so.  See comments in
 * RelationGetIndexScan().
 */
if (!scan->xactStartedInRecovery)
scan->kill_prior_tuple = all_dead;

(this is the only place where kill_prior_tuple can be set to true)

-- 
Andrew (irc:RhodiumToad)



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-17 Thread Nikhil Sontakke
Hi Alvaro,


>> There was a slight oversight in the twophase_gid length calculation in
>> the XactLogCommitRecord() code path in the cf5a1890592 commit. The
>> corresponding XactLogAbortRecord() code path was ok. PFA, a small
>> patch to fix the oversight.
>
> Forgot to add: maybe it would be useful to have tests in core where
> these omissions become evident.  Do you have some?
>
Thanks for the commit.

I do have some tests. They are part of the "logical decoding of 2PC"
patch which adds the needed infrastructure to *actually* use this code
present in the core as of now. I am going to submit it in the upcoming
commitfest.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: WAL prefetch

2018-06-17 Thread Konstantin Knizhnik




On 17.06.2018 03:00, Andres Freund wrote:

On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:


On 16.06.2018 22:02, Andres Freund wrote:

On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:

On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
 wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?


It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).

I think there's a number of issues with just issuing prefetch requests
via fadvise etc:

- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.

Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we need
to first read page from the disk

Yes.



before been able to perform update. So before "we'll write out the whole
page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.

Yes.



Please notice that at the moment of prefetch there is no double
buffering.

Sure, but as soon as it's read there is.



As far as page is not accessed before, it is not present in shared buffers.
And once page is updated,  there is really no need to keep it in shared
buffers.  We can use cyclic buffers (like in case  of sequential scan or
bulk update) to prevent throwing away useful pages from shared  buffers by
redo process. So once again there will no double buffering.

That's a terrible idea. There's a *lot* of spatial locality of further
WAL records arriving for the same blocks.


In some cases it is true, in some cases - not. In typical OLTP system if 
record is updated, then there is high probability that
it will be accessed soon. So if at such system we perform write requests 
on master and read-only queries at replicas,

keeping updated pages in shared buffers at replica can be very helpful.

But if replica is used for running mostly analytic queries while master 
performs some updates, then
it is more useful to keep in replica's cache indexes  and most 
frequently accessed pages, rather than recent updates from the master.


So at least it seems to be reasonable to have such parameter and make 
DBA to choose caching policy at replicas.







I am not so familiar with current implementation of full page writes
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but I
failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full page
write doesn't require prefetch.
Full page write has to be performed when the page is update first time after
checkpoint.
But what if slightly extend this rule and perform full page write also when
distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?

In this case even if checkpoint interval is larger than OS cache size, we
still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!

We could do so, but I suspect the WAL volume penalty would be
prohibitive in many cases. Worthwhile to try though.


Well, the typical size of server's memory is now several hundreds of 
megabytes.
Certainly some of this memory is used for shared buffers, backends work 
memory, ...
But still there are hundreds of gigabytes of free memory which can be 
used by OS for caching.
Let's assume that full page write threshold is 100Gb. So one extra 8kb 
for 100Gb of WAL!
Certainly it is estimation only for one page and it is more realistic to 
expect that we have to force full page writes for most of the updated 
pages. But still I do not believe that it will cause significant growth 
of log size.


Another question is why do we choose so large checkpoint interval: re 
than hundred gigabytes.
Certainly frequent checkpoints have negative impact on performance. But 
100Gb is not "too frequent" in any case...