date:20180711

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Amit Langote

On 2018/07/12 14:32, Amit Langote wrote:
> Thanks Ashutosh for reporting and Dilip for the analysis and the patch.
> 
> On 2018/07/11 21:39, Dilip Kumar wrote:
>> On Wed, Jul 11, 2018 at 5:36 PM, amul sul  wrote:
>>> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:
>>

>>> I am not sure that I have understand the following comments
>>>  11 +* Generate one prune step for the information derived from IS NULL,
>>>  12 +* if any.  To prune hash partitions, we must have found IS NULL
>>>  13 +* clauses for all partition keys.
>>>  14  */
>>>
>>> I am not sure that I have understood this --  no such restriction
>>> required to prune the hash partitions, if I am not missing anything.
>>
>> Maybe it's not very clear but this is the original comments I have
>> retained.  Just moved it out of the (!generate_opsteps) condition.
>>
>> Just the explain this comment consider below example,
>>
>> create table hp (a int, b text) partition by hash (a int, b text);
>> create table hp0 partition of hp for values with (modulus 4, remainder 0);
>> create table hp3 partition of hp for values with (modulus 4, remainder 3);
>> create table hp1 partition of hp for values with (modulus 4, remainder 1);
>> create table hp2 partition of hp for values with (modulus 4, remainder 2);
>>
>> postgres=# insert into hp values (1, null);
>> INSERT 0 1
>> postgres=# insert into hp values (2, null);
>> INSERT 0 1
>> postgres=# select tableoid::regclass, * from hp;
>>  tableoid | a | b
>> --+---+---
>>  hp1  | 1 |
>>  hp2  | 2 |
>> (2 rows)
>>
>> Now, if we query based on "b is null" then we can not decide which
>> partition should be pruned whereas in case
>> of other schemes, it will go to default partition so we can prune all
>> other partitions.
> 
> That's right.  By generating a pruning step with only nullkeys set, we are
> effectively discarding OpExprs that may have been found for some partition
> keys.  That's fine for list/range partitioning, because nulls can only be
> found in a designated partition, so it's okay to prune all other
> partitions and for that it's enough to generate the pruning step like
> that.  For hash partitioning, nulls could be contained in any partition so
> it's not okay to discard OpExpr's like that.  We can generate pruning
> steps with combination of null and non-null keys in the hash partitioning
> case if there are any OpExprs.
> 
> I think your fix is correct.  I slightly modified it along with updating
> nearby comments and added regression tests.

I updated regression tests to reduce lines.  There is no point in
repeating tests like v2 patch did.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..d9612bf65b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
 
/*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+* For list and range partitioning, null partition keys can only be 
found
+* in one designated partition, so if there are IS NULL clauses 
containing
+* partition keys we should generate a pruning step that gets rid of all
+* partitions but the special null-accepting partitions.  For range
+* partitioning, that means we will end up disregarding OpExpr's that 
may
+* have been found for some other keys, but that's fine, because it is 
not
+* possible to prune range partitions with a combination of null and
+* non-null keys.
+*
+* For hash partitioning however, it is possible to combine null and 
non-
+* null keys in a pruning step, so do this only if *all* partition keys
+* are involved in IS NULL clauses.
 */
-   if (!generate_opsteps)
+   if (!bms_is_empty(nullkeys) &&
+   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
+bms_num_members(nullkeys) == part_scheme->partnatts))
+   {
+   PartitionPruneStep *step;
+
+   step = gen_prune_step_op(context, InvalidStrategy,
+false, NIL, 
NIL, nullkeys);
+   result = lappend(result, step);
+   }
+   else if (!generate_opsteps && !bms_is_empty(notnullkeys))
{
/*
-* Generate one prune step for the information derived from IS 
NULL,
-* if any.  To prune hash partitions, we must have found IS NULL
-* clauses for all partition keys.
+* There are no OpExpr's, but there are IS NOT NULL clauses, 
which
+* can be used to eliminate the null-partition-key-only 
partition.
 */
-   if (!bms_is_empty(nullkeys) &&
-   (part_scheme->strategy !=

Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-07-11 Thread Michael Paquier

On Wed, Jun 13, 2018 at 08:29:12PM +, Bossart, Nathan wrote:
> Previous thread: 
> https://postgr.es/m/4BC0F3CD-F4B5-4F23-AADB-80607F9E4B4E%40amazon.com
> 
> This is a new thread for tracking the work to add SKIP LOCKED to
> VACUUM and ANALYZE.  I've attached a rebased set of patches.

I am beginning to look at this stuff, and committed patch 4 and 5 on the
way as they make sense independently.  I am still reviewing the rest,
which applies with some fuzz, and the tests proposed still pass.
--
Michael


signature.asc
Description: PGP signature

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Amit Langote

Thanks Ashutosh for reporting and Dilip for the analysis and the patch.

On 2018/07/11 21:39, Dilip Kumar wrote:
> On Wed, Jul 11, 2018 at 5:36 PM, amul sul  wrote:
>> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:
> 
>>>
>> I am not sure that I have understand the following comments
>>  11 +* Generate one prune step for the information derived from IS NULL,
>>  12 +* if any.  To prune hash partitions, we must have found IS NULL
>>  13 +* clauses for all partition keys.
>>  14  */
>>
>> I am not sure that I have understood this --  no such restriction
>> required to prune the hash partitions, if I am not missing anything.
> 
> Maybe it's not very clear but this is the original comments I have
> retained.  Just moved it out of the (!generate_opsteps) condition.
> 
> Just the explain this comment consider below example,
> 
> create table hp (a int, b text) partition by hash (a int, b text);
> create table hp0 partition of hp for values with (modulus 4, remainder 0);
> create table hp3 partition of hp for values with (modulus 4, remainder 3);
> create table hp1 partition of hp for values with (modulus 4, remainder 1);
> create table hp2 partition of hp for values with (modulus 4, remainder 2);
> 
> postgres=# insert into hp values (1, null);
> INSERT 0 1
> postgres=# insert into hp values (2, null);
> INSERT 0 1
> postgres=# select tableoid::regclass, * from hp;
>  tableoid | a | b
> --+---+---
>  hp1  | 1 |
>  hp2  | 2 |
> (2 rows)
> 
> Now, if we query based on "b is null" then we can not decide which
> partition should be pruned whereas in case
> of other schemes, it will go to default partition so we can prune all
> other partitions.

That's right.  By generating a pruning step with only nullkeys set, we are
effectively discarding OpExprs that may have been found for some partition
keys.  That's fine for list/range partitioning, because nulls can only be
found in a designated partition, so it's okay to prune all other
partitions and for that it's enough to generate the pruning step like
that.  For hash partitioning, nulls could be contained in any partition so
it's not okay to discard OpExpr's like that.  We can generate pruning
steps with combination of null and non-null keys in the hash partitioning
case if there are any OpExprs.

I think your fix is correct.  I slightly modified it along with updating
nearby comments and added regression tests.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..d9612bf65b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
 
/*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+* For list and range partitioning, null partition keys can only be 
found
+* in one designated partition, so if there are IS NULL clauses 
containing
+* partition keys we should generate a pruning step that gets rid of all
+* partitions but the special null-accepting partitions.  For range
+* partitioning, that means we will end up disregarding OpExpr's that 
may
+* have been found for some other keys, but that's fine, because it is 
not
+* possible to prune range partitions with a combination of null and
+* non-null keys.
+*
+* For hash partitioning however, it is possible to combine null and 
non-
+* null keys in a pruning step, so do this only if *all* partition keys
+* are involved in IS NULL clauses.
 */
-   if (!generate_opsteps)
+   if (!bms_is_empty(nullkeys) &&
+   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
+bms_num_members(nullkeys) == part_scheme->partnatts))
+   {
+   PartitionPruneStep *step;
+
+   step = gen_prune_step_op(context, InvalidStrategy,
+false, NIL, 
NIL, nullkeys);
+   result = lappend(result, step);
+   }
+   else if (!generate_opsteps && !bms_is_empty(notnullkeys))
{
/*
-* Generate one prune step for the information derived from IS 
NULL,
-* if any.  To prune hash partitions, we must have found IS NULL
-* clauses for all partition keys.
+* There are no OpExpr's, but there are IS NOT NULL clauses, 
which
+* can be used to eliminate the null-partition-key-only 
partition.
 */
-   if (!bms_is_empty(nullkeys) &&
-   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
-bms_num_members(nullkeys) == part_scheme->partnatts))
-   {
-   PartitionPruneStep *step;
+

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-11 Thread Ashutosh Bapat

On Thu, Jul 12, 2018 at 9:02 AM, Etsuro Fujita
 wrote:
> (2018/07/11 20:02), Ashutosh Bapat wrote:
>>
>> On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita
>>   wrote:
>>>
>>> Actually, even if we could create such an index on the child table and
>>> the
>>> targetlist had the ConvertRowtypeExpr, the planner would still not be
>>> able
>>> to use an index-only scan with that index; because check_index_only would
>>> not consider that an index-only scan is possible for that index because
>>> that
>>> index is an expression index and that function currently does not
>>> consider
>>> that index expressions are able to be returned back in an index-only
>>> scan.
>>> That behavior of the planner might be improved in future, though.
>
>
>> Right and when we do so, not having ConvertRowtypeExpr in the
>> targetlist will be a problem.
>
>
> Yeah, but I don't think that that's unsolvable; because in that case the CRE
> as an index expression could be converted back to the whole-row Var's
> rowtype by adding another CRE to the index expression for that conversion, I
> suspect that that special handling could allow us to support an index-only
> scan even when having the whole-row Var instead of the CRE in the
> targetlist.

I am not able to understand this. Can you please provide an example?

> (Having said that, I'm not 100% sure we need to solve that
> problem when we improve the planner, because there doesn't seem to me to be
> enough use-case to justify making the code complicated for that.)  Anyway, I
> think that that would be a matter of future versions of PG.

I am afraid that a fix which just works only till we change the other
parts of the system is useful only till that time. At that time, it
needs to be replaced with a different fix. If that time is long
enough, that's ok. In this case, I agree that if we haven't heard from
users till now that they need to create indexes on whole-row
expressions, there's chance that we will never implement this feature.

>
 At places in planner we match equivalence members
 to the targetlist entries. This matching will fail unexpectedly when
 ConvertRowtypeExpr is removed from a child's targetlist. But again I
 couldn't reproduce a problem when such a mismatch arises.
>>>
>>>
>>>
>>> IIUC, I don't think the planner assumes that for an equivalence member
>>> there
>>> is an matching entry for that member in the targetlist; what I think the
>>> planner assumes is: an equivalence member is able to be computed from
>>> expressions in the targetlist.
>>
>>
>> This is true. However,
>>
>>>   So, I think it is safe to have whole-row
>>> Vars instead of ConvertRowtypeExprs in the targetlist.
>>
>>
>> when it's looking for an expression, it finds a whole-row expression
>> so it think it needs to add a ConvertRowtypeExpr on that. But when the
>> plan is created, there is ConvertRowtypeExpr already, but there is no
>> way to know that a new ConvertRowtypeExpr is not needed anymore. So,
>> we may have two ConvertRowtypeExprs giving wrong results.
>
>
> Maybe I'm missing something, but I don't think that we need to worry about
> that, because in the approach I proposed, we only add CREs above whole-row
> Vars in the targetlists for subplans of an Append/MergeAppend for a
> partitioned relation at plan creation time.

There's a patch in an adjacent thread started by David Rowley to rip
out Append/MergeAppend when there is only one subplan. So, your
solution won't work there.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 03:01:03PM +0300, Heikki Linnakangas wrote:
> That would be more complicated to parse. Yeah, we might need further options
> for some SASL mechanisms in the future, but we can cross that bridge when we
> get there. I don't see any need to complicate this case for that.

Okay, fine for me.

> I started digging into this more closely, and ran into a little problem. If
> channel binding is not used, the client sends a flag to the server to
> indicate if it's because the client doesn't support channel binding, or
> because it thought that the server doesn't support it. This is the
> gs2-cbind-flag. How should the flag be set, if the server supports channel
> binding type A, while client supports only type B? The purpose of the flag
> is to detect downgrade attacks, where a man-in-the-middle removes the PLUS
> variants from the list of supported mechanisms. If we treat incompatible
> channel binding types as "client doesn't support channel binding", then the
> downgrade attack becomes possible (the attacker can replace the legit PLUS
> variants with bogus channel binding types that the client surely doesn't
> support). If we treat it as "server doesn't support channel binding", then
> we won't downgrade to the non-channel-binding variant, in the legitimate
> case that the client and server both support channel binding, but with
> incompatible types.
> 
> What we'd really want, is to include the full list of server's supported
> mechanisms as part of the exchange, not just a boolean "y/n" flag. But
> that's not what the spec says :-(.

Let's not break the spec :)  I understand from it that the client is in
charge of the choice, so we are rather free to choose the reaction the
client should have.  In the second phase of the exchange, the client
communicates back to the server the channel binding it has decided to
choose, it is not up to the server to pick up one if the client thinks
that it can use multiple ones.

> I guess we should err on the side of caution, and fail the authentication in
> that case. That's unfortunate, but it's better than not negotiating at all.
> At least with the negotiation, the authentication will work if there is any
> mutually supported channel binding type.

I think that in this case the client should throw an error
unconditionally if it wants to use channel A but server supports only B.
It is always possible for the client to adjust the channel binding type
it wants to use by using the connection parameter scram_channel_binding,
or to recompile.  If there is incompatibility between channel binding
types, it would actually mean that the client and the server are not
compiled with the same SSL implementation, would that actually work? Say
a server has only tls-unique with gnu's library and the client uses JDBC
which only has tls-server-end-point..

So, to keep things simple, it seems to me that we should just make the
server send the list, and then check at client-level if the list sent by
server includes what the client wants to use, complaining if the option
is not present.
--
Michael

signature.asc
Description: PGP signature

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 09:33:21AM +0800, Craig Ringer wrote:
> On 12 July 2018 at 09:10, Tsunakawa, Takayuki <
> tsunakawa.ta...@jp.fujitsu.com> wrote:
> > From: Nico Williams [mailto:n...@cryptonector.com]
> > > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > > > How can one make defensive use of his patent if he allows everyone to
> > > > use it royalty-free?  Can he use his patent for cross-licensing
> > > > negotiation if some commercial database vendor accuses his company
> > > > that PostgreSQL unintentionally infringes that vendor's patent?
> > >
> > > https://en.wikipedia.org/wiki/Defensive_termination
> >
> > Thank you, his looks reasonable to give everyone the grant.  Then, I
> > wonder if the community can accept patented code if the patent holder
> > grants this kind of license.
> 
> Personally, I'd be in favour, but the core team has spoken here. I'm not
> privy to the discussion but the outcome seems firm.

Has the issue been considered in enough detail?

A thorough treatment of the issue would probably mean that PG should
have a contributor agreement, or at the very least the license that PG
uses should include generous royalty-free patent grants so as to force
contributors to make them.  Of course, any time you add contributor
agreements you add friction to the contribution process, and at least
one-time legal costs for the core.

Also, what about patents that are placed in the public domain?  Surely
code implementing those would not be rejected for involving patents...

I'm not sure that the widest grant with no-lawsuit defensive clauses
should be accepted, but it's not that far from the public domain case.
Even here there's legal costs: at least a one-time cost of hiring an IP
lawyer to write up patent grant language that the PG community would
insist on.  It could be well worth it.

Nico
--

Re: Cannot dump foreign key constraints on partitioned table

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 03:49:59PM +0530, amul sul wrote:
> On the master head, getConstraints() function skips FK constraints for
> a partitioned table because of tbinfo->hastriggers is false.
> 
> While creating FK constraints on the partitioned table, the FK triggers are 
> only
> created on leaf partitions and skipped for the partitioned tables.

Oops.  Good catch.

> To fix this, either bypass the aforementioned condition of getConstraints() or
> set pg_class.relhastriggers to true for the partitioned table which doesn't 
> seem
> to be the right solution, IMO.  Thoughts?

Changing pg_class.relhastriggers is out of scope because as far as I
know partitioned tables have no triggers, so the current value is
correct, and that would be a catalog change at this stage which would
cause any existing deployments of v11 to complain about the
inconsistency.  I think that this should be fixed client-side as the
attached does.

I have just stolen this SQL set from Alvaro to check the consistency of
the dumps created:
create table prim (a int primary key);
create table partfk (a int references prim) partition by range (a);
create table partfk1 partition of partfk for values from (0) to (100);
create table partfk2 partition of partfk for values from (100) to (200);

Thoughts?
--
Michael
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 463639208d..74a1270169 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -7131,7 +7131,12 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int numTables)
 	{
 		TableInfo  *tbinfo = [i];
 
-		if (!tbinfo->hastriggers ||
+		/*
+		 * For partitioned tables, foreign keys have no triggers so they
+		 * must be included anyway in case some foreign keys are defined.
+		 */
+		if ((!tbinfo->hastriggers &&
+			 tbinfo->relkind != RELKIND_PARTITIONED_TABLE) ||
 			!(tbinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 			continue;
 


signature.asc
Description: PGP signature

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-11 Thread Etsuro Fujita


(2018/07/11 20:02), Ashutosh Bapat wrote:

On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita
  wrote:

Actually, even if we could create such an index on the child table and the
targetlist had the ConvertRowtypeExpr, the planner would still not be able
to use an index-only scan with that index; because check_index_only would
not consider that an index-only scan is possible for that index because that
index is an expression index and that function currently does not consider
that index expressions are able to be returned back in an index-only scan.
That behavior of the planner might be improved in future, though.



Right and when we do so, not having ConvertRowtypeExpr in the
targetlist will be a problem.


Yeah, but I don't think that that's unsolvable; because in that case the 
CRE as an index expression could be converted back to the whole-row 
Var's rowtype by adding another CRE to the index expression for that 
conversion, I suspect that that special handling could allow us to 
support an index-only scan even when having the whole-row Var instead of 
the CRE in the targetlist.  (Having said that, I'm not 100% sure we need 
to solve that problem when we improve the planner, because there doesn't 
seem to me to be enough use-case to justify making the code complicated 
for that.)  Anyway, I think that that would be a matter of future 
versions of PG.



At places in planner we match equivalence members
to the targetlist entries. This matching will fail unexpectedly when
ConvertRowtypeExpr is removed from a child's targetlist. But again I
couldn't reproduce a problem when such a mismatch arises.



IIUC, I don't think the planner assumes that for an equivalence member there
is an matching entry for that member in the targetlist; what I think the
planner assumes is: an equivalence member is able to be computed from
expressions in the targetlist.


This is true. However,


  So, I think it is safe to have whole-row
Vars instead of ConvertRowtypeExprs in the targetlist.


when it's looking for an expression, it finds a whole-row expression
so it think it needs to add a ConvertRowtypeExpr on that. But when the
plan is created, there is ConvertRowtypeExpr already, but there is no
way to know that a new ConvertRowtypeExpr is not needed anymore. So,
we may have two ConvertRowtypeExprs giving wrong results.


Maybe I'm missing something, but I don't think that we need to worry 
about that, because in the approach I proposed, we only add CREs above 
whole-row Vars in the targetlists for subplans of an Append/MergeAppend 
for a partitioned relation at plan creation time.


Best regards,
Etsuro Fujita

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 01:10:33AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > > How can one make defensive use of his patent if he allows everyone to
> > > use it royalty-free?  Can he use his patent for cross-licensing
> > > negotiation if some commercial database vendor accuses his company
> > > that PostgreSQL unintentionally infringes that vendor's patent?
> > 
> > https://en.wikipedia.org/wiki/Defensive_termination
> 
> Thank you, his looks reasonable to give everyone the grant.  Then, I
> wonder if the community can accept patented code if the patent holder
> grants this kind of license.

I'm not a core member, but personally I'd be inclined to accept a
royalty-free, non-exclusive, non-expiring, transferrable, ..., grant
where the only condition is to not sue the patent holder.

I don't object to patents in general, but I think patent lifetimes are
way too long for software (because they are not commensurate with the
costs of developing software), and I understand that often one must
obtain patents for *defensive* purposes because there are sharks out
there.  So I'm inclined to accept patents with defensive but otherwise
very wide grants.

Nico
--

Re: TRUNCATE tables referenced by FKs on partitioned tables

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 06:59:16PM -0400, Alvaro Herrera wrote:
> Anyway, this patch seems to fix it, and adds what I think is appropriate
> test coverage.

This looks good to me.  I am noticing that the documentation of TRUNCATE
does not mention that when running the command on a partitioned table
then it automatically gets to the children even if CASCADE is not used
and each child partition is not listed.

What is the filler column added in truncpart used for?
--
Michael

signature.asc
Description: PGP signature

bgworkers should share more of BackendRun / PostgresMain

2018-07-11 Thread Craig Ringer

Hi folks

As I work on background workers almost exclusively at the moment, I keep on
running into things that are just not initialized, or not supported, in
bgworkers. Many are quite surprising.

I think bgworkers should just share more of the main Pg startup path.

My latest surprise is that we don't re-initialize the RNG for bgworkers,
because it's done in BackendRun before calling PostgresMain, and bgworker
code path diverges before we call BackendRun.

bgworkers also have to do all their own error handling, duplicating lots of
logic in PostgresMain, if they want to have any error recovery except "die
and restart".

bgworkers don't participate in the ProcSignal mechanism, get query cancels,
etc, unless they do extra setup.

There are all sorts of other small surprises, like the way we don't set up
the dbname in shmem, so logs for bgworkers show it as empty.

There's no patch here because I'm not yet sure what the best approach is.
Some things seem clear, like moving the RNG init work from BackendRun a bit
earlier so all backends hit it. Other parts much less so.

Look at the number of calls to InitProcess() from our profusion of custom
worker types, user backends, the bgworker infrastructure, etc.

I could just copy stuff from BackendRun / PostgresMain to
StartBackgroundWorker() but I'm sure that's not the right approach. I think
we need to decide what should be common and factor it out a bit.

So discussion/ideas appreciated.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: _isnan() on Windows

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote:
> I just pushed it before seeing your message.

Fine as well, thanks for picking this up.  The buildfarm shows no
failures about this patch.
--
Michael


signature.asc
Description: PGP signature

Re: Failure assertion in GROUPS mode of window function in current HEAD

2018-07-11 Thread Masahiko Sawada

On Wed, Jul 11, 2018 at 4:21 AM, Tom Lane  wrote:
> Masahiko Sawada  writes:
>> BTW, I found an another but related issue. We can got an assertion
>> failure also by the following query.
>
>> =# select sum(c) over (partition by c order by c groups between 1
>> preceding and current row) from test;
>
> Oh, good point, that's certainly legal per spec, so we'd need to do
> something about it.
>
> The proximate cause of the problem, I think, is that the planner throws
> away the "order by c" as being redundant because it already sorted by c
> for the partitioning requirement.  So we end up with ordNumCols = 0
> even though the query had ORDER BY.

Yeah, I think so too.

>
> This breaks not only GROUPS cases, but also RANGE OFFSET cases, because
> the executor expects to have an ordering column.  I thought for a bit
> about fixing that by forcing the planner to emit the ordering column for
> RANGE OFFSET even if it's redundant.  But it's hard to make the existing
> logic in get_column_info_for_window do that --- it can't tell which
> partitioning column the ordering column was redundant with, and even if it
> could, that column might've been of a different data type.  So eventually
> I just threw away that redundant-key-elimination logic altogether.
> I believe that we never thought it was really useful as an optimization,
> but way back when window functions were put in, we didn't have (or at
> least didn't think about) a way to identify the partitioning/ordering
> columns without reference to the input pathkeys.
>

Agreed.

> With this patch, WindowAggPath.winpathkeys is not used for anything
> anymore.  I'm inclined to get rid of it, though I didn't do so here.
> (If we keep it, we at least need to adjust the comment in relation.h
> that claims createplan.c needs it.)

+1 for removing.

>
> The other issue here is that nodeWindowAgg's setup logic is not quite
> consistent with update_frameheadpos and update_frametailpos about when
> to create tuplestore read pointers and slots.  We might've prevented
> all the inconsistent cases with the parser and planner fixes, but it
> seems best to make them really consistent anyway, so I changed that.
>
> Draft patch attached.
>

Thank you for committing!
I think we also can update the doc about that GROUPS offset mode
requires ORDER BY clause. Thoughts? Attached patch updates it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


update_window_syntax_doc.patch
Description: Binary data

Re: [HACKERS] Replication status in logical replication

2018-07-11 Thread Masahiko Sawada

On Thu, Jul 12, 2018 at 10:22 AM, Michael Paquier  wrote:
> On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote:
>> Thanks.  If there are no objections, then I will try to wrap this stuff
>> on Thursday my time.
>
> And done down to 9.4.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

pread() and pwrite()

2018-07-11 Thread Thomas Munro

Hello hackers,

A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt
$SUBJECT.  Last year, Tobias Oberstein argued again that we should do
that[2].  At the end of that thread there was a +1 from multiple
committers in support of getting it done for PostgreSQL 12.  Since
Oskari hasn't returned, I decided to dust off his patch and start a
new thread.

Here is a rebase and an initial review.  I plan to address the
problems myself, unless Oskari wants to do that in which case I'll
happily review and hopefully eventually commit it.  It's possible I
introduced new bugs while rebasing since basically everything moved
around, but it passes check-world here with and without HAVE_PREAD and
HAVE_PWRITE defined.

Let me summarise what's going on in the patch:

1.  FileRead() and FileWrite() are replaced with FileReadAt() and
FileWriteAt(), taking a new offset argument.  Now we can do one
syscall instead of two for random reads and writes.
2.  fd.c no longer tracks seek position, so we lose a bunch of cruft
from the vfd machinery.  FileSeek() now only supports SEEK_SET and
SEEK_END.

This is taking the position that we no longer want to be able to do
file IO with implicit positioning at all.  I think that seems
reasonable: it's nice to drop all that position tracking and caching
code, and the seek-based approach would be incompatible with any
multithreading plans.  I'm not even sure I'd bother adding "At" to the
function names.  If there are any extensions that want the old
interface they will fail to compile either way.  Note that the BufFile
interface remains the same, so hopefully that covers many use cases.

I guess the only remaining reason to use FileSeek() is to get the file
size?  So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?

pgstat_report_wait_start(wait_event_info);
+#ifdef HAVE_PREAD
+   returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+   lseek(VfdCache[file].fd, offset, SEEK_SET);
returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();

This obviously lacks error handling for lseek().

I wonder if anyone would want separate wait_event tracking for the
lseek() (though this codepath would be used by almost nobody,
especially if someone adds Windows support, so it's probably not worth
bothering with).

I suppose we could assume that if you have pread() you must also have
pwrite() and save on ./configure cycles.

I will add this to the next Festival of Commits, though clearly it
needs a bit more work before the festivities begin.

[1] 
https://www.postgresql.org/message-id/flat/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
[2] 
https://www.postgresql.org/message-id/flat/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patch
Description: Binary data

Re: Preferring index-only-scan when the cost is equal

2018-07-11 Thread Yugo Nagata

On Wed, 11 Jul 2018 14:37:46 +0200
Tomas Vondra  wrote:

> 
> On 07/11/2018 01:28 PM, Ashutosh Bapat wrote:

> > I don't think we should change add_path() for this. We will
> > unnecessarily check that condition even for the cases where we do not
> > create index paths. I think we should fix the caller of add_path()
> > instead to add index only path before any index paths. For that the
> > index list needs to be sorted by the possibility of using index only
> > scan.
> > 
> > But I think in your case, it might be better to first check whether
> > there is any costing error because of which index only scan's path has
> > the same cost as index scan path. Also I don't see any testcase which
> > will show why index only scan would be more efficient in your case.
> > May be provide output of EXPLAIN ANALYZE.
> > 
> 
> I suspect this only happens due to testing on empty tables. Not only is 
> testing of indexes on small tables rather pointless in general, but more 
> importantly there will be no statistics. So we fall back to some default 
> estimates, but we also don't have relallvisible etc which is crucial for 
> estimating index-only scans. I'd bet that's why the cost estimates for 
> index scans and index-only scans are the same here.

You are right. When the table have rows and this is vacuumed, index only
scan's cost is cheaper and chosen properly. Sorry, I have jumped to the
conclusion before confirming this.

Thanks,

-- 
Yugo Nagata

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Craig Ringer

On 12 July 2018 at 09:10, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Nico Williams [mailto:n...@cryptonector.com]
> > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > > How can one make defensive use of his patent if he allows everyone to
> > > use it royalty-free?  Can he use his patent for cross-licensing
> > > negotiation if some commercial database vendor accuses his company
> > > that PostgreSQL unintentionally infringes that vendor's patent?
> >
> > https://en.wikipedia.org/wiki/Defensive_termination
>
> Thank you, his looks reasonable to give everyone the grant.  Then, I
> wonder if the community can accept patented code if the patent holder
> grants this kind of license.


Personally, I'd be in favour, but the core team has spoken here. I'm not
privy to the discussion but the outcome seems firm.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-07-11 Thread David Rowley

On 12 July 2018 at 12:19, Haribabu Kommi  wrote:
> Yes, I agree that we may need a new counter that counts the buffers that
> are just allocated (no read or no write). But currently, may be the counter
> value is very less, so people are not interested.

The existing counters don't show up in EXPLAIN (ANALYZE, BUFFERS) when
they're zero. The new counter would surely work the same way, so the
users wouldn't be burdened by additional explain output it when
they're not affected by it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Replication status in logical replication

2018-07-11 Thread Michael Paquier

On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote:
> Thanks.  If there are no objections, then I will try to wrap this stuff
> on Thursday my time.

And done down to 9.4.
--
Michael


signature.asc
Description: PGP signature

Re: Allow to specify a index name as ANALYZE parameter

2018-07-11 Thread Yugo Nagata

On Wed, 11 Jul 2018 14:26:03 +0300
Alexander Korotkov  wrote:
 
> On Wed, Jul 11, 2018 at 12:04 PM Yugo Nagata  wrote:
> > When we specify column names for ANALYZE, only the statistics for those 
> > columns
> > are collected. Similarly, is it useful if we have a option to specify an 
> > index
> > for ANALYZE to collect only the statistics for expression in the specified 
> > index?
> >
> > A usecase I suppose is that when a new expression index is created and that
> > we need only the statistics for the new index. Although ANALYZE of all the 
> > indexes
> > is usually fast because ANALYZE uses a random sampling of the table rows, 
> > ANALYZE
> > on the specific index may be still useful if there are other index whose 
> > "statistics
> > target" is large and/or whose expression takes time to compute, for example.
> >
> > Attached is the WIP patch to allow to specify a index name as ANALYZE 
> > parameter.
> > Any documatation is not yet included.  I would appreciate any feedback!
> 
> I think this makes sense.  Once we can collect statistics individually
> for regular columns, we should be able to do the same for indexed
> expression.  Please, register this patch on the next commitfest.

Thank you for your comment! I registered this to CF 2018-09.

> Regarding current implementation I found message "ANALYZE option must
> be specified when a column list is provided" to be confusing.
> Perhaps, you've missed some negation in this message, since in fact
> you disallow analyze with column list.
> 
> However, since multicolumn index may contain multiple expression, I
> think we should support specifying columns for ANALYZE index clause.
> We could support specifying columns by their numbers in the same way
> we did for "ALTER INDEX index ALTER COLUMN colnum SET STATISTICS
> number".

Make sense. I'll fix the patch to support specifying columns of index.

Thanks,

-- 
Yugo Nagata

RE: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Tsunakawa, Takayuki

From: Nico Williams [mailto:n...@cryptonector.com]
> On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> > How can one make defensive use of his patent if he allows everyone to
> > use it royalty-free?  Can he use his patent for cross-licensing
> > negotiation if some commercial database vendor accuses his company
> > that PostgreSQL unintentionally infringes that vendor's patent?
> 
> https://en.wikipedia.org/wiki/Defensive_termination

Thank you, his looks reasonable to give everyone the grant.  Then, I wonder if 
the community can accept patented code if the patent holder grants this kind of 
license.


Regards
Takayuki Tsunakawa

Upper limit arguments of pg_logical_slot_xxx_changes functions accept invalid values

2018-07-11 Thread Masahiko Sawada

Hi,

While reading the replication slot codes, I found a wrong assignment
in pg_logical_slot_get_changes_guts() function as follows.

if (PG_ARGISNULL(2))
   upto_nchanges = InvalidXLogRecPtr;
else
upto_nchanges = PG_GETARG_INT32(2);

Since the upto_nchanges is an integer value we should set 0 meaning
unlimited instead of InvalidXLogRecPtr. Since InvalidXLogRecPtr is
actually 0 this function works fine so far. Also I surprised that
these functions accept to set negative values to upto_nchanges.  The
upto_lsn argument is also the same; it also accepts an invalid lsn.
For safety maybe it's better to reject non-NULL invalid values.That
way, the behavior of these functions are consistent with what the
documentation says;

  If upto_lsn is non-NULL, decoding will include only those
transactions which commit prior to the specified LSN. If upto_nchanges
is non-NULL, decoding will stop when the number of rows produced by
decoding exceeds the specified value.

Attached patch fixes them. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_logical_slot_changes_funcs.patch
Description: Binary data

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

Attached is an additional patch, as well as a new, rebased patch.

This includes changes responsive to Álvaro Herrera's commentary about
the SET CONSTRAINTS manual page.

Nico
-- 
>From e7838b60dbf0a8cd7f35591db2f9aab78d8903cb Mon Sep 17 00:00:00 2001
From: Nicolas Williams 
Date: Wed, 11 Jul 2018 19:53:01 -0500
Subject: [PATCH] Add ALWAYS DEFERRED option for CONSTRAINTs (CR)

---
 doc/src/sgml/catalogs.sgml |  2 +-
 doc/src/sgml/ref/set_constraints.sgml  | 10 ++
 src/backend/catalog/index.c|  1 -
 src/backend/catalog/information_schema.sql |  8 +++-
 src/backend/commands/trigger.c |  1 -
 src/backend/parser/gram.y  | 18 ++
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 291e6a9..4d42594 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2245,7 +2245,7 @@ SCRAM-SHA-256$iteration count:
   Constraint deferral option:
 a = always deferred,
 d = deferrable,
-d = deferrable initially deferred,
+i = deferrable initially deferred,
 n = not deferrable
   
  
diff --git a/doc/src/sgml/ref/set_constraints.sgml b/doc/src/sgml/ref/set_constraints.sgml
index 671332a..390015e 100644
--- a/doc/src/sgml/ref/set_constraints.sgml
+++ b/doc/src/sgml/ref/set_constraints.sgml
@@ -34,11 +34,13 @@ SET CONSTRAINTS { ALL | name [, ...
   
 
   
-   Upon creation, a constraint is given one of three
+   Upon creation, a constraint is given one of four
characteristics: DEFERRABLE INITIALLY DEFERRED,
-   DEFERRABLE INITIALLY IMMEDIATE, or
-   NOT DEFERRABLE. The third
-   class is always IMMEDIATE and is not affected by the
+   DEFERRABLE INITIALLY IMMEDIATE,
+   NOT DEFERRABLE, or ALWAYS DEFERRED.
+   The third
+   class is always IMMEDIATE, while the fourth class is
+   always DEFERRED, and neither affected by the
SET CONSTRAINTS command.  The first two classes start
every transaction in the indicated mode, but their behavior can be changed
within a transaction by SET CONSTRAINTS.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 795a7a9..45b52b4 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1070,7 +1070,6 @@ index_create(Relation heapRelation,
 
 recordDependencyOn(, , deptype);
 			}
-			Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0);
 		}
 
 		/* Store dependency on parent index, if any */
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index bde6199..dd4792a 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -894,11 +894,9 @@ CREATE VIEW domain_constraints AS
CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END
  AS yes_or_no) AS is_deferrable,
CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN 'YES' ELSE 'NO' END
- AS yes_or_no) AS initially_deferred
-	   /*
-	* XXX Can we add is_always_deferred here?  Are there
-	* standards considerations?
-	*/
+ AS yes_or_no) AS initially_deferred,
+   CAST(CASE WHEN condeferral = 'a' THEN 'YES' ELSE 'NO' END
+ AS yes_or_no) AS always_deferred
 FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t
 WHERE rs.oid = con.connamespace
   AND n.oid = t.typnamespace
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 41dc6a4..33b1095 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3627,7 +3627,6 @@ typedef struct AfterTriggerSharedData
 	TriggerEvent ats_event;		/* event type indicator, see trigger.h */
 	Oid			ats_tgoid;		/* the trigger's ID */
 	Oid			ats_relid;		/* the relation it's on */
-	bool			ats_alwaysdeferred;	/* whether this can be deferred */
 	CommandId	ats_firing_id;	/* ID for firing cycle */
 	struct AfterTriggersTableData *ats_table;	/* transition table access */
 } AfterTriggerSharedData;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dab721a..9aaa2af 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -185,8 +185,8 @@ static void SplitColQualList(List *qualList,
 			 List **constraintList, CollateClause **collClause,
 			 core_yyscan_t yyscanner);
 static void processCASbits(int cas_bits, int location, const char *constrType,
-			   char *deferral,
-			   bool *not_valid, bool *no_inherit, core_yyscan_t yyscanner);
+			   char *deferral, bool *not_valid, bool *no_inherit,
+			   core_yyscan_t yyscanner);
 static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %}
@@ -5579,13 +5579,13 @@ ConstraintAttributeSpec:
 		(newspec & (CAS_INITIALLY_IMMEDIATE)))
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
-

Re: Costing bug in hash join logic for semi joins

2018-07-11 Thread David Rowley

On 10 July 2018 at 22:21, David Rowley  wrote:
> I've done that in the attached.  Also on reading the comment above, it
> looks slightly incorrect. To me, it looks like it's applying a
> twentieth of the cost and not a tenth as the comment claims. I
> couldn't resist updating that too.

I've added this patch to the September commit fest:

https://commitfest.postgresql.org/19/1720/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-11 Thread David Pacheco

On Tue, Jul 10, 2018 at 10:32 PM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake 
> wrote:
> > On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
> >>
> >> Thanks to everyone who took the time to look at the patch and send me
> >> feedback.  I'm happy to work on improving the documentation of this new
> >> tunable to clarify when it should be used and the implications. I'm
> trying
> >> to understand more specifically what else needs to be done next. To
> >> summarize, I think the following general concerns were brought up.
> >>
> >> For #6, there is no feasible way for us to recreate our workload on
> other
> >> operating systems or filesystems. Can anyone expand on what performance
> data
> >> is needed?
> >
> > I think a simple way to prove this would be to run BenchmarkSQL against
> > PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem
> > that is COW (zfs) and then run another test where pg_xlog/pg_wal is
> patched
> > with your patch and new behavior and then run the test again.
> BenchmarkSQL
> > is a more thorough benchmarking tool that something like pg_bench and is
> > very easy to setup.
>
> I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS
> on spinning rust.  It occurred to me that such an anaemic machine
> might show this effect easily because its cold reads are as slow as a
> Lada full of elephants going uphill.  Let's see...
>
> # os setup
> sysctl vfs.zfs.arc_min=134217728
> sysctl vfs.zfs.arc_max=134217728
> zfs create zoot/data/test
> zfs set mountpoint=/data/test zroot/data/test
> zfs set compression=off zroot/data/test
> zfs set recordsize=8192 zroot/data/test
>
> # initdb into /data/test/pgdata, then set postgresql.conf up like this:
> fsync=off
> max_wal_size = 600MB
> min_wal_size = 600MB
>
> # small scale test, we're only interested in producing WAL, not db size
> pgbench -i -s 100 postgres
>
> # do this a few times first, to make sure we have lots of WAL segments
> pgbench -M prepared -c 4 -j 4 -T 60 postgres
>
> # now test...
>
> With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10
> shows numbers like this:
>
> procs  memory   pagedisks faults cpu
> r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us
> sy id
> 3 0 3 1.2G  3.1G  4496   0   0   052   76 144 138  607 84107 29713 55
> 17 28
> 4 0 3 1.2G  3.1G  2955   0   0   084   77 134 130  609 82942 34324 61
> 17 22
> 4 0 3 1.2G  3.1G  2327   0   0   0 0   77 114 125  454 83157 29638 68
> 15 18
> 5 0 3 1.2G  3.1G  1966   0   0   082   77  86  81  335 84480 25077 74
> 13 12
> 3 0 3 1.2G  3.1G  1793   0   0   0   533   74  72  68  310 127890 31370 77
> 16  7
> 4 0 3 1.2G  3.1G  1113   0   0   0   151   73  95  94  363 128302 29827 74
> 18  8
>
> With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10
> shows numbers like this:
>
> procs  memory   pagedisks faults cpu
> r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us
> sy id
> 0 0 3 1.2G  3.1G   148   0   0   0   402   71  38  38  153 16668  5656 10
> 3 87
> 5 0 3 1.2G  3.1G  4527   0   0   050   73  28  27  123 123986 23373 68
> 15 17
> 5 0 3 1.2G  3.1G  3036   0   0   0   151   73  47  49  181 148014 29412 83
> 16  0
> 4 0 3 1.2G  3.1G  2063   0   0   0   233   73  56  54  200 143018 28699 81
> 17  2
> 4 0 3 1.2G  3.1G  1202   0   0   095   73  48  49  189 147276 29196 81
> 18  1
> 4 0 3 1.2G  3.1G   732   0   0   0 0   73  56  55  207 146805 29265 82
> 17  1
>
> I don't have time to investigate further for now and my knowledge of
> ZFS is superficial, but the patch seems to have a clear beneficial
> effect, reducing disk IOs and page faults on my little storage box.
> Obviously this isn't representative of a proper server environment, or
> some other OS, but it's a clue.  That surprised me... I was quietly
> hoping it was hoping it was going to be 'oh, if you turn off
> compression and use 8kb it doesn't happen because the pages line up'.
> But nope.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
Hi Thomas,

Thanks for testing!  It's validating that you saw the same results.

-- Dave

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > My advice is to write up a patent grant that allows all to use the
> > relevant patents royalty-free with a no-lawsuit covenant.  I.e., make
> > only defensive use of your patents.
> 
> How can one make defensive use of his patent if he allows everyone to
> use it royalty-free?  Can he use his patent for cross-licensing
> negotiation if some commercial database vendor accuses his company
> that PostgreSQL unintentionally infringes that vendor's patent?

https://en.wikipedia.org/wiki/Defensive_termination

RE: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Tsunakawa, Takayuki

From: Nico Williams [mailto:n...@cryptonector.com]
> You're proposing to include code that implements patented ideas with a
> suitable patent grant.  I would be free to not read the patent, but what
> if the code or documents mention the relevant patented algorithms?
> 
> If I come across something like this in the PG source code:
> 
> /* The following is covered by patents US#... and so on */

I got it.  Your concern makes sense.



> My advice is to write up a patent grant that allows all to use the
> relevant patents royalty-free with a no-lawsuit covenant.  I.e., make
> only defensive use of your patents.

How can one make defensive use of his patent if he allows everyone to use it 
royalty-free?  Can he use his patent for cross-licensing negotiation if some 
commercial database vendor accuses his company that PostgreSQL unintentionally 
infringes that vendor's patent?

Regards
Takayuki Tsunakawa

Re: patch to allow disable of WAL recycling

2018-07-11 Thread Andres Freund

Hi,

On 2018-07-10 14:15:30 -0600, Jerry Jelinek wrote:
>  Thanks to everyone who took the time to look at the patch and send me
> feedback.  I'm happy to work on improving the documentation of this new
> tunable to clarify when it should be used and the implications. I'm trying
> to understand more specifically what else needs to be done next. To
> summarize, I think the following general concerns were brought up.
> 
> 1) Disabling WAL recycling could have a negative performance impact on a
> COW filesystem if all WAL files could be kept in the filesystem cache.

> For #1, #2 and #3, I don't understand these concerns. It would be helpful
> if these could be more specific

We perform more writes (new files are zeroed, which needs to be
fsynced), and increase metadata traffic (creation of files), when not
recycling.

Regards,

Andres

Re: patch to allow disable of WAL recycling

2018-07-11 Thread David Pacheco

On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera 
wrote:

> On 2018-Jul-10, Jerry Jelinek wrote:
>
> > 2) Disabling WAL recycling reduces reliability, even on COW filesystems.
>
> I think the problem here is that WAL recycling in normal filesystems
> helps protect the case where filesystem gets full.  If you remove it,
> that protection goes out the window.  You can claim that people needs to
> make sure to have available disk space, but this does become a problem
> in practice.  I think the thing to do is verify what happens with
> recycling off when the disk gets full; is it possible to recover
> afterwards?  Is there any corrupt data?  What happens if the disk gets
> full just as the new WAL file is being created -- is there a Postgres
> PANIC or something?  As I understand, with recycling on it is easy (?)
> to recover, there is no PANIC crash, and no data corruption results.
>

If the result of hitting ENOSPC when creating or writing to a WAL file was
that the database could become corrupted, then wouldn't that risk already
be present (a) on any system, for the whole period from database init until
the maximum number of WAL files was created, and (b) all the time on any
copy-on-write filesystem?

Thanks,
Dave

RE: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Tsunakawa, Takayuki

From: Nico Williams [mailto:n...@cryptonector.com]
> On Wed, Jul 11, 2018 at 01:03:44AM +, Tsunakawa, Takayuki wrote:
> > As a practical matter, when and where are you planning to post the
> > project policy?  How would you check and prevent patented code?
> 
> PG may need a contributor agreement.  All I can find on that is:
> 
> https://www.postgresql.org/message-id/flat/54337476.3040605%402ndquadr
> ant.com#9b1968ddc0fadfe225001adc1a821925

Yes, I thought we should probably need a CLA, but I was hesitant to mention 
it...

This is somehow off-topic, but a member in our IP department informed me that 
TPL only states the disclaimer for the University of California, not for PGDG 
or all copyright holders, which is unlike Apache License.  Is this intended?


https://www.postgresql.org/about/licence/
--
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR 
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST 
PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF 
THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH 
DAMAGE. 

THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, 
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 
PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND 
THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, 
SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. 
--


Regards
Takayuki Tsunakawa

Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-07-11 Thread Haribabu Kommi

On Thu, Jul 12, 2018 at 8:32 AM Thomas Munro 
wrote:

> On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi
>  wrote:
> >> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote:
> >> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show
> up
> >> >> as "reads" when in fact they are not reads at all:
> >> >>
> >> >> 1.  Relation extension, which in fact writes a zero-filled block.
> >> >> 2.  The RBM_ZERO_* modes, which provoke neither read nor write.
> >
> > I checked the patch and I agree with the change 1). And regarding change
> 2)
> > whether it is  zeroing the contents of the page or not, it does read?
> > because if it exists in the buffer pool, we are counting them as hits
> > irrespective
> > of the mode? Am I missing something?
>
> Further down in the function you can see that there is no read()
> system call for the RBM_ZERO_* modes:
>
> if (mode == RBM_ZERO_AND_LOCK || mode ==
> RBM_ZERO_AND_CLEANUP_LOCK)
> MemSet((char *) bufBlock, 0, BLCKSZ);
> else
> {
> ...
> smgrread(smgr, forkNum, blockNum, (char *)
> bufBlock);
> ...
> }
>

Thanks for the details. I got your point. But we need to include
RBM_ZERO_ON_ERROR case read operations, excluding others
are fine.


> I suppose someone might argue that even when it's not a hit and it's
> not a read, we might still want to count this buffer interaction in
> some other way.  Perhaps there should be a separate counter?  It may
> technically be a kind of cache miss, but it's nowhere near as
> expensive as a synchronous system call like read() so I didn't propose
> that.
>

Yes, I agree that we may need a new counter that counts the buffers that
are just allocated (no read or no write). But currently, may be the counter
value is very less, so people are not interested.


> Some more on my motivation:  In our zheap prototype, when the system
> is working well and we have enough space, we constantly allocate
> zeroed buffer pages at the insert point (= head) of an undo log and
> drop pages at the discard point (= tail) in the background;
> effectively a few pages just go round and round via the freelist and
> no read() or write() syscalls happen.  That's something I'm very happy
> about and it's one of our claimed advantages over the traditional heap
> (which tends to read and dirty more pages), but EXPLAIN (BUFFERS)
> hides this virtuous behaviour when comparing with the traditional
> heap: it falsely and slanderously reports that zheap is reading undo
> pages when it is not.  Of course I don't intent to litigate zheap
> design in this thread, I just I figured that since this accounting is
> wrong on principle and affects current PostgreSQL too (at least in
> theory) I would propose this little patch independently.  It's subtle
> enough that I wouldn't bother to back-patch it though.
>

OK. May be it is better to implement the buffer allocate counter along with
zheap to provide better buffer results?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Tips on committing

2018-07-11 Thread Peter Geoghegan

On Tue, Jul 10, 2018 at 9:14 PM, Noah Misch  wrote:
> My rule has been to add to my private checklist anytime I mail or push a patch
> containing a readily-checkable mistake.  I go through the checklist before
> mailing or pushing any patch.  It has things in common with your list, plus
> these:
>
> * Validate err*() calls against 
> https://www.postgresql.org/docs/devel/static/error-style-guide.html
> * Validate *printf calls for trailing newlines
>
> * Spellcheck the patch
>
> * Verify long lines are not better broken
> git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | 
> awk "length > 78 || /^diff/"
>
> * Run pgindent on changed files; keep the changes I made necessary
>
> * Update version numbers, if needed
> CATALOG_VERSION_NO, PG_CONTROL_VERSION, XLOG_PAGE_MAGIC, PGSTAT_FILE_FORMAT_ID

I'm going to use some of these, myself.

>> I'll try to do that, but I'd still recommend personalizing it. A lot
>> of the stuff in there is specific to my own workflow and tool
>> preferences, and my own personal working style.
>
> I agree we won't all want the exact same checklist.  Still, it wouldn't hurt
> to have a wiki page of checklist entry ideas from which folks cherry-pick the
> entries they like.

You've convinced me that we should definitely have such a list. I've
put it on my TODO list.

-- 
Peter Geoghegan

Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2018-07-11 Thread Peter Geoghegan

On Mon, Jul 24, 2017 at 11:12 AM, Peter Geoghegan  wrote:
> You can get about a 1/3 loss of space by inserting randomly, rather
> than inserting in sorted order, which is what REINDEX will more or
> less do for you. That's because random workloads almost entirely get
> 50:50 page splits, whereas sorted input will always split the
> rightmost page, and so will always get 90:10 splits. The space in the
> random case isn't exactly wasted; it's there for the taking, for key
> values that happen to fit on the page.

I looked into this again. I decided to see for myself what was up.
Actually, I used oltpbench [1], since that was the easiest TPC-C
benchmark available.

The order_line primary key may have been designed to be as unfriendly
as possible to implementations that don't have good suffix truncation.
This is also true of several of other TPC-C indexes.

I initialized a scale 50 TPC-C database:

pg@bat:~/code/oltpbench$ ./oltpbenchmark -b tpcc -c
config/my_postgres_tpcc_config.xml --create=true --load=true
15:26:57,214 (DBWorkload.java:259) INFO  -
==

Benchmark: TPCC {com.oltpbenchmark.benchmarks.tpcc.TPCCBenchmark}
Configuration: config/sample_tpcc_config.xml
Type:  POSTGRES
Driver:org.postgresql.Driver
URL:   jdbc:postgresql:tpcc
Isolation: TRANSACTION_SERIALIZABLE
Scale Factor:  50.0

Note that there is no garbage for VACUUM to clean up just yet. There
hasn't been any deletes or updates yet. And yet, order_line_pkey is
bloated. It's 783 MB, but if I run REINDEX it shrinks right down to
451 MB. This is on the master branch -- not with one of my patches.
You can also arrange to make the index much smaller if you force the
insertions to occur in a totally random order, rather than the order
that the benchmark actually inserts them.

The problem for us is that tuples are inserted in
clustered/monotonically increasing order for the last 3 columns, while
the first column (ol_w_id) is a low cardinality column whose values
appear in random order, more or less. This isn't actually unrealistic
- it makes sense that associated records would be inserted as
per-warehouse groups like this, while the order of the groups remains
unpredictable.

I am working on adding suffix truncation at the moment. Right now, my
patch doesn't help, because it isn't sophisticated enough about the
choice of split point (we need to care about suffix truncation, and
not just balancing space among the two halves of a split). It should
try to find a split point that maximizes the chances of a new pivot
tuple not having any low cardinality final column (ol_number).

We should be able to mostly not have any of the last column in tuples
that make it into the branch/internal nodes:

pg@tpcc[1452]=# select count(*) from order_line ;
   count

 15,001,784
(1 row)

pg@tpcc[1452]=# select count(distinct(ol_w_id, ol_d_id, ol_o_id)) from
order_line ;
   count
───
 1,500,000
(1 row)

As I said, the final ol_number column makes the keyspace unbalanced by
unnecessarily appearing in the internal/branch nodes:

pg@tpcc[1452]=# select count(distinct(ol_number)) from order_line ;
 count
───
15
(1 row)

Since there can only be 15 distinct values for any given (ol_w_id,
ol_d_id, ol_o_id, *), and since over 100 index tuples will fit on each
leaf page, we just have to pick a split point that's between (x, y, z,
15) and (x, y, z + 1, 1). This makes it legal to truncate away the
ol_number column, which allows us to balance the use of free space for
items that are inserted after the split, by later transactions.

[1] https://github.com/oltpbenchmark/oltpbench
-- 
Peter Geoghegan

Re: TRUNCATE tables referenced by FKs on partitioned tables

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, Michael Paquier wrote:

> > alvherre=# truncate table prim, partfk;
> > ERROR:  cannot truncate a table referenced in a foreign key constraint
> > DETALLE:  Table "partfk" references "prim".
> > SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
> > CASCADE.
> 
> Your first and second queries are the same :)

Yeah, C failure :-(

Anyway, this patch seems to fix it, and adds what I think is appropriate
test coverage.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 07500a1d7d7e6e2a79514672a7b6441781baa1da Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 11 Jul 2018 18:54:00 -0400
Subject: [PATCH] fix truncate

---
 src/backend/catalog/heap.c |  7 +++-
 src/backend/commands/tablecmds.c   |  2 +-
 src/test/regress/expected/truncate.out | 75 ++
 src/test/regress/sql/truncate.sql  | 47 +
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d223ba8537..4cfc0c8911 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -3181,13 +3181,16 @@ heap_truncate_check_FKs(List *relations, bool 
tempTables)
 * Build a list of OIDs of the interesting relations.
 *
 * If a relation has no triggers, then it can neither have FKs nor be
-* referenced by a FK from another table, so we can ignore it.
+* referenced by a FK from another table, so we can ignore it.  For
+* partitioned tables, FKs have no triggers, so we must include them
+* anyway.
 */
foreach(cell, relations)
{
Relationrel = lfirst(cell);
 
-   if (rel->rd_rel->relhastriggers)
+   if (rel->rd_rel->relhastriggers ||
+   rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
oids = lappend_oid(oids, RelationGetRelid(rel));
}
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c0cf0d7ee..22e81e712d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1421,7 +1421,7 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, 
List *relids_logged,
Oid*logrelids;
 
/*
-* Open, exclusive-lock, and check all the explicitly-specified 
relations
+* Check the explicitly-specified relations.
 *
 * In CASCADE mode, suck in all referencing relations as well.  This
 * requires multiple iterations to find indirectly-dependent relations. 
At
diff --git a/src/test/regress/expected/truncate.out 
b/src/test/regress/expected/truncate.out
index 735d0e862d..0c2ba173db 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -464,3 +464,78 @@ ERROR:  cannot truncate only a partitioned table
 HINT:  Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions 
directly.
 TRUNCATE truncparted;
 DROP TABLE truncparted;
+-- foreign key on partitioned table: partition key is referencing column.
+-- Make sure truncate did execute on all tables
+CREATE FUNCTION tp_ins_data() RETURNS void LANGUAGE plpgsql AS $$
+  BEGIN
+   INSERT INTO truncprim VALUES (1), (100), (150);
+   INSERT INTO truncpart VALUES (1, 1), (100, 100), (150, 150);
+  END
+$$;
+CREATE FUNCTION tp_chk_data(OUT pktb regclass, OUT pkval int, OUT fktb 
regclass, OUT fkval int)
+  RETURNS SETOF record LANGUAGE plpgsql AS $$
+  BEGIN
+RETURN QUERY SELECT
+  pk.tableoid::regclass, pk.a, fk.tableoid::regclass, fk.a
+FROM truncprim pk FULL JOIN truncpart fk USING (a)
+ORDER BY 2, 4;
+  END
+$$;
+CREATE TABLE truncprim (a int PRIMARY KEY);
+CREATE TABLE truncpart (a int REFERENCES truncprim, b int, filler text)
+  PARTITION BY RANGE (a);
+CREATE TABLE truncpart_1 PARTITION OF truncpart FOR VALUES FROM (0) TO (100);
+CREATE TABLE truncpart_2 PARTITION OF truncpart FOR VALUES FROM (100) TO (200)
+  PARTITION BY RANGE (a);
+CREATE TABLE truncpart_2_1 PARTITION OF truncpart_2 FOR VALUES FROM (100) TO 
(150);
+CREATE TABLE truncpart_2_d PARTITION OF truncpart_2 DEFAULT;
+TRUNCATE TABLE truncprim;  -- should fail
+ERROR:  cannot truncate a table referenced in a foreign key constraint
+DETAIL:  Table "truncpart" references "truncprim".
+HINT:  Truncate table "truncpart" at the same time, or use TRUNCATE ... 
CASCADE.
+select tp_ins_data();
+ tp_ins_data 
+-
+ 
+(1 row)
+
+-- should truncate everything
+TRUNCATE TABLE truncprim, truncpart;
+select * from tp_chk_data();
+ pktb | pkval | fktb | fkval 
+--+---+--+---
+(0 rows)
+
+select tp_ins_data();
+ tp_ins_data 
+-
+ 
+(1 row)
+
+-- should truncate everything
+SET client_min_messages TO WARNING;-- suppress cascading notices
+TRUNCATE TABLE truncprim

Segfault logical replication PG 10.4

2018-07-11 Thread Mai Peng

We discovered our pg_wal partition was full few days after setting our first 
logical publication on a PG 10.4 instance.
Then, we can not synchronise our slave to the master, it triggers a segfault on 
the slave. We had to drop manually the subscription on slave and the slot on 
master.
Then, we wanted to find the cause of this bug, stop connection between master 
and slave , after 30 minutes, the slave had a segfault and could not 
synchronise.
Why does the slave can not synchronise without a complete creation subscription 
after dropping the slot?
How to manage the replication, knowing we use cloud vm and issue network 
latency.

Here the details of conf and error logs:
Conf on master:
max_replication_slots = 10
max_sync_workers_per_subscription = 2
wal_receiver_timeout: 60s
wal_keep_segments : 1000
wal_receiver_status_interval :10
wal_retrieve_retry_interval :5 s
max_logical_replication_workers :4
Conf on slave
same except wal_keep_segments=0

Error log on slave:
LOG: logical replication apply worker for subscription « " has started
DEBUG: connecting to publisher using connection string "postgresql://USER@IP"
LOG: worker process: logical replication worker for subscription 132253 (PID 
25359) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly co
rrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat 
your command.
LOG: all server processes terminated; reinitializing
DEBUG: unregistering background worker "logical replication worker for 
subscription 132253"
LOG: database system was interrupted; last known up at 2018-07-11 21:50:56 UTC
DEBUG: checkpoint record is at 0/7DBFEF10
DEBUG: redo record is at 0/7DBFEF10; shutdown TRUE
DEBUG: next transaction ID: 0:93714; next OID: 140237
DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
DEBUG: oldest unfrozen transaction ID: 548, in database 1
DEBUG: oldest MultiXactId: 1, in database 1
DEBUG: commit timestamp Xid oldest/newest: 0/0
DEBUG: transaction ID wrap limit is 2147484195, limited by database with OID 1
DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
DEBUG: starting up replication slots
LOG: recovered replication state of node 2 to 0/0
LOG: recovered replication state of node 3 to 0/0
LOG: recovered replication state of node 4 to 0/0
LOG: recovered replication state of node 5 to 56A5/29ACA918
LOG: database system was not properly shut down; automatic recovery in progress



THANK YOU

Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2018-07-11 Thread Tom Lane

I wrote:
> I propose to run through the system operator classes, find any for which
> the comparison function isn't marked leakproof but the operators are,
> and fix them.  This is clearly appropriate for HEAD and maybe it's not
> too late to force an initdb for v11 --- thoughts?

I did that for the built-in btree opclasses.  I decided that it's probably
not worth forcing an initdb in v11 for, though.  In principle, losing
selectivity estimates because of non-leakproof functions should only
happen in queries that are going to fail at runtime anyway.  The real
problem that ought to be addressed and perhaps back-patched is this:

> Another question that could be raised is why we are refusing to use
> stats for a child table when the caller has select on the parent.
> It's completely trivial to extract data from a child table if you
> have select on the parent, so it seems like we are checking the
> wrong table's privileges.

regards, tom lane

Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-07-11 Thread Thomas Munro

On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi
 wrote:
>> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote:
>> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up
>> >> as "reads" when in fact they are not reads at all:
>> >>
>> >> 1.  Relation extension, which in fact writes a zero-filled block.
>> >> 2.  The RBM_ZERO_* modes, which provoke neither read nor write.
>
> I checked the patch and I agree with the change 1). And regarding change 2)
> whether it is  zeroing the contents of the page or not, it does read?
> because if it exists in the buffer pool, we are counting them as hits
> irrespective
> of the mode? Am I missing something?

Further down in the function you can see that there is no read()
system call for the RBM_ZERO_* modes:

if (mode == RBM_ZERO_AND_LOCK || mode ==
RBM_ZERO_AND_CLEANUP_LOCK)
MemSet((char *) bufBlock, 0, BLCKSZ);
else
{
...
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
...
}

I suppose someone might argue that even when it's not a hit and it's
not a read, we might still want to count this buffer interaction in
some other way.  Perhaps there should be a separate counter?  It may
technically be a kind of cache miss, but it's nowhere near as
expensive as a synchronous system call like read() so I didn't propose
that.

Some more on my motivation:  In our zheap prototype, when the system
is working well and we have enough space, we constantly allocate
zeroed buffer pages at the insert point (= head) of an undo log and
drop pages at the discard point (= tail) in the background;
effectively a few pages just go round and round via the freelist and
no read() or write() syscalls happen.  That's something I'm very happy
about and it's one of our claimed advantages over the traditional heap
(which tends to read and dirty more pages), but EXPLAIN (BUFFERS)
hides this virtuous behaviour when comparing with the traditional
heap: it falsely and slanderously reports that zheap is reading undo
pages when it is not.  Of course I don't intent to litigate zheap
design in this thread, I just I figured that since this accounting is
wrong on principle and affects current PostgreSQL too (at least in
theory) I would propose this little patch independently.  It's subtle
enough that I wouldn't bother to back-patch it though.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

On Wed, Jul 11, 2018 at 03:13:30PM -0400, Alvaro Herrera wrote:
> On 2018-Jun-06, Nico Williams wrote:
> > I've finally gotten around to rebasing this patch and making the change
> > that was requested, which was: merge the now-would-be-three deferral-
> > related bool columns in various pg_catalog tables into one char column.
> > 
> > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> > (deferral).
> 
> Nice stuff.
> 
> Please add #defines for the chars in pg_constraint.h instead of using
> the values directly in the source.  Also, catalogs.sgml has a typo in
> one of the values.
> 
> What happens if you do SET CONSTRAINTS ALL IMMEDIATE and you have one of
> these constraints?  I expect that it silently does not alter that
> constraint, but you didn't change that manpage.

Correct, that's the idea, that it should be possible to write deferred
constraints/triggers which users cannot make immediate.  For example, an
audit extension that logs changes via FOR EACH ROW ALWAYS DEFERRED, or
my PoC COMMIT TRIGGER implementation (which depends on deferred
triggers).

I missed that there is a page for SET CONSTRAINTS!  I'll update it.

> I suggest not to include psql/tab-complete changes with your main patch.
> If you want to update it, split it out to its own patch.  Committer can
> choose to include it in one commit or not (I'm mildly inclined not to,
> but I'm probably inconsistent about it), but for review IMO it's better
> not to mix things -- It's way too easy to get entangled in silly details
> while editing that code, and it's not critical anyway.

OK, sure, though, of course, the committer could always leave that out
themselves, no?

To me though it seems that the change should be complete.

> > Incidentally, I had to do commit-by-commit rebasing to make the rebase
> > easier.  I have a shell function I use for this, if anyone wants a copy
> > of it -- sometimes it's much easier to do this than to do one huge jump.
> 
> I've done this manually a few times.  Please share, I'm curious.

OK, attached is my latest version of that script, though this one is a
bit changed from the one I used.  This version tries to be faster / more
efficient by first doing 1 commit, then 2, then 3, and so on, and on
conflict aborts and halves N to try again.  The idea is to only have to
merge conflicts at each commit where conflicts arise, never resolving
conflicts across more than one commit -- this makes is much easier to
reason about conflicts!

Note that the script is actually a shell function, and that it keeps
state in shel variables.  A better implementation would do the sort of
thing that git(1) itself does to keep rebase state.

Nico
-- 
# Based on a shell function by Viktor Dukhovni
#
# slowrebase BRANCH_TO_REBASE ONTO
function slowrebase {
typeset b N

if (($# > 0)) && [[ $1 = -h || $1 = --help ]]; then
printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n'
printf '   slowrebase # to continue after resolving conflicts\n'
printf '\n\tslowrebase is a shell function that uses the following\n'
printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n'
printf '\t$slowrebase_window_sz\n'
printf '\tDO NOT CHANGE THOSE VARIABLES.\n'
return 0
elif (($# > 0 && $# != 2)); then
printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n' 1>&2
printf '   slowrebase # to continue after resolving conflicts\n'
printf '\n\tslowrebase is a shell function that uses the following\n' 
1>&2
printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n' 1>&2
printf '\t$slowrebase_window_sz\n' 1>&2
printf '\tDO NOT CHANGE THOSE VARIABLES.\n' 1>&2
return 1
fi

if (($# == 2)); then
slowrebase_window_sz=1
S=$1
T=$2
B=$(git merge-base "$S" "$T")
C=( $(git log --oneline "$B".."$2" | awk '{print $1}') )
set --

# Prep
git log -n1 "$S" > /dev/null || return 1
if [[ $(git log --oneline -n1 HEAD) != $(git log --oneline -n1 "$S") 
]]; then
if (($(git status -sb | wc -l) != 1)); then
printf 'Error: please clean your workspace\n'
return 1
fi
git checkout "$S"
elif (($(git status -sbuno | wc -l) != 1)); then
printf 'Error: please clean your workspace\n'
return 1
fi

# Fall through to get started
elif [[ $(git log --oneline -n1 HEAD) != $(git log --oneline -n1 "$S") ]] &&
   ! git rebase --continue; then
N=$(( ${#C[@]} - slowrebase_window_sz ))
printf '\nConflicts while rebasing $S (%s) slowly onto $T (%s)\n' "$S" 
"$T"
printf '${C[@]} has the commits left to process (%s left)\n' $N
printf '$B is the commit we are rebasing onto right now: %s\n' "$B"
printf '$b is the previous commit we had already

Re: make installcheck-world in a clean environment

2018-07-11 Thread Tom Lane

Alexander Lakhin  writes:
> 06.07.2018 00:39, Peter Eisentraut wrote:
>> Exactly what order of steps are you executing that doesn't work?

> In Centos 7, using the master branch from git:
> ./configure --enable-tap-tests
> make install
> make install -C contrib
> chown -R postgres:postgres /usr/local/pgsql/
> /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
> /usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data
> /make clean/
> # Also you can just install binary packages to get the same state.

> make installcheck-world
> # This check fails.

I do not think that should be expected to work.  It would require that
"make installcheck" first invoke "make all" (to rebuild the stuff you
threw away with "make clean"), which is rather antithetical to its
purpose.  Specifically, installcheck is supposed to be checking something
you already built; so having it do a fresh build seems to introduce
version-skew hazards that we don't need.

regards, tom lane

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Fabien COELHO






can we try something like this?

PGBENCH_ERROR_START(DEBUG_FAIL)
{
PGBENCH_ERROR("client %d repeats the failed transaction (try %d",


Argh, no? I was thinking of something much more trivial:

   pgbench_error(DEBUG, "message format %d %s...", 12, "hello world");

If you really need some complex dynamic buffer, and I would prefer 
that you avoid that, then the fallback is:


   if (level >= DEBUG)
   {
  initPQstuff();
  ...
  pgbench_error(DEBUG, "fixed message... %s\n", msg);
  freePQstuff();
   }

The point is to avoid building the message with dynamic allocation and so
if in the end it is not used.

--
Fabien.

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Alvaro Herrera

On 2018-Jun-06, Nico Williams wrote:

> I've finally gotten around to rebasing this patch and making the change
> that was requested, which was: merge the now-would-be-three deferral-
> related bool columns in various pg_catalog tables into one char column.
> 
> Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> (deferral).

Nice stuff.

Please add #defines for the chars in pg_constraint.h instead of using
the values directly in the source.  Also, catalogs.sgml has a typo in
one of the values.

What happens if you do SET CONSTRAINTS ALL IMMEDIATE and you have one of
these constraints?  I expect that it silently does not alter that
constraint, but you didn't change that manpage.

I suggest not to include psql/tab-complete changes with your main patch.
If you want to update it, split it out to its own patch.  Committer can
choose to include it in one commit or not (I'm mildly inclined not to,
but I'm probably inconsistent about it), but for review IMO it's better
not to mix things -- It's way too easy to get entangled in silly details
while editing that code, and it's not critical anyway.

> Incidentally, I had to do commit-by-commit rebasing to make the rebase
> easier.  I have a shell function I use for this, if anyone wants a copy
> of it -- sometimes it's much easier to do this than to do one huge jump.

I've done this manually a few times.  Please share, I'm curious.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Shouldn't validateForeignKeyConstraint() reset memory context?

2018-07-11 Thread Andres Freund

Hi,

while looking at the pluggable storage patch I noticed that
validateForeignKeyConstraint() calls RI_FKey_check() for each row
without resetting a memory / expression context.  There's not too much
leakage in the called code, but there's some I think.

Greetings,

Andres Freund

Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints

2018-07-11 Thread Nico Williams

On Tue, Jun 26, 2018 at 04:54:13PM -0400, Robbie Harwood wrote:
> Nico Williams  writes:
> 
> > [Re-send; first attempt appears to have hit /dev/null somewhere.  My
> > apologies if you get two copies.]
> >
> > I've finally gotten around to rebasing this patch and making the change
> > that was requested, which was: merge the now-would-be-three deferral-
> > related bool columns in various pg_catalog tables into one char column.
> >
> > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just
> > (deferral).
> 
> This design seems correct to me.  I have a couple questions inline, and
> some nits to go with them.

Thanks.  Replies below.

> > All tests (make check) pass.
> 
> Thank you for adding tests!

Well, yeah :)

> > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> > index 3ed9021..e82e39b 100644
> > --- a/doc/src/sgml/catalogs.sgml
> > +++ b/doc/src/sgml/catalogs.sgml
> > @@ -2239,17 +2239,15 @@ SCRAM-SHA-256$iteration 
> > count:
> >   
> >  
> >   
> > -  condeferrable
> > -  bool
> > -  
> > -  Is the constraint deferrable?
> > - 
> > -
> > - 
> > -  condeferred
> > -  bool
> > +  condeferral
> > +  char
> >
> > -  Is the constraint deferred by default?
> > +  Constraint deferral option:
> > +a = always deferred,
> > +d = deferrable,
> > +d = deferrable initially deferred,
> 
> From the rest of the code, I think this is supposed to be 'i', not 'd'.

Oh yes, good catch.

> > +n = not deferrable
> > +  
> >   
> >  
> >   
> 
> > diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> > index 8b276bc..795a7a9 100644
> > --- a/src/backend/catalog/index.c
> > +++ b/src/backend/catalog/index.c
> > @@ -1070,6 +1070,7 @@ index_create(Relation heapRelation,
> >  
> > recordDependencyOn(, , 
> > deptype);
> > }
> > +   Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0);
> 
> What does this ensure, and why is it in this part of the code?  We're in
> the `else` branch here - isn't this always the case (i.e., Assert can
> never fire), since `flags` isn't manipulated in this block?

Oy, nothing.  I think that's a cut-n-paste error.

I'll remove it.

> > }
> >  
> > /* Store dependency on parent index, if any */
> 
> > diff --git a/src/backend/catalog/information_schema.sql 
> > b/src/backend/catalog/information_schema.sql
> > index f4e69f4..bde6199 100644
> > --- a/src/backend/catalog/information_schema.sql
> > +++ b/src/backend/catalog/information_schema.sql
> > @@ -891,10 +891,14 @@ CREATE VIEW domain_constraints AS
> > CAST(current_database() AS sql_identifier) AS domain_catalog,
> > CAST(n.nspname AS sql_identifier) AS domain_schema,
> > CAST(t.typname AS sql_identifier) AS domain_name,
> > -   CAST(CASE WHEN condeferrable THEN 'YES' ELSE 'NO' END
> > +   CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END
> >   AS yes_or_no) AS is_deferrable,
> > -   CAST(CASE WHEN condeferred THEN 'YES' ELSE 'NO' END
> > +   CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN 
> > 'YES' ELSE 'NO' END
> >   AS yes_or_no) AS initially_deferred
> > +  /*
> > +   * XXX Can we add is_always_deferred here?  Are there
> > +   * standards considerations?
> > +   */
> 
> I'm not familiar enough to speak to this.  Hopefully someone else can.
> Absent other input, I think we should add is_always_deferred.  (And if
> we were building this today, probably just expose a single character
> instead of three booleans.)

I had found the answer ("yes") in the history of this file but forgot to
remove the comment while rebasing.  I'll remove the comment.

> >  FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t
> >  WHERE rs.oid = con.connamespace
> >AND n.oid = t.typnamespace
> 
> > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
> > index 57519fe..41dc6a4 100644
> > --- a/src/backend/commands/trigger.c
> > +++ b/src/backend/commands/trigger.c
> > @@ -3632,6 +3627,7 @@ typedef struct AfterTriggerSharedData
> > TriggerEvent ats_event; /* event type indicator, see trigger.h 
> > */
> > Oid ats_tgoid;  /* the trigger's ID */
> > Oid ats_relid;  /* the relation it's on 
> > */
> > +   boolats_alwaysdeferred; /* whether this can be 
> > deferred */
> 
> "Whether this must be deferred"?  Also, I'm not sure what this is for -
> it doesn't seem to be used anywhere.

Ah, this became unused during the rebase.  I'll remove it.

> > CommandId   ats_firing_id;  /* ID for firing cycle */
> > struct AfterTriggersTableData *ats_table;   /* transition table 
> > access */
> >  }

Re: Shared buffer access rule violations?

2018-07-11 Thread Asim R P

On Tue, Jul 10, 2018 at 8:33 PM, Tom Lane  wrote:
> Asim R P  writes:
>
>> One can find several PageInit() calls with no content lock held.  See,
>> for example:
>
>> fill_seq_with_data()
>
> That would be for a relation that no one else can even see yet, no?

Yes, when the sequence is being created.  No, when the sequence is
being reset, in ResetSequence().

>
>> vm_readbuf()
>> fsm_readbuf()
>
> In these cases I'd imagine that the I/O completion interlock is what
> is preventing other backends from accessing the buffer.
>

What is I/O completion interlock?  I see no difference in initializing
a visimap/fsm page and initializing a standard heap page.  For
standard heap pages, the code currently acquires the buffer pin as
well as content lock for initialization.

>> Moreover, fsm_vacuum_page() performs
>> "PageGetContents(page))->fp_next_slot = 0;" without content lock.
>
> That field is just a hint, IIRC, and the possibility of a torn read
> is explicitly not worried about.

Yes, that's a hint.  And ignoring torn page possibility doesn't result
in checksum failures because fsm_read() passes RMB_ZERO_ON_ERROR to
buffer manager.  The page will be zeroed out in the event of checksum
failure.

Asim

Re: [PATCH] btree_gist: fix union implementation for variable length columns

2018-07-11 Thread Pavel Raiskup

On Wednesday, July 11, 2018 7:26:40 PM CEST Tom Lane wrote:
> Pavel Raiskup  writes:
> > On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote:
> >> Hi Pavel!  For patches that purport to resolve bugs, we usually like to
> >> add a regression test case that demonstrates the bug in unpatched code.
> >> Can you provide a small test case that does so?  (The BZ you pointed to
> >> doesn't seem to address this...)
> 
> > Turns out the problem is only related to bit/bit varying type (so the
> > patch comments need to be reworded properly, at least) since those are the
> > only types which have implemented the f_l2n() callback.
> 
> What I'm failing to wrap my head around is why this code exists at all.
> AFAICS, gbt_bit_xfrm just forces the bitstring to be zero-padded out to
> an INTALIGN boundary, which it wouldn't necessarily be otherwise.
> But why bother?  That should have no effect on the behavior of bit_cmp().

The gbt_bit_xfrm() method actually also skips the varbit header (number of
bits, stored in VarBit.bit_len), the magic is done by VARBITS macro.  If
the header is dropped, it's OK to just use existing byteacmp() for bit
array comparison.

Pavel

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Alvaro Herrera

Just a quick skim while refreshing what were those error reporting API
changes about ...

On 2018-May-21, Marina Polyakova wrote:

> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
> - a patch for the RandomState structure (this is used to reset a client's
> random seed during the repeating of transactions after
> serialization/deadlock failures).

LGTM, though I'd rename the random_state struct members so that it
wouldn't look as confusing.  Maybe that's just me.

> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
> - a patch for the Variables structure (this is used to reset client
> variables during the repeating of transactions after serialization/deadlock
> failures).

Please don't allocate Variable structs one by one.  First time allocate
some decent number (say 8) and then enlarge by duplicating size.  That
way you save realloc overhead.  We use this technique everywhere else,
no reason do different here.  Other than that, LGTM.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Add missing type conversion functions for PL/Python

2018-07-11 Thread Heikki Linnakangas


On 26/03/18 19:07, Nikita Glukhov wrote:

Attached fixed 3th version of the patch:


Thanks, I'm reviewing this now. Nice speedup!

There is no test coverage for some of the added code. You can get a code 
coverage report with:


./configure --enable-coverage ...
make
make -C src/pl/plpython check
make coverage-html

That produces a code coverage report in coverage/index.html. Please look 
at the coverage of the new functions, and add tests to 
src/pl/plpython/sql/plpython_types.sql so that all the new code is covered.


In some places, where you've already checked the object type e.g. with 
PyFloat_Check(), I think you could use the less safe variants, like 
PyFloat_AS_DOUBLE() instead of PyFloat_AsDouble(). Since this patch is 
all about performance, seems worth doing.


Some of the conversions seem a bit questionable. For example:


/*
 * Convert a Python object to a PostgreSQL float8 datum directly.
 * If can not convert it directly, fallback to PLyObject_ToScalar
 * to convert it.
 */
static Datum
PLyObject_ToFloat8(PLyObToDatum *arg, PyObject *plrv,
   bool *isnull, bool inarray)
{
if (plrv == Py_None)
{
*isnull = true;
return (Datum) 0;
}

if (PyFloat_Check(plrv) || PyInt_Check(plrv) || PyLong_Check(plrv))
{
*isnull = false;
return Float8GetDatum((double)PyFloat_AsDouble(plrv));
}

return PLyObject_ToScalar(arg, plrv, isnull, inarray);
}


The conversion from Python int to C double is performed by 
PyFloat_AsDouble(). But there's no error checking. And wouldn't 
PyLong_AsDouble() be more appropriate in that case, since we already 
checked the python type?


- Heikki

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, Marina Polyakova wrote:

> can we try something like this?
> 
> PGBENCH_ERROR_START(DEBUG_FAIL)
> {
>   PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
> st->id, st->retries + 1);
>   if (max_tries)
>   PGBENCH_ERROR("/%d", max_tries);
>   if (latency_limit)
>   {
>   PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
> getLatencyUsed(st, ));
>   }
>   PGBENCH_ERROR(")\n");
> }
> PGBENCH_ERROR_END();

I didn't quite understand what these PGBENCH_ERROR() functions/macros
are supposed to do.  Care to explain?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: In pageinspect, perform clean-up after testing gin-related functions

2018-07-11 Thread Tom Lane

Andres Freund  writes:
> On 2018-07-11 12:56:49 +0530, Amit Kapila wrote:
>> Yeah, it is good practice to drop the objects at the end.  It is
>> strange that original commit adfb81d9e1 has this at the end of the
>> test, but a later commit 367b99bbb1 by Tom has removed the Drop
>> statement.  AFAICS, this is just a silly mistake, but I might be
>> missing something.  Tom, do you remember any reason for doing so?  If
>> not, then I think we can revert back that change (aka commit Kuntal's
>> patch).

> We actually sometimes intentionally want to persist objects past the end
> of the test. Allows to test pg_dump / pg_upgrade. Don't know whether
> that's the case here, but it's worthwhile to note.

I don't think our pg_dump testbed makes any use of contrib regression
tests, so that's not the reason here.  I believe I took out the DROP
because it made it impossible to do additional manual tests after the end
of an installcheck run without laboriously re-creating the test table.

In other words, I disagree with Amit's opinion that it's good practice
to drop everything at the end of a test script.  There are often good
reasons to leave the objects available for later use.

regards, tom lane

Re: no partition pruning when partitioning using array type

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, Amit Langote wrote:

> On 2018/07/11 13:12, Alvaro Herrera wrote:
> > On 2018-Jul-11, Amit Langote wrote:
> > 
> >> What's the solution here then?  Prevent domains as partition key?
> > 
> > Maybe if a domain is used in a partition key somewhere, prevent
> > constraints from being added?
> 
> Maybe, but I guess you mean only prevent adding such a constraint
> after-the-fact.

Yeah, any domain constraints added before won't be a problem.  Another
angle on this problem is to verify partition bounds against the domain
constraint being added; if they all pass, there's no reason to reject
the constraint.

But I worry that Tom was using domain constraints as just an example of
a more general problem that we can get into.  

> If a domain has a constraint before creating any partitions based on the
> domain, then partition creation command would check that the partition
> bound values satisfy those constraints.

Of course.

> > Another thing worth considering: are you prevented from dropping a
> > domain that's used in a partition key?  If not, you get an ugly message
> > when you later try to drop the table.
> 
> Yeah, I was about to write a message about that.  I think we need to teach
> RemoveAttributeById, which dependency.c calls when dropping the
> referencing domain with cascade option, to abort if the attribute passed
> to it belongs to the partition key of the input relation.

Actually, here's another problem: why are we letting a column on a
domain become partition key, if you'll never be able to create a
partition?  It seems that for pg11 the right reaction is to check
the partition key and reject this case.

I'm not sure of this implementation -- I couldn't find any case where we
reject the deletion on the function called from doDeletion (and we don't
have a preliminary phase on which to check for this, either).  Maybe we
need a pg_depend entry to prevent this, though it seems a little silly.
Maybe we should add a preliminary verification phase in
deleteObjectsInList(), where we invoke object-type-specific checks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] btree_gist: fix union implementation for variable length columns

2018-07-11 Thread Tom Lane

Pavel Raiskup  writes:
> On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote:
>> Hi Pavel!  For patches that purport to resolve bugs, we usually like to
>> add a regression test case that demonstrates the bug in unpatched code.
>> Can you provide a small test case that does so?  (The BZ you pointed to
>> doesn't seem to address this...)

> Turns out the problem is only related to bit/bit varying type (so the
> patch comments need to be reworded properly, at least) since those are the
> only types which have implemented the f_l2n() callback.

What I'm failing to wrap my head around is why this code exists at all.
AFAICS, gbt_bit_xfrm just forces the bitstring to be zero-padded out to
an INTALIGN boundary, which it wouldn't necessarily be otherwise.  But
why bother?  That should have no effect on the behavior of bit_cmp().
So I'm speculating that the reason nobody has noticed a problem is that
there is no problem because this code is useless and could be ripped out.

It would be easier to figure this out if the btree_gist code weren't
so desperately undocumented.  Teodor, do you remember why it's like
this?

regards, tom lane

Costing bug in hash join logic for semi joins

2018-07-11 Thread RK Korlapati

There is a costing bug in hash join logic seems to have been introduced by
the patch related to inner_unique enhancements(commit:
9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples"
which tracks the number of matches for hash clauses is computed wrong for
inner unique scenario. This leads to lot of semi-joins  incorrectly turn to
inner joins with unique on inner side. Function "final_cost_hashjoin" has
special handling to cost semi/anti joins to account for early stop after
the first match. This is enhanced by the above said commit to be used for
inner_unique scenario also. However, "hashjointuples" computation is not
fixed to handle this new scenario which is incorrectly stepping into the
anti join logic and assuming unmatched rows. Fix is to handle inner_unique
case when computing "hashjointuples".  Here is the outline of the code that
shows the bug.

void
final_cost_hashjoin(PlannerInfo *root, HashPath *path,
  JoinCostWorkspace *workspace,
JoinPathExtraData *extra)
{
 .
/* CPU costs */


*if (path->jpath.jointype == JOIN_SEMI ||
path->jpath.jointype == JOIN_ANTI ||extra->inner_unique)*
   {
 ..





*/* Get # of
tuples that will pass the basic join */   if
(path->jpath.jointype == JOIN_SEMI)  hashjointuples =
outer_matched_rows;  else hashjointuples =
outer_path_rows - outer_matched_rows; *
}
   else
   {
 .
   }
}

Thanks, RK (Salesforce)

Re: cursors with prepared statements

2018-07-11 Thread Heikki Linnakangas


On 07/06/18 22:42, Peter Eisentraut wrote:

I have developed a patch that allows declaring cursors over prepared
statements:

 DECLARE cursor_name CURSOR FOR prepared_statement_name
[ USING param, param, ... ]

This is an SQL standard feature.  ECPG already supports it (with
different internals).

Internally, this just connects existing functionality in different ways,
so it doesn't really introduce anything new.

One point worth pondering is how to pass the parameters of the prepared
statements.  The actual SQL standard syntax would be

 DECLARE cursor_name CURSOR FOR prepared_statement_name;
 OPEN cursor_name USING param, param;

But since we don't have the OPEN statement in direct SQL, it made sense
to me to attach the USING clause directly to the DECLARE statement.


Hmm. I'm not excited about adding PostgreSQL-extensions to the SQL 
standard. It's confusing, and risks conflicting with future additions to 
the standard. ECPG supports the actual standard syntax, with OPEN, 
right? So this wouldn't be consistent with ECPG, either.



Curiously, the direct EXECUTE statement uses the non-standard syntax

 EXECUTE prep_stmt (param, param);

instead of the standard

 EXECUTE prep_stmt USING param, param;

I tried to consolidate this.  But using

 DECLARE c CURSOR FOR p (foo, bar)

leads to parsing conflicts (and looks confusing?),


How about

DECLARE c CURSOR FOR EXECUTE p (foo, bar)

? As a user, I'm already familiar with the "EXECUTE p (foo, bar)" 
syntax, so that's what I would intuitively try to use with DECLARE as 
well. In fact, I think I tried doing just that once, and was 
disappointed that it didn't work.



and instead allowing
EXECUTE + USING leads to a mess in the ECPG parser that exhausted me.
So I'm leaving it as is for now and might give supporting EXECUTE +
USING another try later on.


The attached patch seems to do the trick, of allowing EXECUTE + USING. 
I'm not sure this is worth the trouble, though, since EXECUTE as a plain 
SQL command is a PostgreSQL-extension anyway.


This also adds a test case for the existing "EXECUTE  ()" 
syntax in ECPG. The current ECPG parsing of that is actually a bit 
weird, it allows "EXECUTE stmt (:param1) USING :param2", which seems 
unintentional. This patch rejects that syntax.


- Heikki
>From 3b6cad3f6ecb615442bd0d0f365fbdec91cf9317 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 11 Jul 2018 19:47:43 +0300
Subject: [PATCH 1/1] Add support for EXECUTE  USING  syntax.

This is the SQL-standard equivalent of "EXECUTE  ()".

TODO: docs.
---
 src/backend/parser/gram.y  |  1 +
 src/interfaces/ecpg/preproc/check_rules.pl |  2 +-
 src/interfaces/ecpg/preproc/ecpg.addons|  2 +-
 src/interfaces/ecpg/preproc/ecpg.trailer   |  9 
 src/interfaces/ecpg/preproc/parse.pl   |  2 +-
 src/interfaces/ecpg/test/expected/sql-execute.c| 51 ++
 .../ecpg/test/expected/sql-execute.stderr  | 24 +++---
 .../ecpg/test/expected/sql-execute.stdout  |  1 +
 src/interfaces/ecpg/test/sql/execute.pgc   | 14 ++
 src/test/regress/expected/prepare.out  |  7 +++
 src/test/regress/sql/prepare.sql   |  3 ++
 11 files changed, 100 insertions(+), 16 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 90dfac2cb1..851363fa4e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10779,6 +10779,7 @@ ExecuteStmt: EXECUTE name execute_param_clause
 		;
 
 execute_param_clause: '(' expr_list ')'{ $$ = $2; }
+	| USING expr_list		{ $$ = $2; }
 	| /* EMPTY */	{ $$ = NIL; }
 	;
 
diff --git a/src/interfaces/ecpg/preproc/check_rules.pl b/src/interfaces/ecpg/preproc/check_rules.pl
index 6c8b004854..ee67817be0 100644
--- a/src/interfaces/ecpg/preproc/check_rules.pl
+++ b/src/interfaces/ecpg/preproc/check_rules.pl
@@ -37,7 +37,7 @@ if ($verbose)
 
 my %replace_line = (
 	'ExecuteStmtEXECUTEnameexecute_param_clause' =>
-	  'EXECUTE prepared_name execute_param_clause execute_rest',
+	  'EXECUTE prepared_name execute_rest',
 
 	'ExecuteStmtCREATEOptTempTABLEcreate_as_targetASEXECUTEnameexecute_param_clause'
 	  => 'CREATE OptTemp TABLE create_as_target AS EXECUTE prepared_name execute_param_clause',
diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons
index ca3efadc48..9606ad4783 100644
--- a/src/interfaces/ecpg/preproc/ecpg.addons
+++ b/src/interfaces/ecpg/preproc/ecpg.addons
@@ -283,7 +283,7 @@ ECPG: PrepareStmtPREPAREprepared_nameprep_type_clauseASPreparableStmt block
 		$$.type = NULL;
 		$$.stmt = $4;
 	}
-ECPG: ExecuteStmtEXECUTEprepared_nameexecute_param_clauseexecute_rest block
+ECPG: ExecuteStmtEXECUTEprepared_nameexecute_rest block
 	{ $$ = $2; }
 ECPG: DeclareCursorStmtDECLAREcursor_namecursor_optionsCURSORopt_holdFORSelectStmt block
 	{
diff --git

Re: In pageinspect, perform clean-up after testing gin-related functions

2018-07-11 Thread Andres Freund

On 2018-07-11 12:56:49 +0530, Amit Kapila wrote:
> On Wed, Jul 11, 2018 at 12:37 PM, Kuntal Ghosh
>  wrote:
> > Hello all,
> >
> > In pageinspect/sql/gin.sql, we don't drop the table test1 at the end
> > of the test. IMHO, we should clean-up at the end of a test.
> >
> 
> Yeah, it is good practice to drop the objects at the end.  It is
> strange that original commit adfb81d9e1 has this at the end of the
> test, but a later commit 367b99bbb1 by Tom has removed the Drop
> statement.  AFAICS, this is just a silly mistake, but I might be
> missing something.  Tom, do you remember any reason for doing so?  If
> not, then I think we can revert back that change (aka commit Kuntal's
> patch).

We actually sometimes intentionally want to persist objects past the end
of the test. Allows to test pg_dump / pg_upgrade. Don't know whether
that's the case here, but it's worthwhile to note.

Greetings,

Andres Freund

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Marina Polyakova


On 11-07-2018 16:24, Fabien COELHO wrote:

Hello Marina,

* -d/--debug: I'm not in favor in requiring a mandatory text argument 
on this option.


As you wrote in [1], adding an additional option is also a bad idea:


Hey, I'm entitled to some internal contradictions:-)


... and discussions will be continue forever %-)

I'm sceptical of the "--debug-fails" options. ISTM that --debug is 
already there and should just be reused.


I was thinking that you could just use the existing --debug, not
change its syntax. My point was that --debug exists, and you could
just print
the messages when under --debug.


Now I understand you better, thanks. I think it will be useful to 
receive only messages about failures, because they and progress reports 
can be lost in many other debug messages such as "client %d sending ..." 
/ "client %d executing ..." / "client %d receiving".


Maybe it's better to use an optional argument/arguments for 
compatibility (--debug[=fails] or --debug[=NUM])? But if we use the 
numbers, now I can see only 2 levels, and there's no guarantee that 
they will no change..


Optional arguments to options (!) are not really clean things, so I'd
like to avoid going onto this path, esp. as I cannot see any other
instance in pgbench or elsewhere in postgres,


AFAICS they are used in pg_waldump (option --stats[=record]) and in psql 
(option --help[=topic]).



and I personnaly
consider these as a bad idea.



So if absolutely necessary, a new option is still better than changing
--debug syntax. If not necessary, then it is better:-)


Ok!


* I'm reserved about the whole ereport thing, see comments in other
messages.


Thank you, I'll try to implement the error reporting in the way you 
suggested.


Dunno if it is a good idea either. The committer word is the good one
in the end:-à


I agree with you that ereport has good reasons to be non-trivial in the 
backend and it does not have the same in pgbench..



* doCustom changes.




On CSTATE_FAILURE, the next command is possibly started. Although 
there is some consistency with the previous point, I think that it 
totally breaks the state automaton where now a command can start 
while the whole transaction is in failing state anyway. There was no 
point in starting it in the first place.


So, for me, the FAILURE state should record/count the failure, then 
skip

to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
This is much clearer that way.

Then RETRY should reinstate the global state and proceed to start the 
*first* command again.

<...>

It is unclear to me why backslash command errors are turned to 
FAILURE

instead of ABORTED: there is no way they are going to be retried, so
maybe they should/could skip directly to ABORTED?


So do you propose to execute the command "ROLLBACK" without 
calculating its latency etc. if we are in a failed transaction and 
clear the conditional stack after each failure?


Also just to be clear: do you want to have the state CSTATE_ABORTED 
for client abortion and another state for interrupting the current 
transaction?


I do not understand what "interrupting the current transaction" means.
A transaction is either committed or rollbacked, I do not know about
"interrupted".


I mean that IIUC the server usually only reports the error and you must 
manually send the command "END" or "ROLLBACK" to rollback a failed 
transaction.



When it is rollbacked, probably some stats will be
collected in passing, I'm fine with that.

If there is an error in a pgbench script, the transaction is aborted,
which means for me that the script execution is stopped where it was,
and either it is restarted from the beginning (retry) or counted as
failure (not retry, just aborted, really).

If by interrupted you mean that one script begins a transaction and
another ends it, as I said in the review I think that this strange
case should be forbidden, so that all the code and documentation
trying to
manage that can be removed.


Ok!


The current RETRY state does memory allocations to generate a message
with buffer allocation and so on. This looks like a costly and 
useless
operation. If the user required "retries", then this is normal 
behavior,

the retries are counted and will be printed out in the final report,
and there is no point in printing out every single one of them.
Maybe you want that debugging, but then coslty operations should be 
guarded.


I think we need these debugging messages because, for example,


Debugging message should cost only when under debug. When not under
debug, there should be no debugging message, and there should be no
cost for building and discarding such messages in the executed code
path beyond
testing whether program is under debug.

if you use the option --latency-limit, you we will never know in 
advance whether the serialization/deadlock failure will be retried or 
not.


ISTM that it will be shown final report. If I want debug, I ask for
--debug, otherwise I think that the

Re: GiST VACUUM

2018-07-11 Thread Andrey Borodin

Hi, Heikki!

Thanks for looking into the patch!

> 11 июля 2018 г., в 0:07, Heikki Linnakangas  написал(а):
> 
> I'm now looking at the first patch in this series, to allow completely empty 
> GiST pages to be recycled. I've got some questions:
> 
>> --- a/src/backend/access/gist/gist.c
>> +++ b/src/backend/access/gist/gist.c
>> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size 
>> freespace, GISTSTATE *giststate)
>>  GISTInsertStack *item;
>>  OffsetNumber downlinkoffnum;
>> +if(GistPageIsDeleted(stack->page))
>> +{
>> +UnlockReleaseBuffer(stack->buffer);
>> +xlocked = false;
>> +state.stack = stack = stack->parent;
>> +continue;
>> +}
>>  downlinkoffnum = gistchoose(state.r, stack->page, itup, 
>> giststate);
>>  iid = PageGetItemId(stack->page, downlinkoffnum);
>>  idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
> 
> This seems misplaced. This code deals with internal pages, and as far as I 
> can see, this patch never marks internal pages as deleted, only leaf pages. 
> However, we should have something like this in the leaf-page branch, to deal 
> with the case that an insertion lands on a page that was concurrently deleted.
That's a bug. Will fix this.

> Did you have any tests, where an insertion runs concurrently with vacuum, 
> that would exercise this?
Yes, I've tried to test this, but, obviously, not enough. I'll think more about 
how to deal with it.

> 
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it 
> records the parent of every empty leaf page it encounters. In the second 
> phase, it scans every leaf page of that parent, not only those leaves that 
> were seen as empty.
Yes, in first patch there is simplest gistbulkdelete(), second patch will 
remember line pointers of downlinks to empty leafs.

> 
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted 
> page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but 
> here it's used for a critical check. This seems to be the same mechanism we 
> use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, 
> not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, 
> not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for 
> explanation. This patch is missing any comments to explain how this works in 
> GiST.
Will look into this. I remember it was OK half a year ago, but now it is clear 
to me that I had to document that part when I understood it

> 
> If you crash in the middle of gistbulkdelete(), after it has removed the 
> downlink from the parent, but before it has marked the leaf page as deleted, 
> the leaf page is "leaked". I think that's acceptable, but a comment at least 
> would be good.
I was considering doing reverse-split (page merge) concurrency like in Lanin 
and Shasha's paper, but it is just too complex for little purpose. Will add 
comments on possible orphan pages.

Many thanks! I hope to post updated patch series this week.


Best regards, Andrey Borodin.

Re: [PATCH] btree_gist: fix union implementation for variable length columns

2018-07-11 Thread Pavel Raiskup

Hi Tom,

On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote:
> Pavel Raiskup  writes:
> > while I tried to debug 'gcc -fstack-protector -O3' problems in [1], I 
> > noticed
> > that gbt_var_union() mistreats the first vector element.  Patch is attached.
> 
> Hi Pavel!  For patches that purport to resolve bugs, we usually like to
> add a regression test case that demonstrates the bug in unpatched code.
> Can you provide a small test case that does so?  (The BZ you pointed to
> doesn't seem to address this...)

I've been looking at the reproducer for a while now...  and I probably
need a hint if that's necessary.

Turns out the problem is only related to bit/bit varying type (so the
patch comments need to be reworded properly, at least) since those are the
only types which have implemented the f_l2n() callback.

Considering testcase 'bit.sql' (bit(33) type), it's pretty clear that on
little endian boxes the problematic strings would be '0011*' (int32
value for '33', because there's varbit header).  Big endian boxes would
have problems with char[] like {0, 0, 0, 33, ...}.  But I fail to
construct right order of correctly picked elements into 'bit.data'.
The problem probably is (a) that picksplit reorders the elements very
often, and probably that (b) random() brings non-determinism into the
final tree shape (if the reproducer was to be based on many equal keys in
the index).  It's amazing to see how the index fixes itself :-) for this
bug.

Can you help me with the reproducer idea, resp. can we live without the
reproducer in this particular case?

Pavel

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Wed, Jul 11, 2018 at 01:03:44AM +, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> > The core team has considered this matter, and has concluded that it's
> > time to establish a firm project policy that we will not accept any code
> > that is known to be patent-encumbered.  The long-term legal risks and
> > complications involved in doing that seem insurmountable, given the
> > community's amorphous legal nature and the existing Postgres license
> > wording (neither of which are open for negotiation here).  Furthermore,
> > Postgres has always been very friendly to creation of closed-source
> > derivatives, but it's hard to see how inclusion of patented code would
> > not cause serious problems for those.  The potential benefits of
> > accepting patented code just don't seem to justify trying to navigate
> > these hazards.
> 
> That decision is very unfortunate as a corporate employee on one hand,
> but the firm cleanness looks a bit bright to me as one developer.
> 
> As a practical matter, when and where are you planning to post the
> project policy?  How would you check and prevent patented code?

PG may need a contributor agreement.  All I can find on that is:

https://www.postgresql.org/message-id/flat/54337476.3040605%402ndquadrant.com#9b1968ddc0fadfe225001adc1a821925

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Tue, Jul 10, 2018 at 09:47:09AM -0400, Tom Lane wrote:
> The core team has considered this matter, and has concluded that it's
> time to establish a firm project policy that we will not accept any code
> that is known to be patent-encumbered.  The long-term legal risks and
> complications involved in doing that seem insurmountable, given the
> community's amorphous legal nature and the existing Postgres license
> wording (neither of which are open for negotiation here).  Furthermore,
> Postgres has always been very friendly to creation of closed-source
> derivatives, but it's hard to see how inclusion of patented code would
> not cause serious problems for those.  The potential benefits of
> accepting patented code just don't seem to justify trying to navigate
> these hazards.

+1.

You should probably consider accepting code that involves patents that
are in the public domain or expired by the time of release, though even
that involves some legal costs you might not want to incur.  E.g., what
if a US patent is in the public domain but a corresponding EU patent is
not?  A search could be done, naturally, but professional patent
searches are not cheap!

Nico
--

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Nico Williams

On Tue, Jul 10, 2018 at 08:20:53AM +, Tsunakawa, Takayuki wrote:
> From: Nico Williams [mailto:n...@cryptonector.com]
> > On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote:
> > > It's entirely possible to dual license contributions and everything. Why
> > > are you making such aggressive statements about a, so far, apparently
> > > good faith engagement?
> > 
> > One problem is that many contributors would not want to be tainted by
> > knowledge of the patents in question (since that leads to triple
> > damages).
> > 
> > How would you protect contributors and core developers from tainting?
> 
> IIUC, you are concerned about the possibility that PG developers would
> read the patent document (not the PG source code), and unconsciously
> use the patented algorithm for other software that's not related to
> PostgreSQL.  That would only be helped by not reading the patent
> document...  BTW, are you relieved the current PostgreSQL doesn't
> contain any patented code?  As far as I know, PostgreSQL development
> process doesn't have a step to check patent and copyright
> infringement.

You're proposing to include code that implements patented ideas with a
suitable patent grant.  I would be free to not read the patent, but what
if the code or documents mention the relevant patented algorithms?

If I come across something like this in the PG source code:

/* The following is covered by patents US#... and so on */

now what?

I could choose not to read it.  But what if I have to touch that code in
order to implement an unrelated feature due to some required refactoring
of internal interfaces used by your code?  Now I have to read it, and
now I'm tainted, or I must choose to abandon my project.

That is a heavy burden on the community.  The community may want to
extract a suitable patent grant to make that burden lighter.

> > One possible answer is that you wouldn't.  But that might reduce the
> > size of the community, or lead to a fork.
> 
> Yes, that's one unfortunate future, which I don't want to happen of
> course.  I believe PostgreSQL should accept patent for further
> evolution, because PostgreSQL is now a popular, influential software
> that many organizations want to join.

I don't speak for the PG community, nor the core developers.  Speaking
for myself only, I hope that you can get, and PG accepts only, the
widest possible royalty-free patent grant to the community, allowing
others to not be constrained in making derivatives of PG.

My advice is to write up a patent grant that allows all to use the
relevant patents royalty-free with a no-lawsuit covenant.  I.e., make
only defensive use of your patents.

Nico
--

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Bruce Momjian

On Wed, Jul 11, 2018 at 04:00:47PM +0300, Heikki Linnakangas wrote:
> In a nutshell, to get the token for tls-server-end-point, you need to get
> the peer's certificate from the TLS library, in raw DER format, and
> calculate a hash over it. The hash algorithm depends on the
> signatureAlgorithm in the certificate, so you need to parse the certificate
> to extract that. We don't want to re-implement X509 parsing, so
> realistically we need the TLS library to have support functions for that.
> 
> Looking at the GnuTLS docs, I believe it has everything we need.
> gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be used
> to get the certificate, and gnutls_x509_crt_get_signature_algorithm() gets
> the signatureAlgorithm.
> 
> The macOS Secure Transport documentation is a bit harder to understand, but
> I think it has everything we need as well.
> SSLCopyPeerTrust()+SecTrustGetCertificateAtIndex()+SecCertificateCopyData()
> functions get you the certificate in DER format. You can get the signature
> algorithm with SecCertificateCopyValues(), with the right constants.
> 
> Am I missing something? I think we can support tls-server-end-point with all
> TLS implementations we might care about.

That seems right to me.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] GnuTLS support

2018-07-11 Thread Heikki Linnakangas


On 05/06/18 00:44, Peter Eisentraut wrote:

On 6/2/18 16:50, Heikki Linnakangas wrote:

On 08/03/18 14:13, Peter Eisentraut wrote:

There are two failures in the SSL tests that I cannot explain.  The
tests are for some rather obscure configurations, so the changed
behaviors are not obviously wrong, perhaps legitimate implementation
differences.  But someone wrote those tests with a purpose (probably),
so we should have some kind of explanation for the regressions.


I applied this over commit 4e0c743c18 (because this doesn't compile
against current master, needs rebasing), and ran "make check" in
src/test/ssl. All the tests passed. I'm using GnuTLS version 3.5.8. What
failures did you see?


The patch adjusts the expected test results so that the tests pass.


Ah, gotcha.


Look for the tests named

- "connect with server CA cert, without root CA"


So, in this test, the client puts the server's certificate in 
sslrootcert, but not the CA cert that the server's certificate was 
signed with. OpenSSL doesn't accept that, but apparently GnuTLS is OK 
with it.


I think the GnuTLS behavior is reasonable, I was actually surprised that 
OpenSSL is so strict about that. If the user explicitly lists a server's 
certificate as trusted, by putting it in sslrootcert, it seems 
reasonable to accept it even if the CA cert is missing.



- "CRL belonging to a different CA"


Hmm. So in OpenSSL, when we load the CRL, we call 
X509_STORE_set_flags(cvstore, X509_V_FLAG_CRL_CHECK | 
X509_V_FLAG_CRL_CHECK_ALL). With that option, if a CRL for the server CA 
cannot be found (in this case, because the CRL is for a different CA), 
OpenSSL throws an error. Apparently, GnuTLS is more lenient. At a quick 
glance, I don't see an option in GnuTLS to change that behavior. But I 
think we can live with it, it's not wrong per se, just different.


- Heikki

Re: [HACKERS] Two pass CheckDeadlock in contentent case

2018-07-11 Thread Ashutosh Bapat

The patch still applies and it's part of this commitfest.

On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura  wrote:
> On 2017-10-03 17:30, Sokolov Yura wrote:
>>
>> Good day, hackers.
>>
>> During hard workload sometimes process reaches deadlock timeout
>> even if no real deadlock occurred. It is easily reproducible with
>> pg_xact_advisory_lock on same value + some time consuming
>> operation (update) and many clients.
>>
>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>> which locks all partitions of lock hash in exclusive mode to
>> walk through graph and search for deadlock.
>>
>> If hundreds of backends reaches this timeout trying to acquire
>> advisory lock on a same value, it leads to hard-stuck for many
>> seconds, cause they all traverse same huge lock graph under
>> exclusive lock.
>> During this stuck there is no possibility to do any meaningful
>> operations (no new transaction can begin).
>>
>> Attached patch makes CheckDeadlock to do two passes:
>> - first pass uses LW_SHARED on partitions of lock hash.
>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>   modify anything.
>> - If there is possibility of "soft" or "hard" deadlock detected,
>>   ie if there is need to modify lock graph, then partitions
>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>
>> It fixes hard-stuck, cause backends walk lock graph under shared
>> lock, and found that there is no real deadlock.
>>
>> Test on 4 socket xeon machine:
>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>
>> ycsb_read_zipf.sql:
>> \set i random_zipfian(1, 10 * :scale, 1.01)
>> SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>> ycsb_update_lock2_zipf.sql:
>> \set i random_zipfian(1, 10 * :scale, 1.01)
>> select lock_and_update( :i );
>>
>> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>  RETURNS void
>>  LANGUAGE sql
>> AS $function$
>> SELECT pg_advisory_xact_lock( $1 );
>> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>> $function$
>>
>> Without attached patch:
>>
>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>
>> (autovacuum led to trigger deadlock timeout,
>>  and therefore, to stuck)
>>
>> Patched:
>>
>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482

I am confused about interpreting these numbers. Does it mean that
without the patch at the end of 70s, we have completed 40553.2 * 70
transactions and with the patch there are 70 * 38387.9 transactions
completed in 70 seconds?

I agree that with the patch there is smoother degradation when a lot
of backends enter deadlock detection phase. That's nice, but if my
interpretation of numbers is correct, that smoothness comes with a
cost of decreased TPS. So, it would make sense to have a GUC
controlling this behaviour.

Talking about GUCs, deadlock_timeout [1] value decides when the
deadlock check kicks in. In such cases probably increasing it would
suffice and we may not need the patch.

Anybody else can comment about how desired is this feature?

> Excuse me, corrected version is in attach.
>

About the patch itself, I think releasing the shared lock and taking
exclusive lock on lock partitions, may lead to starvation. If there
are multiple backends that enter deadlock detection

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, David Rowley wrote:

> On 6 July 2018 at 21:25, Kato, Sho  wrote:
> > 2. 11beta2 + patch1 + patch2
> >
> > patch1: Allow direct lookups of AppendRelInfo by child relid
> > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
> >
> >  part_num |   tps_ex| latency_avg | update_latency | select_latency | 
> > insert_latency
> > --+-+-+++
> >   100 | 1224.430344 |   0.817 |  0.551 |  0.085 |   
> >0.048
> >   200 |  689.567511 |1.45 |   1.12 |  0.119 |   
> > 0.05
> >   400 |  347.876616 |   2.875 |  2.419 |  0.185 |   
> >0.052
> >   800 |  140.489269 |   7.118 |  6.393 |  0.329 |   
> >0.059
> >  1600 |   29.681672 |  33.691 | 31.272 |  1.517 |   
> >0.147
> >  3200 |7.021957 | 142.412 |  136.4 |  4.033 |   
> >0.214
> >  6400 |1.462949 | 683.557 |669.187 |  7.677 |   
> >0.264

> Just a note to say that the "Allow direct lookups of AppendRelInfo by
> child relid" patch is already in master. It's much more relevant to be
> testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2.  So we still
don't know how much of an improvement patch2 is by itself :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Fabien COELHO



Hello Marina,

* -d/--debug: I'm not in favor in requiring a mandatory text argument on 
this option.


As you wrote in [1], adding an additional option is also a bad idea:


Hey, I'm entitled to some internal contradictions:-)

I'm sceptical of the "--debug-fails" options. ISTM that --debug is 
already there and should just be reused.


I was thinking that you could just use the existing --debug, not change 
its syntax. My point was that --debug exists, and you could just print

the messages when under --debug.

Maybe it's better to use an optional argument/arguments for compatibility 
(--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see 
only 2 levels, and there's no guarantee that they will no change..


Optional arguments to options (!) are not really clean things, so I'd like 
to avoid going onto this path, esp. as I cannot see any other instance in 
pgbench or elsewhere in postgres, and I personnaly consider these as a bad 
idea.


So if absolutely necessary, a new option is still better than changing 
--debug syntax. If not necessary, then it is better:-)



* I'm reserved about the whole ereport thing, see comments in other
messages.


Thank you, I'll try to implement the error reporting in the way you 
suggested.


Dunno if it is a good idea either. The committer word is the good one in 
the end:-à



Thank you, I'll fix this.
I'm sorry, I'll fix this.


You do not have to thank me or being sorry on every comment I do, once a 
the former is enough, and there is no need for the later.



* doCustom changes.




On CSTATE_FAILURE, the next command is possibly started. Although there 
is some consistency with the previous point, I think that it totally 
breaks the state automaton where now a command can start while the 
whole transaction is in failing state anyway. There was no point in 
starting it in the first place.


So, for me, the FAILURE state should record/count the failure, then skip
to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
This is much clearer that way.

Then RETRY should reinstate the global state and proceed to start the 
*first* command again.

<...>

It is unclear to me why backslash command errors are turned to FAILURE
instead of ABORTED: there is no way they are going to be retried, so
maybe they should/could skip directly to ABORTED?


So do you propose to execute the command "ROLLBACK" without calculating its 
latency etc. if we are in a failed transaction and clear the conditional 
stack after each failure?


Also just to be clear: do you want to have the state CSTATE_ABORTED for 
client abortion and another state for interrupting the current transaction?


I do not understand what "interrupting the current transaction" means. A 
transaction is either committed or rollbacked, I do not know about 
"interrupted". When it is rollbacked, probably some stats will be 
collected in passing, I'm fine with that.


If there is an error in a pgbench script, the transaction is aborted, 
which means for me that the script execution is stopped where it was, and 
either it is restarted from the beginning (retry) or counted as failure 
(not retry, just aborted, really).


If by interrupted you mean that one script begins a transaction and 
another ends it, as I said in the review I think that this strange case 
should be forbidden, so that all the code and documentation trying to

manage that can be removed.


The current RETRY state does memory allocations to generate a message
with buffer allocation and so on. This looks like a costly and useless
operation. If the user required "retries", then this is normal behavior,
the retries are counted and will be printed out in the final report,
and there is no point in printing out every single one of them.
Maybe you want that debugging, but then coslty operations should be 
guarded.


I think we need these debugging messages because, for example,


Debugging message should cost only when under debug. When not under debug, 
there should be no debugging message, and there should be no cost for 
building and discarding such messages in the executed code path beyond

testing whether program is under debug.

if you use the option --latency-limit, you we will never know in advance 
whether the serialization/deadlock failure will be retried or not.


ISTM that it will be shown final report. If I want debug, I ask for 
--debug, otherwise I think that the command should do what it was asked 
for, i.e. run scripts, collect performance statistics and show them at the 
end.


In particular, when running with retries is enabled, the user is expecting 
deadlock/serialization errors, so that they are not "errors" as such for

them.

They also help to understand which limit of retries was violated or how 
close we were to these limits during the execution of a specific 
transaction. But I agree with you that they are costly and can be 
skipped if the failure type is never retried. Maybe it is better to 
split them

Re: Concurrency bug in UPDATE of partition-key

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, Amit Kapila wrote:

> Attached, please find an updated patch based on comments by Alvaro.
> See, if this looks okay to you guys.

LGTM as far as my previous comments are concerned.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: _isnan() on Windows

2018-07-11 Thread Alvaro Herrera

On 2018-Jul-11, Michael Paquier wrote:

> On Tue, Jul 10, 2018 at 04:23:42PM -0400, Tom Lane wrote:
> > Alvaro Herrera  writes:
> >> On 2018-Jul-10, Tom Lane wrote:
> >>> I disagree --- including  in c.h, as this would have us do,
> >>> seems like a huge expansion of the visibility of that header.  Moreover,
> >>> doing that only on Windows seems certain to lead to missing-header
> >>> problems in the reverse direction, ie patches developed on Windows
> >>> will fail elsewhere.
> > 
> >> I don't think so, because that's only done for MSVC older than 2013.
> >> Nobody uses that for new development anymore.
> > 
> > Hm.  OK, maybe it's all right given that.  I'm still a bit worried
> > about downsides, but no doubt the buildfarm will tell us.
> 
> It seems to me that this patch is a good idea.  Any objections if I take
> care of it?  I have a Windows VM with only MSVC 2015 if I recall
> correctly though...

I just pushed it before seeing your message.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add function to release an allocated SQLDA

2018-07-11 Thread Thomas Munro

On Tue, Jun 19, 2018 at 9:11 PM, Kato, Sho  wrote:
>>This is not clear to me.  ECPGfreeSQLDA() releases a whole chain, but
>>free() only releases a single SQLDA(), so they are obviously not 
>>interchangeable.  When exactly should a user prefer one over the other?
>
> If an application use FETCH ALL to get the result set at once, a user should 
> use ECPGfreeSQLDA().
> Because the user does not know that the SQLDA holds the result set in the 
> chain.
> If it does not release it will remain leaked.
>
> Considering the use of people who know the structure of SQLDA, I described 
> the releasing method using free () in the document.

I wonder how other ESQL/C implementations deal with this stuff.  The
DB2 SQLDA struct[1] doesn't have desc_next.  The Informix one[2] does,
but I'm not entirely sure what it's for since Informix apparently
doesn't support EXEC SQL FETCH ALL.  The Informix manual also lists a
function SqlFreeMem() that is needed on Windows for the same reason we
need a new function[3].  I don't think there is anyone left who cares
about compatibility with Informix so (unless there is some guidance
from the spec here) I think your proposed name ECPGfreeSQLDA() is OK.

test/sql/sqlda.pgc includes a bit where SQLDA objects are still freed
with free().  That's because the loop does extra processing with each
SQLA:

while (outp_sqlda1)
{
blah blah...
outp_sqlda1 = outp_sqlda1->desc_next;
free(ptr);
}

To allow that type of usage, we would need two new functions:
ECPGfreeSQLDA() to free just one, and ECPGfreeAllSQLDA() to free the
whole chain.  The minimum thing to back-patch would be a
EDPGfreeSQLDA() function to free just one, since that fixes a
potential bug for Window users (the same reason we back-patched
4c8156d87108fa1f245bee13775e76819cd46a90).  Then ECPGfreeAllSQLDA()
(or better name) could be a separate patch to discuss for PG 12.  Does
that make sense?

sqlda.c:

+void
+ECPGfreeSQLDA_informix(struct sqlda_compat *sqlda_ptr)
+{

ecpglib.h:

+void ECPGfreeSQLDA_compat(struct sqlda_compat *);

I think these function names were supposed to match?

[1] 
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.apdv.api.doc/doc/r0001751.html
[2] 
https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.esqlc.doc/ids_esqlc_0656.htm
[3] 
https://www.ibm.com/support/knowledgecenter/en/SSGU8G_11.50.0/com.ibm.esqlc.doc/xbsql1007134.htm

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Heikki Linnakangas


On 11/07/18 12:27, Heikki Linnakangas wrote:

Based on recent discussions, it looks like there's going to be
differences in this area [1]. OpenSSL can support both tls-unique and
tls-server-end-point. Java only supports tls-server-end-point, while
GnuTLS only supports tls-unique. And Mac OS Secure Transports supports
neither one. Furthermore, it's not clear how TLS v1.3 affects this.
tls-unique might no longer be available in TLS v1.3, but we might get
new channel binding types to replace it. So this is about to get really
messy, if there is no way to negotiate. (Yes, it's going to be messy
even with negotiation.)


I've been reading up on the discussions on GnuTLS and Secure Transport, 
as well as the specs for tls-server-end-point.


In a nutshell, to get the token for tls-server-end-point, you need to 
get the peer's certificate from the TLS library, in raw DER format, and 
calculate a hash over it. The hash algorithm depends on the 
signatureAlgorithm in the certificate, so you need to parse the 
certificate to extract that. We don't want to re-implement X509 parsing, 
so realistically we need the TLS library to have support functions for that.


Looking at the GnuTLS docs, I believe it has everything we need. 
gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be 
used to get the certificate, and 
gnutls_x509_crt_get_signature_algorithm() gets the signatureAlgorithm.


The macOS Secure Transport documentation is a bit harder to understand, 
but I think it has everything we need as well. 
SSLCopyPeerTrust()+SecTrustGetCertificateAtIndex()+SecCertificateCopyData() 
functions get you the certificate in DER format. You can get the 
signature algorithm with SecCertificateCopyValues(), with the right 
constants.


Am I missing something? I think we can support tls-server-end-point with 
all TLS implementations we might care about.


- Heikki

Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-07-11 Thread Haribabu Kommi

On Mon, Apr 30, 2018 at 1:38 PM Thomas Munro 
wrote:

> On Mon, Apr 30, 2018 at 3:13 PM, Andres Freund  wrote:
> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote:
> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up
> >> as "reads" when in fact they are not reads at all:
> >>
> >> 1.  Relation extension, which in fact writes a zero-filled block.
> >> 2.  The RBM_ZERO_* modes, which provoke neither read nor write.
> >
> > Just for understanding: 2) can never happen for buffers showing up in
> > EXPLAIN, right?
> >
> > I'm not saying you shouldn't fix the accounting...
>
> Maybe the hash AM can reach that in _hash_getinitbuf() while adding
> overflow pages to bucket chains?  Admittedly case 2 is obscure and
> rare if not unreachable and probably no one would care too much about
> that in practice (whereas case 1 can be seen by simply inserting stuff
> into a new empty table).  Other patches I'm working on for later
> proposal do it more often (think accessing known-empty pages in a kind
> of preallocated extent), and it occurred to me that it's clearly a bug
> on principle, hence this patch.
>

I checked the patch and I agree with the change 1). And regarding change 2)
whether it is  zeroing the contents of the page or not, it does read?
because if it exists in the buffer pool, we are counting them as hits
irrespective
of the mode? Am I missing something?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Dilip Kumar

On Wed, Jul 11, 2018 at 5:36 PM, amul sul  wrote:
> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:

>>
> I am not sure that I have understand the following comments
>  11 +* Generate one prune step for the information derived from IS NULL,
>  12 +* if any.  To prune hash partitions, we must have found IS NULL
>  13 +* clauses for all partition keys.
>  14  */
>
> I am not sure that I have understood this --  no such restriction
> required to prune the hash partitions, if I am not missing anything.

Maybe it's not very clear but this is the original comments I have
retained.  Just moved it out of the (!generate_opsteps) condition.

Just the explain this comment consider below example,

create table hp (a int, b text) partition by hash (a int, b text);
create table hp0 partition of hp for values with (modulus 4, remainder 0);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp1 partition of hp for values with (modulus 4, remainder 1);
create table hp2 partition of hp for values with (modulus 4, remainder 2);

postgres=# insert into hp values (1, null);
INSERT 0 1
postgres=# insert into hp values (2, null);
INSERT 0 1
postgres=# select tableoid::regclass, * from hp;
 tableoid | a | b
--+---+---
 hp1  | 1 |
 hp2  | 2 |
(2 rows)

Now, if we query based on "b is null" then we can not decide which
partition should be pruned whereas in case
of other schemes, it will go to default partition so we can prune all
other partitions.

explain (costs off) select * from hp where b is null;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Preferring index-only-scan when the cost is equal

2018-07-11 Thread Tomas Vondra




On 07/11/2018 01:28 PM, Ashutosh Bapat wrote:

On Wed, Jul 11, 2018 at 11:03 AM, Yugo Nagata  wrote:

Hi,

I found that there is a situation that even when index only scan can be 
effective,
the planner doesn't select this.  The planner makes indexe paths in descending
order of indexrelid, and the new path is discarded if its cost is not less than
the existing paths' cost. As a result, IndexOnlyScan path can be discard even
hough it may be effective than normal IndexScan.

Here is a example;

=# create table test1 (i int, d int);
CREATE TABLE
=# create index on test1(i) include (d);
CREATE INDEX

=# explain select d from test1 where i = 0;
 QUERY PLAN
--
  Index Only Scan using test1_i_d_idx on test1  (cost=0.15..36.35 rows=11 
width=4)
Index Cond: (i = 0)
(2 rows)

=# create index on test1(i) ;
CREATE INDEX
=# explain select d from test1 where i = 0;
 QUERY PLAN
---
  Index Scan using test1_i_idx on test1  (cost=0.15..36.35 rows=11 width=4)
Index Cond: (i = 0)
(2 rows)


This is not new for the covered index feature. We can see the same thing when 
using
multi-column indexes.


=# create table test2 (i int, d int);
CREATE TABLE
=# create index on test2(i,d);
CREATE INDEX
=# explain select d from test2 where i = 0;
 QUERY PLAN
--
  Index Only Scan using test2_i_d_idx on test2  (cost=0.15..36.35 rows=11 
width=4)
Index Cond: (i = 0)
(2 rows)

=# create index on test2(i);
CREATE INDEX
=# explain select d from test2 where i = 0;
 QUERY PLAN
---
  Index Scan using test2_i_idx on test2  (cost=0.15..36.35 rows=11 width=4)
Index Cond: (i = 0)
(2 rows)


Attached is a patch to prefer index-only-scan when the cost is equal to other 
index
path.  Honestly, I'm not sure this is the best way. Any comments and advices 
would
be appriciated.



I don't think we should change add_path() for this. We will
unnecessarily check that condition even for the cases where we do not
create index paths. I think we should fix the caller of add_path()
instead to add index only path before any index paths. For that the
index list needs to be sorted by the possibility of using index only
scan.

But I think in your case, it might be better to first check whether
there is any costing error because of which index only scan's path has
the same cost as index scan path. Also I don't see any testcase which
will show why index only scan would be more efficient in your case.
May be provide output of EXPLAIN ANALYZE.



I suspect this only happens due to testing on empty tables. Not only is 
testing of indexes on small tables rather pointless in general, but more 
importantly there will be no statistics. So we fall back to some default 
estimates, but we also don't have relallvisible etc which is crucial for 
estimating index-only scans. I'd bet that's why the cost estimates for 
index scans and index-only scans are the same here.


Yugo-san, have you observed this behavior on larger tables?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-11 Thread Jerry Jelinek

Alvaro,

I'll perform several test runs with various combinations and post the
results.

Thanks,
Jerry


On Tue, Jul 10, 2018 at 2:34 PM, Alvaro Herrera 
wrote:

> On 2018-Jul-10, Jerry Jelinek wrote:
>
> > 2) Disabling WAL recycling reduces reliability, even on COW filesystems.
>
> I think the problem here is that WAL recycling in normal filesystems
> helps protect the case where filesystem gets full.  If you remove it,
> that protection goes out the window.  You can claim that people needs to
> make sure to have available disk space, but this does become a problem
> in practice.  I think the thing to do is verify what happens with
> recycling off when the disk gets full; is it possible to recover
> afterwards?  Is there any corrupt data?  What happens if the disk gets
> full just as the new WAL file is being created -- is there a Postgres
> PANIC or something?  As I understand, with recycling on it is easy (?)
> to recover, there is no PANIC crash, and no data corruption results.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread amul sul

On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:
>
> On Wed, Jul 11, 2018 at 4:20 PM, Dilip Kumar  wrote:
> > On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat
> >  wrote:
> >> Hi,
> >> Consider following test case.
> >> create table prt (a int, b int, c int) partition by range(a, b);
> >> create table prt_p1 partition of prt for values (0, 0) to (100, 100);
> >> create table prt_p1 partition of prt for values from (0, 0) to (100, 100);
> >> create table prt_p2 partition of prt for values from (100, 100) to (200, 
> >> 200);
> >> create table prt_def partition of prt default;
> >>
>
> > --- a/src/backend/partitioning/partprune.c
> > +++ b/src/backend/partitioning/partprune.c
> > @@ -857,7 +857,7 @@
> > gen_partprune_steps_internal(GeneratePruningStepsContext *context,
> >  * If generate_opsteps is set to false it means no OpExprs were 
> > directly
> >  * present in the input list.
> >  */
> > -   if (!generate_opsteps)
> > +   if (nullkeys || !generate_opsteps)
> > {
> > /*
> >  * Generate one prune step for the information derived
> > from IS NULL,
> > @@ -865,8 +865,7 @@
> > gen_partprune_steps_internal(GeneratePruningStepsContext *context,
> >  * clauses for all partition keys.
> >  */
> > if (!bms_is_empty(nullkeys) &&
> > -   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
> > -bms_num_members(nullkeys) == 
> > part_scheme->partnatts))
> > +   (part_scheme->strategy != PARTITION_STRATEGY_HASH))
> > {
> > PartitionPruneStep *step;
> >
> > postgres=# explain verbose select * from prt where a is null and b = 100;
> >   QUERY PLAN
> > --
> >  Append  (cost=0.00..35.51 rows=1 width=12)
> >->  Seq Scan on public.prt_def  (cost=0.00..35.50 rows=1 width=12)
> >  Output: prt_def.a, prt_def.b, prt_def.c
> >  Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100))
> > (4 rows)
> >
> > Above fix is just to show the root cause of the issue, I haven't
> > investigated that what should be the exact fix for this issue.
> >
>
> I think the actual fix should be as attached.
>
I am not sure that I have understand the following comments
 11 +* Generate one prune step for the information derived from IS NULL,
 12 +* if any.  To prune hash partitions, we must have found IS NULL
 13 +* clauses for all partition keys.
 14  */

I am not sure that I have understood this --  no such restriction
required to prune the hash partitions, if I am not missing anything.

Regards,
Amul

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Heikki Linnakangas


On 11/07/18 14:37, Michael Paquier wrote:

On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote:

as the supported mechanisms, we change that into:

SCRAM-SHA-256-PLUS tls-unique
SCRAM-SHA-256-PLUS tls-server-end-point
SCRAM-SHA-256


Can we say for sure that there won't be other options associated to a
given SASL mechanism?  It seems to me that something like the following
is better long-term, with channel binding available as a comma-separated
list of items:

SCRAM-SHA-256-PLUS channel_bindings=tls-unique,tls-server-end-point
SCRAM-SHA-256


That would be more complicated to parse. Yeah, we might need further 
options for some SASL mechanisms in the future, but we can cross that 
bridge when we get there. I don't see any need to complicate this case 
for that.


I started digging into this more closely, and ran into a little problem. 
If channel binding is not used, the client sends a flag to the server to 
indicate if it's because the client doesn't support channel binding, or 
because it thought that the server doesn't support it. This is the 
gs2-cbind-flag. How should the flag be set, if the server supports 
channel binding type A, while client supports only type B? The purpose 
of the flag is to detect downgrade attacks, where a man-in-the-middle 
removes the PLUS variants from the list of supported mechanisms. If we 
treat incompatible channel binding types as "client doesn't support 
channel binding", then the downgrade attack becomes possible (the 
attacker can replace the legit PLUS variants with bogus channel binding 
types that the client surely doesn't support). If we treat it as "server 
doesn't support channel binding", then we won't downgrade to the 
non-channel-binding variant, in the legitimate case that the client and 
server both support channel binding, but with incompatible types.


What we'd really want, is to include the full list of server's supported 
mechanisms as part of the exchange, not just a boolean "y/n" flag. But 
that's not what the spec says :-(.


I guess we should err on the side of caution, and fail the 
authentication in that case. That's unfortunate, but it's better than 
not negotiating at all. At least with the negotiation, the 
authentication will work if there is any mutually supported channel 
binding type.


- Heikki

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Dilip Kumar

On Wed, Jul 11, 2018 at 4:20 PM, Dilip Kumar  wrote:
> On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat
>  wrote:
>> Hi,
>> Consider following test case.
>> create table prt (a int, b int, c int) partition by range(a, b);
>> create table prt_p1 partition of prt for values (0, 0) to (100, 100);
>> create table prt_p1 partition of prt for values from (0, 0) to (100, 100);
>> create table prt_p2 partition of prt for values from (100, 100) to (200, 
>> 200);
>> create table prt_def partition of prt default;
>>

> --- a/src/backend/partitioning/partprune.c
> +++ b/src/backend/partitioning/partprune.c
> @@ -857,7 +857,7 @@
> gen_partprune_steps_internal(GeneratePruningStepsContext *context,
>  * If generate_opsteps is set to false it means no OpExprs were 
> directly
>  * present in the input list.
>  */
> -   if (!generate_opsteps)
> +   if (nullkeys || !generate_opsteps)
> {
> /*
>  * Generate one prune step for the information derived
> from IS NULL,
> @@ -865,8 +865,7 @@
> gen_partprune_steps_internal(GeneratePruningStepsContext *context,
>  * clauses for all partition keys.
>  */
> if (!bms_is_empty(nullkeys) &&
> -   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
> -bms_num_members(nullkeys) == part_scheme->partnatts))
> +   (part_scheme->strategy != PARTITION_STRATEGY_HASH))
> {
> PartitionPruneStep *step;
>
> postgres=# explain verbose select * from prt where a is null and b = 100;
>   QUERY PLAN
> --
>  Append  (cost=0.00..35.51 rows=1 width=12)
>->  Seq Scan on public.prt_def  (cost=0.00..35.50 rows=1 width=12)
>  Output: prt_def.a, prt_def.b, prt_def.c
>  Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100))
> (4 rows)
>
> Above fix is just to show the root cause of the issue, I haven't
> investigated that what should be the exact fix for this issue.
>

I think the actual fix should be as attached.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


generate_prunestep_for_isnull.patch
Description: Binary data

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote:
> as the supported mechanisms, we change that into:
> 
> SCRAM-SHA-256-PLUS tls-unique
> SCRAM-SHA-256-PLUS tls-server-end-point
> SCRAM-SHA-256

Can we say for sure that there won't be other options associated to a
given SASL mechanism?  It seems to me that something like the following
is better long-term, with channel binding available as a comma-separated
list of items:

SCRAM-SHA-256-PLUS channel_bindings=tls-unique,tls-server-end-point
SCRAM-SHA-256

> Thoughts? Unfortunately this breaks compatibility with current v11beta
> clients, but IMHO we must fix this. If we want to alleviate that, and save a
> few bytes of network traffic, we could decide that plain
> "SCRAM-SHA-256-PLUS" implies tls-unique, so the list would be:
> 
> SCRAM-SHA-256-PLUS
> SCRAM-SHA-256-PLUS tls-server-end-point
> SCRAM-SHA-256
> 
> That would be mostly compatible with current libpq clients.

I am not sure it is worth caring about the compatibility at a beta2
stage, as v11 is not released yet.  Still I agree that not specifying
anything should mean the default, which is per the RFCs currently
existing tls-unique.

Another piece of the puzzle is here as well:
https://www.postgresql.org/message-id/flat/20180122072936.gb1...@paquier.xyz
We need to allow a given SSL implementation to specify what the list of
channel bindings it supports is.
--
Michael

signature.asc
Description: PGP signature

Re: Problem with tupdesc in jsonb_to_recordset

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 11:38:46AM +0100, Andrew Gierth wrote:
> My first approach - assuming that update_cached_tupdesc has good reason
> to make a copy, which I'm not convinced is the case - would be to simply
> make a per-query-context copy of the tupdesc to assign to rsi.setDesc in
> order to conform to the call protocol.

I see what you are coming at here, thanks for the input.  I am not
really convinced that update_cached_tupdesc needs to do a new copy
either, but let's see what Tom has to say on the matter.  That's his
feature and his code.
--
Michael


signature.asc
Description: PGP signature

Re: Preferring index-only-scan when the cost is equal

2018-07-11 Thread Ashutosh Bapat

On Wed, Jul 11, 2018 at 11:03 AM, Yugo Nagata  wrote:
> Hi,
>
> I found that there is a situation that even when index only scan can be 
> effective,
> the planner doesn't select this.  The planner makes indexe paths in descending
> order of indexrelid, and the new path is discarded if its cost is not less 
> than
> the existing paths' cost. As a result, IndexOnlyScan path can be discard even
> hough it may be effective than normal IndexScan.
>
> Here is a example;
>
> =# create table test1 (i int, d int);
> CREATE TABLE
> =# create index on test1(i) include (d);
> CREATE INDEX
>
> =# explain select d from test1 where i = 0;
> QUERY PLAN
> --
>  Index Only Scan using test1_i_d_idx on test1  (cost=0.15..36.35 rows=11 
> width=4)
>Index Cond: (i = 0)
> (2 rows)
>
> =# create index on test1(i) ;
> CREATE INDEX
> =# explain select d from test1 where i = 0;
> QUERY PLAN
> ---
>  Index Scan using test1_i_idx on test1  (cost=0.15..36.35 rows=11 width=4)
>Index Cond: (i = 0)
> (2 rows)
>
>
> This is not new for the covered index feature. We can see the same thing when 
> using
> multi-column indexes.
>
>
> =# create table test2 (i int, d int);
> CREATE TABLE
> =# create index on test2(i,d);
> CREATE INDEX
> =# explain select d from test2 where i = 0;
> QUERY PLAN
> --
>  Index Only Scan using test2_i_d_idx on test2  (cost=0.15..36.35 rows=11 
> width=4)
>Index Cond: (i = 0)
> (2 rows)
>
> =# create index on test2(i);
> CREATE INDEX
> =# explain select d from test2 where i = 0;
> QUERY PLAN
> ---
>  Index Scan using test2_i_idx on test2  (cost=0.15..36.35 rows=11 width=4)
>Index Cond: (i = 0)
> (2 rows)
>
>
> Attached is a patch to prefer index-only-scan when the cost is equal to other 
> index
> path.  Honestly, I'm not sure this is the best way. Any comments and advices 
> would
> be appriciated.
>

I don't think we should change add_path() for this. We will
unnecessarily check that condition even for the cases where we do not
create index paths. I think we should fix the caller of add_path()
instead to add index only path before any index paths. For that the
index list needs to be sorted by the possibility of using index only
scan.

But I think in your case, it might be better to first check whether
there is any costing error because of which index only scan's path has
the same cost as index scan path. Also I don't see any testcase which
will show why index only scan would be more efficient in your case.
May be provide output of EXPLAIN ANALYZE.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Allow to specify a index name as ANALYZE parameter

2018-07-11 Thread Alexander Korotkov

Hi!

On Wed, Jul 11, 2018 at 12:04 PM Yugo Nagata  wrote:
> When we specify column names for ANALYZE, only the statistics for those 
> columns
> are collected. Similarly, is it useful if we have a option to specify an index
> for ANALYZE to collect only the statistics for expression in the specified 
> index?
>
> A usecase I suppose is that when a new expression index is created and that
> we need only the statistics for the new index. Although ANALYZE of all the 
> indexes
> is usually fast because ANALYZE uses a random sampling of the table rows, 
> ANALYZE
> on the specific index may be still useful if there are other index whose 
> "statistics
> target" is large and/or whose expression takes time to compute, for example.
>
> Attached is the WIP patch to allow to specify a index name as ANALYZE 
> parameter.
> Any documatation is not yet included.  I would appreciate any feedback!

I think this makes sense.  Once we can collect statistics individually
for regular columns, we should be able to do the same for indexed
expression.  Please, register this patch on the next commitfest.

Regarding current implementation I found message "ANALYZE option must
be specified when a column list is provided" to be confusing.
Perhaps, you've missed some negation in this message, since in fact
you disallow analyze with column list.

However, since multicolumn index may contain multiple expression, I
think we should support specifying columns for ANALYZE index clause.
We could support specifying columns by their numbers in the same way
we did for "ALTER INDEX index ALTER COLUMN colnum SET STATISTICS
number".

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-11 Thread Ashutosh Bapat

On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita
 wrote:
>
>
> Actually, even if we could create such an index on the child table and the
> targetlist had the ConvertRowtypeExpr, the planner would still not be able
> to use an index-only scan with that index; because check_index_only would
> not consider that an index-only scan is possible for that index because that
> index is an expression index and that function currently does not consider
> that index expressions are able to be returned back in an index-only scan.
> That behavior of the planner might be improved in future, though.
>
>> Pathkey points to an equivalence class, which contains equivalence
>> members. A parent equivalence class member containing a whole-row
>> reference gets translated into a child equivalence member containing a
>> ConvertRowtypeExpr.

Right and when we do so, not having ConvertRowtypeExpr in the
targetlist will be a problem.

>
>
> I think so too.
>
>> At places in planner we match equivalence members
>> to the targetlist entries. This matching will fail unexpectedly when
>> ConvertRowtypeExpr is removed from a child's targetlist. But again I
>> couldn't reproduce a problem when such a mismatch arises.
>
>
> IIUC, I don't think the planner assumes that for an equivalence member there
> is an matching entry for that member in the targetlist; what I think the
> planner assumes is: an equivalence member is able to be computed from
> expressions in the targetlist.

This is true. However,

>  So, I think it is safe to have whole-row
> Vars instead of ConvertRowtypeExprs in the targetlist.

when it's looking for an expression, it finds a whole-row expression
so it think it needs to add a ConvertRowtypeExpr on that. But when the
plan is created, there is ConvertRowtypeExpr already, but there is no
way to know that a new ConvertRowtypeExpr is not needed anymore. So,
we may have two ConvertRowtypeExprs giving wrong results.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Jsonb transform for pl/python

2018-07-11 Thread Alexander Korotkov

On Wed, Jul 11, 2018 at 12:30 AM Tom Lane  wrote:
> Peter Eisentraut  writes:
> > On 6/23/18 01:44, Nikita Glukhov wrote:
> >> We are simply trying first to convert numeric to int64 if is does not have
> >> digits after the decimal point, and then construct Python Int instead of
> >> Decimal.  Standard Python json.loads() does the same for exact integers.
>
> > We just had a very similar but not identical discussion in the thread
> > about the PL/Perl transforms, where we rejected the proposal.  Other
> > comments on this one?
>
> I can sympathize with the speed concern, but the proposed patch produces
> a functional change (ie, you get a different kind of Python object, with
> different behaviors, in some cases).  That seems to me to be a good reason
> to reject it.

Nevertheless, as Nikita pointed, json.loads() performs the same in
Python.  See an example for python 2.7.10.

>>> import json
>>> type(json.loads('1'))

>>> type(json.loads('1.0'))


So, from postgres point of view this behavior is wrong.  But on
another hand why don't let python transform to behave like a python?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Dilip Kumar

On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat
 wrote:
> Hi,
> Consider following test case.
> create table prt (a int, b int, c int) partition by range(a, b);
> create table prt_p1 partition of prt for values (0, 0) to (100, 100);
> create table prt_p1 partition of prt for values from (0, 0) to (100, 100);
> create table prt_p2 partition of prt for values from (100, 100) to (200, 200);
> create table prt_def partition of prt default;
>
> In a range partitioned table, a row with any partition key NULL goes
> to the default partition if it exists.
> insert into prt values (null, 1);
> insert into prt values (1, null);
> insert into prt values (null, null);
> select tableoid::regclass, * from prt;
>  tableoid | a | b | c
> --+---+---+---
>  prt_def  |   | 1 |
>  prt_def  | 1 |   |
>  prt_def  |   |   |
> (3 rows)
>
> There's a comment in get_partition_for_tuple(), which says so.
> /*
>  * No range includes NULL, so this will be accepted by the
>  * default partition if there is one, and otherwise rejected.
>  */
>
> But when there is IS NULL clause on any of the partition keys with
> some condition on other partition key, all the partitions scanned. I
> expected pruning to prune all the partitions except the default one.
>
> explain verbose select * from prt where a is null and b = 100;
>   QUERY PLAN
> --
>  Append  (cost=0.00..106.52 rows=3 width=12)
>->  Seq Scan on public.prt_p1  (cost=0.00..35.50 rows=1 width=12)
>  Output: prt_p1.a, prt_p1.b, prt_p1.c
>  Filter: ((prt_p1.a IS NULL) AND (prt_p1.b = 100))
>->  Seq Scan on public.prt_p2  (cost=0.00..35.50 rows=1 width=12)
>  Output: prt_p2.a, prt_p2.b, prt_p2.c
>  Filter: ((prt_p2.a IS NULL) AND (prt_p2.b = 100))
>->  Seq Scan on public.prt_def  (cost=0.00..35.50 rows=1 width=12)
>  Output: prt_def.a, prt_def.b, prt_def.c
>  Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100))
> (10 rows)
>
> I thought that the following code in get_matching_range_bounds()
> /*
>  * If there are no datums to compare keys with, or if we got an IS NULL
>  * clause just return the default partition, if it exists.
>  */
> if (boundinfo->ndatums == 0 || !bms_is_empty(nullkeys))
> {
> result->scan_default = partition_bound_has_default(boundinfo);
> return result;
> }
>
> would do the trick but through the debugger I saw that nullkeys is
> NULL for this query.
>
> I didn't investigate further to see why nullkeys is NULL, but it looks
> like that's the problem and we are missing an optimization.


I think the problem is that the gen_partprune_steps_internal expect
that all the keys should match to IS NULL clause, only in such case
the "partition pruning step" will store the nullkeys.

After a small change, it is able to prune the partitions.

--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -857,7 +857,7 @@
gen_partprune_steps_internal(GeneratePruningStepsContext *context,
 * If generate_opsteps is set to false it means no OpExprs were directly
 * present in the input list.
 */
-   if (!generate_opsteps)
+   if (nullkeys || !generate_opsteps)
{
/*
 * Generate one prune step for the information derived
from IS NULL,
@@ -865,8 +865,7 @@
gen_partprune_steps_internal(GeneratePruningStepsContext *context,
 * clauses for all partition keys.
 */
if (!bms_is_empty(nullkeys) &&
-   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
-bms_num_members(nullkeys) == part_scheme->partnatts))
+   (part_scheme->strategy != PARTITION_STRATEGY_HASH))
{
PartitionPruneStep *step;

postgres=# explain verbose select * from prt where a is null and b = 100;
  QUERY PLAN
--
 Append  (cost=0.00..35.51 rows=1 width=12)
   ->  Seq Scan on public.prt_def  (cost=0.00..35.50 rows=1 width=12)
 Output: prt_def.a, prt_def.b, prt_def.c
 Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100))
(4 rows)

Above fix is just to show the root cause of the issue, I haven't
investigated that what should be the exact fix for this issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Problem with tupdesc in jsonb_to_recordset

2018-07-11 Thread Andrew Gierth

> "Michael" == Michael Paquier  writes:

 >> I don't think that your solution is correct. From my read of
 >> 37a795a6, the tuple descriptor is moved from the query-lifespan
 >> memory context (ecxt_per_query_memory) to a function-level context,
 >> which could cause the descriptor to become busted as your test is
 >> pointing out. Tom?

 Michael> Hacking my way out I am finishing with the attached, which
 Michael> fixes the problem and passes all regression tests. I am not
 Michael> completely sure that this the right approach though, I would
 Michael> need to think a bit more about it.

What's happening in the original case is this: the SRF call protocol
says that it's the executor's responsibility to free rsi.setDesc if it's
not refcounted, under the assumption that in such a case it's in
per-query memory (and not in either a shorter-lived or longer-lived
context).

The change to update_cached_tupdesc broke the protocol by causing
populate_recordset_worker to set rsi.setDesc to a non-refcounted tupdesc
allocated in a context that's not the per-query context.

What's not clear here is why populate_recordset_record thinks it needs
to update the tupdesc for every record in a single result set. The
entire result set will ultimately be treated as having the same tupdesc
regardless, so any per-record changes will break things later anyway.

Your latest proposed fix isn't OK because it puts the wrong context in
cache->fn_mcxt - data that's dangled off flinfo->fn_extra must NOT be in
any context that has a different lifetime than flinfo->fn_mcxt (which
might not be query lifetime), unless you're taking specific steps to
free or invalidate it at the correct point.

My first approach - assuming that update_cached_tupdesc has good reason
to make a copy, which I'm not convinced is the case - would be to simply
make a per-query-context copy of the tupdesc to assign to rsi.setDesc in
order to conform to the call protocol.

-- 
Andrew (irc:RhodiumToad)

Re: Negotiating the SCRAM channel binding type

2018-07-11 Thread Bruce Momjian

On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote:
> Currently, there is no negotiation of the channel binding type between
> client and server. The server advertises that it supports channel binding,
> or not, and the client decides what channel binding to use. If the server
> doesn't support the binding type that the client chose, authentication will
> fail.
> 
> Based on recent discussions, it looks like there's going to be differences
> in this area [1]. OpenSSL can support both tls-unique and
> tls-server-end-point. Java only supports tls-server-end-point, while GnuTLS
> only supports tls-unique. And Mac OS Secure Transports supports neither one.
> Furthermore, it's not clear how TLS v1.3 affects this. tls-unique might no
> longer be available in TLS v1.3, but we might get new channel binding types
> to replace it. So this is about to get really messy, if there is no way to
> negotiate. (Yes, it's going to be messy even with negotiation.)
> 
> I think we must fix that before we release v11, because this affects the
> protocol. There needs to be a way for the server to advertise what channel
> binding types it supports.

Yes, this does make sense.  From a security perspective, it doesn't
matter which channel binding method we use, assuming there are no
unfixed exploits.  The difference between the channel binding methods is
explained in our PG 11 docs:


https://www.postgresql.org/docs/11/static/sasl-authentication.html#SASL-SCRAM-SHA-256

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Cannot dump foreign key constraints on partitioned table

2018-07-11 Thread amul sul

Hi,

On the master head, getConstraints() function skips FK constraints for
a partitioned table because of tbinfo->hastriggers is false.

While creating FK constraints on the partitioned table, the FK triggers are only
created on leaf partitions and skipped for the partitioned tables.

To fix this, either bypass the aforementioned condition of getConstraints() or
set pg_class.relhastriggers to true for the partitioned table which doesn't seem
to be the right solution, IMO.  Thoughts?

Regards,
Amul

partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Ashutosh Bapat

Hi,
Consider following test case.
create table prt (a int, b int, c int) partition by range(a, b);
create table prt_p1 partition of prt for values (0, 0) to (100, 100);
create table prt_p1 partition of prt for values from (0, 0) to (100, 100);
create table prt_p2 partition of prt for values from (100, 100) to (200, 200);
create table prt_def partition of prt default;

In a range partitioned table, a row with any partition key NULL goes
to the default partition if it exists.
insert into prt values (null, 1);
insert into prt values (1, null);
insert into prt values (null, null);
select tableoid::regclass, * from prt;
 tableoid | a | b | c
--+---+---+---
 prt_def  |   | 1 |
 prt_def  | 1 |   |
 prt_def  |   |   |
(3 rows)

There's a comment in get_partition_for_tuple(), which says so.
/*
 * No range includes NULL, so this will be accepted by the
 * default partition if there is one, and otherwise rejected.
 */

But when there is IS NULL clause on any of the partition keys with
some condition on other partition key, all the partitions scanned. I
expected pruning to prune all the partitions except the default one.

explain verbose select * from prt where a is null and b = 100;
  QUERY PLAN
--
 Append  (cost=0.00..106.52 rows=3 width=12)
   ->  Seq Scan on public.prt_p1  (cost=0.00..35.50 rows=1 width=12)
 Output: prt_p1.a, prt_p1.b, prt_p1.c
 Filter: ((prt_p1.a IS NULL) AND (prt_p1.b = 100))
   ->  Seq Scan on public.prt_p2  (cost=0.00..35.50 rows=1 width=12)
 Output: prt_p2.a, prt_p2.b, prt_p2.c
 Filter: ((prt_p2.a IS NULL) AND (prt_p2.b = 100))
   ->  Seq Scan on public.prt_def  (cost=0.00..35.50 rows=1 width=12)
 Output: prt_def.a, prt_def.b, prt_def.c
 Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100))
(10 rows)

I thought that the following code in get_matching_range_bounds()
/*
 * If there are no datums to compare keys with, or if we got an IS NULL
 * clause just return the default partition, if it exists.
 */
if (boundinfo->ndatums == 0 || !bms_is_empty(nullkeys))
{
result->scan_default = partition_bound_has_default(boundinfo);
return result;
}

would do the trick but through the debugger I saw that nullkeys is
NULL for this query.

I didn't investigate further to see why nullkeys is NULL, but it looks
like that's the problem and we are missing an optimization.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Problem with tupdesc in jsonb_to_recordset

2018-07-11 Thread Dmitry Dolgov

> On Wed, 11 Jul 2018 at 08:27, Michael Paquier  wrote:
>
> On Tue, Jul 10, 2018 at 10:39:28PM +0200, Dmitry Dolgov wrote:
> > I've found out that currently in some situations jsonb_to_recordset can 
> > lead to
> > a crash. Minimal example that I've managed to create looks like this:
>
> It would be better not to send the same email across multiple mailing
> lists.

Ok, sorry - next time I'll avoid this.

> On Wed, Jul 11, 2018 at 03:27:13PM +0900, Michael Paquier wrote:
> > I don't think that your solution is correct.  From my read of 37a795a6,
> > the tuple descriptor is moved from the query-lifespan memory context
> > (ecxt_per_query_memory) to a function-level context, which could cause
> > the descriptor to become busted as your test is pointing out.  Tom?
>
> Hacking my way out I am finishing with the attached, which fixes the
> problem and passes all regression tests.  I am not completely sure that
> this the right approach though, I would need to think a bit more about
> it.

So, you removed FreeTupleDesc call - does it mean, that if io->tupdesc will not
be reset in some other situations it's going to be a memory leak?

Negotiating the SCRAM channel binding type

2018-07-11 Thread Heikki Linnakangas

Currently, there is no negotiation of the channel binding type between 
client and server. The server advertises that it supports channel 
binding, or not, and the client decides what channel binding to use. If 
the server doesn't support the binding type that the client chose, 
authentication will fail.


Based on recent discussions, it looks like there's going to be 
differences in this area [1]. OpenSSL can support both tls-unique and 
tls-server-end-point. Java only supports tls-server-end-point, while 
GnuTLS only supports tls-unique. And Mac OS Secure Transports supports 
neither one. Furthermore, it's not clear how TLS v1.3 affects this. 
tls-unique might no longer be available in TLS v1.3, but we might get 
new channel binding types to replace it. So this is about to get really 
messy, if there is no way to negotiate. (Yes, it's going to be messy 
even with negotiation.)


I think we must fix that before we release v11, because this affects the 
protocol. There needs to be a way for the server to advertise what 
channel binding types it supports.


I propose that the server list each supported channel binding type as a 
separate SASL mechanism. For example, where currently the server lists:


SCRAM-SHA-256-PLUS
SCRAM-SHA-256

as the supported mechanisms, we change that into:

SCRAM-SHA-256-PLUS tls-unique
SCRAM-SHA-256-PLUS tls-server-end-point
SCRAM-SHA-256

Thoughts? Unfortunately this breaks compatibility with current v11beta 
clients, but IMHO we must fix this. If we want to alleviate that, and 
save a few bytes of network traffic, we could decide that plain 
"SCRAM-SHA-256-PLUS" implies tls-unique, so the list would be:


SCRAM-SHA-256-PLUS
SCRAM-SHA-256-PLUS tls-server-end-point
SCRAM-SHA-256

That would be mostly compatible with current libpq clients.

[1] 
https://www.postgresql.org/message-id/flat/20180122072936.GB1772%40paquier.xyz


- Heikki

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-07-11 Thread Marina Polyakova


On 09-07-2018 16:05, Fabien COELHO wrote:

Hello Marina,


Hello, Fabien!


Here is a review for the last part of your v9 version.


Thank you very much for this!


Patch does not "git apply" (may anymore):
  error: patch failed: doc/src/sgml/ref/pgbench.sgml:513
  error: doc/src/sgml/ref/pgbench.sgml: patch does not apply


Sorry, I'll send a new version soon.


However I could get it to apply with the "patch" command.

Then patch compiles, global & pgbench "make check" are ok.


:-)


Feature
===

The patch adds the ability to restart transactions (i.e. the full 
script)

on some errors, which is a good thing as it allows to exercice postgres
performance in more realistic scenarii.

* -d/--debug: I'm not in favor in requiring a mandatory text argument 
on this
option. It is not pratical, the user has to remember it, and it is a 
change.
I'm sceptical of the overall debug handling changes. Maybe we could 
have
multiple -d which lead to higher debug level, but I'm not sure that it 
can be
made to work for this case and still be compatible with the previous 
behavior.

Maybe you need a specific option for your purpose, eg "--debug-retry"?


As you wrote in [1], adding an additional option is also a bad idea:


I'm sceptical of the "--debug-fails" options. ISTM that --debug is
already there
and should just be reused.


Maybe it's better to use an optional argument/arguments for 
compatibility (--debug[=fails] or --debug[=NUM])? But if we use the 
numbers, now I can see only 2 levels, and there's no guarantee that they 
will no change..



Code


* The implementation is less complex that the previous submission,
which is a good thing. I'm not sure that all the remaining complexity
is still fully needed.

* I'm reserved about the whole ereport thing, see comments in other
messages.


Thank you, I'll try to implement the error reporting in the way you 
suggested.



Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me.
In particular, the "CLIENT" part is not very useful. If the
distinction makes sense, I would have kept "LOG" for the initial one 
and

add other ones for ABORT and PGBENCH, maybe.


Ok!


* There are no comments about "retries" in StatData, CState and Command
structures.

* Also, for StatData, I would like to understand the logic between cnt,
skipped, retries, retried, errors, ... so a clear information about the
expected invariant if any would be welcome. One has to go in the code 
to

understand how these fields relate one to the other.

<...>

* How "errors" differs from "ecnt" is unclear to me.


Thank you, I'll fix this.


* commandFailed: I think that it should be kept much simpler. In
particular, having errors on errors does not help much: on
ELEVEL_FATAL, it ignores the actual reported error and generates
another error of the same level, so that the initial issue is hidden.
Even if these are can't happen cases, hidding the origin if it occurs
looks unhelpful. Just print it directly, and maybe abort if you think
that it is a can't happen case.


Oh, thanks, my mistake(

* copyRandomState: just use sizeof(RandomState) instead of making 
assumptions
about the contents of the struct. Also, this function looks pretty 
useless,

why not just do a plain assignment?

* copyVariables: lacks comments to explain that the destination is 
cleaned up
and so on. The cleanup phase could probaly be in a distinct function, 
so that
the code would be clearer. Maybe the function variable names are too 
long.


Thank you, I'll fix this.


  if (current_source->svalue)

in the context of a guard for a strdup, maybe:

  if (current_source->svalue != NULL)


I'm sorry, I'll fix this.

* I do not understand the comments on CState enum: "First, remember the 
failure
in CSTATE_FAILURE. Then process other commands of the failed 
transaction if any"
Why would other commands be processed at all if the transaction is 
aborted?

For me any error must leads to the rollback and possible retry of the
transaction. This comment needs to be clarified. It should also say
that on FAILURE, it will go either to RETRY or ABORTED. See below my
comments about doCustom.

It is unclear to me why their could be several failures within a
transaction, as I would have stopped that it would be aborted on the
first one.

* I do not undestand the purpose of first_failure. The comment should 
explain
why it would need to be remembered. From my point of view, I'm not 
fully

convinced that it should.

<...>

* executeCondition: this hides client automaton state changes which 
were

clearly visible beforehand in the switch, and the different handling of
if & elif is also hidden.

I'm against this unnecessary restructuring and to hide such an 
information,
all state changes should be clearly seen in the state switch so that it 
is

easier to understand and follow.

I do not see why touching the conditional stack on internal errors
(evaluateExpr failure) brings anything, the whole transaction will be 
aborted

anyway.

*

Allow to specify a index name as ANALYZE parameter

2018-07-11 Thread Yugo Nagata

Hi,

When we specify column names for ANALYZE, only the statistics for those columns
are collected. Similarly, is it useful if we have a option to specify an index
for ANALYZE to collect only the statistics for expression in the specified 
index?

A usecase I suppose is that when a new expression index is created and that
we need only the statistics for the new index. Although ANALYZE of all the 
indexes
is usually fast because ANALYZE uses a random sampling of the table rows, 
ANALYZE
on the specific index may be still useful if there are other index whose 
"statistics
target" is large and/or whose expression takes time to compute, for example.

Attached is the WIP patch to allow to specify a index name as ANALYZE parameter.
Any documatation is not yet included.  I would appreciate any feedback!

Regards,

-- 
Yugo Nagata 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 25194e8..058f242 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -84,7 +84,7 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel, int options,
 			   VacuumParams *params, List *va_cols,
 			   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
-			   bool inh, bool in_outer_xact, int elevel);
+			   bool inh, bool in_outer_xact, int elevel, Oid indexid);
 static void compute_index_stats(Relation onerel, double totalrows,
 	AnlIndexData *indexdata, int nindexes,
 	HeapTuple *rows, int numrows,
@@ -121,6 +121,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	AcquireSampleRowsFunc acquirefunc = NULL;
 	BlockNumber relpages = 0;
 	bool		rel_lock = true;
+	Oid			indexid = InvalidOid;
 
 	/* Select logging level */
 	if (options & VACOPT_VERBOSE)
@@ -253,6 +254,28 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 		/* Also get regular table's size */
 		relpages = RelationGetNumberOfBlocks(onerel);
 	}
+	else if (onerel->rd_rel->relkind == RELKIND_INDEX)
+	{
+		Relation rel;
+
+		indexid = relid;
+		relid = IndexGetRelation(indexid, true);
+
+		if (va_cols != NIL)
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	 errmsg("ANALYZE option must be specified when a column list is provided")));
+
+		rel = try_relation_open(relid, ShareUpdateExclusiveLock);
+		relation_close(onerel, ShareUpdateExclusiveLock);
+
+		onerel = rel;
+
+		/* Regular table, so we'll use the regular row acquisition function */
+		acquirefunc = acquire_sample_rows;
+		/* Also get regular table's size */
+		relpages = RelationGetNumberOfBlocks(onerel);
+	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/*
@@ -308,14 +331,14 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 		do_analyze_rel(onerel, options, params, va_cols, acquirefunc,
-	   relpages, false, in_outer_xact, elevel);
+	   relpages, false, in_outer_xact, elevel, indexid);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
 		do_analyze_rel(onerel, options, params, va_cols, acquirefunc, relpages,
-	   true, in_outer_xact, elevel);
+	   true, in_outer_xact, elevel, InvalidOid);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -345,7 +368,7 @@ static void
 do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
 			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   int elevel, Oid indexid)
 {
 	int			attr_cnt,
 tcnt,
@@ -445,7 +468,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 		}
 		attr_cnt = tcnt;
 	}
-	else
+	else if (!OidIsValid(indexid))
 	{
 		attr_cnt = onerel->rd_att->natts;
 		vacattrstats = (VacAttrStats **)
@@ -459,6 +482,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 		}
 		attr_cnt = tcnt;
 	}
+	else
+		attr_cnt = 0;
 
 	/*
 	 * Open all indexes of the relation, and see if there are any analyzable
@@ -468,7 +493,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 	 * all.
 	 */
 	if (!inh)
-		vac_open_indexes(onerel, AccessShareLock, , );
+		vac_open_indexes(onerel, AccessShareLock, , , indexid);
 	else
 	{
 		Irel = NULL;
@@ -750,6 +775,7 @@ compute_index_stats(Relation onerel, double totalrows,
 	int			ind,
 i;
 
+
 	ind_context = AllocSetContextCreate(anl_context,
 		"Analyze Index",
 		ALLOCSET_DEFAULT_SIZES);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d90cb9a..b21811d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1606,7 +1606,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
  */
 void
 vac_open_indexes(Relation relation, LOCKMODE lockmode,
- int *nindexes, Relation **Irel)
+ int *nindexes, Relation **Irel, Oid idx)
 {
 	List

Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-07-11 Thread Heikki Linnakangas


On 11/07/18 04:16, Thomas Munro wrote:

On Tue, Jul 10, 2018 at 11:39 PM, Heikki Linnakangas  wrote:

I don't have a FreeBSD machine at hand, so I didn't try fixing that
patch.


I updated the FreeBSD version to use the header test approach you
showed, and pushed that too.  FWIW the build farm has some FreeBSD
animals with and without PROC_PDEATHSIG_CTL.


Thanks!


I suppose it's possibly that we might want to reconsider the choice of
signal in the future (SIGINFO or SIGPWR).


We could reuse SIGUSR1 for this. If we set the flag in SIGUSR1 handler, 
then some PostmasterIsAlive() calls would take the slow path 
unnecessarily, but it would probably be OK. The slow path isn't that 
slow. But using SIGINFO/SIGPWR seems fine.


- Heikki

Re: Libpq support to connect to standby server as priority

2018-07-11 Thread Haribabu Kommi

On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe 
wrote:

> Haribabu Kommi wrote:
> > On Wed, Jan 24, 2018 at 9:01 AM Jing Wang  wrote:
> > > Hi All,
> > >
> > > Recently I put a proposal to support 'prefer-read' parameter in
> target_session_attrs in libpq. Now I updated the patch with adding content
> in the sgml and regression test case.
> > >
> > > Some people may have noticed there is already another patch (
> https://commitfest.postgresql.org/15/1148/ ) which looks similar with
> this. But I would say this patch is more complex than my proposal.
> > >
> > > It is better separate these 2 patches to consider.
> >
> > I also feel prefer-read and read-only options needs to take as two
> different options.
> > prefer-read is simple to support than read-only.
> >
> > Here I attached an updated patch that is rebased to the latest master
> and also
> > fixed some of the corner scenarios.
>

Thanks for the review.


> The patch applies, builds and passes "make check-world".
>
> I think the "prefer-read" functionality is desirable: It is exactly what
> you need
> if you want to use replication for load balancing, and your application
> supports
> different database connections for reading and writing queries.
>
> "read-only" does not have a clear use case in my opinion.
>
> With the patch, PostgreSQL behaves as expected if I have a primary and a
> standby and run:
>
>   psql "host=/tmp,/tmp port=5433,5434 target_session_attrs=prefer-read"
>
> But if I stop the standby (port 5434), libpq goes into an endless loop.
>

There was a problem in reusing the primary host index and it leads to loop.
Attached patch fixed the issue.


> Concerning the code:
>
> - The documentation needs some attention. Suggestion:
>
>If this parameter is set to prefer-read, connections
>where SHOW transaction_read_only returns off are
> preferred.
>If no such connection can be found, a connection that allows read-write
>transactions will be accepted.
>

updated as per you comment.


> - I think the construction with "read_write_host_index" makes the code
> even more
>   complicated than it already is.
>
>   What about keeping the first successful connection open and storing it
> in a
>   variable if we are in "prefer-read" mode.
>   If we get the read-only connection we desire, close that cached
> connection,
>   otherwise use it.
>

Even if we add a variable to cache the connection, I don't think the logic
of checking
the next host for the read-only host logic may not change, but the extra
connection
request to the read-write host again will be removed.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Allow-taget-session-attrs-to-accept-prefer-read-opti_v2.patch
Description: Binary data

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-11 Thread Etsuro Fujita


(2018/07/10 19:58), Ashutosh Bapat wrote:

On Tue, Jul 10, 2018 at 9:08 AM, Etsuro Fujita
  wrote:

(2018/07/09 20:43), Ashutosh Bapat wrote:

At the cost of having targetlist being type inconsistent. I don't have
any testcase either to show that that's a problem in practice.



As I said before, I don't see any issue in having such a targetlist, but I
might be missing something, so I'd like to discuss a bit more about that.
Could you tell me the logic/place in the PG code where you think the problem
might occur.  (IIRC, you mentioned something about that before (pathkeys? or
index-only scans?), but sorry, I don't understand that.)


IIUC, index-only scan will be used when the all the required columns
are covered by an index. If there is an index on the whole-row
reference of the parent, it will be translated into a
ConvertRowtypeExpr of the child when an index on the child is created.


I think so too.


If the targetlist doesn't have ConvertRowtypeExpr, as your patch does,
the planner won't be able to use such an index on the child table. But
I couldn't create an index with a whole-row reference in it. So, I
think this isn't possible right now.


Actually, even if we could create such an index on the child table and 
the targetlist had the ConvertRowtypeExpr, the planner would still not 
be able to use an index-only scan with that index; because 
check_index_only would not consider that an index-only scan is possible 
for that index because that index is an expression index and that 
function currently does not consider that index expressions are able to 
be returned back in an index-only scan.  That behavior of the planner 
might be improved in future, though.



Pathkey points to an equivalence class, which contains equivalence
members. A parent equivalence class member containing a whole-row
reference gets translated into a child equivalence member containing a
ConvertRowtypeExpr.


I think so too.


At places in planner we match equivalence members
to the targetlist entries. This matching will fail unexpectedly when
ConvertRowtypeExpr is removed from a child's targetlist. But again I
couldn't reproduce a problem when such a mismatch arises.


IIUC, I don't think the planner assumes that for an equivalence member 
there is an matching entry for that member in the targetlist; what I 
think the planner assumes is: an equivalence member is able to be 
computed from expressions in the targetlist.  So, I think it is safe to 
have whole-row Vars instead of ConvertRowtypeExprs in the targetlist.


Thanks for the explanation!

Best regards,
Etsuro Fujita

Re: How can we submit code patches that implement our (pending) patents?

2018-07-11 Thread Dave Page

On Wed, Jul 11, 2018 at 1:34 AM, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Dave Page [mailto:dp...@pgadmin.org]
> > SFLC have acted as the projects counsel in the past, so I'm not surprised
> > they aren't talking to you; you won't be a known contact to them as a PG
> > contributor, and as a Fujitsu employee there would likely be a conflict
> > of interest for them to talk to you.
>
> I see.  Then I hope for some reply saying so so that I can give up my hope
> that I might get a good idea from them...
>
> Who is a known contact to SFLC in PostgreSQL community?  Can we expect
> response from SFLC if he/she contacts them?
>

I am - however as you've seen from Tom, the core team has already discussed
this and come to the conclusion that the risks and downsides to accepting
code under patent far outweighs the benefitsm so I won't be contacting them
about this.

We do thank you (and Fujitsu) for your efforts though.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem with tupdesc in jsonb_to_recordset

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 03:27:13PM +0900, Michael Paquier wrote:
> I don't think that your solution is correct.  From my read of 37a795a6,
> the tuple descriptor is moved from the query-lifespan memory context
> (ecxt_per_query_memory) to a function-level context, which could cause
> the descriptor to become busted as your test is pointing out.  Tom?

Hacking my way out I am finishing with the attached, which fixes the
problem and passes all regression tests.  I am not completely sure that
this the right approach though, I would need to think a bit more about
it.
--
Michael
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index e358b5ad13..b82060528b 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -2721,24 +2721,22 @@ JsValueToJsObject(JsValue *jsv, JsObject *jso)
 static void
 update_cached_tupdesc(CompositeIOData *io, MemoryContext mcxt)
 {
-	if (!io->tupdesc ||
-		io->tupdesc->tdtypeid != io->base_typid ||
-		io->tupdesc->tdtypmod != io->base_typmod)
-	{
-		TupleDesc	tupdesc = lookup_rowtype_tupdesc(io->base_typid,
-	 io->base_typmod);
-		MemoryContext oldcxt;
+	TupleDesc	tupdesc;
+	MemoryContext oldcxt;
 
-		if (io->tupdesc)
-			FreeTupleDesc(io->tupdesc);
+	if (io->tupdesc != NULL &&
+		io->tupdesc->tdtypeid == io->base_typid &&
+		io->tupdesc->tdtypmod == io->base_typmod)
+		return;
 
-		/* copy tuple desc without constraints into cache memory context */
-		oldcxt = MemoryContextSwitchTo(mcxt);
-		io->tupdesc = CreateTupleDescCopy(tupdesc);
-		MemoryContextSwitchTo(oldcxt);
+	tupdesc = lookup_rowtype_tupdesc(io->base_typid, io->base_typmod);
 
-		ReleaseTupleDesc(tupdesc);
-	}
+	/* copy tuple desc without constraints into cache memory context */
+	oldcxt = MemoryContextSwitchTo(mcxt);
+	io->tupdesc = CreateTupleDescCopy(tupdesc);
+	MemoryContextSwitchTo(oldcxt);
+
+	ReleaseTupleDesc(tupdesc);
 }
 
 /* recursively populate a composite (row type) value from json/jsonb */
@@ -3577,8 +3575,8 @@ populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname,
 	if (!cache)
 	{
 		fcinfo->flinfo->fn_extra = cache =
-			MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt, sizeof(*cache));
-		cache->fn_mcxt = fcinfo->flinfo->fn_mcxt;
+			MemoryContextAllocZero(rsi->econtext->ecxt_per_query_memory, sizeof(*cache));
+		cache->fn_mcxt = rsi->econtext->ecxt_per_query_memory;
 
 		if (have_record_arg)
 		{


signature.asc
Description: PGP signature

Re: In pageinspect, perform clean-up after testing gin-related functions

2018-07-11 Thread Amit Kapila

On Wed, Jul 11, 2018 at 12:37 PM, Kuntal Ghosh
 wrote:
> Hello all,
>
> In pageinspect/sql/gin.sql, we don't drop the table test1 at the end
> of the test. IMHO, we should clean-up at the end of a test.
>

Yeah, it is good practice to drop the objects at the end.  It is
strange that original commit adfb81d9e1 has this at the end of the
test, but a later commit 367b99bbb1 by Tom has removed the Drop
statement.  AFAICS, this is just a silly mistake, but I might be
missing something.  Tom, do you remember any reason for doing so?  If
not, then I think we can revert back that change (aka commit Kuntal's
patch).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: TRUNCATE tables referenced by FKs on partitioned tables

2018-07-11 Thread Michael Paquier

On Tue, Jul 10, 2018 at 08:06:24PM -0400, Alvaro Herrera wrote:
> You can't truncate prim on its own.  This is expected.
> alvherre=# truncate table prim, partfk;
> ERROR:  cannot truncate a table referenced in a foreign key constraint
> DETALLE:  Table "partfk" references "prim".
> SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
> CASCADE.

You mean that instead:
=# truncate table prim;
ERROR:  0A000: cannot truncate a table referenced in a foreign key
constraint
DETAIL:  Table "partfk" references "prim".
HINT:  Truncate table "partfk" at the same time, or use TRUNCATE
... CASCADE.
LOCATION:  heap_truncate_check_FKs, heap.c:3245

I agree that this should be an error.

> However, you can't do it even if you try to include partfk in the mix:
> 
> alvherre=# truncate table prim, partfk;
> ERROR:  cannot truncate a table referenced in a foreign key constraint
> DETALLE:  Table "partfk" references "prim".
> SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
> CASCADE.

Your first and second queries are the same :)

And those ones work:
=# truncate table partfk;
TRUNCATE TABLE
=# truncate table partfk, partfk1;
TRUNCATE TABLE
=# truncate table partfk, partfk1, partfk2;
TRUNCATE TABLE
=# truncate table partfk, partfk2;
TRUNCATE TABLE

> Trying to list all the partitions individually is pointless:
> 
> alvherre=# truncate table prim, partfk, partfk1, partfk2;
> ERROR:  cannot truncate a table referenced in a foreign key constraint
> DETALLE:  Table "partfk" references "prim".
> SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
> CASCADE.

Yes, I would expect this one to pass.

> CASCADE is also useless:
> 
> alvherre=# truncate table prim cascade;
> NOTICE:  truncate cascades to table "partfk"
> NOTICE:  truncate cascades to table "partfk1"
> NOTICE:  truncate cascades to table "partfk2"
> ERROR:  cannot truncate a table referenced in a foreign key constraint
> DETALLE:  Table "partfk" references "prim".
> SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
> CASCADE.

And this one as well.
--
Michael


signature.asc
Description: PGP signature

In pageinspect, perform clean-up after testing gin-related functions

2018-07-11 Thread Kuntal Ghosh

Hello all,

In pageinspect/sql/gin.sql, we don't drop the table test1 at the end
of the test. IMHO, we should clean-up at the end of a test. I've
attached the patch to perform the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From 1e4be3749eed9ff9d59f775d2bd4ad2b409a6d3c Mon Sep 17 00:00:00 2001
From: Kuntal Ghosh 
Date: Wed, 11 Jul 2018 12:21:13 +0530
Subject: [PATCH] In pageinspect, drop table after testing gin-related
 functions

---
 contrib/pageinspect/expected/gin.out | 1 +
 contrib/pageinspect/sql/gin.sql  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/contrib/pageinspect/expected/gin.out b/contrib/pageinspect/expected/gin.out
index 82f63b2..ef7570b 100644
--- a/contrib/pageinspect/expected/gin.out
+++ b/contrib/pageinspect/expected/gin.out
@@ -35,3 +35,4 @@ FROM gin_leafpage_items(get_raw_page('test1_y_idx',
 -[ RECORD 1 ]
 ?column? | t
 
+DROP TABLE test1;
diff --git a/contrib/pageinspect/sql/gin.sql b/contrib/pageinspect/sql/gin.sql
index d516ed3..423f5c5 100644
--- a/contrib/pageinspect/sql/gin.sql
+++ b/contrib/pageinspect/sql/gin.sql
@@ -17,3 +17,5 @@ SELECT COUNT(*) > 0
 FROM gin_leafpage_items(get_raw_page('test1_y_idx',
 (pg_relation_size('test1_y_idx') /
  current_setting('block_size')::bigint)::int - 1));
+
+DROP TABLE test1;
-- 
1.8.3.1

Re: automatic restore point

2018-07-11 Thread Michael Paquier

On Wed, Jul 11, 2018 at 06:11:01AM +, Yotsunaga, Naoki wrote:
> I want to hear about the following in addition to the previous
> comment.   What would you do if your customer dropped the table and asked you 
> to
> restore it?

I can think of 4 reasons on top of my mind:
1) Don't do that.
2) Implement safe-guards using utility hooks or event triggers.
3) Have a delayed standby if you don't believe that your administrators
are skilled enough in case.
4) Have backups and a WAL archive.

> Everyone is thinking what to do to avoid operation failure, but I’m
> thinking about after the user’s failure. 
> What I mean is that not all users will set up in advance.
> For example, if you make the settings described in the manual, you
> will not drop the table by operation failure. However, not all users
> do that setting. 
> For such users, I think that it is necessary to have a function to
> easily restore data after failing operation without setting anything
> in advance. So I proposed this function.

Well, if you put in place correct measures from the start you would not
have problems.  It seems to me that there is no point in implementing
something which is a solution for a very narrow case, where the user has
shot his own foot to begin with.  Having backups anyway is mandatory by
the way, standby replicas are not backups.
--
Michael


signature.asc
Description: PGP signature

1 2 >

1 - 100 of 103 matches

Mail list logo