Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 10:03 PM, Tom Lane  wrote:
>> Typically not, but I don't think that we can rule it out right away.
>
> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
> a parallel build happened or not, so that we can check the buildfarm logs
> next time we see that failure?

I can do that tomorrow. Of course, it might be easier to just push the
pending fix for the mapped relation bug, and see if that makes the
issue go away. That would take longer, but it's also the only thing
that's likely to be definitive anyway.

If you want to go ahead with it yourself, I suggest modifying the
DEBUG1 elog()s within index_build() to be LOG elevel.

-- 
Peter Geoghegan



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane  wrote:
>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>> index rebuild during that test, would it?

> Typically not, but I don't think that we can rule it out right away.

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

> I don't know all that much about the buildfarm client code, and it's
> late.

It doesn't really stick in any undocumented configuration changes,
AFAIK.  Possibly Andrew would have some more insight.

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane  wrote:
>> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
>> pg_class_oid_index when we happen to get a parallel build, since
>> pg_class is a mapped relation, and I've identified that as a problem
>> for parallel CREATE INDEX [2]. If that was the ultimate cause of the
>> issue, it would explain why only REL_11_STABLE and master are
>> involved.
>
> Oooh ... but pg_class wouldn't be big enough to get a parallel
> index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.
"min_parallel_table_scan_size = 0" will make it happen, and that
happens in quite a few regression tests. Though that doesn't include
vacuum.sql.

The animals mandrill, mantid and lapwing are all
"force_parallel_mode=regress", according to their notes. That actually
isn't enough to force parallel CREATE INDEX, though I tend to think it
should be. I don't see anything interesting in their "extra_config",
but perhaps "min_parallel_table_scan_size = 0" got in some other way.
I don't know all that much about the buildfarm client code, and it's
late.

-- 
Peter Geoghegan



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-08 Thread Amit Langote
On 2018/08/09 13:00, Tom Lane wrote:
> Amit Langote  writes:
>> One reason why we should adapt such a test case is that, in the future, we
>> may arrange for make_partitionedrel_pruneinfo(), whose code we just fixed,
>> to not be called if we know that run-time pruning is not needed.  It seems
>> that that's true for the test added by the commit, that is, it doesn't
>> need run-time pruning.
> 
> Not following your argument here.  Isn't make_partition_pruneinfo
> precisely what is in charge of figuring out whether run-time pruning
> is possible?

With the current coding, yes, it is...

> (See my point in the other thread about Jaime's assertion crash,
> that no run-time pruning actually would be possible for that query.
> But we got to the assertion failure anyway.)

The first time I'd seen that make_partition_pruneinfo *always* gets called
from create_append_plan if rel->baserestrictinfo is non-NIL, I had
wondered whether we couldn't avoid doing it for the cases for which we'll
end up throwing away all that work anyway.  But looking at the code now,
it may be a bit hard -- analyze_partkey_exprs(), which determines whether
we'll need any execution-time pruning, could not be called any sooner.

So, okay, I have to admit that my quoted argument isn't that strong.

Thanks,
Amit




Re: Ideas for a relcache test mode about missing invalidations

2018-08-08 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 3 Aug 2018 15:42:22 -0700, Peter Geoghegan  wrote in 

> On Fri, Aug 3, 2018 at 12:34 AM, Kyotaro HORIGUCHI
>  wrote:
> >> I reread through the thread and IUCC, drop_index() is sending
> >> inval on the owing relation and invalidation happens (that is,
> >
> > I finally understand that I was totally inept. This came from
> > *the result* of the original -bug thread.
> 
> I certainly would not say that you were in any way inept. Perhaps
> there is an argument for *also* doing what you propose. I am not
> dismissive of your idea.

I jumped to the URL to see the thread but it was not
up-to-date. It's the cause of my confusion that I thought the
problem was the error seen *after* the invalidation fix patch in
my repo.

> It seems like a question that should be considered separately, on
> another thread. If you still want to pursue it.

Thanks. I might bring this again later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: partition tree inspection functions

2018-08-08 Thread Amit Langote
Thanks Thomas for notifying.

On 2018/08/08 20:47, Thomas Munro wrote:
> On Wed, Aug 8, 2018 at 11:21 PM, Thomas Munro
>  wrote:
>> On Tue, Aug 7, 2018 at 7:32 PM, Amit Langote
>>  wrote:
>>> Attached updated patch adds isleaf to pg_partition_children's output.
>>
>> Hmm, I wonder where this garbage is coming from:
> 
> partition.c:437:3: error: array index 3 is past the end of the array
> (which contains 3 elements) [-Werror,-Warray-bounds]

Oops, fixed.

> That'll do it.  Also:
> 
> partition.c:409:19: error: implicit declaration of function
> 'get_rel_relkind' is invalid in C99
> [-Werror,-Wimplicit-function-declaration]

Fixed too.

Attached updated patch.

Thanks,
Amit
From de9a597f8cfc1e8707c047ec7ec760ef01244a03 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 3 Aug 2018 17:06:05 +0900
Subject: [PATCH v11] Add pg_partition_children to report partitions

It returns set of records one for each partition containing the
partition name, parent name, and level in the partition tree with
given table as root
---
 doc/src/sgml/func.sgml   | 37 
 src/backend/catalog/partition.c  | 90 
 src/include/catalog/pg_proc.dat  |  9 +++
 src/test/regress/expected/partition_info.out | 84 ++
 src/test/regress/parallel_schedule   |  2 +-
 src/test/regress/serial_schedule |  1 +
 src/test/regress/sql/partition_info.sql  | 35 +++
 7 files changed, 257 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..6b10aa3b3d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19995,6 +19995,43 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type 
Description
+ 
+
+  
+  
+   
pg_partition_children(regclass)
+   setof record
+   
+List name, parent name, level, and whether it's a leaf partition for
+each partition contained in the partition tree with given root table,
+including the root table itself
+   
+  
+ 
+
+   
+
+   
+To check the total size of the data contained in
+measurement table described in
+, one could use the
+following query:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(relid))) as total_size from 
pg_partition_children('measurement');
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+
   
 
   
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..6061bfc369 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -23,13 +23,17 @@
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_partitioned_table.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -357,3 +361,89 @@ get_proposed_default_constraint(List *new_part_constraints)
 
return make_ands_implicit(defPartConstraint);
 }
+
+Datum
+pg_partition_children(PG_FUNCTION_ARGS)
+{
+   Oid rootrelid = PG_GETARG_OID(0);
+   FuncCallContext *funccxt;
+   ListCell **next;
+
+   if (SRF_IS_FIRSTCALL())
+   {
+   MemoryContext oldcxt;
+   TupleDesc   tupdesc;
+   List   *partitions;
+
+   funccxt = SRF_FIRSTCALL_INIT();
+   oldcxt = MemoryContextSwitchTo(funccxt->multi_call_memory_ctx);
+
+   partitions = find_all_inheritors(rootrelid, NoLock, NULL);
+
+   tupdesc = CreateTemplateTupleDesc(4, false);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 1, "relid",
+  REGCLASSOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 2, "parentid",
+  REGCLASSOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "level",
+  INT4OID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 4, "isleaf",
+  BOOLOID, -1, 0);
+
+   next = (ListCell **) palloc(sizeof(ListCell *));
+   *next = list_head(partitions);
+
+   funccxt->attinmeta = TupleDescGetAttInMetadata(tupdesc);
+   funccxt->user_fctx = (void *) next;

Re: Documentaion fix.

2018-08-08 Thread Kyotaro HORIGUCHI
At Sat, 4 Aug 2018 05:58:52 +0900, Michael Paquier  wrote 
in <20180803205852.gb20...@paquier.xyz>
> On Fri, Aug 03, 2018 at 04:37:05PM -0400, Alvaro Herrera wrote:
> > On 2018-Aug-03, Kyotaro HORIGUCHI wrote: 
> >> That said, I don't object to reduce the columns. Please find the
> >> attached.
> > 
> > Thanks, pushed.
> 
> Thanks Alvaro for keeping the three-column version, I was going to look
> at the proposed patch and push as you did, until I noticed that you
> showed up :)

Thanks.

# I noticed that the subject has typo..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-08 Thread Tom Lane
Amit Langote  writes:
> One reason why we should adapt such a test case is that, in the future, we
> may arrange for make_partitionedrel_pruneinfo(), whose code we just fixed,
> to not be called if we know that run-time pruning is not needed.  It seems
> that that's true for the test added by the commit, that is, it doesn't
> need run-time pruning.

Not following your argument here.  Isn't make_partition_pruneinfo
precisely what is in charge of figuring out whether run-time pruning
is possible?

(See my point in the other thread about Jaime's assertion crash,
that no run-time pruning actually would be possible for that query.
But we got to the assertion failure anyway.)

regards, tom lane



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-08 Thread Amit Langote
On 2018/08/09 0:48, Tom Lane wrote:
> David Rowley  writes:
>> On 8 August 2018 at 17:28, Amit Langote  
>> wrote:
>>> Attached is a patch which modifies the if test to compare relids instead
>>> of RelOptInfo pointers.
> 
>> Thanks for investigating and writing a patch. I agree with the fix.
> 
> I changed this to compare the relid sets not just rel->relid, since
> rel->relid is only reliable for baserels.  The partitioned rel could
> safely be assumed to be a baserel, but I'm less comfortable with
> supposing that the parentrel always will be.  Otherwise, added a
> test case based on Rushabh's example and pushed.  (I'm not quite
> sure if the plan will be stable enough to satisfy the buildfarm,
> but we'll soon find out ...)

Thank you for committing, agreed that comparing relid sets for equality
might be more future-proof.

About the test case, wondering if we should, like David seemed to suggest,
add a test case that would actually use run-time pruning?   Maybe, even
better if the new test also had partitioned parent under UNION ALL parent
under ModifyTable.  Something like in the attached?

One reason why we should adapt such a test case is that, in the future, we
may arrange for make_partitionedrel_pruneinfo(), whose code we just fixed,
to not be called if we know that run-time pruning is not needed.  It seems
that that's true for the test added by the commit, that is, it doesn't
need run-time pruning.

Regards,
Amit
diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index 693c348185..61457862a9 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2478,6 +2478,8 @@ deallocate ab_q3;
 deallocate ab_q4;
 deallocate ab_q5;
 -- UPDATE on a partition subtree has been seen to have problems.
+set enable_hashjoin to off;
+set enable_mergejoin to off;
 insert into ab values (1,2);
 explain (analyze, costs off, summary off, timing off)
 update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a;
@@ -2556,6 +2558,69 @@ table ab;
  1 | 3
 (1 row)
 
+truncate ab;
+insert into ab values (1, 1), (1, 2), (1, 3);
+explain (analyze, costs off, summary off, timing off)
+update ab_a1 set b = 3 from (select * from (select * from ab_a2 union all 
select 1, 2) s where s.b = (select 2)) ss where ss.a = ab_a1.a;
+   QUERY PLAN  
  
+-
+ Update on ab_a1 (actual rows=0 loops=1)
+   Update on ab_a1_b1
+   Update on ab_a1_b2
+   Update on ab_a1_b3
+   InitPlan 1 (returns $0)
+ ->  Result (actual rows=1 loops=1)
+   ->  Nested Loop (actual rows=1 loops=1)
+ ->  Append (actual rows=1 loops=1)
+   ->  Seq Scan on ab_a2_b1 (never executed)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b3 (never executed)
+ Filter: (b = $0)
+   ->  Subquery Scan on "*SELECT* 2" (actual rows=1 loops=1)
+ ->  Result (actual rows=1 loops=1)
+   One-Time Filter: (2 = $0)
+ ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=1 
loops=1)
+   Index Cond: (a = ab_a2_b1.a)
+   ->  Nested Loop (actual rows=1 loops=1)
+ ->  Append (actual rows=1 loops=1)
+   ->  Seq Scan on ab_a2_b1 (never executed)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b3 (never executed)
+ Filter: (b = $0)
+   ->  Subquery Scan on "*SELECT* 2_1" (actual rows=1 loops=1)
+ ->  Result (actual rows=1 loops=1)
+   One-Time Filter: (2 = $0)
+ ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=1 
loops=1)
+   Index Cond: (a = ab_a2_b1.a)
+   ->  Nested Loop (actual rows=1 loops=1)
+ ->  Append (actual rows=1 loops=1)
+   ->  Seq Scan on ab_a2_b1 (never executed)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
+ Filter: (b = $0)
+   ->  Seq Scan on ab_a2_b3 (never executed)
+ Filter: (b = $0)
+   ->  Subquery Scan on "*SELECT* 2_2" (actual rows=1 loops=1)
+ ->  Result (actual rows=1 loops=1)
+   One-Time Filter: (2 = $0)
+ ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=1 
loops=1)
+   Index Cond: (a = ab_a2_b1.a)
+(45 rows)
+
+table ab;
+ a | b 
+---+---
+ 1 | 3
+ 1 | 3
+ 1 | 3
+(3 rows)
+
+reset enable_hashjoin;
+reset enable_mergejoin;
 drop table ab, lprt_a;
 -- Join
 create table tbl1(col1 int);

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-08-08 Thread Andrey Lepikhov



09.08.2018 05:19, Peter Geoghegan пишет:

On Tue, Aug 7, 2018 at 12:19 AM, Andrey Lepikhov
 wrote:

I wrote a background worker (hcleaner) to demonstrate application of Retail
IndexTuple deletion (see patch at attachment).
Like Autovacuum it utilizes concept of one launcher and many workers. But
one worker correspond to one database.

Short description:
Backend collects dirty block numbers by a hash table at the point in code
immediately after heap_page_prune() call. Backend send a package of dirty
block numbers (not one-by-one!) by socket at the end of transaction or if
hash table is full.
Launcher transfers block numbers to correspond workers.
Worker collects dead tuples from a block, clean index relations, clean heap
block. It uses conditional locking with waiting list approach if heap block
are busy.

hcleaner has experimental status, but make check-world passed


How does this affect ordinary opportunistic pruning?

As far as I understand, background worker not affect on ordinary 
opportunistic pruning. For a dirty block it made cleaning work like 
vacuum (if acquire conditional lock).


I tried two ways:
1. In executor: tid of deleted/updated tuples collects directly in 
executor and send to background worker. Background worker make 
heap_page_prune() himself.
But this is no good variant, because each backend perform opportunistic 
pruning and it is sufficient work to detect dead tuples. Moreover, in 
this case the worker compete with backends for pruning and has high load 
when many backends existed.
2. In heap_page_prune_opt(): backend collects ids of dirty blocks 
immediately after pruning and send it to the worker. In this case 
backend perform all work for detection of dead tuples. It let the worker 
to miss block cleaning during periods of heavy load: later it can clean 
block.
Since locks are conditional we do not hamper backends work during high 
load periods and utilize idle time for relation cleaning in soft manner.


Variant No.2 look better better and now i use it.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: FailedAssertion on partprune

2018-08-08 Thread Tom Lane
I wrote:
> Now that seems to me to be a rather weird plan: why doesn't it prefer
> to flatten everything into one parallel append?  Indeed, if you take
> out any of the remaining query parts such as the LIMIT, that's what
> it does.  I think that its willingness to do this is actually kind
> of a bug, because this query is going to be a total disaster in terms
> of the number of workers it will try to use --- way more than the
> user would expect given max_parallel_workers_per_gather = 2.

Oh ... never mind that last.  The parent Append will run its children
sequentially, so that the Gathers execute one at a time, and at no
point will more than 2 workers be active.

Nonetheless, it's a damfool query plan, because we'll be going through
parallel worker startup/shutdown overhead 4 times for no benefit.
We really should put in some kind of logic to force those sibling
Gathers to be aggregated into one, or else disallow them altogether.

regards, tom lane



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Andres Freund



On August 9, 2018 12:41:17 AM GMT+05:30, Michael Paquier  
wrote:
>Hi Andres,
>
>(Not my intention to miss your message, I have just noticed it.)
>
>On Wed, Aug 08, 2018 at 01:41:27AM -0700, Andres Freund wrote:
>> I can't parse this. "Even if this is an atomic operation, this can be
>> safely done lock-less" - that seems like a contradictory sentence. Is
>> there a "not" missing?
>
>Yes, a "not" has gone missing here.  I reworked the comment block as
>mentioned upthread.
>
>> Also, this seems like insufficient reasoning. What guarantees the
>> visibility of the flag? You're going to have to talk about externally
>> implied memory ordering here.  Or add explicit barriers - the latter
>is
>> probably preferrable.
>
>Well, we use BackendIdGetProc() in this case, where we could finish
>with
>information out-of-date pretty quickly, and there is no special
>reasoning for backendId and databaseId for autovacuum but...  Perhaps
>you could explain more what you have in mind?  And it is not like this
>relies on the number of elements in PGPROC.

My point is that the documentation isn't sufficient. Not that there's an active 
problem.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund  wrote:
>> I don't immediately see it being responsible, but I wonder if there's a
>> chance it actually is: Note that it happens in a parallel group that
>> includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
>> don't immediately see how it could apply.

> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
> pg_class_oid_index when we happen to get a parallel build, since
> pg_class is a mapped relation, and I've identified that as a problem
> for parallel CREATE INDEX [2]. If that was the ultimate cause of the
> issue, it would explain why only REL_11_STABLE and master are
> involved.

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

regards, tom lane



Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-08 Thread Peter Geoghegan
On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund  wrote:
>> HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
>> but I don't immediatley know where.
>
> Hm, there was:
> http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de
>
>
> I don't immediately see it being responsible, but I wonder if there's a
> chance it actually is: Note that it happens in a parallel group that
> includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
> don't immediately see how it could apply.

It's now pretty clear that it was not that particular bug, since I
pushed a fix, and yet the issue hasn't gone away on affected buildfarm
animals. There was a recurrence of the problem on lapwing, for example
[1].

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

My guess is that the metapage considers the root page to be at block 3
(block 3 is often the root page for small though not tiny B-Trees),
which for whatever reason is where we get a short read. I don't know
why there is a short read, but corrupting mapped catalog indexes at
random can be expected to cause all kinds of chaos, so that doesn't
mean much.

In any case, I'll probably push a fix for this other bug on Friday,
barring any objections. It's possible that that will make the problem
go away.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing=2018-08-04%2004%3A20%3A01
[2] 
https://www.postgresql.org/message-id/CAH2-Wzn=j0i8rxCAo6E=tbo9xuyxb8hbusnw7j_stkon8dd...@mail.gmail.com
-- 
Peter Geoghegan



Re: Why do we expand tuples in execMain.c?

2018-08-08 Thread Andres Freund



On August 9, 2018 1:33:17 AM GMT+05:30, Andrew Dunstan 
 wrote:
>
>
>On 08/08/2018 12:20 AM, Andres Freund wrote:
>> Hi,
>>
>> I noticed
>>  if (HeapTupleHeaderGetNatts(tuple.t_data) <
>>  RelationGetDescr(erm->relation)->natts)
>>  {
>>  copyTuple = heap_expand_tuple(,
>>  
>>   RelationGetDescr(erm->relation));
>>  }
>>  else
>>  {
>>  /* successful, copy tuple */
>>  copyTuple = heap_copytuple();
>>  }
>>
>> in EvalPlanQualFetchRowMarks, and I'm somewhat confused why it's
>there?
>> If it's required here, why isn't it required in dozens of other
>places?
>>
>>
>
>
>
>Not dozens, I think, since you can't have short records for catalog 
>tables, which account for most of the calls to heap_copytuple().
>
>I will look at the remainder of cases (less than 10) and reply in a day
>
>or two.

But why is it needed at all, and the deforming code at the site where we access 
the columns isn't sufficient?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Shared buffer access rule violations?

2018-08-08 Thread Asim R P
On Tue, Aug 7, 2018 at 7:00 PM, Peter Geoghegan  wrote:
>
> I wonder if it would be a better idea to enable Valgrind's memcheck to
> mark buffers as read-only or read-write. We've considered doing
> something like that for years, but for whatever reason nobody followed
> through.

Basic question: how do you mark buffers as read-only using memcheck
tool?  Running installcheck with valgrind didn't uncover any errors:

valgrind --trace-children=yes pg_ctl -D datadir start
make installcheck-parallel

Asim & David



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-08-08 Thread Peter Geoghegan
On Tue, Aug 7, 2018 at 12:19 AM, Andrey Lepikhov
 wrote:
> I wrote a background worker (hcleaner) to demonstrate application of Retail
> IndexTuple deletion (see patch at attachment).
> Like Autovacuum it utilizes concept of one launcher and many workers. But
> one worker correspond to one database.
>
> Short description:
> Backend collects dirty block numbers by a hash table at the point in code
> immediately after heap_page_prune() call. Backend send a package of dirty
> block numbers (not one-by-one!) by socket at the end of transaction or if
> hash table is full.
> Launcher transfers block numbers to correspond workers.
> Worker collects dead tuples from a block, clean index relations, clean heap
> block. It uses conditional locking with waiting list approach if heap block
> are busy.
>
> hcleaner has experimental status, but make check-world passed

How does this affect ordinary opportunistic pruning?

-- 
Peter Geoghegan



Re: Scariest patch tournament, PostgreSQL 11 edition

2018-08-08 Thread Fabien COELHO



Another way to look at it:

 Alvaro = 6*2 + 4*4 = 28
 Robert = 4*5 + 2 = 22
 Andres = 5 + 3 + 1 =  9
 Peter  = 5
 Andrew = 4
 Teodor = 3
 Simon = 3

A round of applause for our winner!

--
Fabien.



Re: FailedAssertion on partprune

2018-08-08 Thread Tom Lane
I spent some more time poking at Jaime's example.  I can reduce the
problem query to this and still get the Assert crash:

select random()
from radicado tablesample bernoulli (9.7)
where radi_texto = radi_inst_actu
limit 33;

None of the elements of this query can be removed without causing the
Assert to go away, which is weird because none of them have any apparent
connection to partition pruning.

With the Assert taken out, we find that this is the plan that's causing
the problem:

QUERY PLAN  
  
--
 Limit  (cost=0.00..919.71 rows=33 width=8)
   ->  Append  (cost=0.00..12792.40 rows=459 width=8)
 ->  Sample Scan on radicado2012  (cost=0.00..2939.18 rows=93 width=8)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Gather  (cost=1000.00..2485.72 rows=93 width=8)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..1476.18 rows=39 width=0)
 ->  Sample Scan on radicado2013_part00  
(cost=0.00..1475.99 rows=47 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Sample Scan on radicado2013_part01  
(cost=0.00..1461.88 rows=46 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Gather  (cost=1000.00..2491.15 rows=93 width=8)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..1481.62 rows=39 width=0)
 ->  Sample Scan on radicado2014_part00  
(cost=0.00..1481.42 rows=47 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Sample Scan on radicado2014_part01  
(cost=0.00..1464.05 rows=46 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Gather  (cost=1000.00..2482.47 rows=93 width=8)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..1472.93 rows=39 width=0)
 ->  Sample Scan on radicado2015_part00  
(cost=0.00..1472.74 rows=47 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Sample Scan on radicado2015_part01  
(cost=0.00..1454.28 rows=46 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Gather  (cost=1000.00..2380.72 rows=86 width=8)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..1371.90 rows=36 width=0)
 ->  Sample Scan on radicado2016_part00  
(cost=0.00..1371.72 rows=43 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Sample Scan on radicado2016_part01  
(cost=0.00..1365.21 rows=43 width=0)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
 ->  Sample Scan on radicado2017  (cost=0.00..10.87 rows=1 width=8)
   Sampling: bernoulli ('9.7'::real)
   Filter: (radi_texto = radi_inst_actu)
(44 rows)

Now that seems to me to be a rather weird plan: why doesn't it prefer
to flatten everything into one parallel append?  Indeed, if you take
out any of the remaining query parts such as the LIMIT, that's what
it does.  I think that its willingness to do this is actually kind
of a bug, because this query is going to be a total disaster in terms
of the number of workers it will try to use --- way more than the
user would expect given max_parallel_workers_per_gather = 2.

In any case, I now understand David's concern about creating a
usable regression test case that produces a plan like this.
It seems to depend on a very corner-case-y and possibly buggy
set of costing behaviors.

So, while I'd be willing to go ahead and commit the Assert-removal, this
actually increases my concern about whether a correct set of pruning
instructions get generated in cases like this, because I now feel fairly
confident in saying that we haven't tested the logic for such cases *at
all*.  (Note that Jaime's query doesn't have any pruning steps after the
dust settles --- if make_partition_pruneinfo doesn't crash, it eventually
figures out that there are no quals usable for pruning.  So the fact that
it goes through without the Assert does nothing to assuage my fear.)

I think we'd be well advised to either change the logic to prohibit
multi-level pruning plans from being 

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-08-08 Thread Nico Williams
On Wed, Aug 08, 2018 at 11:47:34AM -0500, Nico Williams wrote:
> Yes.  Would that snprintf() and vsnprintf() were async-signal-safe --
> they can be, and some implementations probably are, but they aren't
> required to be, then making ereport() safe would be easier.

So, I took a look at glibc's implementation for giggles.  It uses
malloc() (and free()) only in three cases: a) when printing floating
point numbers with very large/small exponents, b) when formatting long
wide strings into multi-byte strings, c) when formatting specifiers have
width asterisks.

Assuming locales are not lazily loaded, I think that's probably all the
reasonable cases where vsnprintf() could call async-signal-unsafe
functions or do async-signal-unsafe things.

Considering that PostgreSQL already has async-signal-unsafe signal
handlers, might as well assume that vsnprintf() is async-signal-safe and
just format strings into alloca()'ed buffers or even into fixed-sized
automatic char arrays, and be done.  Perhaps when a global volatile
sig_atomic_t is set denoting we're handling a signal, then use
vsnprintf() with a fixed-sized automatic buffer, otherwise malloc() one.

Nico
-- 



Re: Why do we expand tuples in execMain.c?

2018-08-08 Thread Andrew Dunstan




On 08/08/2018 12:20 AM, Andres Freund wrote:

Hi,

I noticed
if (HeapTupleHeaderGetNatts(tuple.t_data) <
RelationGetDescr(erm->relation)->natts)
{
copyTuple = heap_expand_tuple(,
   
   RelationGetDescr(erm->relation));
}
else
{
/* successful, copy tuple */
copyTuple = heap_copytuple();
}

in EvalPlanQualFetchRowMarks, and I'm somewhat confused why it's there?
If it's required here, why isn't it required in dozens of other places?






Not dozens, I think, since you can't have short records for catalog 
tables, which account for most of the calls to heap_copytuple().


I will look at the remainder of cases (less than 10) and reply in a day 
or two.


cheers

andrew


--

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: memory leak when serializing TRUNCATE in reorderbuffer

2018-08-08 Thread Tomas Vondra




On 08/08/2018 09:19 PM, Peter Eisentraut wrote:

On 20/06/2018 21:42, Tomas Vondra wrote:

So I think we should fix and serialize/restore the OID array, just like
we do for tuples, snapshots etc. See the attached fix.


Yes please.



OK, will do.


Another thing we should probably reconsider is where the relids is
allocated - the pointer remains valid because we happen to allocate it
in TopMemoryContext. It's not that bad because we don't free the other
reorderbuffer contexts until the walsender exits anyway, but still.

So I propose to allocate it in rb->context just like the other bits of
data (snapshots, ...). Replacing the palloc() in DecodeTruncate() with
something like:

 MemoryContextAlloc(ctx->reorder->context,
xlrec->nrelids * sizeof(Oid));

should do the trick.


It's not clear from the code comments which context would be the
appropriate one.

More standard coding style would be to set the current memory context
somewhere, but I suppose the reorderbuffer.c code isn't written that way.



IMHO the cleanest way is to add a method like ReorderBufferGetChange, 
which does the allocation internally. That way the memory context choice 
is up to reorderbuffer, not decode.c. That's at least consistent with 
what the rest of decode.c does.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: REINDEX and shared catalogs

2018-08-08 Thread Bossart, Nathan
On 8/8/18, 2:16 PM, "Michael Paquier"  wrote:
> By no-backpatch, I only implied patching v12, but that would not be a
> huge amount of efforts to get that into v11, so I can do that as well.
> Any objections to doing that?

+1 for back-patching to v11.

Nathan



unused/redundant foreign key code

2018-08-08 Thread Peter Eisentraut
I found some unused and some redundant code in ri_triggers.c that was
left around by some previous changes that aimed to optimize away certain
trigger invocations.  See attached patches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1fe8dac15a0013ef348dce1586db5af7bb02639a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 19 Jul 2018 08:20:59 +0200
Subject: [PATCH 1/2] Remove dead foreign key optimization code

The ri_KeysEqual() calls in the foreign-key trigger functions to
optimize away some updates are useless because since
adfeef55cbcc5dc72a772777f88c1be05a70dfee those triggers are not enqueued
at all.  (It's also not useful to keep these checks as some kind of
backstop, since it's also semantically correct to just run the full
check even with equal keys.)
---
 src/backend/utils/adt/ri_triggers.c | 51 -
 1 file changed, 51 deletions(-)

diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c
index fc034ce601..b674011aa7 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -744,20 +744,6 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
break;
}
 
-   /*
-* In UPDATE, no need to do anything if old and new 
keys are equal
-*/
-   if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
-   {
-   HeapTuple   new_row = trigdata->tg_newtuple;
-
-   if (ri_KeysEqual(pk_rel, old_row, new_row, 
riinfo, true))
-   {
-   heap_close(fk_rel, RowShareLock);
-   return PointerGetDatum(NULL);
-   }
-   }
-
/*
 * If another PK row now exists providing the old key 
values, we
 * should not do anything.  However, this check should 
only be
@@ -1098,15 +1084,6 @@ RI_FKey_cascade_upd(PG_FUNCTION_ARGS)
break;
}
 
-   /*
-* No need to do anything if old and new keys are equal
-*/
-   if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, 
true))
-   {
-   heap_close(fk_rel, RowExclusiveLock);
-   return PointerGetDatum(NULL);
-   }
-
if (SPI_connect() != SPI_OK_CONNECT)
elog(ERROR, "SPI_connect failed");
 
@@ -1316,20 +1293,6 @@ ri_setnull(TriggerData *trigdata)
break;
}
 
-   /*
-* In UPDATE, no need to do anything if old and new 
keys are equal
-*/
-   if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
-   {
-   HeapTuple   new_row = trigdata->tg_newtuple;
-
-   if (ri_KeysEqual(pk_rel, old_row, new_row, 
riinfo, true))
-   {
-   heap_close(fk_rel, RowExclusiveLock);
-   return PointerGetDatum(NULL);
-   }
-   }
-
if (SPI_connect() != SPI_OK_CONNECT)
elog(ERROR, "SPI_connect failed");
 
@@ -1536,20 +1499,6 @@ ri_setdefault(TriggerData *trigdata)
break;
}
 
-   /*
-* In UPDATE, no need to do anything if old and new 
keys are equal
-*/
-   if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
-   {
-   HeapTuple   new_row = trigdata->tg_newtuple;
-
-   if (ri_KeysEqual(pk_rel, old_row, new_row, 
riinfo, true))
-   {
-   heap_close(fk_rel, RowExclusiveLock);
-   return PointerGetDatum(NULL);
-   }
-   }
-
if (SPI_connect() != SPI_OK_CONNECT)
elog(ERROR, "SPI_connect failed");
 
-- 
2.18.0

From d2bcff7d75bfbe07bcb27100fef8b3bedf279d2a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 19 Jul 2018 08:37:32 +0200
Subject: [PATCH 2/2] Apply RI trigger skipping tests also for DELETE

The tests added in 

Re: memory leak when serializing TRUNCATE in reorderbuffer

2018-08-08 Thread Peter Eisentraut
On 20/06/2018 21:42, Tomas Vondra wrote:
> So I think we should fix and serialize/restore the OID array, just like 
> we do for tuples, snapshots etc. See the attached fix.

Yes please.

> Another thing we should probably reconsider is where the relids is 
> allocated - the pointer remains valid because we happen to allocate it 
> in TopMemoryContext. It's not that bad because we don't free the other 
> reorderbuffer contexts until the walsender exits anyway, but still.
> 
> So I propose to allocate it in rb->context just like the other bits of 
> data (snapshots, ...). Replacing the palloc() in DecodeTruncate() with 
> something like:
> 
> MemoryContextAlloc(ctx->reorder->context,
>xlrec->nrelids * sizeof(Oid));
> 
> should do the trick.

It's not clear from the code comments which context would be the
appropriate one.

More standard coding style would be to set the current memory context
somewhere, but I suppose the reorderbuffer.c code isn't written that way.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fix hints on CREATE PROCEDURE errors

2018-08-08 Thread Tom Lane
Peter Eisentraut  writes:
> Yes, the hint should be changed.  But I also think the error message
> should be changed to be more appropriate to the procedure situation
> (where is the return type?).  Attached patch does both.  Unlike your
> patch, I kept the "DROP FUNCTION" message for the function case.  It
> might be too confusing otherwise.  Thoughts?

I'm not a translator, but if I were, stuff like "Use DROP %s %s first."
would probably confuse me.  IMO it's too close to assembling a message
out of parts, even if it's true that neither %s needs translation.
I think you'd be better off with

isprocedure ? errhint("Use DROP PROCEDURE %s first.", ...)
: errhint("Use DROP FUNCTION %s first.", ...)

Or if that seems too carpal-tunnel-inducing, maybe a workable compromise
is

dropcmd = (prokind == PROKIND_PROCEDURE ? "DROP PROCEDURE" : "DROP 
FUNCTION");

/* translator: first %s is DROP FUNCTION or DROP PROCEDURE */
errhint("Use %s %s first.", dropcmd, ...)

Looks reasonable other than that quibble.

regards, tom lane



Re: REINDEX and shared catalogs

2018-08-08 Thread Michael Paquier
On Wed, Aug 08, 2018 at 02:39:00PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> I agree that it would be good to have it fixed in released versions, but
>> I also agree that such a change could cause trouble in production for
>> some.  Is the "no backpatch" idea that you will push this to both pg11
>> and master?  That would get my vote.
> 
> Same here.  I am not excited about putting such a change into stable
> branches, mainly because the existing behavior has been there for
> twenty years without any previous complaints.  So it's not *that* big
> a problem.  But it's not too late for v11, I think.

We are talking a lot about the definition of what a stable branch is,
aren't we? ;p

By no-backpatch, I only implied patching v12, but that would not be a
huge amount of efforts to get that into v11, so I can do that as well.
Any objections to doing that?
--
Michael


signature.asc
Description: PGP signature


Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Michael Paquier
Hi Andres,

(Not my intention to miss your message, I have just noticed it.)

On Wed, Aug 08, 2018 at 01:41:27AM -0700, Andres Freund wrote:
> I can't parse this. "Even if this is an atomic operation, this can be
> safely done lock-less" - that seems like a contradictory sentence. Is
> there a "not" missing?

Yes, a "not" has gone missing here.  I reworked the comment block as
mentioned upthread.

> Also, this seems like insufficient reasoning. What guarantees the
> visibility of the flag? You're going to have to talk about externally
> implied memory ordering here.  Or add explicit barriers - the latter is
> probably preferrable.

Well, we use BackendIdGetProc() in this case, where we could finish with
information out-of-date pretty quickly, and there is no special
reasoning for backendId and databaseId for autovacuum but...  Perhaps
you could explain more what you have in mind?  And it is not like this
relies on the number of elements in PGPROC.
--
Michael


signature.asc
Description: PGP signature


Re: Allow COPY's 'text' format to output a header

2018-08-08 Thread Simon Muller
On 6 August 2018 at 16:34, Stephen Frost  wrote:

> Greetings,
>
> * Cynthia Shang (cynthia.sh...@crunchydata.com) wrote:
> > I was able to apply the patch (after resolving a merge conflict which
> was expected given an update in master). All looks good.
>
> If there's a merge conflict against master, then it'd be good for an
> updated patch to be posted.
>
> Thanks!
>
> Stephen
>

Attached is an updated patch that should directly apply against current
master.

--
Simon Muller


text_header_v6.patch
Description: Binary data


Re: Fix hints on CREATE PROCEDURE errors

2018-08-08 Thread Peter Eisentraut
On 06/08/2018 20:32, Jeremy Evans wrote:
> The current code's hint is misleading for procedures:
> 
> CREATE OR REPLACE PROCEDURE a(in int)
> LANGUAGE SQL
> AS $$
> SELECT NULL;
> $$;
> # CREATE PROCEDURE
> 
> CREATE OR REPLACE PROCEDURE a(inout int)
> LANGUAGE SQL
> AS $$
> SELECT NULL;
> $$;
> # ERROR:  cannot change return type of existing function
> # HINT:  Use DROP FUNCTION a(integer) first.

Yes, the hint should be changed.  But I also think the error message
should be changed to be more appropriate to the procedure situation
(where is the return type?).  Attached patch does both.  Unlike your
patch, I kept the "DROP FUNCTION" message for the function case.  It
might be too confusing otherwise.  Thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 36f00184dcc7597f69466a6e9aa2db6eee1a6ef6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 8 Aug 2018 20:39:26 +0200
Subject: [PATCH] Improve error messages for CREATE OR REPLACE PROCEDURE

Change the hint to recommend DROP PROCEDURE instead of FUNCTION.  Also
make the error message when changing the return type more specific to
the case of procedures.

Reported-by: Jeremy Evans 
---
 src/backend/catalog/pg_proc.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 9b4015d0d4..654ee943be 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -375,6 +375,7 @@ ProcedureCreate(const char *procedureName,
Form_pg_proc oldproc = (Form_pg_proc) GETSTRUCT(oldtup);
Datum   proargnames;
boolisnull;
+   const char *prokind_keyword;
 
if (!replace)
ereport(ERROR,
@@ -400,16 +401,25 @@ ProcedureCreate(const char *procedureName,
  errdetail("\"%s\" is a window 
function.", procedureName) :
  0)));
 
+   prokind_keyword = (prokind == PROKIND_PROCEDURE ? "PROCEDURE" : 
"FUNCTION");
+
/*
 * Not okay to change the return type of the existing proc, 
since
 * existing rules, views, etc may depend on the return type.
+*
+* In case of a procedure, a changing return type means that 
whether
+* the procedure has output parameters was changed.  Since 
there is no
+* user visible return type, we produce a more specific error 
message.
 */
if (returnType != oldproc->prorettype ||
returnsSet != oldproc->proretset)
ereport(ERROR,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
-errmsg("cannot change return type of 
existing function"),
-errhint("Use DROP FUNCTION %s first.",
+prokind == PROKIND_PROCEDURE
+? errmsg("cannot change whether a 
procedure has output parameters")
+: errmsg("cannot change return type of 
existing function"),
+errhint("Use DROP %s %s first.",
+prokind_keyword,
 
format_procedure(HeapTupleGetOid(oldtup);
 
/*
@@ -434,7 +444,8 @@ ProcedureCreate(const char *procedureName,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change return 
type of existing function"),
 errdetail("Row type defined by 
OUT parameters is different."),
-errhint("Use DROP FUNCTION %s 
first.",
+errhint("Use DROP %s %s 
first.",
+
prokind_keyword,
 
format_procedure(HeapTupleGetOid(oldtup);
}
 
@@ -477,7 +488,8 @@ ProcedureCreate(const char *procedureName,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change 
name of input parameter \"%s\"",

old_arg_names[j]),
-errhint("Use DROP 
FUNCTION %s first.",
+errhint("Use DROP %s 
%s first.",
+  

Re: REINDEX and shared catalogs

2018-08-08 Thread Tom Lane
Alvaro Herrera  writes:
> On 2018-Aug-05, Michael Paquier wrote:
>> Attached is a set of patches I proposed on the original thread, which
>> skips shared catalogs if the user running REINDEX is not an owner of
>> it.  This is a behavior change, and as I have a hard time believing that
>> anybody can take advantage of the current situation, I would like also
>> to see this stuff back-patched, as anybody doing shared hosting of
>> Postgres is most likely fixing the hole one way or another.  However, I
>> am sure as well that many folks here would not like that.

> I agree that it would be good to have it fixed in released versions, but
> I also agree that such a change could cause trouble in production for
> some.  Is the "no backpatch" idea that you will push this to both pg11
> and master?  That would get my vote.

Same here.  I am not excited about putting such a change into stable
branches, mainly because the existing behavior has been there for
twenty years without any previous complaints.  So it's not *that* big
a problem.  But it's not too late for v11, I think.

regards, tom lane



Re: REINDEX and shared catalogs

2018-08-08 Thread Alvaro Herrera
On 2018-Aug-05, Michael Paquier wrote:

> Attached is a set of patches I proposed on the original thread, which
> skips shared catalogs if the user running REINDEX is not an owner of
> it.  This is a behavior change, and as I have a hard time believing that
> anybody can take advantage of the current situation, I would like also
> to see this stuff back-patched, as anybody doing shared hosting of
> Postgres is most likely fixing the hole one way or another.  However, I
> am sure as well that many folks here would not like that.

I agree that it would be good to have it fixed in released versions, but
I also agree that such a change could cause trouble in production for
some.  Is the "no backpatch" idea that you will push this to both pg11
and master?  That would get my vote.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-08-08 Thread Peter Geoghegan
On Wed, Aug 8, 2018 at 10:23 AM, Kyle Samson  wrote:
> We found the exact same issue on a production database at TripAdvisor. We 
> have no new information to add for debugging. Bumping this to up the priority 
> on a patch version release getting out.

There is a batch of point releases that were just stamped + tagged.
They should be released shortly.

-- 
Peter Geoghegan



Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-08-08 Thread Kyle Samson
Hello,

We found the exact same issue on a production database at TripAdvisor. We have 
no new information to add for debugging. Bumping this to up the priority on a 
patch version release getting out.

Thanks,
-Kyle Samson


POC for a function trust mechanism

2018-08-08 Thread Tom Lane
This is sort of a counter-proposal to Noah's discussion of search path
security checking in <20180805080441.gh1688...@rfd.leadboat.com>.
(There's no technical reason we couldn't do both things, but I think
this'd be more useful to most people.)

Some back story here is that the PG security team has been aware of the
issues in CVE-2018-1058 for an embarrassing number of years, and we'd
been vainly working to find a fix that was both non-invasive to users
and practical to back-patch.  Eventually our hands were forced by an
outside security researcher who discovered some of those problems, and
naturally wanted to publish on a fairly short time scale.  So we ended
up with the decidedly not non-invasive approach of locking down
search_path in especially critical places, and otherwise telling people
that they had to worry about this themselves.  Of the various ideas that
we'd kicked around and not been able to finish, the one that seemed most
promising to me was to invent a "function trust" mechanism.

The core idea here is to prevent security problems not by changing an
application's rules for operator/function name resolution, but by
detecting an attempted compromise and preventing the trojan-horse code
from being executed.  Essentially, a user or application is to declare
a list of roles that it trusts functions owned by, and the system will
then refuse to execute functions owned by other not-trusted roles.
So, if $badguy creates a trojan-horse operator and manages to capture
a call from your SQL code, he'll nonetheless not be able to execute
code as you.

To reduce the overhead of the mechanism and chance of unintentionally
breaking things, superuser-owned functions (particularly, all built-in
functions) are always trusted by everybody.  A superuser who wants to
become you can do so trivially, with no need for a trojan horse, so
this restriction isn't creating any new security hole.

The things that we hadn't resolved, which is why this didn't get further
than POC stage, were

(1) What's the mechanism for declaring trust?  In this POC, it's just
a GUC that you can set to a list of role names, with $user for yourself
and "public" if you want to trust everybody.  It's not clear if that's
good enough, or if we want something a bit more locked-down.

(2) Is trust transitive?  Where and how would the list of trusted roles
change?  Arguably, if you call a SECURITY DEFINER function, then once
you've decided that you trust the function owner, actual execution of the
function should use the function owner's list of trusted roles not yours.
With the GUC approach, it'd be necessary for SECURITY DEFINER functions
to implement this with a "SET trusted_roles" clause, much as they now
have to do with search_path.  That's possible but it's again not very
non-invasive, so we'd been speculating about automating this more.
If we had, say, a catalog that provided the desired list of trusted roles
for every role, then we could imagine implementing that context change
automatically.  Likewise, stuff like autovacuum or REINDEX would want
to run with the table owner's list of trusted roles, but the GUC approach
doesn't really provide enough infrastructure to know what to do there.

So we'd kind of decided that the GUC solution wasn't good enough, but
it didn't seem like catalog additions would be feasible as a back-patched
security fix, which is why this didn't go anywhere.  But it could work
as a new feature.

Anyway, I had written a small POC that did use a GUC for this, and just
checked function calls without any attempts to switch the active
trusted_roles setting in places like autovacuum.  I've rebased it up to
HEAD and here it is.

regards, tom lane

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 9a754da..ea24bc3 100644
*** a/src/backend/commands/variable.c
--- b/src/backend/commands/variable.c
*** show_role(void)
*** 950,952 
--- 950,1075 
  	/* Otherwise we can just use the GUC string */
  	return role_string ? role_string : "none";
  }
+ 
+ 
+ /*
+  * TRUSTED_ROLES
+  */
+ 
+ /* XXX this needs to be replaced with some more complex cache structure */
+ static const char *trusted_roles = "";
+ 
+ /*
+  * check_trusted_roles: GUC check_hook for trusted_roles
+  */
+ bool
+ check_trusted_roles(char **newval, void **extra, GucSource source)
+ {
+ 	char	   *rawname;
+ 	List	   *namelist;
+ 
+ 	/* Need a modifiable copy of string */
+ 	rawname = pstrdup(*newval);
+ 
+ 	/* Parse string into list of identifiers */
+ 	if (!SplitIdentifierString(rawname, ',', ))
+ 	{
+ 		/* syntax error in name list */
+ 		GUC_check_errdetail("List syntax is invalid.");
+ 		pfree(rawname);
+ 		list_free(namelist);
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * No additional checks are possible now, because we might not be inside a
+ 	 * transaction; so we're done.
+ 	 */
+ 
+ 	pfree(rawname);
+ 	list_free(namelist);
+ 
+ 	return true;
+ }
+ 
+ /*
+  * assign_trusted_roles: GUC 

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2018-08-08 Thread Nico Williams
On Wed, Aug 08, 2018 at 07:19:34PM +0300, Heikki Linnakangas wrote:
> On 20/07/18 18:03, Andres Freund wrote:
> >It's much less the exit() that's unsafe, than the callbacks themselves,
> >right?  Perhaps just restate that we wouldn't want to trigger atexit
> >processing due to signal safety?
> 
> Well, in principle exit() is unsafe too, although you're right that in
> practice it's more likely the callbacks that would cause trouble. I reworded
> the comment to put more emphasis on the callbacks.

It's unsafe because it will:

 - flush stdio buffers
 - call atexit handlers (which might, e.g., free())
 - who knows what else

It's easy enough to decide to exit() or _exit().  If you tell can't
which to call from lexical context, then use a global volatile
sig_atomic_t variable to indicate that we're exiting from a signal
handler and check that variable in the quickdie() functions.

> So, with this commit, the various SIGQUIT quickdie handlers are in a better
> shape. The regular backend's quickdie() handler still calls ereport(), which
> is not safe, but it's a step in the right direction. And we haven't
> addressed the original complaint, which was actually about startup_die(),
> not quickdie().

Yes.  Would that snprintf() and vsnprintf() were async-signal-safe --
they can be, and some implementations probably are, but they aren't
required to be, then making ereport() safe would be easier.

Nico
-- 



Re: REINDEX and shared catalogs

2018-08-08 Thread Michael Paquier
On Wed, Aug 08, 2018 at 12:25:03PM +0530, Robert Haas wrote:
> In my opinion, the behavior change is probably OK, but not
> back-patchable.

Thanks.  I see three votes in favor of not back-patching (you,
Horiguchi-san and Nathan), so that won't happen.

> I think that the documentation could be phrased more clearly.  If I
> understand the proposed semantics, something like this might be about
> right:
> 
>Reindexing a single index or table requires being the owner of that
>index or table.  Reindexing a schema or database requires being the
>owner of that schema or database.  Note that is therefore sometimes
>possible for non-superusers to rebuild indexes of tables owner by other
>users; however, as a special exception, when REINDEX
>DATABASE or REINDEX SCHEMA is
>issued by a non-superuser, indexes on shared catalogs will be skipped
>unless the user owns the catalog (which typically won't be the case).
>Of course, superusers can always reindex anything.

I quite like what you are proposing here.  I'll reuse that, I hope you
don't mind ;)
--
Michael


signature.asc
Description: PGP signature


Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Michael Paquier
On Wed, Aug 08, 2018 at 10:50:34AM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> I'm unsure about pg11 -- is it a backbranch already or not?  Since we've
>> released beta3 already, ISTM we should consider it so.
> 
> No.  See
> https://www.postgresql.org/message-id/12725.1533737052%40sss.pgh.pa.us
> 
> At this point I'd judge it's still more valuable to minimize differences
> between v11 and v12 than to avoid ABI breakages in v11.  It's unlikely
> this would be the last ABI break before 11.0 anyway.

Thanks!  Based on that I will get this patch into HEAD and 11 pretty
soon.  v10 is definitely out of scope, even if it would have been
interesting to make autovacuum more aggressive there as well, as that's
where the last improvements have been done.
--
Michael


signature.asc
Description: PGP signature


Re: Have an encrypted pgpass file

2018-08-08 Thread Nico Williams
On Tue, Jul 24, 2018 at 12:25:31PM +0200, Marco van Eck wrote:
> Indeed having unencrypted password lying (.pgpass or PGPASSWORD or -W)
> around is making my auditors unhappy, and forcing me to enter the password
> over and over again. With a simple test it seems the password entered by
> the user also stays in memory, since it is able to reset a broken
> connection. Finding the password in memory is not trivial, but prevention
> is always preferred.

Sometimes the auditors are just wrong.  Say you're using Kerberos, so
you put service keys in... "keytab files" -- that's just like putting a
password in a file because they are equivalent.  Or server certificates
and their private keys... -- where are you going to put them if not in
some file?

Sure, you could put keys in a TPM (except they are orders of magnitude
too slow) or some other hardware token that doesn't suck (except those
are expensive).  But now you still need unattended access to the token,
and who cares what the keys _are_ when you can just _use_ them to
escalate privilege anyways??

Forcing attended operation of otherwise automatic systems is not a good
idea, and it is very expensive too.

A quick search turns up tools for finding cryptographic keys in memory.
Passwords can't be much harder.

> It might be an idea to wipe the password after the login, and decrypt/read
> it again if it needs to reconnect. Would this make the solution more
> secure? I had a quick look at the code and the patch would stay compact.
> Please let me know of doing this would make sense.

But you still wanted to automate things, no?

You can't protect from local hosts compromises.  You have to protect the
host.  Please, no security theater.  Do you think the day after a
massive data breach at your company you can tell the press "gee, I dunno
how they got the password, it was only loaded in memory!" and save it
from liability?  (If there's no liability either way, that's a different
problem.)

Nico
-- 



Re: Scariest patch tournament, PostgreSQL 11 edition

2018-08-08 Thread Alvaro Herrera
On 2018-Jun-25, Jeff Janes wrote:

> Is there a summary of the results of the previous rounds?  I didn't see any
> announcements of them.  I've been trying to find some crash recovery
> torture testing to do for v11 release, but can't find features to focus on
> which might be scariest from a WAL perspective.

I processed the answers today after many delays.  Here's [one way of
looking at] the results:

   commit│ title  │ 
   committer │ points 
─┼┼──┼
 499be013de6 │ Support partition pruning at execution time│ 
Alvaro Herrera   │  6
 8b08f7d4820 │ Local partitioned indexes  │ 
Alvaro Herrera   │  6
 cc415a56d09 │ Basic planner and executor integration for JIT.│ 
Andres Freund│  5
 8561e4840c8 │ Transaction control in PL procedures   │ 
Peter Eisentraut │  5
 86f575948c7 │ Allow FOR EACH ROW triggers on partitioned tables  │ 
Alvaro Herrera   │  4
 e2f1eb0ee30 │ Implement partition-wise grouping/aggregation. │ 
Robert Haas  │  4
 3de241dba86 │ Foreign keys on partitioned tables │ 
Alvaro Herrera   │  4
 2f178441044 │ Allow UPDATE to move rows between partitions.  │ 
Robert Haas  │  4
 f49842d1ee3 │ Basic partition-wise join functionality.   │ 
Robert Haas  │  4
 16828d5c027 │ Fast ALTER TABLE ADD COLUMN with a non-NULL default│ 
Andrew Dunstan   │  4
 9fdb675fc5d │ Faster partition pruning   │ 
Alvaro Herrera   │  4
 1aba8e651ac │ Add hash partitioning. │ 
Robert Haas  │  4
 6f6b99d1335 │ Allow a partitioned table to have a default partition. │ 
Robert Haas  │  4
 eb7ed3f3063 │ Allow UNIQUE indexes on partitioned tables │ 
Alvaro Herrera   │  4
 b96d550eb03 │ Support for optimizing and emitting code in LLVM JIT provider. │ 
Andres Freund│  3
 8224de4f42c │ Indexes with INCLUDE columns and their support in B-tree   │ 
Teodor Sigaev│  3
 4b0d28de06b │ Remove secondary checkpoint│ 
Simon Riggs  │  3
 9da0cc35284 │ Support parallel btree index builds.   │ 
Robert Haas  │  2
 1804284042e │ Add parallel-aware hash joins. │ 
Andres Freund│  1

It's fair to say that anything in the list is scary regardless of score,
because:

1) Some respondents were vague and provided answers such as "Partition
improvements", which I attributed to several commits.  In three cases
(all involving JIT) I added less commits than I probably should have,
but then there are so many of those commits that I'm not sure which are
relevant.

2) The mechanism to award points to each commit is quite unfair, but it
didn't seem important enough.  We have too few answer anyhow :-(

Thanks to all respondents!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Stephen Frost
Greetings,

* David Steele (da...@pgmasters.net) wrote:
> I can see cases where it *might* be worth it, but several backup tools
> support prefetch and/or parallelism which should be able to keep
> Postgres fed with WAL unless there is very high latency to the repo.
> I'm not sure the small performance boost (if any) would be enough to
> justify the extra code paths and tests.

...

> That said, if there is anything we can do in core to make these tools
> simpler or more performant I'm all for it.  For instance, it might be
> good to have return code (as suggested upstream) that indicates to
> Postgres that the segment in pg_wal is a good choice.   Some use cases
> might benefit enormously even if it doesn't make sense as a general case.

Well, if there are some use cases that would benefit greatly from it
then it may just be worth it.  That said, it doesn't sound like a common
enough ask to be something we'd prioritize for pgbackrest, at least not
without actual client demand/interest in it.  If someone else wants to
write the tests and the code and contribute that capability, we'd
certainly welcome it.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/8/18 11:45 AM, Stephen Frost wrote:
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 08/08/2018 04:08 PM, David Steele wrote:
>>> On 8/7/18 12:05 PM, Stephen Frost wrote:
> All I'm saying is that (assuming my understanding of RestoreArchivedFile 
> is
> correct) we can't just do that in the current restore_command. We do need 
> a
> way to ask the archive for some metadata/checksums, and restore_command is
> too late.

 Yeah, I see what you mean there.  An alternative might be to *not*
 unlink the file, in case the restore command wants to check if it's
 valid and matches what's in the archive, and instead restore the
 requested file somewhere else and then perform an unlink/mv after
 the restore command has been run.
>>> I don't see any cases (in master, at least) where RestoredArchivedFile()
>>> would unlink an existing WAL segment.  If you look at the call sites
>>> they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
>>> parameter.
>>>
>>> RestoreArchivedFile() does overwrite the existing file after we decide
>>> that we prefer the restored version over the one in pg_wal, but that
>>> happens after restore_command returns.
>>>
>>> So as far as I can see, it could be possible for the restore_command to
>>> do something clever here.
>>
>> Oh, I see - I really was reading RestoredArchivedFile() wrong ...
>>
>> Yes, it seems possible to do this in restore_command. Currently it'd require
>> constructing the local segment path (replacing RECOVERYXLOG with the actual
>> segment filename), but that's solvable by adding a new pattern similar to
>> %p.
> 
> Nice, but I don't know that there's any need to add another %-flag,
> really.  David, do you think we'd really need that?

%p is already available for restore_command, so the path from %p and the
segment name from %f would do the trick.

>> I'm not sure if packing all this into a single restore_command is a good
>> idea, because I'm pretty sure it'll lead to abhorrently monstrous bash
>> commands jammed into the restore_command. But then again, we kinda already
>> have that, and all archival systems already provide simple and correct
>> commands doing the right thing. And extending that would not be a big deal.
> 
> Running a bash snippet as restore_command is a bad idea already, imv.

Agreed.  Archiving is more complicated than backup in many ways and a
good implementation is far from trivial.

> The real PG backup tools already have much more complicated restore
> commands written in C, Perl, or Python, and could definitely implement
> this.  There's still a bit of a question as to if it'd really be worth
> it, but perhaps it would be in certain cases.

I can see cases where it *might* be worth it, but several backup tools
support prefetch and/or parallelism which should be able to keep
Postgres fed with WAL unless there is very high latency to the repo.
I'm not sure the small performance boost (if any) would be enough to
justify the extra code paths and tests.

That said, if there is anything we can do in core to make these tools
simpler or more performant I'm all for it.  For instance, it might be
good to have return code (as suggested upstream) that indicates to
Postgres that the segment in pg_wal is a good choice.   Some use cases
might benefit enormously even if it doesn't make sense as a general case.

Regards,
-- 
-David
da...@pgmasters.net



Re: Have an encrypted pgpass file

2018-08-08 Thread Bruce Momjian
On Wed, Aug  1, 2018 at 05:33:39PM +0200, Marco van Eck wrote:
> After explaining the patch to a college we identified potentially execution of
> another user when it is defined in as a command parameter. To protect agains 
> it
> I've removed the possibility to pass the 'passcommand'. With the result libpq
> only allows the PGPASSCOMMAND environment variable, which can only be defined
> by the executing user, and will be executed by the same user. It only reduces
> the need of unencrypted password's in a file. 
> 
> I think this solution is secure enough, shall we solve this feature-request?

[Toshi and Nico added as CC's]

I think we need to step back and understand where we are going with our
various encryption options.  We have this feature under consideration,
and we have a proposal for data-at-rest encryption, which encrypts
writes to the file system:


https://www.postgresql.org/message-id/CA%2BCSw_tb3bk5i7if6inZFc3yyf%2B9HEVNTy51QFBoeUk7UE_V%3Dw%40mail.gmail.com

So, until now, we have only supported encryption at the "optimal" level,
e.g. encrypted file system which encrypts the storage.  We have relied
on file permissions to protect access to the mounted file system, and
have a similar approach to securing .pgpass.  So, the question is
whether we continue to offer only encryption at the "optimal" level, or
whether we allow encryption at other levels.

There are two reasons to support non-optimal encryption levels.  First,
there is the issue that different threat models require different
encryption levels for protection.  For example, someone might have the
ability to _read_ a directory, but not the ability to modify it and
therefore attack it by grabbing passwords as they are typed.  An
encrypted .pgpass would be secure from that attack, but a .pgpass stored
on an encrypted file system would not.

Second, there should always be some kind of unlocking action, either
with a password, a hardware encryption device, or connection to a key
server.  Sometimes this unlocking action only makes sense at a certain
level, e.g. a personal Yubikey works for .pgpass but might be odd for
file system encryption.

I sympathize with concerns that the encryption key might be stored
unencrypted, or stored with the key that decrypts the encryption key. 
However, I don't think we can assume that there are no valid uses for
encryption at non-optimal levels, and that they would only be used in
insecure ways.

As a practical example, this presentation:

http://momjian.us/main/writings/crypto_hw_use.pdf

shows how you might use a Yubikey to encrypt .pgpass.  The private key
can't be copied out of the Yubikey, but the private key wouldn't be used
to encrypt .pgpass --- instead a key would be encrypted using the
Yubikey's private key.  An attacker wouldn't need to read the Yubikey
private key but read the .pgpass password once it is decrypted.  That
could be done by modifying the gpg binary, the psql binary, or the
gpg-agent script, but those all require modification of something, with
proper permissions and possible detection.  I don't know how valuable
that would be to the average user, but I am sure it is useful for some
users.

I think with proper documentation and warnings, we could make these
non-optimal encryption levels useful for the users who have threat
models where they are useful, and hopefully minimize their misuse.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-08 Thread Tom Lane
David Rowley  writes:
> On 8 August 2018 at 17:28, Amit Langote  wrote:
>> Attached is a patch which modifies the if test to compare relids instead
>> of RelOptInfo pointers.

> Thanks for investigating and writing a patch. I agree with the fix.

I changed this to compare the relid sets not just rel->relid, since
rel->relid is only reliable for baserels.  The partitioned rel could
safely be assumed to be a baserel, but I'm less comfortable with
supposing that the parentrel always will be.  Otherwise, added a
test case based on Rushabh's example and pushed.  (I'm not quite
sure if the plan will be stable enough to satisfy the buildfarm,
but we'll soon find out ...)

regards, tom lane



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/08/2018 04:08 PM, David Steele wrote:
> >On 8/7/18 12:05 PM, Stephen Frost wrote:
> >>>All I'm saying is that (assuming my understanding of RestoreArchivedFile is
> >>>correct) we can't just do that in the current restore_command. We do need a
> >>>way to ask the archive for some metadata/checksums, and restore_command is
> >>>too late.
> >>
> >>Yeah, I see what you mean there.  An alternative might be to *not*
> >>unlink the file, in case the restore command wants to check if it's
> >>valid and matches what's in the archive, and instead restore the
> >>requested file somewhere else and then perform an unlink/mv after
> >>the restore command has been run.
> >I don't see any cases (in master, at least) where RestoredArchivedFile()
> >would unlink an existing WAL segment.  If you look at the call sites
> >they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
> >parameter.
> >
> >RestoreArchivedFile() does overwrite the existing file after we decide
> >that we prefer the restored version over the one in pg_wal, but that
> >happens after restore_command returns.
> >
> >So as far as I can see, it could be possible for the restore_command to
> >do something clever here.
> 
> Oh, I see - I really was reading RestoredArchivedFile() wrong ...
> 
> Yes, it seems possible to do this in restore_command. Currently it'd require
> constructing the local segment path (replacing RECOVERYXLOG with the actual
> segment filename), but that's solvable by adding a new pattern similar to
> %p.

Nice, but I don't know that there's any need to add another %-flag,
really.  David, do you think we'd really need that?

> I'm not sure if packing all this into a single restore_command is a good
> idea, because I'm pretty sure it'll lead to abhorrently monstrous bash
> commands jammed into the restore_command. But then again, we kinda already
> have that, and all archival systems already provide simple and correct
> commands doing the right thing. And extending that would not be a big deal.

Running a bash snippet as restore_command is a bad idea already, imv.
The real PG backup tools already have much more complicated restore
commands written in C, Perl, or Python, and could definitely implement
this.  There's still a bit of a question as to if it'd really be worth
it, but perhaps it would be in certain cases.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread Tomas Vondra




On 08/08/2018 04:08 PM, David Steele wrote:

On 8/7/18 12:05 PM, Stephen Frost wrote:


All I'm saying is that (assuming my understanding of RestoreArchivedFile is
correct) we can't just do that in the current restore_command. We do need a
way to ask the archive for some metadata/checksums, and restore_command is
too late.


Yeah, I see what you mean there.  An alternative might be to *not*
unlink the file, in case the restore command wants to check if it's
valid and matches what's in the archive, and instead restore the
requested file somewhere else and then perform an unlink/mv after
the restore command has been run.

I don't see any cases (in master, at least) where RestoredArchivedFile()
would unlink an existing WAL segment.  If you look at the call sites
they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
parameter.

RestoreArchivedFile() does overwrite the existing file after we decide
that we prefer the restored version over the one in pg_wal, but that
happens after restore_command returns.

So as far as I can see, it could be possible for the restore_command to
do something clever here.




Oh, I see - I really was reading RestoredArchivedFile() wrong ...

Yes, it seems possible to do this in restore_command. Currently it'd 
require constructing the local segment path (replacing RECOVERYXLOG with 
the actual segment filename), but that's solvable by adding a new 
pattern similar to %p.


I'm not sure if packing all this into a single restore_command is a good 
idea, because I'm pretty sure it'll lead to abhorrently monstrous bash 
commands jammed into the restore_command. But then again, we kinda 
already have that, and all archival systems already provide simple and 
correct commands doing the right thing. And extending that would not be 
a big deal.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Improve behavior of concurrent TRUNCATE

2018-08-08 Thread Bossart, Nathan
Hi,

On 8/6/18, 11:59 AM, "Michael Paquier"  wrote:
> Attached is a patch I have been working on which refactors the code of
> TRUNCATE in such a way that we check for privileges before trying to
> acquire a lock, without any user-facing impact (I have reworked a couple
> of comments compared to the last version).  This includes a set of tests
> showing the new behavior.

Here are some comments for the latest version of the patch.

-truncate_check_rel(Relation rel)
+truncate_check_rel(Oid relid, Form_pg_class reltuple)

Could we use HeapTupleGetOid(reltuple) instead of passing the OID
separately?

-   if (rel->rd_rel->relkind != RELKIND_RELATION &&
-   rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+   if (reltuple->relkind != RELKIND_RELATION &&
+
+   reltuple->relkind != RELKIND_PARTITIONED_TABLE)

There appears to be an extra empty line here.

+# Test for lock lookup conflicts with TRUNCATE when working on unowned
+# relations, particularly catalogs should trigger an ERROR for all the
+# scenarios present here.

If possible, it might be worth it to check that TRUNCATE still blocks
when a role has privileges, too.

> Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
> past history (and the consensus being reached for the REINDEX case would
> be to patch only HEAD), I would be actually incline to not back-patch
> this stuff and qualify that as an improvement.  That's also less work
> for me at commit :)

Since the only behavior this patch changes is the order of locking and
permission checks, my vote would be to back-patch.  However, since the
REINDEX piece won't be back-patched, I can see why we might not here,
either.

Nathan



Re: Facility for detecting insecure object naming

2018-08-08 Thread Tom Lane
Mark Dilger  writes:
> On Aug 8, 2018, at 7:47 AM, Tom Lane  wrote:
>> The advantage of a function trust mechanism is that it'd provide
>> security against calling functions you didn't intend to without
>> any visible changes in normal application behavior.  The security
>> team gave up on that approach because it seemed too complicated to
>> pursue as a secretly-developed security patch, but I still think
>> it's the right long-term answer.

> Do you have a WIP patch partially developed for this?  If it is no
> longer secret, perhaps the rest of us could take a look?

Yeah, I do have a POC prototype, let me blow the dust off it ...

regards, tom lane



Re: Typo in doc or wrong EXCLUDE implementation

2018-08-08 Thread Alvaro Herrera
On 2018-Aug-08, KES wrote:

> I do not know many internals and maybe wrong.
> 
> But from my point of view with my current knowledge. 
> If such exclusion constraint would be marked as UNIQUE we can use it for FK 
> while implementing temporal/bi-temporal tables.
> 
> And this will be simplify relationing while implementing them.

I think what you're looking for is "inclusion constraints" from Jeff
Davis:

https://postgr.es/m/1423354088.12308.117.camel@jeff-desktop

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Typo in doc or wrong EXCLUDE implementation

2018-08-08 Thread KES
I do not know many internals and maybe wrong.

But from my point of view with my current knowledge. 
If such exclusion constraint would be marked as UNIQUE we can use it for FK 
while implementing temporal/bi-temporal tables.

And this will be simplify relationing while implementing them.

07.08.2018, 20:37, "Bruce Momjian" :
> This email was sent to docs, but I think it is a hackers issue. The
> person is asking why exclusion constraints aren't marked as UNIQUE
> indexes that can be used for referential integrity. I think the reason
> is that non-equality exclusion constraints, like preventing overlap, but
> don't uniquely identify a specific value, and I don't think we want to
> auto-UNIQUE just for equality exclusion constraints.
>
> ---
>
> On Tue, Jul 10, 2018 at 09:34:36AM +, PG Doc comments form wrote:
>>  The following documentation comment has been logged on the website:
>>
>>  Page: https://www.postgresql.org/docs/10/static/sql-createtable.html
>>  Description:
>>
>>  Hi.
>>
>>  
>> https://www.postgresql.org/docs/current/static/sql-createtable.html#sql-createtable-exclude
>>  If all of the specified operators test for equality, this is equivalent to a
>>  UNIQUE constraint
>>
>>  Exclusion constraints are implemented using an index
>>
>>  ALTER TABLE person
>>    add constraint person_udx_person_id2
>>    EXCLUDE USING gist (
>>  person_id WITH =
>>    )
>>  ;
>>
>>  tucha=> ALTER TABLE "person_x_person" ADD CONSTRAINT
>>  "person_x_person_fk_parent_person_id"
>>  tucha-> FOREIGN KEY ("parent_person_id")
>>  tucha-> REFERENCES "person" ("person_id")
>>  tucha-> ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE;
>>  ERROR: there is no unique constraint matching given keys for referenced
>>  table "person"
>>
>>  because gist does not support unique indexes, I try with 'btree'
>>
>>  ALTER TABLE person
>>    add constraint person_udx_person_id2
>>    EXCLUDE USING btree (
>>  person_id WITH =
>>    )
>>  ;
>>
>>  \d person
>>  ...
>>  "person_udx_person_id2" EXCLUDE USING btree (person_id WITH =)
>>
>>  tucha=> ALTER TABLE "person_x_person" ADD CONSTRAINT
>>  "person_x_person_fk_parent_person_id"
>>  tucha-> FOREIGN KEY ("parent_person_id")
>>  tucha-> REFERENCES "person" ("person_id")
>>  tucha-> ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE;
>>  ERROR: there is no unique constraint matching given keys for referenced
>>  table "person"
>>
>>  Why postgres does not add unique flag. Despite on: "this is equivalent to a
>>  UNIQUE constraint"
>>  I thought it should be:
>>  "person_udx_person_id2" UNIQUE EXCLUDE USING btree (person_id WITH =)
>>
>>  PS.
>>  > For example, you can specify a constraint that no two rows in the table
>>  contain overlapping circles (see Section 8.8) by using the && operator.
>>
>>  Also I expect that this:
>>  ALTER TABLE person
>>    add constraint person_udx_person_id
>>    EXCLUDE USING gist (
>>  person_id WITH =,
>>  tstzrange(valid_from, valid_till, '[)' ) WITH &&
>>    )
>>
>>  also should raise UNIQUE flag for exclusion thus we can use it in FK
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I. As I am, so you will be. +
> + Ancient Roman grave inscription +



Re: Facility for detecting insecure object naming

2018-08-08 Thread Mark Dilger


> On Aug 8, 2018, at 7:47 AM, Tom Lane  wrote:
> 
> Isaac Morland  writes:
>> On 8 August 2018 at 09:58, Tom Lane  wrote:
>>> When the security team was discussing this issue before, we speculated
>>> about ideas like inventing a function trust mechanism, so that attacks
>>> based on search path manipulations would fail even if they managed to
>>> capture an operator reference.  I'd rather go down that path than
>>> encourage people to do more schema qualification.
> 
>> I must be missing something. Aren't search_path manipulation problems
>> avoided by using "SET search_path FROM CURRENT"?
> 
> No, not if you have any publicly-writable schemas anywhere in your
> search path.  The reason this business is so nasty is that our
> historical default ("$user, public", with pg_catalog implicitly
> searched before that) is actually insecure, even for references
> intended to go to pg_catalog.  There are too many cases where the
> ambiguous-operator resolution rules will allow a reference to be
> captured by a similarly-named operator later in the search path.
> 
> If you're using a secure search_path to start with, it's possible
> that decorating your functions with SET search_path FROM CURRENT
> would be helpful, but it's not like that's some kind of magic bullet.
> 
>> While I'm asking, does anybody know why this isn't the default, especially
>> for SECURITY DEFINER functions?
> 
> It might fix some subset of security issues, but I think that changing
> the default behavior like that would break a bunch of other use-cases.
> It'd be especially surprising for such a thing to apply only to
> SECURITY DEFINER functions.
> 
> The bigger picture here is that it's really hard to fix this class
> of problems by trying to dictate changes in the way people have their
> search paths set up.  You're much more likely to break working
> applications than help anybody.  I'm still of the opinion that we're
> going to be forced to provide loopholes in what we did to pg_dump
> et al in CVE-2018-1058.  We've been seeing an increasing number
> of complaints about that since the patches came out, and I'm pretty
> sure that most of the complainers are totally uninterested in
> defending against share-my-database-with-a-hostile-user scenarios.
> Why should they have to complicate their lives because of problems
> that other people might have?
> 
> The advantage of a function trust mechanism is that it'd provide
> security against calling functions you didn't intend to without
> any visible changes in normal application behavior.  The security
> team gave up on that approach because it seemed too complicated to
> pursue as a secretly-developed security patch, but I still think
> it's the right long-term answer.

Do you have a WIP patch partially developed for this?  If it is no
longer secret, perhaps the rest of us could take a look?

mark



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Tom Lane
Alvaro Herrera  writes:
> Unnoticed ABI breaks make my hair stand on end. 

Yeah.

> I'm unsure about pg11 -- is it a backbranch already or not?  Since we've
> released beta3 already, ISTM we should consider it so.

No.  See
https://www.postgresql.org/message-id/12725.1533737052%40sss.pgh.pa.us

At this point I'd judge it's still more valuable to minimize differences
between v11 and v12 than to avoid ABI breakages in v11.  It's unlikely
this would be the last ABI break before 11.0 anyway.

regards, tom lane



Re: Facility for detecting insecure object naming

2018-08-08 Thread Tom Lane
Isaac Morland  writes:
> On 8 August 2018 at 09:58, Tom Lane  wrote:
>> When the security team was discussing this issue before, we speculated
>> about ideas like inventing a function trust mechanism, so that attacks
>> based on search path manipulations would fail even if they managed to
>> capture an operator reference.  I'd rather go down that path than
>> encourage people to do more schema qualification.

> I must be missing something. Aren't search_path manipulation problems
> avoided by using "SET search_path FROM CURRENT"?

No, not if you have any publicly-writable schemas anywhere in your
search path.  The reason this business is so nasty is that our
historical default ("$user, public", with pg_catalog implicitly
searched before that) is actually insecure, even for references
intended to go to pg_catalog.  There are too many cases where the
ambiguous-operator resolution rules will allow a reference to be
captured by a similarly-named operator later in the search path.

If you're using a secure search_path to start with, it's possible
that decorating your functions with SET search_path FROM CURRENT
would be helpful, but it's not like that's some kind of magic bullet.

> While I'm asking, does anybody know why this isn't the default, especially
> for SECURITY DEFINER functions?

It might fix some subset of security issues, but I think that changing
the default behavior like that would break a bunch of other use-cases.
It'd be especially surprising for such a thing to apply only to
SECURITY DEFINER functions.

The bigger picture here is that it's really hard to fix this class
of problems by trying to dictate changes in the way people have their
search paths set up.  You're much more likely to break working
applications than help anybody.  I'm still of the opinion that we're
going to be forced to provide loopholes in what we did to pg_dump
et al in CVE-2018-1058.  We've been seeing an increasing number
of complaints about that since the patches came out, and I'm pretty
sure that most of the complainers are totally uninterested in
defending against share-my-database-with-a-hostile-user scenarios.
Why should they have to complicate their lives because of problems
that other people might have?

The advantage of a function trust mechanism is that it'd provide
security against calling functions you didn't intend to without
any visible changes in normal application behavior.  The security
team gave up on that approach because it seemed too complicated to
pursue as a secretly-developed security patch, but I still think
it's the right long-term answer.

regards, tom lane



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Alvaro Herrera
On 2018-Aug-08, Michael Paquier wrote:

> As this introduces a new
> field to PGPROC, so back-patching the thing as-is would cause an ABI
> breakage.  Are folks here fine with the new field added to the bottom of
> the structure for the backpatched versions, including v11?  I have found
> about commit 13752743 which has also added a new field called
> isBackgroundWorker in the middle of PGPROC in a released branch, which
> looks to me like an ABI breakage...

Unnoticed ABI breaks make my hair stand on end. 

I suppose if we didn't know about 13752743 earlier, then not much
outside code relies on PGPROC, or at least its members after
isBackgroundWorker.  I wouldn't move it now (I suppose anyone who cared
has already adjusted for it), but please do put your new member at the
end in backbranches.

I'm unsure about pg11 -- is it a backbranch already or not?  Since we've
released beta3 already, ISTM we should consider it so.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Facility for detecting insecure object naming

2018-08-08 Thread Isaac Morland
On 8 August 2018 at 09:58, Tom Lane  wrote:

When the security team was discussing this issue before, we speculated
> about ideas like inventing a function trust mechanism, so that attacks
> based on search path manipulations would fail even if they managed to
> capture an operator reference.  I'd rather go down that path than
> encourage people to do more schema qualification.
>
>
I must be missing something. Aren't search_path manipulation problems
avoided by using "SET search_path FROM CURRENT"?

While I'm asking, does anybody know why this isn't the default, especially
for SECURITY DEFINER functions? It seems like in addition to being a more
secure default, it would be better for JIT compilation - right now it seems
you need to re-compile whenever the function is called with a different
search_path. The ability for a function's meaning to change dramatically
depending on the caller's search_path seems like an occasionally-useful
extra, not what one would expect as the default.


Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/7/18 12:05 PM, Stephen Frost wrote:
>>
>> All I'm saying is that (assuming my understanding of RestoreArchivedFile is
>> correct) we can't just do that in the current restore_command. We do need a
>> way to ask the archive for some metadata/checksums, and restore_command is
>> too late.
> 
> Yeah, I see what you mean there.  An alternative might be to *not*
> unlink the file, in case the restore command wants to check if it's
> valid and matches what's in the archive, and instead restore the
> requested file somewhere else and then perform an unlink/mv after
> the restore command has been run.
I don't see any cases (in master, at least) where RestoredArchivedFile()
would unlink an existing WAL segment.  If you look at the call sites
they all pass "RECOVERYHISTORY" or "RECOVERYXLOG"  to the recovername
parameter.

RestoreArchivedFile() does overwrite the existing file after we decide
that we prefer the restored version over the one in pg_wal, but that
happens after restore_command returns.

So as far as I can see, it could be possible for the restore_command to
do something clever here.

Regards,
-- 
-David
da...@pgmasters.net



Re: Standby trying "restore_command" before local WAL

2018-08-08 Thread David Steele
On 8/7/18 11:42 AM, Stephen Frost wrote:
> 
>>> CRC's are per WAL record, and possibly some WAL records might not be ok
>>> to replay, or at least we need to make sure that we replay the right set
>>> of WAL in the right order even when there are partial WAL files being
>>> given to PG (that aren't named that way...).  The more I think about
>>> this, I think we really need to avoid partial WAL files entirely- what
>>> are we going to do when we get to the end of one?  We'd need to request
>>> the full one from the restore command anyway, so seems like we should
>>> just go ahead and get it from the archive, the question is if there's an
>>> easy/cheap way to detect partial WAL files in pg_wal.
>>
>> As explained above, I don't think this is actually a problem. The checksums
>> do cover the whole file thanks to chaining, and there are ways to detect
>> partial segments. IMHO it's fine if we replay a segment and then find out it
>> was partial and that we need to fetch it from archive anyway and re-apply it
>> - it should not be very common case, except when the user does something
>> silly.
> 
> As long as we *do* go off and try to fetch that WAL file and replay it,
> and don't assume that the end of that partial WAL file means the end of
> WAL replay, then I think you may be right and that it'd be fine, but it
> does seem a bit risky to me.

This assumes that the local partial is a subset of the archived full WAL
segment, which should be true in most cases but I don't think we can
discount the possibility that it isn't.  Split-brain is certainly a way
to get to differing partials, though in that case things are already
pretty bad.

I've seen some pretty messed up situations and usually it is best to
treat the WAL archive as the ground truth.  If the archive_command is
smart enough not to overwrite WAL segments that already exist with
different versions then it should be a reliable record that all servers
can be replayed from (split-brains aside).  I think it's best to treat
the local WAL with some suspicion unless it is known to be good, i.e.
just restored from archive.

I do agree that most inconsistencies could be detected and throw an
error, but only if the WAL in the repository is examined, which means
making a round-trip there anyway.

At the very least, it seems that simple enabling "read from pg_wal
first" is not a good idea without making other changes to ensure it is
done correctly.

Regards,
-- 
-David
da...@pgmasters.net



Re: pgsql: Fix run-time partition pruning for appends with multiple source

2018-08-08 Thread Tom Lane
Christoph Berg  writes:
> Since this commit added a new node type, all following node types got
> renumbered. This means all extension modules using the information
> compiled against 11beta2 need to be recompiled against 11beta3.

That's standard procedure for beta releases.  We don't normally start
to worry about keeping binary ABI compatibility until the .0 release.

> I believe this should be mentioned in the beta3 release notes.

Too late ...

regards, tom lane



Re: Facility for detecting insecure object naming

2018-08-08 Thread Tom Lane
Robert Haas  writes:
> +1.  Better still would be to invent a way to remove the need for such
> onerous qualification, but I don't have a good idea.

Yeah.

> So I guess what we need is a UI that lets you say either:

> - Don't tell me about anything, or
> - Tell me about all unqualified references, or
> - Tell me about all unqualified references except those that latch
> onto an object in 

> Personally, I'm not entirely sold on having that third capability.  I
> guess it's valuable, but the second one seems like the much more
> valuable thing.

I'm not sold on #2 either.  That path leads to, for example,
s/=/OPERATOR(pg_catalog.=)/g everywhere, which is utterly catastrophic
to both readability and portability of your SQL code.  We *must* find
a way to do better, not tell people that's what to do.

When the security team was discussing this issue before, we speculated
about ideas like inventing a function trust mechanism, so that attacks
based on search path manipulations would fail even if they managed to
capture an operator reference.  I'd rather go down that path than
encourage people to do more schema qualification.

regards, tom lane



Re: Typo in doc or wrong EXCLUDE implementation

2018-08-08 Thread Bruce Momjian
On Wed, Aug  8, 2018 at 01:55:53PM +0300, KES wrote:
> I do not know many internals and maybe wrong.
> 
> But from my point of view with my current knowledge. 
> If such exclusion constraint would be marked as UNIQUE we can use it for FK 
> while implementing temporal/bi-temporal tables.
> 
> And this will be simplify relationing while implementing them.

Yes, it would work, but doing that only for equality would be surprising
to many people because exclusion constraints are more general than
equality comparisons.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [PATCH] Improve geometric types

2018-08-08 Thread Tomas Vondra




On 08/03/2018 02:39 PM, Tomas Vondra wrote:

On 08/03/2018 06:40 AM, Kyotaro HORIGUCHI wrote:

...

I'm not confident on replacing double to float8 partially in gist
code. After the 0002 patch applied, I see most of problematic
usage of double or bare arithmetic on dimentional values in
gistproc.c.


static inline float
non_negative(float val)
{
if (val >= 0.0f)
    return val;
else
    return 0.0f;
}


It is used as "non_negative(overlap)", where overlap is float4,
which is calculated using float8_mi.  Float4 makes sense only if
we need to store a large number of it to somewhere but they are
just working varialbles. Couldn't we eliminate float4 that
doesn't have a requirement to do so?



I'm not sure I follow. The patch does not modify non_negative() at all, 
and we still call it like this:


     if (non_negative(overlap) < non_negative(context->overlap) ||
     (range > context->range &&
  non_negative(overlap) <= non_negative(context->overlap)))
     selectthis = true;

where all the "overlap" values are still float4. The only thing that 
changed here is that instead of doing the arithmetic operations directly 
we call float8_mi/float8_div to benefit from the float8 handling.


So I'm not sure how does the patch beaks this? And what do you mean by 
'eliminate float4'?




Kyotaro-san, can you explain what your concerns regarding this bit are? 
I'd like to get 0002 committed, but I'm not sure I understand your point 
about the changes in gist code, so I can't address them. And I certainly 
don't want to just ignore them ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: partition tree inspection functions

2018-08-08 Thread Thomas Munro
On Wed, Aug 8, 2018 at 11:21 PM, Thomas Munro
 wrote:
> On Tue, Aug 7, 2018 at 7:32 PM, Amit Langote
>  wrote:
>> Attached updated patch adds isleaf to pg_partition_children's output.
>
> Hmm, I wonder where this garbage is coming from:

partition.c:437:3: error: array index 3 is past the end of the array
(which contains 3 elements) [-Werror,-Warray-bounds]

That'll do it.  Also:

partition.c:409:19: error: implicit declaration of function
'get_rel_relkind' is invalid in C99
[-Werror,-Wimplicit-function-declaration]

-- 
Thomas Munro
http://www.enterprisedb.com



Re: partition tree inspection functions

2018-08-08 Thread Thomas Munro
On Tue, Aug 7, 2018 at 7:32 PM, Amit Langote
 wrote:
> Attached updated patch adds isleaf to pg_partition_children's output.

Hi Amit,

Hmm, I wonder where this garbage is coming from:

6201   select *, pg_relation_size(relid) as size from
pg_partition_children('ptif_test');
6202 ! ERROR:  invalid byte sequence for encoding "UTF8": 0x8b

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.8126

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgbench's expression parsing & negative numbers

2018-08-08 Thread Ibrar Ahmed
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Patch does not apply cleanly on the master branch, anyways I managed that.  
Patch work according to specs, and no issue found. 
The only minor nit is that you can keep the full comments of function strtoint64

/*
 * If not errorOK, an error message is printed out.
 * If errorOK is true, just return "false" for bad input.
 */

Re: pgsql: Fix run-time partition pruning for appends with multiple source

2018-08-08 Thread Christoph Berg
Re: Tom Lane 2018-08-02 
> Fix run-time partition pruning for appends with multiple source rels.
> 
> The previous coding here supposed that if run-time partitioning applied to
> a particular Append/MergeAppend plan, then all child plans of that node
> must be members of a single partitioning hierarchy.  This is totally wrong,
> since an Append could be formed from a UNION ALL: we could have multiple
> hierarchies sharing the same Append, or child plans that aren't part of any
> hierarchy.
> 
> To fix, restructure the related plan-time and execution-time data
> structures so that we can have a separate list or array for each
> partitioning hierarchy.  Also track subplans that are not part of any
> hierarchy, and make sure they don't get pruned.
> 
> Per reports from Phil Florent and others.  Back-patch to v11, since
> the bug originated there.

Since this commit added a new node type, all following node types got
renumbered. This means all extension modules using the information
compiled against 11beta2 need to be recompiled against 11beta3.

For apt.postgresql.org, the list of affected modules (so far, haven't
yet checked all) is pgsql-ogr-fdw, plr, wal2json.

I believe this should be mentioned in the beta3 release notes.

Christoph

(Thanks to Andrew Gierth for helping me pinpoint the issue.)



Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-08 Thread David Rowley
On 8 August 2018 at 17:28, Amit Langote  wrote:
> Attached is a patch which modifies the if test to compare relids instead
> of RelOptInfo pointers.

Thanks for investigating and writing a patch. I agree with the fix.

It's probably worth writing a test that performs run-time pruning from
an inheritance planner plan.  Do you want to include that in your
patch? If not, I can.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Andres Freund
Hi,

On 2018-07-30 16:59:16 +0900, Michael Paquier wrote:
> + /*
> +  * Mark MyProc as owning this namespace which other processes can use to
> +  * decide if a temporary namespace is in use or not. Even if this is an
> +  * atomic operation, this can be safely done lock-less as no temporary
> +  * relations associated to it can be seen yet if scanning pg_class.
> +  */
> + MyProc->tempNamespaceId = namespaceId;

I can't parse this. "Even if this is an atomic operation, this can be
safely done lock-less" - that seems like a contradictory sentence. Is
there a "not" missing?

Also, this seems like insufficient reasoning. What guarantees the
visibility of the flag? You're going to have to talk about externally
implied memory ordering here.  Or add explicit barriers - the latter is
probably preferrable.

Greetings,

Andres Freund



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-08 Thread Michael Paquier
On Tue, Jul 31, 2018 at 01:39:07PM +0900, Kyotaro HORIGUCHI wrote:
> "int backendID" is left in autovacuum.c as an unused variable.

Fixed.

> "Even if this is *not* an atomic operation" ?

Yeah, I am reworking this comment a bit more to map with the other
PGPROC fields.

> +  * Mark the proc entry as owning this namespace which autovacuum uses to
> +  * decide if a temporary file entry can be marked as orphaned or not. Even
> +  * if this is an atomic operation, this can be safely done lock-less as no
> +  * temporary relations associated to it can be seen yet by autovacuum when
> +  * scanning pg_class.
> +  */
> + MyProc->tempNamespaceId = namespaceId;
> 
> The comment looks wrong. After a crash having temp tables,
> pg_class has orphaned temp relations and the namespace can be of
> reconnected backends especially for low-numbered backends

I don't quite understand your comment.  InitProcess() initializes the
field, so if a backend reconnects and uses the same proc slot as a
backend which had a temporary namespace, then you would discard orphaned
tables.  Anyway, I have reworked it as such:
+   /*
+* Mark MyProc as owning this namespace which other processes can use to
+* decide if a temporary namespace is in use or not.  We assume that
+* assignment of namespaceId is an atomic operation.  Even if it is not,
+* there is as no temporary relations associated to it and the temporary
+* namespace creation is not committed yet, so none of its contents can
+* be seen yet if scanning pg_class or pg_namespace.
+*/

As new minor versions have been tagged, I am planning to commit this
patch soon with some tweaks for the comments.  As this introduces a new
field to PGPROC, so back-patching the thing as-is would cause an ABI
breakage.  Are folks here fine with the new field added to the bottom of
the structure for the backpatched versions, including v11?  I have found
about commit 13752743 which has also added a new field called
isBackgroundWorker in the middle of PGPROC in a released branch, which
looks to me like an ABI breakage...
--
Michael


signature.asc
Description: PGP signature


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-08-08 Thread Kyotaro HORIGUCHI
Hello. Please find the attached.

At Fri, 3 Aug 2018 11:48:38 +0530, Ashutosh Bapat 
 wrote in 

> The purpose of non-Var node is to avoid adding the attribute to
> relation descriptor. Idea is to create a new node, which will act as a
> place holder for table oid or row id (whatever) to be fetched from the
> foreign server. I don't understand why do you think we need it to be
> added to the relation descriptor.

I choosed to expand tuple descriptor for junk column added to
foreign relaions. We might be better to have new member in
ForeignScan but I didn't so that we can backpatch it.

What the patch does are:

- This abuses ForeignScan->fdw_scan_tlist to store the additional
  junk columns when foreign simple relation scan (that is, not a
  join).

  Several places seems to be assuming that fdw_scan_tlist may be
  used foreign scan on simple relation but I didn't find that
  actually happens. This let us avoid adding new data members to
  core data structure. Separate member would be preferable for
  new version.

- The remote OID request is added to targetlist as a non-system
  junk column. get_relation_info exands per-column storage in
  creating RelOptInfo so that the additional junk columns can be
  handled.

- ExecInitForeignScan is changed so that it expands created tuple
  descriptor if it finds the junk columns stored in
  fdw_scan_tlist so that make_tuple_from_result_row can store
  them. ( ExecEvalWholeRowVar needed to modify so that it ignores
  the expanded portion of tuple descriptor.)

I'm not sure whether the following ponits are valid.

- If fdw_scan_tlist is used for simple relation scans, this would
  break the case. (ExecInitForeignScan,  set_foreignscan_references)

- I'm using the name "tableoid" for the junk column but it also
  can be used in user query. The two points to different targets
  so it doesn't matter at all, except that it makes a bit
  confusing explain output.

- Explain stuff doesn't have a crue for the name of the added
  junk. It is shown as  in EXPLAIN output.

| Update on public.fp
|   Remote SQL: UPDATE public.p SET b = $3 WHERE tableoid = $1 AND ctid = $2
|   ->  Foreign Scan on public.fp
| Output: a, (b + 1), "", ctid
| Filter: (random() <= '1'::double precision)
| Remote SQL: SELECT a, b, tableoid AS __remote_tableoid, ctid
| FROM public.p WHERE ((a = 0)) FOR UPDATE

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fe660a5ab953d68a671861479fce0b3e60a57cd8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 8 Aug 2018 12:14:58 +0900
Subject: [PATCH 2/2] Regression test for update/delete on foreign partitioned
 table

Add test for foreign update on remote partitioned tables.
---
 contrib/postgres_fdw/expected/postgres_fdw.out | 221 -
 contrib/postgres_fdw/sql/postgres_fdw.sql  |  27 +++
 2 files changed, 167 insertions(+), 81 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index f5498c62bd..9ae329ab4f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5497,15 +5497,15 @@ INSERT INTO ft2 (c1,c2,c3)
   SELECT id, id % 10, to_char(id, 'FM0') FROM generate_series(2001, 2010) id;
 EXPLAIN (verbose, costs off)
 UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;-- can't be pushed down
-QUERY PLAN
---
+ QUERY PLAN 
+
  Update on public.ft2
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: UPDATE "S 1"."T 1" SET c3 = $2 WHERE ctid = $1 RETURNING "C 1", c2, c3, c4, c5, c6, c7, c8
+   Remote SQL: UPDATE "S 1"."T 1" SET c3 = $3 WHERE tableoid = $1 AND ctid = $2 RETURNING "C 1", c2, c3, c4, c5, c6, c7, c8
->  Foreign Scan on public.ft2
- Output: c1, c2, NULL::integer, 'bar'::text, c4, c5, c6, c7, c8, ctid
+ Output: c1, c2, NULL::integer, 'bar'::text, c4, c5, c6, c7, c8, "", ctid
  Filter: (postgres_fdw_abs(ft2.c1) > 2000)
- Remote SQL: SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" FOR UPDATE
+ Remote SQL: SELECT "C 1", c2, c4, c5, c6, c7, c8, tableoid, ctid FROM "S 1"."T 1" FOR UPDATE
 (7 rows)
 
 UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
@@ -5532,13 +5532,13 @@ UPDATE ft2 SET c3 = 'baz'
 

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-08-08 Thread Marina Polyakova

On 07-08-2018 19:21, Fabien COELHO wrote:

Hello Marina,


Hello, Fabien!


v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).


About this v10 part 1:

Patch applies cleanly, compile, global & local make check both ok.

The random state is cleanly separated so that it will be easy to reset
it on client error handling ISTM that the pgbench side is
deterministic with
the separation of the seeds for different uses.

Code is clean, comments are clear.


:-)


I'm wondering what is the rational for the "xseed" field name? In
particular, what does the "x" stands for?


I called it "...seed" instead of "data" because perhaps the "data" is 
too general a name for use here (but I'm not entirely sure what Alvaro 
Herrera meant in [1], see my answer in [2]). I called it "xseed" to 
combine it with the arguments of the functions _dorand48 / pg_erand48 / 
pg_jrand48 in the file erand48.c. IIUC they use a linear congruential 
generator and perhaps "xseed" means the sequence with the name X of 
pseudorandom values of size 48 bits (X_0, X_1, ... X_n) where X_0 is the 
seed / the start value.


[1] 
https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7@alvherre.pgsql



LGTM, though I'd rename the random_state struct members so that it
wouldn't look as confusing.  Maybe that's just me.


[2] 
https://www.postgresql.org/message-id/cb2cde10e4e7a10a38b48e9cae8fbd28%40postgrespro.ru


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Facility for detecting insecure object naming

2018-08-08 Thread Noah Misch
On Wed, Aug 08, 2018 at 12:00:45PM +0530, Robert Haas wrote:
> On Sun, Aug 5, 2018 at 1:34 PM, Noah Misch  wrote:
> > If hackers and non-core extension authors are to write such code, let's make
> > it easier to check the work.
> 
> +1.  Better still would be to invent a way to remove the need for such
> onerous qualification, but I don't have a good idea.

Agreed.  Thanks for thinking about it.

> > a. SQL intended to run under secure search_path.  No class-specific rules.
> >src/bin code is an example of this class, and this is the only secure 
> > class
> >for end-user applications.
> >
> > b. SQL intended for a particular search_path, possibly untrusted.  
> > Unqualified
> >names need an exact match.  Use a qualified name for any object whose
> >schema appears in search_path later than some untrusted schema.  Examples
> >of this class include extension scripts, pre-CVE-2018-1058 pg_dump, some
> >functions with "SET search_path", and many casual end-user applications.
> >
> > c. SQL intended to work the same under every search_path setting.  Do not 
> > use
> >any unqualified name.  Most pg_catalog and contrib functions, but not 
> > those
> >having a "SET search_path" annotation, are examples of this class.
> 
> If I understand correctly, the only difference between (b) and (c) is
> that references to an object in the very first schema in the search
> path could be unqualified; in most cases, that would be pg_catalog.

In (b), there's no bound on the number of schemas for which qualification is
optional.  Consider search_path=pg_temp,pg_catalog,admin,"$user",public with
you trusting each of those schemas except "public".  The rules of (b) then do
not require any qualified names, though unqualified names shall arrange an
exact match of argument types.  On the other hand, with
search_path=public,equally_public, one shall qualify all names of objects
located in equally_public.

> So I guess what we need is a UI that lets you say either:
> 
> - Don't tell me about anything, or
> - Tell me about all unqualified references, or
> - Tell me about all unqualified references except those that latch
> onto an object in 

"SELECT exp(1.0)" merits a warning under search_path=public,pg_catalog but not
under search_path=pg_catalog,public.  Thus, it's more subtle than your last
bullet.  That's why I envisioned the user declaring which schemas to trust.
The system uses that and the actual search_path to choose warnings.

> Personally, I'm not entirely sold on having that third capability.  I
> guess it's valuable, but the second one seems like the much more
> valuable thing.

That's fair.  Even without pursuing that, it's valuable to know the list of
trusted schemas so we can relax about exact argument type matches when the
function is in a trusted schema.  For example, pg_catalog.exp(1) is safe
because no hostile user can create pg_catalog.exp(int4).



Re: REINDEX and shared catalogs

2018-08-08 Thread Robert Haas
On Mon, Aug 6, 2018 at 2:40 AM, Michael Paquier  wrote:
> In the case of REINDEX, we *allow* shared catalogs to be reindexed.
> Hence, if a user is a database owner, he would also be able to reindex
> critical indexes on shared catalogs, where blocking authentication is
> possible just with sessions connected to the database reindexed.  For a
> schema, the situation is basically worse since 9.5 as a schema owner can
> do the same with lighter permissions.  One can just run "SELECT * FROM
> pg_stat_activity" in a transaction block in session 1, run REINDEX in
> session 2, and cause the system to refuse new connections.  This is
> documented as well.

In my opinion, the behavior change is probably OK, but not back-patchable.

I think that the documentation could be phrased more clearly.  If I
understand the proposed semantics, something like this might be about
right:

   Reindexing a single index or table requires being the owner of that
   index or table.  Reindexing a schema or database requires being the
   owner of that schema or database.  Note that is therefore sometimes
   possible for non-superusers to rebuild indexes of tables owner by other
   users; however, as a special exception, when REINDEX
   DATABASE or REINDEX SCHEMA is
   issued by a non-superuser, indexes on shared catalogs will be skipped
   unless the user owns the catalog (which typically won't be the case).
   Of course, superusers can always reindex anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Allow postgres_fdw passwordless non-superuser conns with prior superuser permission

2018-08-08 Thread Robert Haas
On Mon, Aug 6, 2018 at 8:52 AM, Craig Ringer  wrote:
> Currently postgres_fdw cannot be used with 'cert' authentication, i.e.
> client-certificate validation and cert cn => postgres username mapping. You
> also can't use things like Kerberos, SSPI, etc with a superuser-created FDW
> and username map.
>
> To permit this, I'd like to allow postgres_fdw user mappings to be created
> with a new 'permit_passwordless' option. Only the superuser is allowed to
> create such a mapping. If it's set to true, we bypass the
> check_conn_params(...) connection-string password check and the
> connect_pg_server(...) check for the conn using a password when a
> non-superuser establishes a connection.

Note that ab3f008a2dc364cf7fb75de0a691fb0c61586c8e provides some
relief -- if the superuser creates a view, then as of 11, the checks
won't be applied when unprivileged users select from the view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Facility for detecting insecure object naming

2018-08-08 Thread Robert Haas
On Sun, Aug 5, 2018 at 1:34 PM, Noah Misch  wrote:
> If hackers and non-core extension authors are to write such code, let's make
> it easier to check the work.

+1.  Better still would be to invent a way to remove the need for such
onerous qualification, but I don't have a good idea.

> a. SQL intended to run under secure search_path.  No class-specific rules.
>src/bin code is an example of this class, and this is the only secure class
>for end-user applications.
>
> b. SQL intended for a particular search_path, possibly untrusted.  Unqualified
>names need an exact match.  Use a qualified name for any object whose
>schema appears in search_path later than some untrusted schema.  Examples
>of this class include extension scripts, pre-CVE-2018-1058 pg_dump, some
>functions with "SET search_path", and many casual end-user applications.
>
> c. SQL intended to work the same under every search_path setting.  Do not use
>any unqualified name.  Most pg_catalog and contrib functions, but not those
>having a "SET search_path" annotation, are examples of this class.

If I understand correctly, the only difference between (b) and (c) is
that references to an object in the very first schema in the search
path could be unqualified; in most cases, that would be pg_catalog.
So I guess what we need is a UI that lets you say either:

- Don't tell me about anything, or
- Tell me about all unqualified references, or
- Tell me about all unqualified references except those that latch
onto an object in 

Personally, I'm not entirely sold on having that third capability.  I
guess it's valuable, but the second one seems like the much more
valuable thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Negotiating the SCRAM channel binding type

2018-08-08 Thread Heikki Linnakangas

On 07/08/18 17:34, Stephen Frost wrote:

* Michael Paquier (mich...@paquier.xyz) wrote:

On Tue, Aug 07, 2018 at 02:32:27PM +0530, Robert Haas wrote:

On Sun, Aug 5, 2018 at 4:30 PM, Heikki Linnakangas  wrote:

Well, it'd be useless for users, there is no reason to switch off channel
binding if both the client and server support it. It might not add any
security you care about, but it won't do any harm either. The
non-channel-binding codepath is still exercised with non-SSL connections.


Is that true?  What if it makes a connection fail that you wanted to
succeed?  Suppose we discover a bug that makes connections using
channel binding fail on Thursdays.


Well, as things stand today on HEAD, if channel binding has a bug, this
makes the SCRAM authentication not able to work over SSL, hence you need
to either drop SSL, SCRAM or patch libpq so as the client tells the
server that it does not want to use channel binding.  None of those are
appealing.  Before 7729113, the client still had the choice to enforce
channel binding to not be used over SSL, which I think is a win to
bypass any potential future bugs.  On top of that, we can test
automatically for *any* future SSL implementations that (SSL + SCRAM +
no channel binding) actually works properly, which is, it seems at least
to me, a good thing to get more confidence when a new SSL implementation
is added.


Uh, really?  We can come up with all kinds of potential bugs that might
exist in the world but I don't think we should be writing in options for
everything that might fail due to some bug existing that we don't know
about.


Yeah, if there's a bug, we'll fix it and move on, like with any other 
feature.



Now- if we thought that maybe there was some connection pooling solution
that could be made to work with SSL+SCRAM if channel binding is turned
off, then that's a use-case that maybe we should try and support, but
this notion that we need to be able to turn it off because there might
be a bug is hogwash, imv.  Now, I haven't seen a pooling solution
actually figure out a way to do SSL+SCRAM even without channel binding,
and there might not even be a way, so I'm currently a -1 on adding an
option to disable it, but if someone turned up tomorrow with an credible
approach to doing that, then I'd +1 adding the option.


Now that's a lot more compelling argument for having an option. 
Essentially, you might have a legitimate proxy or connection pooler that 
acts like a Man-In-The-Middle.


The removed "channel_binding" libpq option wasn't very user-friendly, 
and wasn't very good for dealing with that scenario anyway; wouldn't you 
want to disable channel binding in the server rather than the client in 
that scenario? So I have no regrets in removing it. But going forward, 
we do need to put some thought in configuring this. We've talked a lot 
about a libpq option to require channel binding, but we should also have 
a server-side option to disable it.


- Heikki