Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Masahiko Sawada
On Thu, Jun 21, 2018 at 2:53 PM, Nico Williams  wrote:
> On Thu, Jun 21, 2018 at 10:05:41AM +0900, Masahiko Sawada wrote:
>> On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  wrote:
>> > On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
>> >> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
>> >> > Note that unless the pg_catalog is protected against manipulation by
>> >> > remote storage, then TDE for user tables might be possible to
>> >> > compromise.  Like so: the attacker manipulates the pg_catalog to
>> >> > escalate privelege in order to obtain the TDE keys.  This argues for
>> >> > full database encryption, not just specific tables or columns.  But
>> >> > again, this is for the threat model where the storage is the threat.
>> >>
>> >> Yes, one big problem with per-column encryption is that administrators
>> >> can silently delete data, though they can't add or modify it.
>> >
>> > They can also re-add ("replay") deleted values; this can only be
>> > defeated by also binding TX IDs or alike in the ciphertext.  And if you
>> > don't bind the encrypted values to the PKs then they can add any value
>> > they've seen to different rows.
>>
>> I think we could avoid it by implementations. If we implement
>> per-column encryption by putting all encrypted columns out to another
>> table like TOAST table and encrypting whole that external table then
>> we can do per-column encryption without such concerns. Also, that way
>> we can encrypt data when disk I/O even if we use per-column
>> encryption. It would get a better performance. A downside of this idea
>> is extra overhead to access encrypted column but it would be
>> predictable since we have TOAST.
>
> The case we were discussing was one where the threat model is that the
> DBAs are the threat.  It is only in that case that the replay,
> cut-n-paste, and silent deletion attacks are relevant.  Encrypting a
> table, or the whole DB, on the server side, does nothing to protect
> against that threat.
>
> Never lose track of the threat model.
>

Understood.

>> On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  wrote:
>> So on the whole I think that crypto is a poor fit for the DBAs-are-the-
>> threat threat model.  It's better to reduce the number of DBAs/sysadmins
>> and audit all privileged (and, for good measure, unprivileged) access.

I agree with this. The in-database data encryption can defend mainly
the threat of storage theft and the threat of memory dump attack. I'm
sure this design had been proposed for the former purpose. If we want
to defend the latter we must encrypt data even on database memory. To
be honest, I'm not sure that there is needs in practice that is user
want to defend the memory dump attack. What user often needs is to
defend the threat of storage theft with minimum performance overhead.
It's known that client-side encryption or encryption on database
memory increase additional performance overheads. So it would be
better to have several ways to defend different threats as Joe
mentioned.

As long as we encrypt data transparently in database, both the
encryption of network between database server and client and
encryption of logical backups (e.g pg_dump) can be problem. For
network encryption we can use SSL for now but for logical backups we
need to address in other ways.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Push down Aggregates below joins

2018-06-21 Thread Heikki Linnakangas

On 21/06/18 09:11, Antonin Houska wrote:

Tomas Vondra  wrote:


On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:

Currently, the planner always first decides the scan/join order, and
adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
and beneficial, to perform the aggregation below a join. I've been
hacking on a patch to allow that.


There was a patch [1] from Antonin Houska aiming to achieve something
similar. IIRC it aimed to push the aggregate down in more cases,
leveraging the partial aggregation stuff.


Yes, I interrupted the work when it become clear that it doesn't find its way
into v11. I'm about to get back to it next week, and at least rebase it so it
can be applied to the current master branch.


Ah, cool! I missed that thread earlier. Yes, seems like we've been 
hacking on the same feature. Let's compare!


I've been using this paper as a guide:

"Including Group-By in Query Optimization", by Surajit Chaudhuri and 
Kyuseok Shim:

https://pdfs.semanticscholar.org/3079/5447cec18753254edbbd7839f0afa58b2a39.pdf

Using the terms from that paper, my patch does only "Invariant Grouping 
transormation", while yours can do "Simple Coalescing Grouping", which 
is more general. In layman terms, my patch can push the Aggregate below 
a join, while your patch can also split an Aggregate so that you do a 
partial aggregate below the join, and a final stage above it. My 
thinking was to start with the simpler Invariant Grouping transformation 
first, and do the more advanced splitting into partial aggregates later, 
as a separate patch.


Doing partial aggregation actually made your patch simpler in some ways, 
though. I had to make some changes to grouping_planner(), because in my 
patch, it is no longer responsible for aggregation. In your patch, it's 
still responsible for doing the final aggregation.


There's some difference in the way we're dealing with the grouped 
RelOptInfos. You're storing them completely separately, in PlannerInfo's 
new 'simple_grouped_rel_array' array and 'join_grouped_rel_list/hash'. 
I'm attaching each grouped RelOptInfo to its parent.


I got away with much less code churn in allpaths.c, indxpath.c, 
joinpath.c. You're adding new 'do_aggregate' flags to many functions. 
I'm not sure if you needed that because you do partial aggregation and I 
don't, but it would be nice to avoid it.


You're introducing a new GroupedVar expression to represent Aggrefs 
between the partial and final aggregates, while I'm just using Aggref 
directly, above the aggregate node. I'm not thrilled about introducing 
an new Var-like expression. We already have PlaceHolderVars, and I'm 
always confused on how those work. But maybe that's necessary to supprot 
partial aggregation?


In the other thread, Robert Haas wrote:

Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
p.i GROUP BY p.i" you assume that it's OK to do a Partial
HashAggregate over c1.parent rather than p.i.  This will be false if,
say, c1.parent is of type citext and p.i is of type text; this will
get grouped together that shouldn't.  It will also be false if the
grouping expression is something like GROUP BY length(p.i::text),
because one value could be '0'::numeric and the other '0.00'::numeric.
I can't think of a reason why it would be false if the grouping
expressions are both simple Vars of the same underlying data type, but
I'm a little nervous that I might be wrong even about that case.
Maybe you've handled all of this somehow, but it's not obvious to me
that it has been considered.


Ah, I made the same mistake. I did consider the "GROUP BY 
length(o.i::text)", but not the cross-datatype case. I think we should 
punt on that for now, and only do the substitution for simple Vars of 
the same datatype. That seems safe to me.


Overall, your patch is in a more polished state than my prototype. For 
easier review, though, I think we should try to get something smaller 
committed first, and build on that. Let's punt on the Var substitution. 
And I'd suggest adopting my approach of attaching the grouped 
RelOptInfos to the parent RelOptInfo, that seems simpler. And if we punt 
on the partial aggregation, and only allow pushing down the whole 
Aggregate, how much smaller would your patch get?



(I'll be on vacation for the next two weeks, but I'll be following this 
thread. After that, I plan to focus on this feature, as time from 
reviewing patches in the commitfest permits.)


- Heikki



"wal receiver" process hang in syslog() while exiting after receiving SIGTERM while the postgres has been promoted.

2018-06-21 Thread Chen, Yan-Jack (NSB - CN/Hangzhou)
Hi Hackers,
  We encounter one problem happened while we try to promote standby 
postgres(version 9.6.9) server to active.  From the trace(we triggered the 
process abort). We can see the process was hang in syslog() while handling 
SIGTERM.  According to below article. Looks it is risky to write syslog in 
signal handling. Any idea to avoid it?

https://cwe.mitre.org/data/definitions/479.html


[New LWP 7741]
b[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: wal receiver process '.
Program terminated with signal SIGABRT, Aborted.
#0  0x7ff25122d8db in __lll_lock_wait_private () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
postgresql-server-9.6.9-3.wf28.x86_64
(gdb) bt full

#0  0x7ff25122d8db in __lll_lock_wait_private () from /lib64/libc.so.6
No symbol table info available.
#1  0x7ff25121aa7b in __vsyslog_chk () from /lib64/libc.so.6
No symbol table info available.
#2  0x7ff25121ae5c in syslog () from /lib64/libc.so.6
No symbol table info available.
#3  0x00705e88 in write_syslog (level=3, line=0x2b19cd8 "FATAL:  
terminating walreceiver process due to administrator command\n") at elog.c:2033
buf = "FATAL:  terminating walreceiver process due to administrator 
command\000\200\310\a\000\000\000\000\000U\211*Q\362\177\000\000\000\000\000\000\000\000\000\000P\317\033\373\377\177",
 '\000' , "\020\272\261\002\000\000\000\000\020", '\000' 
, "\001", '\000' , "\240\311\033"...
buflen = 68
i = 
chunk_nr = 
seq = 4
len = 69
nlpos = 0x2b19d1c "\n"
#4  0x00707105 in send_message_to_server_log (edata=0xb4ded8 
) at elog.c:3003
syslog_level = 
buf = {data = 0x2b19cd8 "FATAL:  terminating walreceiver process due to 
administrator command\n", len = 69, maxlen = 1024, cursor = 0}
buf = 
syslog_level = 
#5  EmitErrorReport () at elog.c:1479
edata = 0xb4ded8 
oldcontext = 
#6  0x00707d8f in errfinish (dummy=) at elog.c:483
edata = 0xb4ded8 
elevel = 21
oldcontext = 
econtext = 0x0
__func__ = "errfinish"
#7  0x00622e58 in ProcessWalRcvInterrupts () at walreceiver.c:171
No locals.
#8  0x00622e9f in WalRcvShutdownHandler (postgres_signal_arg=) at walreceiver.c:821
save_errno = 0
#9  
No symbol table info available.
#10 0x7ff2512218fb in send () from /lib64/libc.so.6
No symbol table info available.
#11 0x7ff25121ab72 in __vsyslog_chk () from /lib64/libc.so.6
No symbol table info available.
#12 0x7ff25121ae5c in syslog () from /lib64/libc.so.6
No symbol table info available.
#13 0x00705e88 in write_syslog (level=3, line=0x2b19929 "\t\tIs the 
server running on host \"192.168.1.8\" and accepting\n\t\tTCP/IP connections on 
port 5432?\n\t\n") at elog.c:2033
buf = "\t\tIs the server running on host \"192.168.1.8\" and 
accepting\000onnect to server: Connection 
refused\000\377\177\000\000x\376\263\002", '\000' , 
"\003\000\000\000\000\000\000\000\000\300\rR\362\177\000\000\377\377\377\377\000\000\000\000\343\030bm\225k\330\001\270\267\337E\362\177\000\000\360\372\263\002\000\000\000\000@\324\033\373\377\177\000\000\000\267\261\030\275\036y=`\324\033\373\377\177\000\000`\324\033\373\377\177\000\000\220\325\033"...
buflen = 59
i = 
chunk_nr = 
seq = 4
len = 97
nlpos = 0x2b19964 "\n\t\tTCP/IP connections on port 5432?\n\t\n"
#14 0x00707105 in send_message_to_server_log (edata=0xb4de20 
) at elog.c:3003
syslog_level = 
buf = {data = 0x2b198c8 "terminating walreceiver process due to 
administrator command", len = 194, maxlen = 1024, cursor = 0}
buf = 
syslog_level = 
#15 EmitErrorReport () at elog.c:1479
edata = 0xb4de20 
oldcontext = 
#16 0x00707d8f in errfinish (dummy=) at elog.c:483
edata = 0xb4de20 
elevel = 21
oldcontext = 
econtext = 0x0
__func__ = "errfinish"
#17 0x7ff245dfcc90 in libpqrcv_connect () from /usr/lib/libpqwalreceiver.so
No symbol table info available.
#18 0x0062352b in WalReceiverMain () at walreceiver.c:314
conninfo = "host=192.168.1.8 port=5432 
user=_nokfssystestpostgres\000ION\000\354Q\362\177\000\000\001", '\000' 
, 
"[\fR\362\177\000\000\250o\023Q\362\177\000\000\060\335\033\373\377\177\000\000:\274\354Q\362\177\000\000x",
 '\000' , 
"\020NQ\362\177\000\000x\000\000\000\000\000\000\000\222\335\033\373\377\177\000\000=\245*Q\362\177\000\000\222\335\033\373\377\177\000\000:\274\354Q\362\177\000\000=\036",
 '\000' , "\001\000\000\000\000\000\000\000"...
tmp_conninfo = 
slotname = "\000\000\000\000\000\000\000\000?\002", '\000' , " 

Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-21 Thread Arseny Sher
Hello,

Thank you for your time.

Alvaro Herrera  writes:

> On 2018-Jun-11, Antonin Houska wrote:
>
>>
>> One comment about the coding: since you introduce a new transaction list and
>> it's sorted by LSN, I think you should make the function AssertTXNLsnOrder()
>> more generic and use it to check the by_base_snapshot_lsn list too. For
>> example call it after the list insertion and deletion in
>> ReorderBufferPassBaseSnapshot().

Ok, probably this makes sense. Updated patch is attached.

>> Also I think it makes no harm if you split the diff into two, although both
>> fixes use the by_base_snapshot_lsn field.
>
> Please don't.

Yeah, I don't think we should bother with that.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 61b4b0a89c95f8912729c54c013b64033f88fccf Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Thu, 21 Jun 2018 10:29:07 +0300
Subject: [PATCH] Fix slot's xmin advancement and subxact's lost snapshots in
 decoding

There are two in some way related bugs here. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of slot was set to
oldest running xid in the record, which is wrong: actually snapshots which will
be used for not-yet-replayed at this moment xacts might consider older xacts as
running too. The problem wasn't noticed earlier because DDL which allows to
delete tuple (set xmax) while some another not yet committed xact looks at it
is pretty rare, if not unique: e.g. all forms of ALTER TABLE which change schema
acquire ACCESS EXCLUSIVE lock conflicting with any inserts. Included test
case uses ALTER of a composite type, which doesn't have such interlocking.

To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. Proposed
fix adds yet another list of ReorderBufferTXNs to it, where xacts are sorted by
LSN of the base snapshot, because
 * Plain usage of current ReorderBufferGetOldestTXN (to get base snapshot's
   xmin) is wrong, because order by LSN of the first record is *not* the same as
   order by LSN of record on which base snapshot was assigned: many record types
   don't need the snapshot, e.g. xl_xact_assignment. We could remember
   snapbuilder's xmin during processing of the first record, but that's too
   conservative and we would keep more tuples from vacuuming than needed.
 * On the other hand, we can't fully replace existing order by LSN of the first
   change with LSN of the base snapshot, because if we do that, WAL records
   which didn't forced receival of base snap might be recycled before we
   replay the xact, as ReorderBufferGetOldestTXN determines which LSN still must
   be kept.

Second issue concerns subxacts. Currently SnapBuildDistributeNewCatalogSnapshot
never spares a snapshot to known subxact. It means that if toplevel doesn't have
base snapshot (never did writes), there was an assignment record (subxact is
known) and subxact is writing, no new snapshots are being queued. Probably the
simplest fix would be to employ freshly baked by_base_snapshot_lsn list and just
distribute snapshot to all xacts with base snapshot whether they are subxacts or
not. However, in this case we would queue unneccessary snapshot to all already
known subxacts. To avoid this, in this patch base snapshot of known subxact
is immediately passed to the toplevel as soon as we learn the kinship / base
snap assigned -- we have to do that anyway in ReorderBufferCommitChild; new
snaps are distributed only to top-level xacts (or unknown subxacts) as before.

Also, minor memory leak is fixed: counter of toplevel's old base
snapshot was not decremented when snap has been passed from child.
---
 contrib/test_decoding/Makefile |   3 +-
 contrib/test_decoding/expected/oldest_xmin.out |  27 +++
 .../test_decoding/expected/subxact_snap_pass.out   |  49 +
 contrib/test_decoding/specs/oldest_xmin.spec   |  37 
 contrib/test_decoding/specs/subxact_snap_pass.spec |  42 +
 src/backend/replication/logical/reorderbuffer.c| 208 +++--
 src/backend/replication/logical/snapbuild.c|  15 +-
 src/include/replication/reorderbuffer.h|  15 +-
 8 files changed, 330 insertions(+), 66 deletions(-)
 create mode 100644 contrib/test_decoding/expected/oldest_xmin.out
 create mode 100644 contrib/test_decoding/expected/subxact_snap_pass.out
 create mode 100644 contrib/test_decoding/specs/oldest_xmin.spec
 create mode 100644 contrib/test_decoding/specs/subxact_snap_pass.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 1d601d8144..05d7766cd5 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -50,7 +50,8 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	$(pg_regress_installcheck) \
 	$(REGRESSCHECKS)
 
-ISOLATIONCHECKS=mxact delayed_startup ondisk_startup 

Re: libpq compression

2018-06-21 Thread Konstantin Knizhnik




On 20.06.2018 23:34, Robbie Harwood wrote:

Konstantin Knizhnik  writes:


My idea was the following: client want to use compression. But server
may reject this attempt (for any reasons: it doesn't support it, has
no proper compression library, do not want to spend CPU for
decompression,...) Right now compression algorithm is hardcoded. But
in future client and server may negotiate to choose proper compression
protocol.  This is why I prefer to perform negotiation between client
and server to enable compression.
Well, for negotiation you could put the name of the algorithm you want
in the startup.  It doesn't have to be a boolean for compression, and
then you don't need an additional round-trip.


Sorry, I can only repeat arguments I already mentioned:
- in future it may be possible to specify compression algorithm
- even with boolean compression option server may have some reasons to 
reject client's request to use compression


Extra flexibility is always good thing if it doesn't cost too much. And 
extra round of negotiation in case of enabling compression seems to me 
not to be a high price for it.




Well, that's a design decision you've made.  You could put lengths on
chunks that are sent out - then you'd know exactly how much is needed.
(For instance, 4 bytes of network-order length followed by a complete
payload.)  Then you'd absolutely know whether you have enough to
decompress or not.


Do you really suggest to send extra header for each chunk of data?
Please notice that chunk can be as small as one message: dozen of bytes 
because libpq is used for client-server communication with request-reply 
pattern.


Frankly speaking I do not completely understand the source of your concern.
My primary idea was to preseve behavior of libpq function as much as 
possible, so there is no need to rewrite all places in Postgres code 
when them are used.
It seems to me that I succeed in reaching this goal. Incase of enabled 
compression zpq_stream functions (zpq-read/write) are used instead of 
(pq)secure_read/write and in turn are using them to fetch/send more 
data. I do not see any bad flaws, encapsulation violation or some other 
problems in such solution.


So before discussing some alternative ways of embedding compression in 
libpq, I will want to understand what's wrong with this approach.




So loop should be something like this:

   decompress(ptr, n)
  ZSTD_inBuffer in = {0}
  ZSTD_outBuffer out = {0}

  in.src = ptr
  in.size = n

  while true
  ret = ZSTD_decompressStream(out, in)
  if ZSTD_isError(ret):
  give_up()
if out.pos != 0
// if we deomcpress soemthing
return out.pos;
read_input(in);

The last line is what I'm taking issue with.  The interface we have
already in postgres's network code has a notion of partial reads, or
that reads might not return data.  (Same with writing and partial
writes.)  So you'd buffer what compressed data you have and return -
this is the preferred idiom so that we don't block or spin on a
nonblocking socket.
If socket is in non-blocking mode and there is available data, then 
secure_read  function will also immediately return 0.
The pseudocode above is not quite correct. Let me show the real 
implementation of zpq_read:


ssize_t
zpq_read(ZpqStream *zs, void *buf, size_t size, size_t *processed)
{
    ssize_t rc;
    ZSTD_outBuffer out;
    out.dst = buf;
    out.pos = 0;
    out.size = size;

    while (1)
    {
        rc = ZSTD_decompressStream(zs->rx_stream, , >rx);
        if (ZSTD_isError(rc))
        {
            zs->rx_error = ZSTD_getErrorName(rc);
            return ZPQ_DECOMPRESS_ERROR;
        }
        /* Return result if we fill requested amount of bytes or read 
operation was performed */

        if (out.pos != 0)
        {
            zs->rx_total_raw += out.pos;
            return out.pos;
        }
        if (zs->rx.pos == zs->rx.size)
        {
            zs->rx.pos = zs->rx.size = 0; /* Reset rx buffer */
        }
        rc = zs->rx_func(zs->arg, (char*)zs->rx.src + zs->rx.size, 
ZPQ_BUFFER_SIZE - zs->rx.size);

        if (rc > 0) /* read fetches some data */
        {
            zs->rx.size += rc;
            zs->rx_total += rc;
        }
        else /* read failed */
        {
            *processed = out.pos;
            zs->rx_total_raw += out.pos;
            return rc;
        }
    }
}

Sorry, but I have spent quite enough time trying to provide the same 
behavior of zpq_read/write as of secure_read/write both for blocking and 
non-blocking mode.
And I hope that now it is preserved. And frankly speaking I do not see 
much differences of this approach with supporting TLS.
Current implementation allows to combine compression with TLS and in 
some cases it may be really useful.




This is how the TLS code works already.  Look at, for instance,
pgtls_read().  If we get back 

Re: bug with expression index on partition

2018-06-21 Thread Amit Langote
On 2018/06/21 15:35, Amit Langote wrote:
> So, CompareIndexInfo and generateClonedIndexStmt are both doing the right
> thing, but DefineIndex is not.  Attached is a patch to fix DefineIndex so
> that it converts indexParams before recursing to create the index on a
> partition.

I noticed that while CompareIndexInfo and generateClonedIndexStmt would
reject the case where index expressions contain a whole-row Var, my patch
didn't teach to do the same to DefineIndex, causing asymmetric behavior.
So, whereas ATTACH PARTITION would error out when trying to clone a
parent's index that contains a whole-row Var, recursively creating an
index on partition won't.

I updated the patch so that even DefineIndex will check if any whole-row
Vars were encountered during conversion and error out if so.

Thanks,
Amit
From f28b5e4ab8aa2c41486ef9f64b2f52220cfbfcbf Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 21 Jun 2018 14:50:20 +0900
Subject: [PATCH v2] Convert indexParams to partition's attnos before recursing

---
 src/backend/commands/indexcmds.c   | 21 +
 src/test/regress/expected/indexing.out | 54 ++
 src/test/regress/sql/indexing.sql  | 34 +
 3 files changed, 109 insertions(+)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 3a3223bffb..7bbdeb9e66 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -993,7 +993,28 @@ DefineIndex(Oid relationId,
{
IndexStmt  *childStmt = 
copyObject(stmt);
boolfound_whole_row;
+   ListCell   *lc;
 
+   /* Adjust Vars to match new table's 
column numbering */
+   foreach(lc, childStmt->indexParams)
+   {
+   IndexElem *ielem = lfirst(lc);
+
+   /*
+* If the index parameter is an 
expression, we must
+* translate it to contain 
child Vars.
+*/
+   if (ielem->expr)
+   {
+   ielem->expr =
+   
map_variable_attnos((Node *) ielem->expr,
+   
1, 0, attmap, maplen,
+   
InvalidOid,
+   
_whole_row);
+   if (found_whole_row)
+   elog(ERROR, 
"cannot convert whole-row table reference");
+   }
+   }
childStmt->whereClause =

map_variable_attnos(stmt->whereClause, 1, 0,

attmap, maplen,
diff --git a/src/test/regress/expected/indexing.out 
b/src/test/regress/expected/indexing.out
index 2c2bf44aa8..02c5a8e5ed 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1337,3 +1337,57 @@ insert into covidxpart values (4, 1);
 insert into covidxpart values (4, 1);
 ERROR:  duplicate key value violates unique constraint "covidxpart4_a_b_idx"
 DETAIL:  Key (a)=(4) already exists.
+-- tests covering expression indexes in a partition tree with differing 
attributes
+create table idx_part (a int) partition by hash (a);
+create table idx_part1 partition of idx_part for values with (modulus 2, 
remainder 0);
+create table idx_part2 partition of idx_part for values with (modulus 2, 
remainder 1);
+alter table idx_part detach partition idx_part2;
+alter table idx_part2 drop a, add a int;
+alter table idx_part attach partition idx_part2 for values with (modulus 2, 
remainder 1);
+create index idx_part_expr on idx_part ((a + 1));
+\d idx_part2
+ Table "public.idx_part2"
+ Column |  Type   | Collation | Nullable | Default 
++-+---+--+-
+ a  | integer |   |  | 
+Partition of: idx_part FOR VALUES WITH (modulus 2, remainder 1)
+Indexes:
+"idx_part2_expr_idx" btree ((a + 1))
+
+drop index idx_part_expr;
+alter table idx_part detach partition idx_part2;
+create index idx_part2_expr 

Buildfarm failure in rolenames.sql

2018-06-21 Thread Michael Paquier
Hi all,

Some of your may have noticed, but the buildfarm has thrown up the
following failure today:
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=gull=HEAD

The failure is the following one:
-- ALTER TABLE OWNER TO
\c -
! \connect: FATAL:  could not read block 3 in file "base/16384/2662": read only 
0 of 8192 bytes

I would have not bothered mentioning that as it could be perfectly be a
spurious failure, but I have actually seen the *same* failure for the
*same* block number in the *same* test on my laptop just two days ago.
So I suspect that we are up to something here.  I saw the failure on my
laptop with gcc (7.3) and Debian sid, and gull is using Debian 9.4,
which is too much for a simple coincidence.  The thing is hard to
reproduce only with the main regression test suite.

Thoughts?
--
Michael


signature.asc
Description: PGP signature


partition table and stddev() /variance() behaviour

2018-06-21 Thread Rajkumar Raghuwanshi
Hi,

I am getting different output for stddev/variance functions with partition
tables.


CREATE TABLE part (c1 INT,c2 INT) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (1) TO (3);
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (3) TO (5);

INSERT INTO part VALUES (1,5),(2,15),(3,3),(4,17);

postgres=# SET parallel_setup_cost=0;
SET
postgres=# EXPLAIN SELECT COUNT(c1),STDDEV(c2),VARIANCE(c2) FROM part;
 QUERY
PLAN

 Finalize Aggregate  (cost=70.36..70.37 rows=1 width=72)
   ->  Gather  (cost=70.12..70.33 rows=2 width=72)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=70.12..70.13 rows=1 width=72)
   ->  Parallel Append  (cost=0.00..56.00 rows=1882 width=8)
 ->  Parallel Seq Scan on part_p1  (cost=0.00..23.29
rows=1329 width=8)
 ->  Parallel Seq Scan on part_p2  (cost=0.00..23.29
rows=1329 width=8)
(7 rows)

postgres=# SELECT COUNT(c1),STDDEV(c2),VARIANCE(c2) FROM part;





* count | stddev | variance ---++-- 4 |  0
|0(1 row)*postgres=#
postgres=# RESET parallel_setup_cost;
RESET
postgres=# EXPLAIN SELECT COUNT(c1),STDDEV(c2),VARIANCE(c2) FROM part;
  QUERY PLAN
---
 Aggregate  (cost=121.71..121.72 rows=1 width=72)
   ->  Append  (cost=0.00..87.80 rows=4520 width=8)
 ->  Seq Scan on part_p1  (cost=0.00..32.60 rows=2260 width=8)
 ->  Seq Scan on part_p2  (cost=0.00..32.60 rows=2260 width=8)
(4 rows)

postgres=# SELECT COUNT(c1),STDDEV(c2),VARIANCE(c2) FROM part;




* count |   stddev   |  variance
---++- 4 |
7.0237691685684926 | 49.(1 row)*
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Bruce Momjian
On Wed, Jun 20, 2018 at 04:57:18PM -0500, Nico Williams wrote:
> On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> > > Note that unless the pg_catalog is protected against manipulation by
> > > remote storage, then TDE for user tables might be possible to
> > > compromise.  Like so: the attacker manipulates the pg_catalog to
> > > escalate privelege in order to obtain the TDE keys.  This argues for
> > > full database encryption, not just specific tables or columns.  But
> > > again, this is for the threat model where the storage is the threat.
> > 
> > Yes, one big problem with per-column encryption is that administrators
> > can silently delete data, though they can't add or modify it.
> 
> They can also re-add ("replay") deleted values; this can only be
> defeated by also binding TX IDs or alike in the ciphertext.  And if you

Yes, and if you bind TX IDs so you can detect loss, you effectively have
to serialize every transaction, which is going to kill performance.

> don't bind the encrypted values to the PKs then they can add any value
> they've seen to different rows.

Yep, you kind of have to add the primary key into the encrypted value.

> One can protect to some degree agains replay and reuse attacks, but
> protecting against silent deletion is much harder.  Protecting against
> the rows (or the entire DB) being restored at a past point in time is
> even harder -- you quickly end up wanting Merkle hash/MAC trees and key
> rotation, but this complicates everything and is performance killing.

Yep.

> > > I think any threat model where DBAs are not the threat is just not that
> > > interesting to address with crypto within postgres itself...
> > 
> > Yes, but in my analysis the only solution there is client-side
> > encryption:
> 
> For which threat model?
> 
> For threat models where the DBAs are not the threat there's no need for
> client-side encryption: just encrypt the storage at the postgres
> instance (with encrypting device drivers or -preferably- filesystems).

Agreed.

> For threat models where the DBAs are the threat then yes, client-side
> encryption works (or server-side encryption to public keys), but you
> must still bind the encrypted values to the primary keys, and you must
> provide integrity protection for as much data as possible -- see above.

Yep.

> Client-side crypto is hard to do well and still get decent performance.
> So on the whole I think that crypto is a poor fit for the DBAs-are-the-
> threat threat model.  It's better to reduce the number of DBAs/sysadmins
> and audit all privileged (and, for good measure, unprivileged) access.

Yeah, kind of.  There is the value of preventing accidental viewing of
the data by the DBA, and of course WAL and backup encryption are nice.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread David Rowley
On 22 June 2018 at 00:18, Rajkumar Raghuwanshi
 wrote:
> CREATE TABLE part (c1 INT,c2 INT) PARTITION BY RANGE (c1);
> CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (1) TO (3);
> CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (3) TO (5);
>
> INSERT INTO part VALUES (1,5),(2,15),(3,3),(4,17);
>
> postgres=# SET parallel_setup_cost=0;
> postgres=# SELECT COUNT(c1),STDDEV(c2),VARIANCE(c2) FROM part;
>  count | stddev | variance
> ---++--
>  4 |  0 |0
> (1 row)

Well, that's quite surprising. It appears to be a bug in
numeric_poly_combine for machines without a working int128 type. The
parameters in accum_sum_copy are in the incorrect order.

The very minimal fix is attached, but I'll need to go look at where
the tests for this have gone.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


fix_numeric_poly_combine.patch
Description: Binary data


Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-21 Thread Amit Kapila
On Thu, Jun 21, 2018 at 11:51 AM, Amit Khandekar  wrote:
> On 20 June 2018 at 14:28, Amit Khandekar  wrote:

>>>
>>> Did you get a chance to look at it?
>>
>> Not yet, but I have planned to do this by tomorrow.
>
>
> After list_concat, the subpaths no longer has only non-partial paths,
> which it is supposed to have. So it anyways should not be used in the
> subsequent code in that function. So I think the following change
> should be good.
> -   foreach(l, subpaths)
> +   foreach(l, pathnode->subpaths)
>
> Attached patch contains the above change.
>

Thanks, Tom, would you like to go-ahead and commit this change as this
is suggested by you or would you like me to take care of this or do
you want to wait for Robert's input?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-21 Thread Don Seiler
On Wed, Jun 20, 2018 at 2:45 PM, Stephen Frost  wrote:

>
> diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
> index 7698cd1f88..088ef346a8 100644
> --- a/src/include/libpq/libpq-be.h
> +++ b/src/include/libpq/libpq-be.h
> @@ -135,6 +135,7 @@ typedef struct Port
>  */
> char   *database_name;
> char   *user_name;
> +   char   *application_name;
> char   *cmdline_options;
> List   *guc_options;
>
> We should have some comments here about how this is the "startup packet
> application_name" and that it's specifically used for the "connection
> authorized" message and that it shouldn't really be used
> post-connection-startup (the GUC should be used instead, as applications
> can change it post-startup).
>

Makes sense. I've now moved that variable declaration underneath the others
with a small comment block above it.



>
> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/
> postmaster.c
> index a4b53b33cd..8e75c80728 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -2094,6 +2094,10 @@ retry1:
>
>   pstrdup(nameptr));
> port->guc_options =
> lappend(port->guc_options,
>
>   pstrdup(valptr));
> +
> +   /* Copy application_name to port as well */
> +   if (strcmp(nameptr, "application_name") ==
> 0)
> +   port->application_name =
> pstrdup(valptr);
> }
> offset = valoffset + strlen(valptr) + 1;
> }
>
> That comment feels a bit lacking- this is a pretty special case which
> should be explained.
>

I've added longer comment explaining again why we are doing this here.



>
> diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/
> postinit.c
> index 09e0df290d..86db7630d5 100644
> --- a/src/backend/utils/init/postinit.c
> +++ b/src/backend/utils/init/postinit.c
> @@ -266,8 +266,8 @@ PerformAuthentication(Port *port)
>  #ifdef USE_SSL
> if (port->ssl_in_use)
> ereport(LOG,
> -   (errmsg("connection
> authorized: user=%s database=%s SSL enabled (protocol=%s, cipher=%s,
> bits=%d, compression=%s)",
> -
>  port->user_name, port->database_name,
> +   (errmsg("connection
> authorized: user=%s database=%s application=%s SSL enabled (protocol=%s,
> cipher=%s, bits=%d, compression=%s)",
> +
>  port->user_name, port->database_name, port->application_name
>
> be_tls_get_version(port),
>
> be_tls_get_cipher(port),
>
> be_tls_get_cipher_bits(port),
> @@ -275,8 +275,8 @@ PerformAuthentication(Port *port)
> else
>  #endif
> ereport(LOG,
> -   (errmsg("connection
> authorized: user=%s database=%s",
> -
>  port->user_name, port->database_name)));
> +   (errmsg("connection
> authorized: user=%s database=%s application=%s",
> +
>  port->user_name, port->database_name, port->application_name)));
> }
> }
>
> You definitely need to be handling the case where application_name is
> *not* passed in more cleanly.  I don't think we should output anything
> different from what we do today in those cases.
>

I've added some conditional logic similar to the ternary operator usage
in src/backend/catalog/pg_collation.c:92. If application_name is set, we'll
include "application=%s" in the log message, otherwise we make no mention
of it now (output will be identical to what we currently do).



> Also, these don't need to be / shouldn't be 3 seperate patches/commits,
> and there should be a sensible commit message which explains what the
> change is entirely.
>

After much head scratching/banging on both our parts yesterday, I've
finally figured this out. Thanks again for your patience and time.


If you could update the patch accordingly and re-send, that'd be
> awesome. :)


See attached.



-- 
Don Seiler
www.seiler.us


0001-Changes-to-add-application_name-to-Port-struct-so-we.patch
Description: Binary data


Re: Sample values for pg_stat_statements

2018-06-21 Thread Daniel Gustafsson
> On 17 Apr 2018, at 17:58, Vik Fearing  wrote:

> Thanks!  Attached is a patch addressing your concerns.

I happened to notice that this patch was moved from Returned with Feedback to
Needs Review after the CF closed, which means it’s now sitting open in a closed
CF.  The intended flow after RWF is that the patch is resubmitted, rather than
status changed, but since it’s there now I guess we might as well move it to
the current CF to make sure it’s not overlooked (which it will be in the
current state)?

cheers ./daniel


Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan



On 06/20/2018 09:04 PM, Andres Freund wrote:


Probably couldn't hurt to run the changed files through
pgindent and fix the damage...





Looks reasonable to me, but I've not tested it.



Thanks

Patch after pgindent attached. This will involve a catversion bump since 
we're adding a new function.


cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..7a2f785 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,13 +11,20 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
 
 
 #define CHECK_IS_BINARY_UPGRADE	\
@@ -192,3 +199,63 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+	Oid			table_id = PG_GETARG_OID(0);
+	text	   *attname = PG_GETARG_TEXT_P(1);
+	text	   *value = PG_GETARG_TEXT_P(2);
+	Datum		valuesAtt[Natts_pg_attribute];
+	bool		nullsAtt[Natts_pg_attribute];
+	bool		replacesAtt[Natts_pg_attribute];
+	Datum		missingval;
+	Form_pg_attribute attStruct;
+	Relation	attrrel,
+tablerel;
+	HeapTuple	atttup,
+newtup;
+	char	   *cattname = text_to_cstring(attname);
+	char	   *cvalue = text_to_cstring(value);
+
+	CHECK_IS_BINARY_UPGRADE;
+
+	/* lock the table the attribute belongs to */
+	tablerel = heap_open(table_id, AccessExclusiveLock);
+
+	/* Lock the attribute row and get the data */
+	attrrel = heap_open(AttributeRelationId, RowExclusiveLock);
+	atttup = SearchSysCacheAttName(table_id, cattname);
+	if (!HeapTupleIsValid(atttup))
+		elog(ERROR, "cache lookup failed for attribute %s of relation %u",
+			 cattname, table_id);
+	attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+	/* get an array value from the value string */
+	missingval = OidFunctionCall3(
+  F_ARRAY_IN,
+  CStringGetDatum(cvalue),
+  ObjectIdGetDatum(attStruct->atttypid),
+  Int32GetDatum(attStruct->atttypmod));
+
+	/* update the tuple - set atthasmissing and attmissingval */
+	MemSet(valuesAtt, 0, sizeof(valuesAtt));
+	MemSet(nullsAtt, false, sizeof(nullsAtt));
+	MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+	valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
+	replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+	replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+	newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+			   valuesAtt, nullsAtt, replacesAtt);
+	CatalogTupleUpdate(attrrel, >t_self, newtup);
+
+	/* clean up */
+	ReleaseSysCache(atttup);
+	heap_close(attrrel, RowExclusiveLock);
+	heap_close(tablerel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ea2f022..e06ed60 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8095,6 +8095,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_typstorage;
 	int			i_attnotnull;
 	int			i_atthasdef;
+	int			i_atthasmissing;
 	int			i_attidentity;
 	int			i_attisdropped;
 	int			i_attlen;
@@ -8103,6 +8104,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attoptions;
 	int			i_attcollation;
 	int			i_attfdwoptions;
+	int			i_attmissingval;
 	PGresult   *res;
 	int			ntups;
 	bool		hasdefaults;
@@ -8132,7 +8134,33 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 
 		resetPQExpBuffer(q);
 
-		if (fout->remoteVersion >= 10)
+		if (fout->remoteVersion >= 11)
+		{
+			/* atthasmissing and attmissingval are new in 11 */
+			appendPQExpBuffer(q, "SELECT a.attnum, a.attname, a.atttypmod, "
+			  "a.attstattarget, a.attstorage, t.typstorage, "
+			  "a.attnotnull, a.atthasdef, a.attisdropped, "
+			  "a.attlen, a.attalign, a.attislocal, "
+			  "pg_catalog.format_type(t.oid,a.atttypmod) AS atttypname, "
+			  "array_to_string(a.attoptions, ', ') AS attoptions, "
+			  "CASE WHEN a.attcollation <> t.typcollation "
+			  "THEN a.attcollation ELSE 0 END AS attcollation, "
+			  "a.atthasmissing, a.attidentity, "
+			  "pg_catalog.array_to_string(ARRAY("
+			  "SELECT pg_catalog.quote_ident(option_name) || "
+			  "' ' || pg_catalog.quote_literal(option_value) "
+			  "FROM pg_catalog.pg_options_to_table(attfdwoptions) "
+			  "ORDER BY option_name"
+			  

Re: Possible bug in logical replication.

2018-06-21 Thread Petr Jelinek
On 20/06/18 09:59, Arseny Sher wrote:
> 
> Michael Paquier  writes:
> 
>> On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:
>>> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
>>> It seems to me that we still want to have the slot forwarding finish in
>>> this case even if this is interrupted.  Petr, isn't that the intention
>>> here?
>>
>> I have been chewing a bit more on the proposed patch, finishing with the
>> attached to close the loop.  Thoughts?
> 
> Sorry for being pedantic, but it seems to me worthwhile to mention *why*
> we need decoding machinery at all -- like I wrote:
> 
> + * While we could just do LogicalConfirmReceivedLocation updating
> + * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
> + * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog 
> tuples).
> 

+1

> Also,
> 
>>   * The slot's restart_lsn is used as start point for reading records,
> 
> This is clearly seen from the code, I propose to remove this.

Given there was just bug fix done for this I guess the extra clarity
there does not hurt.

> 
>>   * The LSN position to move to is checked by doing a per-record scan and
>>   * logical decoding which makes sure that confirmed_lsn is updated to a
>>   * LSN which allows the future slot consumer to get consistent logical
>> - * changes.
>> + * changes.  As decoding is done with fast_forward mode, no changes are
>> + * actually generated.
> 
> confirmed_lsn is always updated to `moveto` unless we run out of WAL
> earlier (and unless we try to move slot backwards, which is obviously
> forbidden) -- consistent changes are practically irrelevant
> here. Moreover, we can directly set confirmed_lsn and still have
> consistent changes further as restart_lsn and xmin of the slot are not
> touched. What we actually do here is trying to advance *restart_lsn and
> xmin* as far as we can but up to the point which ensures that decoding
> can assemble a consistent snapshot allowing to fully decode all COMMITs
> since updated `confirmed_flush_lsn`. All this happens in
> SnapBuildProcessRunningXacts.
> 

Those are two different things, here is consistent snapshot for logical
decoding without which we can't decode and that's handled by restart_lsn
and xmin. But the consistent stream of data for the consumer is handled
by confirmed_lsn (and the comment says that). You have to take into
account that next use of the slot can consume data (ie may be done via
one of the other interfaces and not by move). So I think what Michael
has here is correct.

>> It seems to me that we still want to have the slot forwarding finish in
>> this case even if this is interrupted.  Petr, isn't that the intention
>> here?
> 

Well, it seems wasteful to just exit there if we already finished all
the requested work, also gives some consistency with the coding of
get/peek_changes. Not very important for the functionality either way.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services



Re: PATCH: backtraces for error messages

2018-06-21 Thread Kyotaro HORIGUCHI
Hello, I basically like this.

At Thu, 21 Jun 2018 12:35:10 +0800, Craig Ringer  wrote 
in 
> This is what the stacks look like btw
> 
> 
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  0:
> find_in_dynamic_libpath: trying
> "/home/craig/pg/10/lib/postgresql/pglogical.so"
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
> find_in_dynamic_libpath, dfmgr.c:639
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
> FRAME0: find_in_dynamic_libpath +628
> FRAME1: expand_dynamic_library_name +205
> FRAME2: load_external_function +38
> FRAME3: LookupBackgroundWorkerFunction +197
> FRAME4: StartBackgroundWorker +549
> FRAME5: do_start_bgworker +466
> FRAME6: maybe_start_bgworkers +382
> FRAME7: reaper +895
> FRAME8: funlockfile +80
> FRAME9: __select +23
> FRAME   10: ServerLoop +394
> FRAME   11: PostmasterMain +4499
> 
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.
> 
> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

I think this for assertion failure is no problem but I'm not sure
for other cases. We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.

Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.

However, system tap doesn't work for me but I'm not sure how it
is broken. (stap version is 3.2/0.170 and uname -r shows
3.10.0-862)

$ sudo stap -e  'probe process(".../bin/postgres").mark("transaction__start"){}'
...
LOG:  autovacuum launcher process (PID 10592) was terminated by signal 4: 
Illegal instruction


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PATCH: backtraces for error messages

2018-06-21 Thread Pavan Deolasee
On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
wrote:

> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> > I wrote it because I got sick of Assert(false) debugging, and I was
> chasing
> > down some "ERROR:  08P01: insufficient data left in message" errors.
> Then I
> > got kind of caught up in it... you know how it is.
>
> Yes, I know that feeling!  I have been using as well the Assert(false)
> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> more easily backtraces would be really nice.
>
>
Sometime back I'd suggested an idea to be able to dynamically manage log
levels for elog messages [1]. That did not invoke much response, but I
still believe some such facility will be very useful for debugging
production systems. Now I concede that the POC patch that I sent is
horribly written and has a bad UI, but those can be improved if there is
interest. Given the lack of response, I suspect there is enough interest in
the feature though.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma%2BfPZpoZVdpL7Zc0bVw%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL prefetch

2018-06-21 Thread Konstantin Knizhnik

I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started 
together with startup process and is able to help it to speedup recovery.

The patch is attached.

Unfortunately result is negative (at least at my desktop: SSD, 16Gb 
RAM). Recovery with prefetch is 3 times slower than without it.

What I am doing:

Configuration:
    max_wal_size=min_wal_size=10Gb,
    shared)buffers = 1Gb
Database:
 pgbench -i -s 1000
Test:
 pgbench -c 10 -M prepared -N -T 100 -P 1
 pkill postgres
 echo 3 > /proc/sys/vm/drop_caches
 time pg_ctl -t 1000 -D pgsql -l logfile start

Without prefetch it is 19 seconds (recovered about 4Gb of WAL), with 
prefetch it is about one minute. About 400k blocks are prefetched.

CPU usage is small (<20%), both processes as in "Ds" state.

vmstat without prefetch shows the following output:

procs ---memory-- ---swap-- -io -system-- 
--cpu-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
0  2 2667964 11465832   7892 2515588    0    0 344272 2 6129 22290  
8  4 84  5  0
 3  1 2667960 10013900   9516 3963056    6    0 355606  8772 7412 25228 
12  6 74  8  0
 1  0 2667960 8526228  11036 5440192    0    0 366910   242 6123 19476  
8  5 83  3  0
 1  1 2667960 7824816  11060 6141920    0    0 166860 171638 9581 
24746  4  4 79 13  0
 0  4 2667960 7822824  11072 6143788    0    0   264 376836 19292 
49973  1  3 69 27  0
 1  0 2667960 7033140  11220 6932400    0    0 188810 168070 14610 
41390  5  4 72 19  0
 1  1 2667960 5739616  11384 8226148    0    0 254492 57884 6733 19263  
8  5 84  4  0
 0  3 2667960 5024380  11400 8941532    0    0 8 398198 18164 
45782  2  5 70 23  0
 0  0 2667960 5020152  11428 8946000    0    0   168 69128 3918 10370  
2  1 91  6  0


with prefetch:

procs ---memory-- ---swap-- -io -system-- 
--cpu-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
0  2 2651816 12340648  11148 1564420    0    0 178980    96 4411 14237  
5  2 90  3  0
 2  0 2651816 11771612  11712 2132180    0    0 169572 0 6388 
18244  5  3 72 20  0
 2  0 2651816 11199248  12008 2701960    0    0 168966   162 6677 
18816  5  3 72 20  0
 1  3 2651816 10660512  12028 3241604    0    0 162666    16 7065 
21668  6  5 69 20  0
 0  2 2651816 10247180  12052 3653888    0    0 131564 18112 7376 
22023  6  3 69 23  0
 0  2 2651816 9850424  12096 4064980    0    0 133158   238 6398 17557  
4  2 71 22  0
 2  0 2651816 9456616  12108 4459456    0    0 134702    44 6219 16665  
3  2 73 22  0
 0  2 2651816 9161336  12160 4753868    0    0 68 74408 8038 20440  
3  3 69 25  0
 3  0 2651816 8810336  12172 5106068    0    0 134694 0 6251 16978  
4  2 73 22  0
 0  2 2651816 8451924  12192 5463692    0    0 137546    80 6264 16930  
3  2 73 22  0
 1  1 2651816 8108000  12596 5805856    0    0 135212    10 6218 16827  
4  2 72 22  0
 1  3 2651816 7793992  12612 6120376    0    0 135072 0 6233 16736  
3  2 73 22  0
 0  2 2651816 7507644  12632 6406512    0    0 134830    90 6267 16910  
3  2 73 22  0
 0  2 2651816 7246696  12776 6667804    0    0 122656 51820 7419 19384  
3  3 71 23  0
 1  2 2651816 6990080  12784 6924352    0    0 121248 55284 7527 19794  
3  3 71 23  0
 0  3 2651816 6913648  12804 7000376    0    0 36078 295140 14852 
37925  2  3 67 29  0
 0  2 2651816 6873112  12804 7040852    0    0 19180 291330 16167 
41711  1  3 68 28  0
 5  1 2651816 6641848  12812 7271736    0    0 107696    68 5760 15301  
3  2 73 22  0
 3  1 2651816 6426356  12820 7490636    0    0 103412 0 5942 15994  
3  2 72 22  0
 0  2 2651816 6195288  12824 7720720    0    0 104446 0 5605 14757  
3  2 73 22  0
 0  2 2651816 5946876  12980 7970912    0    0 113340    74 5980 15678  
3  2 71 24  0
 1  2 2651816 5655768  12984 8262880    0    0 137290 0 6235 16412  
3  2 73 21  0
 2  0 2651816 5359548  13120 8557072    0    0 137608    86 6309 16658  
3  2 73 21  0
 2  0 2651816 5068268  13124 8849136    0    0 137386 0 6225 16589  
3  2 73 21  0
 2  0 2651816 4816812  13124 9100600    0    0 120116 53284 7273 18776  
3  2 72 23  0
 0  2 2651816 4563152  13132 9353232    0    0 117972 54352 7423 19375  
3  2 73 22  0
 1  2 2651816 4367108  13144 9549712    0    0 51994 239498 10846 
25987  3  5 73 19  0
 0  0 2651816 4366356  13164 9549892    0    0   168 294196 14981 
39432  1  3 79 17  0



So as you can see, read speed with prefetch is smaller: < 130Mb/sec, 
while without prefetch up to 366Mb/sec.
My hypothesis is that prefetch flushes dirty pages from cache and as a 
result, more data has to be written and backends are more frequently 
blocked in write.


In any case - very upsetting result.
Any comments are welcome.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/xlogreader.c 

Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread Tom Lane
David Rowley  writes:
> Well, that's quite surprising. It appears to be a bug in
> numeric_poly_combine for machines without a working int128 type. The
> parameters in accum_sum_copy are in the incorrect order.

Ouch.

> The very minimal fix is attached, but I'll need to go look at where
> the tests for this have gone.

coverage.postgresql.org shows that numeric_poly_serialize/combine()
aren't exercised at all by the regression tests.  Which is embarrassing
for this case, but I'm a bit leery of trying to insist on 100% coverage.

It might be a plan to insist on buildfarm coverage for anything with
platform-varying code in it, in which case there's at least one
other undertested bit of HAVE_INT128 code in numeric.c.

regards, tom lane



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-21 Thread Ashutosh Sharma
On Wed, Jun 20, 2018 at 1:00 PM, Haribabu Kommi
 wrote:
> The pg_stat_statements contains the statistics of the queries that are
> cumulative.
> I find that any optimizations that are done to improve the performance of a
> query
> are not be visible clearly until the stats are reset. Currently there is a
> function to
> reset all the statistics, I find it will be useful if we a function that
> resets the stats of
> a single query, instead of reseting all the queries.
>
> Attached is a simple patch with implementation. Comments?

The idea looks interesting to me. But, as Euler Taveira mentioned,
can't we extend an existing function pg_stat_statements_reset()
instead of adding a new one and update the documentation for it. Also,
in the test-case it would be good to display the output of
pg_stat_statements before and after deleting the query. Thanks.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: Postgres 11 release notes

2018-06-21 Thread Alexander Korotkov
On Tue, Jun 19, 2018 at 1:40 PM Alexander Korotkov
 wrote:
> On Tue, Jun 19, 2018 at 12:15 PM Alexander Korotkov
>  wrote:
> > On Sat, Jun 16, 2018 at 3:57 PM Darafei "Komяpa" Praliaskouski
> >  wrote:
> > >>
> > >> > I'm not sure it is usefull in release notes since it is more about 
> > >> > API, and not
> > >> > user-facing change. Just in case.
> > >> > GiST opclasses now can omit compress and decompress functions. If 
> > >> > compress
> > >> > function is omited, IndexOnlyScan is enabled for opclass without any 
> > >> > extra
> > >> > change.
> > >> > https://github.com/postgres/postgres/commit/
> > >> > d3a4f89d8a3e500bd7c0b7a8a8a5ce1b47859128
> > >>
> > >> Uh, we do have this for SP-GiST:
> > >>
> > >> Allow SP-GiST indexes to optionally use compression (Teodor 
> > >> Sigaev,
> > >> Heikki Linnakangas, Alexander Korotkov, Nikita Glukhov)
> > >>
> > >> I am unclear how far downt the API stack I should go in documenting
> > >> changes like this.
> > >
> > >
> > > It is also a bit misleading - the idea in that change is that now index 
> > > representation can be a lossy version of actual data type (a box instead 
> > > of polygon as a referende, so a changelog entry can tell "Allow SP-GiST 
> > > index creation for polygon datatype.").  There is no "decompression" for 
> > > such thing. "compression" sounds like gzip for me in user-facing context.
> >
> > +1 that current wording looks confusing.  But I think we need to
> > highlight that we have general SP-GiST improvement, not just support
> > for particular datatype.  So, I propose following wording: "Allow
> > SP-GiST to use lossy representation of leaf keys, and add SP-GiST
> > support for polygon type using that".
>
> Oh, I missed that we have separate release notes entry for polygons
> indexing.  Then second part of sentence isn't needed, it should be
> just "Allow SP-GiST to use lossy representation of leaf keys".  If no
> objections, I'm going to commit that altogether with fixes by
> Liudmila.

Wording improvement is pushed.  Typo fixes are already pushed by Magnus.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-21 Thread Tom Lane
Amit Kapila  writes:
> On Thu, Jun 21, 2018 at 11:51 AM, Amit Khandekar  
> wrote:
>> After list_concat, the subpaths no longer has only non-partial paths,
>> which it is supposed to have. So it anyways should not be used in the
>> subsequent code in that function. So I think the following change
>> should be good.
>> -   foreach(l, subpaths)
>> +   foreach(l, pathnode->subpaths)

Patch LGTM.

> Thanks, Tom, would you like to go-ahead and commit this change as this
> is suggested by you or would you like me to take care of this or do
> you want to wait for Robert's input?

Please push --- I'm busy getting ready to leave for vacation ...

regards, tom lane



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Bruce Momjian
On Wed, Jun 20, 2018 at 05:28:43PM -0500, Nico Williams wrote:
> On Wed, Jun 20, 2018 at 06:19:40PM -0400, Joe Conway wrote:
> > On 06/20/2018 05:12 PM, Bruce Momjian wrote:
> > > On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
> > > Even if they are encrypted with the same key, they use different
> > > initialization vectors that are stored inside the encrypted payload, so
> > > you really can't identify much except the length, as Robert stated.
> 
> Definitely use different IVs, and don't reuse them (or use cipher modes
> where IV reuse is not fatal).
> 
> > The more you encrypt with a single key, the more fuel you give to the
> > person trying to solve for the key with cryptanalysis.
> 
> With modern 128-bit block ciphers in modern cipher modes you'd have to
> encrypt enough data to make this not a problem.  On the other hand,
> you'll still have other reasons to do key rotation.  Key rotation
> ultimately means re-encrypting everything.  Getting all of this right is
> very difficult.
> 
> So again, what's the threat model?  Because if it's sysadmins/DBAs
> you're afraid of, there are better things to do.

Agreed.  Databases just don't match to the typical cryptographic
solutions and threat models.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-21 Thread Konstantin Knizhnik

Hi hackers,

I hope that somebody understand postgres_fdw cost calculation magic 
better than I;)
The following very simple test reduce the problem with wrong cost 
estimation:



create table t1(x integer primary key, y integer);
create index on t1(y);
insert into t1 values (generate_series(1,100), 
generate_series(101,200));

create table t2(x integer primary key);
insert into t2 values (generate_series(1,100));
create server pg_fdw  FOREIGN DATA WRAPPER postgres_fdw options(host 
'localhost', dbname 'postgres');
create foreign table t1_fdw(x integer, y integer) server pg_fdw options 
(table_name 't1', use_remote_estimate 'false');
create foreign table t2_fdw(x integer) server pg_fdw options (table_name 
't2', use_remote_estimate 'false');

analyze t1;
analyze t2;
analyze t2_fdw;
analyze t1_fdw;
explain select * from t1_fdw join t2_fdw on t1_fdw.x=t2_fdw.x where y in 
(1234567,1234577,1234667,1235567,1244567,1334567);


---
 Hash Join  (cost=22125.20..60300.26 rows=6 width=12) (actual 
time=439.187..1849.459 rows=6 loops=1)

   Hash Cond: (t2_fdw.x = t1_fdw.x)
   ->  Foreign Scan on t2_fdw  (cost=100.00..34525.00 rows=100 
width=4) (actual time=0.526..1711.671 rows=100 loops=1)
   ->  Hash  (cost=22025.12..22025.12 rows=6 width=8) (actual 
time=0.511..0.511 rows=6 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Foreign Scan on t1_fdw  (cost=100.00..22025.12 rows=6 
width=8) (actual time=0.506..0.507 rows=6 loops=1)

 Planning Time: 0.173 ms
 Execution Time: 1849.871 ms
(8 rows)

So instead of pushing join to the remote server, optimizer decides that 
it is more efficient to perform join locally.

If IN lis contains less alternatives (<= 2), then correct plan is used:
postgres=# explain select * from t1_fdw join t2_fdw on t1_fdw.x=t2_fdw.x 
where y in (1234567,1234577);

   QUERY PLAN
-
 Foreign Scan  (cost=100.00..41450.04 rows=2 width=12)
   Relations: (public.t1_fdw) INNER JOIN (public.t2_fdw)
(2 rows)


It is possible to force Postgres to use correct plan by setting 
"fdw_startup_cost" to some very large value (1 for example).


Also correct plan is used when use_remote_estimate is true. But in this 
case query optimization time is too large (not at this dummy example, 
but on real database  and query with join of many large tables it takes 
about 10 seconds to perform remote estimation of all joined tables).


Please notice that optimizer correctly estimates number of retrieved 
rows: 6.

But it overestimates cost of remote join.
Looks like it is because of the following code in estimate_path_cost_size:

            /*
             * Run time cost includes:
             *
             * 1. Run time cost (total_cost - startup_cost) of 
relations being

             * joined
             *
             * 2. Run time cost of applying join clauses on the cross 
product

             * of the joining relations.
             *
             * 3. Run time cost of applying pushed down other clauses 
on the

             * result of join
             *
             * 4. Run time cost of applying nonpushable other clauses 
locally

             * on the result fetched from the foreign server.
             */
            run_cost = fpinfo_i->rel_total_cost - 
fpinfo_i->rel_startup_cost;
            run_cost += fpinfo_o->rel_total_cost - 
fpinfo_o->rel_startup_cost;

            run_cost += nrows * join_cost.per_tuple;
            nrows = clamp_row_est(nrows * fpinfo->joinclause_sel);
            run_cost += nrows * remote_conds_cost.per_tuple;
            run_cost += fpinfo->local_conds_cost.per_tuple * 
retrieved_rows;


815                run_cost = fpinfo_i->rel_total_cost - 
fpinfo_i->rel_startup_cost;

(gdb) p fpinfo_i->rel_total_cost
$23 = 14425
2816                run_cost += fpinfo_o->rel_total_cost - 
fpinfo_o->rel_startup_cost;

(gdb) p fpinfo_o->rel_total_cost
$25 = 21925
2817                run_cost += nrows * join_cost.per_tuple;
(gdb) p run_cost
$26 = 36350

I wonder if it is possible to make estimation of foreign join cost more 
precise.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan




On 06/21/2018 01:18 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 06/21/2018 12:39 PM, Tom Lane wrote:

Andres Freund  writes:

On June 21, 2018 9:04:28 AM PDT, Tom Lane  wrote:
This isn't really ready to go.  One clear problem is that you broke
pg_dump'ing from any pre-v11 version, because you did not add suitable
null outputs to the pre-v11 query variants in getTableAttrs.
I followed the pattern used for attidentity. But why will it spit out
warnings? It doesn't ask for the missing stuff from an older version.

Oh, I see.  Well, the short answer is that that's not the style we use
in pg_dump, and the attidentity code is inappropriate/wrong too IMO.
When something has been done one way a hundred times before, thinking
you're too smart for that and you'll do it some other way does not lend
itself to code clarity or reviewability.

I might be OK with a patch that converts *all* of pg_dump's cross-version
difference handling code to depend on PQfnumber silently returning -1
rather than failing, but I don't want to see it done like that in just
one or two places.





I don't mind changing it. But please note that I wouldn't have done it 
that way unless there were a precedent. I fully expected to add dummy 
values to all the previous queries, but when I couldn't find attidentity 
in them to put them next to I followed that example.


I don't think it's at all a bad idea, TBH. It means you only have to add 
one query per version rather than having to adjust all of them.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Bruce Momjian
On Wed, Jun 20, 2018 at 06:19:40PM -0400, Joe Conway wrote:
> On 06/20/2018 05:12 PM, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
> >>> At the same time, having to have a bunch of independently-decipherable
> >>> short field values is not real secure either, especially if they're known
> >>> to all be encrypted with the same key.  But what you know or can guess
> >>> about the plaintext in such cases would be target-specific, rather than
> >>> an attack that could be built once and used against any PG database.
> >>
> >> Again is dependent on the specific solution for encryption. In some
> >> cases you might do something like generate a single use random key,
> >> encrypt the payload with that, encrypt the single use key with the
> >> "global" key, append the two results and store.
> > 
> > Even if they are encrypted with the same key, they use different
> > initialization vectors that are stored inside the encrypted payload, so
> > you really can't identify much except the length, as Robert stated.
> 
> The more you encrypt with a single key, the more fuel you give to the
> person trying to solve for the key with cryptanalysis.
> 
> By encrypting only essentially random data (the single use keys,
> generated with cryptographically strong random number generator) with
> the "master key", and then encrypting the actual payloads (which are
> presumably more predictable than the strong random single use keys), you
> minimize the probability of someone cracking your master key and you
> also minimize the damage caused by someone cracking one of the single
> use keys.

Yeah, I have a slide about that too, and the previous and next slide:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=90

The more different keys you use the encrypt data, the more places you
have to store it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andres Freund  writes:
> On June 21, 2018 9:04:28 AM PDT, Tom Lane  wrote:
>> This isn't really ready to go.  One clear problem is that you broke
>> pg_dump'ing from any pre-v11 version, because you did not add suitable
>> null outputs to the pre-v11 query variants in getTableAttrs.

> Thought the same for a bit - think it's OK though, because there a check for 
> PQfnumber being bigger than 0...

It won't actually crash, but it will spit out lots of ugly warnings.

regards, tom lane



Re: Wrong cost estimation for foreign tables join with use_remote_estimate disabled

2018-06-21 Thread Tom Lane
Konstantin Knizhnik  writes:
> The following very simple test reduce the problem with wrong cost 
> estimation:
> create foreign table t1_fdw(x integer, y integer) server pg_fdw options 
> (table_name 't1', use_remote_estimate 'false');
> create foreign table t2_fdw(x integer) server pg_fdw options (table_name 
> 't2', use_remote_estimate 'false');

> It is possible to force Postgres to use correct plan by setting 
> "fdw_startup_cost" to some very large value (1 for example).
> ...
> Also correct plan is used when use_remote_estimate is true.

If you are unhappy about the results with use_remote_estimate off, don't
run it that way.  The optimizer does not have a crystal ball.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andres Freund
On 2018-06-21 13:44:19 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 06/21/2018 01:18 PM, Tom Lane wrote:
> >> I might be OK with a patch that converts *all* of pg_dump's cross-version
> >> difference handling code to depend on PQfnumber silently returning -1
> >> rather than failing, but I don't want to see it done like that in just
> >> one or two places.
> 
> > I don't mind changing it. But please note that I wouldn't have done it 
> > that way unless there were a precedent. I fully expected to add dummy 
> > values to all the previous queries, but when I couldn't find attidentity 
> > in them to put them next to I followed that example.
> 
> Actually, now that I think about it, there is a concrete reason for the
> historical pattern: it provides a cross-check that you did not fat-finger
> the query, ie misspell the column alias vs the PQfnumber parameter.  This
> gets more valuable the more per-version variants of the query there are.
> With the way the attidentity code does it, it would just silently act as
> though the column has its default value, which you might or might not
> notice in cursory testing.  Getting visible bleats about column number -1
> is much more likely to get your attention.

To me that benefit is far smaller than the increased likelihood of
screwing something up because you'd to touch pg_dump support for
postgres versions one likely doesn't readily have available for testing.

Greetings,

Andres Freund



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-21 13:44:19 -0400, Tom Lane wrote:
>> Actually, now that I think about it, there is a concrete reason for the
>> historical pattern: it provides a cross-check that you did not fat-finger
>> the query, ie misspell the column alias vs the PQfnumber parameter.  This
>> gets more valuable the more per-version variants of the query there are.
>> With the way the attidentity code does it, it would just silently act as
>> though the column has its default value, which you might or might not
>> notice in cursory testing.  Getting visible bleats about column number -1
>> is much more likely to get your attention.

> To me that benefit is far smaller than the increased likelihood of
> screwing something up because you'd to touch pg_dump support for
> postgres versions one likely doesn't readily have available for testing.

I beg to differ: code that works "by omission" is just as likely to be
wrong.  Anyway, our solution for unmaintainable back branches has been
to drop support, cf where we got rid of the pre-8.0 queries awhile back.
One (admittedly small) problem with the "i_attidentity >= 0 ? ..." coding
is that it's not obvious when you can get rid of it because you dropped
the last old branch that needed it.  After that, it's just a hazard for
hiding mistakes.

regards, tom lane



Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread Tom Lane
David Rowley  writes:
> On 22 June 2018 at 02:01, Tom Lane  wrote:
>> coverage.postgresql.org shows that numeric_poly_serialize/combine()
>> aren't exercised at all by the regression tests.  Which is embarrassing
>> for this case, but I'm a bit leery of trying to insist on 100% coverage.
>> It might be a plan to insist on buildfarm coverage for anything with
>> platform-varying code in it, in which case there's at least one
>> other undertested bit of HAVE_INT128 code in numeric.c.

> I think some coverage of the numerical aggregates is a good idea, so
> I've added some in the attached. I managed to get a parallel plan
> going with a query to onek, which is pretty cheap to execute. I didn't
> touch the bool aggregates. Maybe I should have done that too..?

This sort of blunderbuss testing was exactly what I *didn't* want to do.
Not only is this adding about 20x as many cycles as we need (at least for
this specific numeric_poly_combine issue), but I'm quite afraid that the
float4 and/or float8 cases will show low-order-digit irreproducibility
in the buildfarm.  I think we should look at the coverage report for the
files in question and add targeted tests to cover gaps where there's
platform-varying code, such that the buildfarm might expose problems
that were missed by the code's author.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andrew Dunstan  writes:
> Patch after pgindent attached. This will involve a catversion bump since 
> we're adding a new function.

This isn't really ready to go.  One clear problem is that you broke
pg_dump'ing from any pre-v11 version, because you did not add suitable
null outputs to the pre-v11 query variants in getTableAttrs.

Personally, I'd have made the pg_dump changes simpler and cheaper by
storing only one new value per column not two.  You could have just
stored attmissingval, or if you want to be paranoid you could do

case when atthasmissing then attmissingval else null end

We could do without the unrelated whitespace changes in dumpDefaultACL,
too (or did pgindent introduce those?  Seems surprising ... oh, looks
like "type" has somehow snuck into typedefs.list.  Time for another
blacklist entry.)

In dumpTableSchema, I'd suggest emitting the numeric table OID, rather
than hacking around with regclass.

regards, tom lane



Re: libpq compression

2018-06-21 Thread Konstantin Knizhnik




On 21.06.2018 17:56, Robbie Harwood wrote:

Konstantin Knizhnik  writes:


On 20.06.2018 23:34, Robbie Harwood wrote:

Konstantin Knizhnik  writes:


My idea was the following: client want to use compression. But server
may reject this attempt (for any reasons: it doesn't support it, has
no proper compression library, do not want to spend CPU for
decompression,...) Right now compression algorithm is hardcoded. But
in future client and server may negotiate to choose proper compression
protocol.  This is why I prefer to perform negotiation between client
and server to enable compression.
Well, for negotiation you could put the name of the algorithm you want
in the startup.  It doesn't have to be a boolean for compression, and
then you don't need an additional round-trip.

Sorry, I can only repeat arguments I already mentioned:
- in future it may be possible to specify compression algorithm
- even with boolean compression option server may have some reasons to
reject client's request to use compression

Extra flexibility is always good thing if it doesn't cost too much. And
extra round of negotiation in case of enabling compression seems to me
not to be a high price for it.

You already have this flexibility even without negotiation.  I don't
want you to lose your flexibility.  Protocol looks like this:

- Client sends connection option "compression" with list of algorithms
   it wants to use (comma-separated, or something).

- First packet that the server can compress one of those algorithms (or
   none, if it doesn't want to turn on compression).

No additional round-trips needed.


This is exactly how it works now...
Client includes compression option in connection string and server 
replies with special message ('Z') if it accepts request to compress 
traffic between this client and server.
I do not know whether sending such message can be considered as 
"round-trip" or not, but I do not want to loose possibility to make 
decision at server whether to use compression or not.





Well, that's a design decision you've made.  You could put lengths on
chunks that are sent out - then you'd know exactly how much is needed.
(For instance, 4 bytes of network-order length followed by a complete
payload.)  Then you'd absolutely know whether you have enough to
decompress or not.

Do you really suggest to send extra header for each chunk of data?
Please notice that chunk can be as small as one message: dozen of bytes
because libpq is used for client-server communication with request-reply
pattern.

I want you to think critically about your design.  I *really* don't want
to design it for you - I have enough stuff to be doing.  But again, the
design I gave you doesn't necessarily need that - you just need to
properly buffer incomplete data.


Right now secure_read may return any number of available bytes. But in 
case of using streaming compression, it can happen that available number 
of bytes is not enough to perform decompression. This is why we may need 
to try to fetch additional portion of data. This is how zpq_stream is 
working now.


I do not understand how it is possible to implement in different way and 
what is wrong with current implementation.



Frankly speaking I do not completely understand the source of your
concern.  My primary idea was to preseve behavior of libpq function as
much as possible, so there is no need to rewrite all places in
Postgres code when them are used.  It seems to me that I succeed in
reaching this goal. Incase of enabled compression zpq_stream functions
(zpq-read/write) are used instead of (pq)secure_read/write and in turn
are using them to fetch/send more data. I do not see any bad flaws,
encapsulation violation or some other problems in such solution.

So before discussing some alternative ways of embedding compression in
libpq, I will want to understand what's wrong with this approach.

You're destroying the existing model for no reason.


Why? Sorry, I really do not understand why adding compression in this 
way breaks exited model.

Can you please explain it to me once again.


   If you needed to, I
could understand some argument for the way you've done it, but what I've
tried to tell you is that you don't need to do so.  It's longer this
way, and it *significantly* complicates the (already difficult to reason
about) connection state machine.

I get that rewriting code can be obnoxious, and it feels like a waste of
time when we have to do so.  (I've been there; I'm on version 19 of my
postgres patchset.)


I am not against rewriting code many times, but first I should 
understand the problem which needs to be solved.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/21/2018 12:39 PM, Tom Lane wrote:
>> Andres Freund  writes:
> On June 21, 2018 9:04:28 AM PDT, Tom Lane  wrote:
> This isn't really ready to go.  One clear problem is that you broke
> pg_dump'ing from any pre-v11 version, because you did not add suitable
> null outputs to the pre-v11 query variants in getTableAttrs.

> I followed the pattern used for attidentity. But why will it spit out 
> warnings? It doesn't ask for the missing stuff from an older version.

Oh, I see.  Well, the short answer is that that's not the style we use
in pg_dump, and the attidentity code is inappropriate/wrong too IMO.
When something has been done one way a hundred times before, thinking
you're too smart for that and you'll do it some other way does not lend
itself to code clarity or reviewability.

I might be OK with a patch that converts *all* of pg_dump's cross-version
difference handling code to depend on PQfnumber silently returning -1
rather than failing, but I don't want to see it done like that in just
one or two places.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/21/2018 01:18 PM, Tom Lane wrote:
>> I might be OK with a patch that converts *all* of pg_dump's cross-version
>> difference handling code to depend on PQfnumber silently returning -1
>> rather than failing, but I don't want to see it done like that in just
>> one or two places.

> I don't mind changing it. But please note that I wouldn't have done it 
> that way unless there were a precedent. I fully expected to add dummy 
> values to all the previous queries, but when I couldn't find attidentity 
> in them to put them next to I followed that example.

Actually, now that I think about it, there is a concrete reason for the
historical pattern: it provides a cross-check that you did not fat-finger
the query, ie misspell the column alias vs the PQfnumber parameter.  This
gets more valuable the more per-version variants of the query there are.
With the way the attidentity code does it, it would just silently act as
though the column has its default value, which you might or might not
notice in cursory testing.  Getting visible bleats about column number -1
is much more likely to get your attention.

So I'm thinking that the attidentity code is just wrong, and you should
change that too while you're at it.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan




On 06/21/2018 01:44 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 06/21/2018 01:18 PM, Tom Lane wrote:

I might be OK with a patch that converts *all* of pg_dump's cross-version
difference handling code to depend on PQfnumber silently returning -1
rather than failing, but I don't want to see it done like that in just
one or two places.

I don't mind changing it. But please note that I wouldn't have done it
that way unless there were a precedent. I fully expected to add dummy
values to all the previous queries, but when I couldn't find attidentity
in them to put them next to I followed that example.

Actually, now that I think about it, there is a concrete reason for the
historical pattern: it provides a cross-check that you did not fat-finger
the query, ie misspell the column alias vs the PQfnumber parameter.  This
gets more valuable the more per-version variants of the query there are.
With the way the attidentity code does it, it would just silently act as
though the column has its default value, which you might or might not
notice in cursory testing.  Getting visible bleats about column number -1
is much more likely to get your attention.

So I'm thinking that the attidentity code is just wrong, and you should
change that too while you're at it.






That should be backpatched if changed, no? I don't think we'd want this 
to get out of sync between the branches. It would make later 
backpatching more difficult for one thing.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread David Rowley
On 22 June 2018 at 03:11, David Rowley  wrote:
> I think some coverage of the numerical aggregates is a good idea, so
> I've added some in the attached. I managed to get a parallel plan
> going with a query to onek, which is pretty cheap to execute. I didn't
> touch the bool aggregates. Maybe I should have done that too..?

I should have mentioned; I tested this on a Windows VS2012 machine and
a Linux x86-64 machine and those results were fine with each. I think
the float4 and float8 aggregate results should be fairly stable since
the numbers being aggregated are all very far from the extreme range
of those types. If we do get some problems with some buildfarm members
producing slightly different results then we can probably cast the
result of the aggregate to numeric and round() them a bit.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
I wrote:
> We could do without the unrelated whitespace changes in dumpDefaultACL,
> too (or did pgindent introduce those?  Seems surprising ... oh, looks
> like "type" has somehow snuck into typedefs.list.  Time for another
> blacklist entry.)

Hmm .. actually, "type" already is in pgindent's blacklist.  Are you
using an up-to-date pgindent'ing setup?

regards, tom lane



Re: WAL prefetch

2018-06-21 Thread Tomas Vondra




On 06/21/2018 04:01 PM, Konstantin Knizhnik wrote:

I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started 
together with startup process and is able to help it to speedup recovery.

The patch is attached.

Unfortunately result is negative (at least at my desktop: SSD, 16Gb 
RAM). Recovery with prefetch is 3 times slower than without it.

What I am doing:

Configuration:
     max_wal_size=min_wal_size=10Gb,
     shared)buffers = 1Gb
Database:
  pgbench -i -s 1000
Test:
  pgbench -c 10 -M prepared -N -T 100 -P 1
  pkill postgres
  echo 3 > /proc/sys/vm/drop_caches
  time pg_ctl -t 1000 -D pgsql -l logfile start

Without prefetch it is 19 seconds (recovered about 4Gb of WAL), with 
prefetch it is about one minute. About 400k blocks are prefetched.

CPU usage is small (<20%), both processes as in "Ds" state.



Based on a quick test, my guess is that the patch is broken in several 
ways. Firstly, with the patch attached (and wal_prefetch_enabled=on, 
which I think is needed to enable the prefetch) I can't even restart the 
server, because pg_ctl restart just hangs (the walprefetcher process 
gets stuck in WaitForWAL, IIRC).


I have added an elog(LOG,...) to walprefetcher.c, right before the 
FilePrefetch call, and (a) I don't see any actual prefetch calls during 
recovery but (b) I do see the prefetch happening during the pgbench. 
That seems a bit ... wrong?


Furthermore, you've added an extra

signal_child(BgWriterPID, SIGHUP);

to SIGHUP_handler, which seems like a bug too. I don't have time to 
investigate/debug this further.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Considering signal handling in plpython again

2018-06-21 Thread Heikki Linnakangas

Hi Hubert,

Are you working on this, or should I pick this up? Would be nice to get 
this done as soon as v12 development begins.


- Heikki



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan




On 06/21/2018 12:08 PM, Tom Lane wrote:

I wrote:

We could do without the unrelated whitespace changes in dumpDefaultACL,
too (or did pgindent introduce those?  Seems surprising ... oh, looks
like "type" has somehow snuck into typedefs.list.  Time for another
blacklist entry.)

Hmm .. actually, "type" already is in pgindent's blacklist.  Are you
using an up-to-date pgindent'ing setup?






I am now, This was a holdover from before I updated.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan




On 06/21/2018 12:39 PM, Tom Lane wrote:

Andres Freund  writes:

On June 21, 2018 9:04:28 AM PDT, Tom Lane  wrote:

This isn't really ready to go.  One clear problem is that you broke
pg_dump'ing from any pre-v11 version, because you did not add suitable
null outputs to the pre-v11 query variants in getTableAttrs.

Thought the same for a bit - think it's OK though, because there a check for 
PQfnumber being bigger than 0...

It won't actually crash, but it will spit out lots of ugly warnings.





I followed the pattern used for attidentity. But why will it spit out 
warnings? It doesn't ask for the missing stuff from an older version.


Incidentally, I just checked this by applying the patch and then running 
through the buildfarm's cross version upgrade check. All was kosher. 
Upgrade from all live branches to the patched master went without a hitch.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Nico Williams
On Thu, Jun 21, 2018 at 10:14:54AM -0400, Bruce Momjian wrote:
> On Wed, Jun 20, 2018 at 04:57:18PM -0500, Nico Williams wrote:
> > Client-side crypto is hard to do well and still get decent performance.
> > So on the whole I think that crypto is a poor fit for the DBAs-are-the-
> > threat threat model.  It's better to reduce the number of DBAs/sysadmins
> > and audit all privileged (and, for good measure, unprivileged) access.
> 
> Yeah, kind of.  There is the value of preventing accidental viewing of
> the data by the DBA, and of course WAL and backup encryption are nice.

One generally does not use crypto to prevent "accidental" viewing of
plaintext, but to provide real security relative to specific threats.

If you stop at encrypting values with no integrity protection for the
PKs, and no binding to TX IDs and such, you will indeed protect against
accidental viewing of the plaintext, but not against a determined
malicious insider.

Is that worthwhile?  Remember: you'll have to reduce and audit sysadmin
& DBA access anyways.

There is also the risk that users won't understand the limitations of
this sort of encryption feature and might get a false sense of security
from [mis]using it.

I'd want documentation to make it absolutely clear that such a feature
is only meant to reduce the risk of accidental viewing of plaintext by
DBAs and not a real security feature.

Nico
-- 



Re: libpq compression

2018-06-21 Thread Robbie Harwood
Konstantin Knizhnik  writes:

> On 21.06.2018 17:56, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>> On 20.06.2018 23:34, Robbie Harwood wrote:
 Konstantin Knizhnik  writes:

 Well, that's a design decision you've made.  You could put lengths
 on chunks that are sent out - then you'd know exactly how much is
 needed.  (For instance, 4 bytes of network-order length followed by
 a complete payload.)  Then you'd absolutely know whether you have
 enough to decompress or not.
>>>
>>> Do you really suggest to send extra header for each chunk of data?
>>> Please notice that chunk can be as small as one message: dozen of
>>> bytes because libpq is used for client-server communication with
>>> request-reply pattern.
>>
>> I want you to think critically about your design.  I *really* don't
>> want to design it for you - I have enough stuff to be doing.  But
>> again, the design I gave you doesn't necessarily need that - you just
>> need to properly buffer incomplete data.
>
> Right now secure_read may return any number of available bytes. But in
> case of using streaming compression, it can happen that available
> number of bytes is not enough to perform decompression. This is why we
> may need to try to fetch additional portion of data. This is how
> zpq_stream is working now.

No, you need to buffer and wait until you're called again.  Which is to
say: decompress() shouldn't call secure_read().  secure_read() should
call decompress().

> I do not understand how it is possible to implement in different way
> and what is wrong with current implementation.

The most succinct thing I can say is: absolutely don't modify
pq_recvbuf().  I gave you pseudocode for how to do that.  All of your
logic should be *below* the secure_read/secure_write functions.

I cannot approve code that modifies pq_recvbuf() in the manner you
currently do.

 My idea was the following: client want to use compression. But
 server may reject this attempt (for any reasons: it doesn't support
 it, has no proper compression library, do not want to spend CPU for
 decompression,...) Right now compression algorithm is
 hardcoded. But in future client and server may negotiate to choose
 proper compression protocol.  This is why I prefer to perform
 negotiation between client and server to enable compression.  Well,
 for negotiation you could put the name of the algorithm you want in
 the startup.  It doesn't have to be a boolean for compression, and
 then you don't need an additional round-trip.
>>>
>>> Sorry, I can only repeat arguments I already mentioned:
>>> - in future it may be possible to specify compression algorithm
>>> - even with boolean compression option server may have some reasons to
>>> reject client's request to use compression
>>>
>>> Extra flexibility is always good thing if it doesn't cost too
>>> much. And extra round of negotiation in case of enabling compression
>>> seems to me not to be a high price for it.
>>
>> You already have this flexibility even without negotiation.  I don't
>> want you to lose your flexibility.  Protocol looks like this:
>>
>> - Client sends connection option "compression" with list of
>>   algorithms it wants to use (comma-separated, or something).
>>
>> - First packet that the server can compress one of those algorithms
>>   (or none, if it doesn't want to turn on compression).
>>
>> No additional round-trips needed.
>
> This is exactly how it works now...  Client includes compression
> option in connection string and server replies with special message
> ('Z') if it accepts request to compress traffic between this client
> and server.

No, it's not.  You don't need this message.  If the server receives a
compression request, it should just turn compression on (or not), and
then have the client figure out whether it got compression back.  This
is of course made harder by your refusal to use packet framing, but
still shouldn't be particularly difficult.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: location reporting in TAP test failures

2018-06-21 Thread Heikki Linnakangas

On 05/06/18 18:28, Peter Eisentraut wrote:

Right now, when a TAP test reports a failure, it looks something like this:

  #   Failed test 'creating a replication slot'
  #   at
/../postgresql/src/bin/pg_basebackup/../../../src/test/perl/TestLib.pm
line 371.

That file location is where we call out to the test function provided by
Test::More.

What we'd really want is

  #   Failed test 'creating a replication slot'
  #   at t/020_pg_receivewal.pl line 36.

because that's where the code that's doing the testing is.

To achieve that, we need to have our test library functions tell that
they are support functions and not the actual tests.  The attached patch
does that.  The mechanism is (somewhat) explained in the Test::Builder
man page.


Looks good. I wish there was some way to remind to do this for any 
future test helper functions, too, but I don't have any ideas off the 
top of my head.


Since this is purely a test change, IMO this could go to master now, no 
need to wait for until v12 development starts.


- Heikki



Re: Speedup of relation deletes during recovery

2018-06-21 Thread Andres Freund
On 2018-06-21 14:40:58 +0900, Michael Paquier wrote:
> On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:
> > On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
> >> We could do that - but add_to_unowned_list() is actually a bottleneck in
> >> other places during recovery too. We pretty much never (outside of
> >> dropping relations / databases) close opened relations during recovery -
> >> which is obviously problematic since nearly all of the entries are
> >> unowned.  I've come to the conclusion that we should have a global
> >> variable that just disables adding anything to the global lists.
> > 
> > On second thought: I think we should your approach in the back branches,
> > and something like I'm suggesting in master once open.
> 
> +1.  Let's also make sure that the removal of smgrdounlink and
> smgrdounlinkfork happens only on master.

FWIW, I'd just skip removing them. They seem useful functionality and
don't hurt. I don't see the point in mucking around with them.

Greetings,

Andres Freund



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/21/2018 01:44 PM, Tom Lane wrote:
>> So I'm thinking that the attidentity code is just wrong, and you should
>> change that too while you're at it.

> That should be backpatched if changed, no? I don't think we'd want this 
> to get out of sync between the branches. It would make later 
> backpatching more difficult for one thing.

If you feel like it.  But if there's attmissingval code right next to it
as of v11, then backpatches wouldn't apply cleanly anyway due to lack of
context match, so I doubt there's really much gain to be had.

regards, tom lane



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Bruce Momjian
On Thu, Jun 21, 2018 at 04:49:34PM +0900, Masahiko Sawada wrote:
> >> On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  
> >> wrote:
> >> So on the whole I think that crypto is a poor fit for the DBAs-are-the-
> >> threat threat model.  It's better to reduce the number of DBAs/sysadmins
> >> and audit all privileged (and, for good measure, unprivileged) access.
> 
> I agree with this. The in-database data encryption can defend mainly
> the threat of storage theft and the threat of memory dump attack. I'm
> sure this design had been proposed for the former purpose. If we want
> to defend the latter we must encrypt data even on database memory. To
> be honest, I'm not sure that there is needs in practice that is user
> want to defend the memory dump attack. What user often needs is to
> defend the threat of storage theft with minimum performance overhead.
> It's known that client-side encryption or encryption on database
> memory increase additional performance overheads. So it would be
> better to have several ways to defend different threats as Joe
> mentioned.

If you can view memory you can't really trust the server and have to do
encryption client-side.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andres Freund



On June 21, 2018 9:04:28 AM PDT, Tom Lane  wrote:
>Andrew Dunstan  writes:
>> Patch after pgindent attached. This will involve a catversion bump
>since 
>> we're adding a new function.
>
>This isn't really ready to go.  One clear problem is that you broke
>pg_dump'ing from any pre-v11 version, because you did not add suitable
>null outputs to the pre-v11 query variants in getTableAttrs.

Thought the same for a bit - think it's OK though, because there a check for 
PQfnumber being bigger than 0...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: phraseto_tsquery design

2018-06-21 Thread Arthur Zakirov
Hello,

On Thu, Jun 21, 2018 at 11:02:32AM -0400, Sagiv Some wrote:
> 2. However, it seems impossible to bypass the performance problem of phrase
> searching. I conduct quite a bit of phrase searching, and although
> postgres' "phraseto_tsquery" performs great on phrases with uncommon words,
> it slows to a screeching halt on phrases with common words such as "law
> firm" or, for example, "bank of america". This is a huge problem, because
> "plainto_tsquery" performs just fine on these but as I understand it,
> phrase searching is built to do a scan after finding each word using
> "plainto"?
> 
> There are already positions and the "plainto" function is quite fast; is
> there a way to modify the "phraseto" query to perform a useful and fast
> search that looks for the distance between found words appropriately?

If I understood you correctly you use GIN index for text search.

Unfortunately it isn't phraseto_tsquery() function issue. It is GIN
index characteristic.

tsvector consists of lexemes and their positions retreived from text
document. GIN has only lexems, and it is OK for regular search (using
plainto_tsquery() function). But phrase search (via phraseto_tsquery())
requires lexem positions. During phrase search additional work is made:
- first get all items from index which satisfy the query (as in reqular
  search)
- then read entire tsvector from the heap
- recheck all got items and exclude those of them which don't satisfy
  the phrase query

Last two point is additional work.

We have our index as an extension. It is changed GIN index and can store
lexemes and their positions. And therefore phrase queries are faster.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Spilling hashed SetOps and aggregates to disk

2018-06-21 Thread David Gershuni


> On Jun 19, 2018, at 10:36 PM, Jeff Davis  wrote:
> 
> But I am worried that I am missing something, because it appears that
> for AGG_MIXED, we wait until the end of the last phase to emit the
> contents of the hash tables. Wouldn't they be complete after the first
> phase?

You're right. They're complete after the first phase phase, but I believe
they're not emitted until the end in order to minimize the complexity of
the code. I don't think the possibility of exceeding work_mem was taken
into account when this decision was made.


> I was imagining that (in the simplest approach) each hash table would
> have its own set of spilled tuples. If grouping set A can be advanced
> (because it's present in the hash table) but B cannot (because it's not
> in the hash table and we are at the memory limit), then it goes into
> the spill file for B but not A. That means that A is still finished
> after the first pass.
...
> I realize that this simple approach could be inefficient if multiple
> hash tables are spilling because we'd duplicate the spilled tuples. But
> let me know if it won't work at all.

This approach seems functionally correct, but I don't like the idea of
transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
(in the worst case). As of yesterday, I think I have a better approach
that will work even for non-sortable grouping keys. (See below).


> I am having a hard time trying to satisfy all of the constraints that
> have been raised so far:
> 
> * handle data types with hash ops but no btree ops
> * handle many different group size distributions and input orders
> efficiently
> * avoid regressions or inefficiencies for grouping sets
> * handle variable-size (particularly O(n)-space) transition states
> 
> without taking on a lot of complexity. I like to work toward the right
> design, but I think simplicity also factors in to the right design.
> NodeAgg.c is already one of the larger and more complicated files that
> we have.

Yes, there are a lot of moving pieces here! And it will be non-trivial to
satisfy them all with one approach. I'm working on a new aggstrategy
that could be used in conjunction with AGG_SORT. This approach could
replace the current AGG_HASH and would satisfy all of the requirements
you mentioned for aggregates that have serial/deserial functions. 

My approach is based on the HashSort algorithm that I mentioned before
but with new modifications to a) handle multiple grouping sets and b) not
require btree ops on the grouping keys.

a) is accomplished by hashing all grouping sets simultaneously and putting
each set in its own memory context. Whenever work_mem is exceeded, we
sort and spill the hash table with the largest memory context.

b) is accomplished by not sorting groups by their actual grouping keys, but
instead sorting by their hash values. This works because we don't need a 
true sort. We just need to process identical groups in a consistent order
during the merge step. As long as we're careful about collisions during the
merge, this should work.

> A lot of the complexity we already have is that grouping sets are
> performed in one giant executor node rather than exposing the
> individual phases as separate executor nodes. I suppose the reason
> (correct me if I'm wrong) is that there are two outputs from a phase:
> the aggregated groups, and the original input tuples in a new sort
> order. That seems solvable -- the executor passes bitmaps around until
> BitmapHeapScan turns them into tuples.

That is an interesting idea, but it's somewhat orthogonal to spill 
implementation.
Still, it could be a nice way to decompose aggregate nodes that require multiple
phases/strategies. I'm curious to hear what others think.

Best,
David


Re: libpq compression

2018-06-21 Thread Robbie Harwood
Konstantin Knizhnik  writes:

> On 20.06.2018 23:34, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>
>>
>> My idea was the following: client want to use compression. But server
>> may reject this attempt (for any reasons: it doesn't support it, has
>> no proper compression library, do not want to spend CPU for
>> decompression,...) Right now compression algorithm is hardcoded. But
>> in future client and server may negotiate to choose proper compression
>> protocol.  This is why I prefer to perform negotiation between client
>> and server to enable compression.
>> Well, for negotiation you could put the name of the algorithm you want
>> in the startup.  It doesn't have to be a boolean for compression, and
>> then you don't need an additional round-trip.
>
> Sorry, I can only repeat arguments I already mentioned:
> - in future it may be possible to specify compression algorithm
> - even with boolean compression option server may have some reasons to 
> reject client's request to use compression
>
> Extra flexibility is always good thing if it doesn't cost too much. And 
> extra round of negotiation in case of enabling compression seems to me 
> not to be a high price for it.

You already have this flexibility even without negotiation.  I don't
want you to lose your flexibility.  Protocol looks like this:

- Client sends connection option "compression" with list of algorithms
  it wants to use (comma-separated, or something).

- First packet that the server can compress one of those algorithms (or
  none, if it doesn't want to turn on compression).

No additional round-trips needed.

>> Well, that's a design decision you've made.  You could put lengths on
>> chunks that are sent out - then you'd know exactly how much is needed.
>> (For instance, 4 bytes of network-order length followed by a complete
>> payload.)  Then you'd absolutely know whether you have enough to
>> decompress or not.
>
> Do you really suggest to send extra header for each chunk of data?
> Please notice that chunk can be as small as one message: dozen of bytes 
> because libpq is used for client-server communication with request-reply 
> pattern.

I want you to think critically about your design.  I *really* don't want
to design it for you - I have enough stuff to be doing.  But again, the
design I gave you doesn't necessarily need that - you just need to
properly buffer incomplete data.

> Frankly speaking I do not completely understand the source of your
> concern.  My primary idea was to preseve behavior of libpq function as
> much as possible, so there is no need to rewrite all places in
> Postgres code when them are used.  It seems to me that I succeed in
> reaching this goal. Incase of enabled compression zpq_stream functions
> (zpq-read/write) are used instead of (pq)secure_read/write and in turn
> are using them to fetch/send more data. I do not see any bad flaws,
> encapsulation violation or some other problems in such solution.
>
> So before discussing some alternative ways of embedding compression in 
> libpq, I will want to understand what's wrong with this approach.

You're destroying the existing model for no reason.  If you needed to, I
could understand some argument for the way you've done it, but what I've
tried to tell you is that you don't need to do so.  It's longer this
way, and it *significantly* complicates the (already difficult to reason
about) connection state machine.

I get that rewriting code can be obnoxious, and it feels like a waste of
time when we have to do so.  (I've been there; I'm on version 19 of my
postgres patchset.)

>>> So loop should be something like this:
>>>
>>>decompress(ptr, n)
>>>   ZSTD_inBuffer in = {0}
>>>   ZSTD_outBuffer out = {0}
>>>
>>>   in.src = ptr
>>>   in.size = n
>>>
>>>   while true
>>>   ret = ZSTD_decompressStream(out, in)
>>>   if ZSTD_isError(ret):
>>>   give_up()
>>> if out.pos != 0
>>> // if we deomcpress soemthing
>>> return out.pos;
>>> read_input(in);
>>
>> The last line is what I'm taking issue with.  The interface we have
>> already in postgres's network code has a notion of partial reads, or
>> that reads might not return data.  (Same with writing and partial
>> writes.)  So you'd buffer what compressed data you have and return -
>> this is the preferred idiom so that we don't block or spin on a
>> nonblocking socket.
>
> If socket is in non-blocking mode and there is available data, then 
> secure_read  function will also immediately return 0.
> The pseudocode above is not quite correct. Let me show the real 
> implementation of zpq_read:
>
> ssize_t
> zpq_read(ZpqStream *zs, void *buf, size_t size, size_t *processed)
> {
>      ssize_t rc;
>      ZSTD_outBuffer out;
>      out.dst = buf;
>      out.pos = 0;
>      out.size = size;
>
>      while (1)
>      {
>          rc = ZSTD_decompressStream(zs->rx_stream, , >rx);
>     

Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread David Rowley
On 22 June 2018 at 02:01, Tom Lane  wrote:
> David Rowley  writes:
>> Well, that's quite surprising. It appears to be a bug in
>> numeric_poly_combine for machines without a working int128 type. The
>> parameters in accum_sum_copy are in the incorrect order.
>
> Ouch.

Yeah. Looks like this function was correct when it went in. It was
9cca11c915e (v10) that caused the issue.

>> The very minimal fix is attached, but I'll need to go look at where
>> the tests for this have gone.
>
> coverage.postgresql.org shows that numeric_poly_serialize/combine()
> aren't exercised at all by the regression tests.  Which is embarrassing
> for this case, but I'm a bit leery of trying to insist on 100% coverage.
>
> It might be a plan to insist on buildfarm coverage for anything with
> platform-varying code in it, in which case there's at least one
> other undertested bit of HAVE_INT128 code in numeric.c.

I'm not familiar with what the coverage tool can do, so maybe there's
an easier way than running a script to check for 0 coverage between
#ifdef / #if and #endif inside all .c files.

This sounds like a longer-term project though, let's get this one fixed first.

I think some coverage of the numerical aggregates is a good idea, so
I've added some in the attached. I managed to get a parallel plan
going with a query to onek, which is pretty cheap to execute. I didn't
touch the bool aggregates. Maybe I should have done that too..?

I also did a quick validation of the other accum_sum_copy usages. All
others look correct.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


fix_numeric_poly_combine_v2.patch
Description: Binary data


Re: Spilling hashed SetOps and aggregates to disk

2018-06-21 Thread David Gershuni
On Jun 21, 2018, at 1:04 PM, Jeff Davis  wrote:
> On Thu, 2018-06-21 at 11:04 -0700, David Gershuni wrote:
>> This approach seems functionally correct, but I don't like the idea
>> of
>> transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
>> (in the worst case).
> 
> It's the same amount of I/O as the idea you suggested as putting the
> hash tables into separate phases, but it may consume more disk space at
> once in the worst case.
> 
> We can mitigate that if it becomes a problem later.

That’s true. Both fall-back approaches would suffer from high I/O.
However, the HashSort-based would offer much better performance,
assuming the sorting issue is overcome.


>> b) is accomplished by not sorting groups by their actual grouping
>> keys, but
>> instead sorting by their hash values. This works because we don't
>> need a 
>> true sort. We just need to process identical groups in a consistent
>> order
>> during the merge step. As long as we're careful about collisions
>> during the
>> merge, this should work.
> 
> Can you expand on how we should handle collisions? If all we can do is
> hash and compare equality, then it seems complex to draw group
> boundaries.

In the HashSort algorithm, the sort order is only used as a tool in the
merge step. In the merge step, we will process one grouping set
at a time.

By sorting before merging, we ensure that we’ll encounter
all partial aggregations of a group at once by looking at the head of each
spill file. This allows us to maintain a ‘current group’ and combine all
instances of it from each spill file. Then when no more instances of that
group are present, we emit the group as output. So the order can be
anything, as long as it’s consistent.

To handle hash collisions, we can do the following:

1) We track the current hash code we’re processing, in ascending
order.

2) Instead of combining one group at at time, we’ll maintain a list of
all groups we’ve seen that match the current hash code.

3) When we peak at a spill file, if its next group’s hash code matches
our current hash code, we’ll check to see if that group matches any of the
groups in our list. If so, we’ll combine it with the matching group. If not,
we’ll add this group to our list.

4) Once the head of each spill file exceeds the current hash code, we’ll
emit our list as output, empty it, and advance to the next hash code.

Does that seem reasonable?

Best,
David
Salesforce

Re: libpq compression

2018-06-21 Thread Nico Williams
On Thu, Jun 21, 2018 at 10:12:17AM +0300, Konstantin Knizhnik wrote:
> On 20.06.2018 23:34, Robbie Harwood wrote:
> >Konstantin Knizhnik  writes:
> >Well, that's a design decision you've made.  You could put lengths on
> >chunks that are sent out - then you'd know exactly how much is needed.
> >(For instance, 4 bytes of network-order length followed by a complete
> >payload.)  Then you'd absolutely know whether you have enough to
> >decompress or not.
> 
> Do you really suggest to send extra header for each chunk of data?
> Please notice that chunk can be as small as one message: dozen of bytes
> because libpq is used for client-server communication with request-reply
> pattern.

You must have lengths, yes, otherwise you're saying that the chosen
compression mechanism must itself provide framing.

I'm not that familiar with compression APIs and formats, but looking at
RFC1950 (zlib) for example I see no framing.

So I think you just have to have lengths.

Now, this being about compression, I understand that you might now want
to have 4-byte lengths, especially given that most messages will be
under 8KB.  So use a varint encoding for the lengths.

Nico
-- 



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Alvaro Herrera
In terms of pgindent, I'm surprised about these lines:

+missingval = OidFunctionCall3(
+  F_ARRAY_IN,

Why did you put a newline there?  In ancient times there was a reason
for that in some cases, because pgindent would move the argument to the
left of the open parens, but it doesn't do that anymore and IMO it's
just ugly.  We have quite a few leftovers from this ancient practice,
I've been thinking about removing these ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Nico Williams
On Fri, May 25, 2018 at 08:41:46PM +0900, Moon, Insung wrote:
> Issues on data encryption of PostgreSQL
> ==
> Currently, in PostgreSQL, data encryption can be using pgcrypto Tool.
> However, it is inconvenient to use pgcrypto to encrypts data in some cases.
> 
> There are two significant inconveniences.
> 
> First, if we use pgcrypto to encrypt/decrypt data, we must call pgcrypto 
> functions everywhere we encrypt/decrypt.

Not so.  VIEWs with INSTEAD OF triggers allow you to avoid this.

> Second, we must modify application program code much if we want to do
> database migration to PostgreSQL from other databases that is using
> TDE.

Not so.  See above.

However, I have at times been told that I should use SQL Server or
whatever because it has column encryption.  My answer is always that it
doesn't help (see my other posts on this thread), but from a business
perspective I understand the problem: the competition has a shiny (if
useless) feature XYZ, therefore we must have it also.

I'm not opposed to PG providing encryption features similar to the
competition's provided the documentation makes their false-sense-of-
security dangers clear.

Incidentally, PG w/ pgcrypto and FDW does provide everything one needs
to be able to implement client-side crypto:

 - use PG w/ FDW as a client-side proxy for the real DB
 - use pgcrypto in VIEWs with INSTEAD OF triggers in the proxy
 - access the DB via the proxy

Presto: transparent client-side crypto that protects against DBAs.  See
other posts about properly binding ciphertext and plaintext.

Protection against malicious DBAs is ultimately a very difficult thing
to get right -- if you really have DBAs as a threat and take that threat
seriously then you'll end up implementing a Merkle tree and performance
will go out the window.

> In these discussions, there were requirements necessary to support TDE in 
> PostgreSQL.
> 
> 1) The performance overhead of encryption and decryption database data must 
> be minimized
> 2) Need to support WAL encryption.
> 3) Need to support Key Management Service.

(2) and full database encryption could be done by the filesystem /
device drivers.  I think this is a much better answer than including
encryption in the DB just because it means not adding all that
complexity to PG, though it's not as portable as doing it in the DB (and
this may well be a winning argument).

What (3) looks like depends utterly on the threat model.  We must
discuss threat models first.

The threat models will drive the design, and (1) will drive some
trade-offs.

> Therefore, I'd like to propose the new design of TDE that deals with
> both above requirements.  Since this feature will become very large,
> I'd like to hear opinions from community before starting making the
> patch.

Any discussion of cryptographic applications should start with a
discussion of threat models.  This is not a big hurdle.

Nico
-- 



PSA: --enable-coverage interferes with parallel query scheduling

2018-06-21 Thread Tom Lane
I've been poking at why coverage.postgresql.org shows that the second
stanza in int8_avg_combine isn't exercised, when it clearly should be.
I can reproduce the problem here, so it's fairly robust.  Eventually
it occurred to me to try a straight EXPLAIN ANALYZE VERBOSE, and what
I find is that with --enable-coverage, you reproducibly get behavior
like this:

explain analyze verbose
  SELECT variance(unique1::int4), sum(unique1::int8) FROM tenk1;

QUERY PLAN  
   
---
 Finalize Aggregate  (cost=401.70..401.71 rows=1 width=64) (actual 
time=116.843..116.843 rows=1 loops=1)
   Output: variance(unique1), sum((unique1)::bigint)
   ->  Gather  (cost=401.65..401.66 rows=4 width=64) (actual 
time=55.079..116.789 rows=4 loops=1)
 Output: (PARTIAL variance(unique1)), (PARTIAL sum((unique1)::bigint))
 Workers Planned: 4
 Workers Launched: 4
 ->  Partial Aggregate  (cost=401.65..401.66 rows=1 width=64) (actual 
time=5.148..5.148 rows=1 loops=4)
   Output: PARTIAL variance(unique1), PARTIAL sum((unique1)::bigint)
   Worker 0: actual time=0.026..0.026 rows=1 loops=1
   Worker 1: actual time=0.028..0.028 rows=1 loops=1
   Worker 2: actual time=0.023..0.023 rows=1 loops=1
   Worker 3: actual time=20.514..20.515 rows=1 loops=1
   ->  Parallel Seq Scan on public.tenk1  (cost=0.00..382.94 
rows=2494 width=4) (actual time=0.009..0.918 rows=2500 loops=4)
 Output: unique1, unique2, two, four, ten, twenty, hundred, 
thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, 
string4
 Worker 0: actual time=0.007..0.007 rows=0 loops=1
 Worker 1: actual time=0.007..0.007 rows=0 loops=1
 Worker 2: actual time=0.005..0.005 rows=0 loops=1
 Worker 3: actual time=0.018..3.653 rows=1 loops=1
 Planning Time: 0.116 ms
 Execution Time: 144.237 ms
(20 rows)

So the reason for the apparent lack of coverage in the combine step
is that only one worker ever sends back a non-null partial result.
We do have coverage, in that normal runs do exercise the code in question,
but you wouldn't know it by looking at the coverage report.

We could probably fix it by using a significantly larger test case,
but that's not very attractive to put into the regression tests.
Anybody have a better idea about how to improve this?  Or even a
clear explanation for what's causing it?  (I'd expect coverage
instrumentation to impose costs at process exit, not startup.)

regards, tom lane



Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

2018-06-21 Thread Alvaro Herrera
I have pushed the patch now (in your original form rather than my later
formulation) -- let's see how the buildfarm likes it.  There are (at
least) three issues remaining, as per below; Pavel, do you have any
insight on these?

First one is about array indexes not working sanely (I couldn't get this
to work in Oracle)

> > Also, array indexes behave funny.  First let's add more XML comments
> > inside that number, and query for the subscripts:
> > 
> > update xmldata set data = regexp_replace(data::text, '791', '791')::xml;
> > 
> > SELECT  xmltable.*
> >   FROM (SELECT data FROM xmldata) x,
> >LATERAL XMLTABLE('/ROWS/ROW'
> > PASSING data
> > COLUMNS
> >  country_name text PATH 
> > 'COUNTRY_NAME/text()' NOT NULL,
> >  size_text float PATH 'SIZE/text()',
> >  size_text_1 float PATH 'SIZE/text()[1]',
> >  size_text_2 float PATH 'SIZE/text()[2]',
> >  "SIZE" float, size_xml xml PATH 'SIZE')
> > where size_text is not null;
> > 
> > country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ 
> >   size_xml
> > ──┼───┼─┼─┼─┼──┼───
> > Singapore│   791 │ 791 │  91 │   1 │  791 │ 
> > 791
> > (1 fila)

The second one is about (lack of!) processing instructions and comments:

> Also, node() matching comments or processing instructions
> seems to be broken too:
> 
> SELECT *
>  FROM (VALUES (''::xml)
> , (''::xml)
>   ) d(x)
>  CROSS JOIN LATERAL
>XMLTABLE('/xml'
> PASSING x
> COLUMNS "node()" TEXT PATH 'node()'
>) t
> 
>  x | node()
> ---+
>   |
>   |
> (2 rows)
> 
> I can look into this, but it may take a while.

Compare the empty second columns with oracle behavior, which returns the
contents of the PI and the comment.  As a script for
http://rextester.com/l/oracle_online_compiler

create table xmltb (data xmltype) \\
insert into xmltb values ('') \\
insert into xmltb values ('') 
\\
SELECT *  FROM xmltb, XMLTABLE('/xml' PASSING data COLUMNS node varchar2(100) 
PATH 'node()') t \\
drop table xmltb \\


The third issue is the way we output comments when they're in a column
of type XML:

> > Note what happens if I change the type from text to xml in that
> > column:
> > 
> > SELECT *
> >  FROM (VALUES ('text'::xml)
> > , (''::xml)
> >   ) d(x)
> >  CROSS JOIN LATERAL
> >XMLTABLE('/xml'
> > PASSING x
> > COLUMNS "node()" xml PATH 'node()'
> >) t;
> > 
> > x │ 
> > node() 
> > ───┼
> > text  │ te ahoy xt
> >  │ some !-- 
> > really -- weird stuff
> > (2 filas)
> 
> The comment seems to be wrong.
> 
> I guess it’s fine if the CDATA gets transformed in to an equivalent
> string using the XML entities. Yet, it might be better avoiding it.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Continue work on changes to recovery.conf API

2018-06-21 Thread Sergei Kornilov
Hello all
I would like to continue work on new recovery api proposed in thread [1]. We 
have some form of consensus but thread has been inactive for a long time and i 
hope i can help.

I start from last published patch [2] and make some changes:
- updated to current HEAD
- made the patch pass make check-world run
- removed recovery.auto.conf lookup and changed pg_basebackup to append 
recovery parameters to postgresql.auto.conf in both plain and tar formats
- revert back trigger_file logic, but rename to promote_signal_file. I think 
here is no reason to make GUC only for directory path with hardcoded file name 
as was discussed in original thread.

Any feedback is strongly appreciated. Thank you

A brief summary of proposed changes:
- if recovery.conf present during start we report FATAL error about old style 
config
- recovery mode start if exists file "recovery.signal" in datadir, all old 
recovery_target_* options are replaced by recovery_target_type and 
recovery_target_value GUC
- start standby mode - by file "standby.signal"
- if present both - standby wins
- pg_basebackup -R will append primary_conninfo and primary_slot_name to 
postgresql.auto.conf config
- all new GUC are still PGC_POSTMASTER

PS: i will add an entry to 2018-09 CommitFest when it is become available.

regards, Sergei

[1] 
https://www.postgresql.org/message-id/flat/CANP8%2BjLO5fmfudbB1b1iw3pTdOK1HBM%3DxMTaRfOa5zpDVcqzew%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CANP8+jJo-LO4xtS7G=in7pg5o60wewdkean+x+gnf+rnay5...@mail.gmail.com



Re: Spilling hashed SetOps and aggregates to disk

2018-06-21 Thread Jeff Davis
On Thu, 2018-06-21 at 11:04 -0700, David Gershuni wrote:
> This approach seems functionally correct, but I don't like the idea
> of
> transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
> (in the worst case).

It's the same amount of I/O as the idea you suggested as putting the
hash tables into separate phases, but it may consume more disk space at
once in the worst case.

We can mitigate that if it becomes a problem later.

> b) is accomplished by not sorting groups by their actual grouping
> keys, but
> instead sorting by their hash values. This works because we don't
> need a 
> true sort. We just need to process identical groups in a consistent
> order
> during the merge step. As long as we're careful about collisions
> during the
> merge, this should work.

Can you expand on how we should handle collisions? If all we can do is
hash and compare equality, then it seems complex to draw group
boundaries.

Regards,
Jeff Davis




Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Andrew Dunstan  writes:
> I left that for a separate exercise. I think this does things the way 
> you want. I must admit to being a bit surprised, I was expecting you to 
> have more to say about the upgrade function than the pg_dump changes :-)

Well, the upgrade function is certainly a bit ugly (I'm generally allergic
to patches that result in a large increase in the number of #includes in
a file --- it suggests that the code was put in an inappropriate place).
But I don't see a good way to make it better, at least not without heavy
refactoring of StoreAttrDefault so that some code could be shared.

I think this is probably OK now, except for one thing: you're likely
to have issues if a dropped column has an attmissingval, because often
the value won't agree with the bogus "integer" type we use for those;
worse, the array might refer to a type that no longer exists.
One way to improve that is to change

  "CASE WHEN a.atthasmissing THEN a.attmissingval "
  "ELSE null END AS attmissingval "

to

  "CASE WHEN a.atthasmissing AND NOT a.attisdropped 
THEN a.attmissingval "
  "ELSE null END AS attmissingval "

However, this might be another reason to reconsider the decision to use
anyarray: it could easily lead to situations where harmless-looking
queries like "select * from pg_attribute" fail.  Or maybe we should tweak
DROP COLUMN so that it forcibly resets atthasmissing/attmissingval along
with setting atthasdropped, so that the case can't arise.

Like Andres, I haven't actually tested, just eyeballed it.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Tom Lane
Alvaro Herrera  writes:
> In terms of pgindent, I'm surprised about these lines:
> +missingval = OidFunctionCall3(
> +  F_ARRAY_IN,

> Why did you put a newline there?  In ancient times there was a reason
> for that in some cases, because pgindent would move the argument to the
> left of the open parens, but it doesn't do that anymore and IMO it's
> just ugly.  We have quite a few leftovers from this ancient practice,
> I've been thinking about removing these ...

I think some people feel this is good style, but I agree with you
about not liking it.  A related practice I could do without is eating
an extra line for an argument-closing paren, as in this example in
tsquery_op.c:

Datum
tsq_mcontained(PG_FUNCTION_ARGS)
{
PG_RETURN_DATUM(
DirectFunctionCall2(
tsq_mcontains,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0)
)
);
}

Aside from the waste of vertical space, it's never very clear to me
(nor, evidently, to pgindent) how such a paren ought to be indented.
So to my eye this could be four lines shorter and look better.

regards, tom lane



Re: Fast default stuff versus pg_upgrade

2018-06-21 Thread Andrew Dunstan



On 06/21/2018 01:53 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 06/21/2018 01:44 PM, Tom Lane wrote:

So I'm thinking that the attidentity code is just wrong, and you should
change that too while you're at it.

That should be backpatched if changed, no? I don't think we'd want this
to get out of sync between the branches. It would make later
backpatching more difficult for one thing.

If you feel like it.  But if there's attmissingval code right next to it
as of v11, then backpatches wouldn't apply cleanly anyway due to lack of
context match, so I doubt there's really much gain to be had.






I left that for a separate exercise. I think this does things the way 
you want. I must admit to being a bit surprised, I was expecting you to 
have more to say about the upgrade function than the pg_dump changes :-)


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..7a2f785 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,13 +11,20 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
 
 
 #define CHECK_IS_BINARY_UPGRADE	\
@@ -192,3 +199,63 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+	Oid			table_id = PG_GETARG_OID(0);
+	text	   *attname = PG_GETARG_TEXT_P(1);
+	text	   *value = PG_GETARG_TEXT_P(2);
+	Datum		valuesAtt[Natts_pg_attribute];
+	bool		nullsAtt[Natts_pg_attribute];
+	bool		replacesAtt[Natts_pg_attribute];
+	Datum		missingval;
+	Form_pg_attribute attStruct;
+	Relation	attrrel,
+tablerel;
+	HeapTuple	atttup,
+newtup;
+	char	   *cattname = text_to_cstring(attname);
+	char	   *cvalue = text_to_cstring(value);
+
+	CHECK_IS_BINARY_UPGRADE;
+
+	/* lock the table the attribute belongs to */
+	tablerel = heap_open(table_id, AccessExclusiveLock);
+
+	/* Lock the attribute row and get the data */
+	attrrel = heap_open(AttributeRelationId, RowExclusiveLock);
+	atttup = SearchSysCacheAttName(table_id, cattname);
+	if (!HeapTupleIsValid(atttup))
+		elog(ERROR, "cache lookup failed for attribute %s of relation %u",
+			 cattname, table_id);
+	attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+	/* get an array value from the value string */
+	missingval = OidFunctionCall3(
+  F_ARRAY_IN,
+  CStringGetDatum(cvalue),
+  ObjectIdGetDatum(attStruct->atttypid),
+  Int32GetDatum(attStruct->atttypmod));
+
+	/* update the tuple - set atthasmissing and attmissingval */
+	MemSet(valuesAtt, 0, sizeof(valuesAtt));
+	MemSet(nullsAtt, false, sizeof(nullsAtt));
+	MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+	valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
+	replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+	replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+	newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+			   valuesAtt, nullsAtt, replacesAtt);
+	CatalogTupleUpdate(attrrel, >t_self, newtup);
+
+	/* clean up */
+	ReleaseSysCache(atttup);
+	heap_close(attrrel, RowExclusiveLock);
+	heap_close(tablerel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ea2f022..40d1eb0 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8103,6 +8103,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attoptions;
 	int			i_attcollation;
 	int			i_attfdwoptions;
+	int			i_attmissingval;
 	PGresult   *res;
 	int			ntups;
 	bool		hasdefaults;
@@ -8132,7 +8133,34 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 
 		resetPQExpBuffer(q);
 
-		if (fout->remoteVersion >= 10)
+		if (fout->remoteVersion >= 11)
+		{
+			/* atthasmissing and attmissingval are new in 11 */
+			appendPQExpBuffer(q, "SELECT a.attnum, a.attname, a.atttypmod, "
+			  "a.attstattarget, a.attstorage, t.typstorage, "
+			  "a.attnotnull, a.atthasdef, a.attisdropped, "
+			  "a.attlen, a.attalign, a.attislocal, "
+			  "pg_catalog.format_type(t.oid,a.atttypmod) AS atttypname, "
+			  "array_to_string(a.attoptions, ', ') AS attoptions, "
+			  "CASE WHEN a.attcollation <> t.typcollation "
+			  "THEN a.attcollation ELSE 0 END AS attcollation, "
+			  

Re: Sample values for pg_stat_statements

2018-06-21 Thread Michael Paquier
On Thu, Jun 21, 2018 at 12:28:01PM +0200, Daniel Gustafsson wrote:
> I happened to notice that this patch was moved from Returned with Feedback to
> Needs Review after the CF closed, which means it’s now sitting open in a 
> closed
> CF.  The intended flow after RWF is that the patch is resubmitted, rather than
> status changed, but since it’s there now I guess we might as well move it to
> the current CF to make sure it’s not overlooked (which it will be in the
> current state)?

Indeed, done.
--
Michael


signature.asc
Description: PGP signature


Incorrect comments in commit_ts.c

2018-06-21 Thread shao bret
Hi,

Here is the comment for CommitTsPagePrecedes in commit_ts.c
/*
* Decide which of two CLOG page numbers is "older" for truncation purposes.
…
*/
I think there is some mistake for this comment.  It should be ‘commitTS page’, 
not ‘CLOG page’.

Thanks,
Br.
Bret

发送自 Windows 10 版邮件应用



001-incorrect-comments.patch
Description: 001-incorrect-comments.patch


Re: Spilling hashed SetOps and aggregates to disk

2018-06-21 Thread Jeff Davis
On Thu, 2018-06-21 at 13:44 -0700, David Gershuni wrote:
> To handle hash collisions, we can do the following:
> 
> 1) We track the current hash code we’re processing, in ascending
> order.
> 
> 2) Instead of combining one group at at time, we’ll maintain a list
> of
> all groups we’ve seen that match the current hash code.
> 
> 3) When we peak at a spill file, if its next group’s hash code
> matches
> our current hash code, we’ll check to see if that group matches any
> of the
> groups in our list. If so, we’ll combine it with the matching group.
> If not,
> we’ll add this group to our list.
> 
> 4) Once the head of each spill file exceeds the current hash code,
> we’ll
> emit our list as output, empty it, and advance to the next hash code.
> 
> Does that seem reasonable?

It seems algorithmically reasonable but awfully complex. I don't think
that's a good path to take.

I still only see two viable approaches: 

(1) Spill tuples into hash partitions: works and is a reasonably simple
extension of our existing code. This is basically what the last patch I
posted does (posted before grouping sets, so needs to be updated).

(2) Spill tuples into a sort: I like this approach because it can be
done simply (potentially less code than #1), and could be further
improved without adding a ton of complexity. It may even eliminate the
need for the planner to choose between hashagg and sort. The problem
here is data types that have a hash opclass but no btree opclass. I
might try to sidestep this problem by saying that data types with no
btree opclass will not obey work_mem.

Additionally, there are two other ideas that we might want to do
regardless of which approach we take:

* support evicting transition states from the hash table, so that we
can handle clustered input better

* refactor grouping sets so that phases are separate executor nodes so
that we can undo some of the complexity in nodeAgg.c

Regards,
Jeff Davis




Re: PATCH: backtraces for error messages

2018-06-21 Thread Craig Ringer
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
wrote:

>
>
> On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
> wrote:
>
>> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
>> > I wrote it because I got sick of Assert(false) debugging, and I was
>> chasing
>> > down some "ERROR:  08P01: insufficient data left in message" errors.
>> Then I
>> > got kind of caught up in it... you know how it is.
>>
>> Yes, I know that feeling!  I have been using as well the Assert(false)
>> and the upgrade-to-PANIC tricks a couple of times, so being able to get
>> more easily backtraces would be really nice.
>>
>>
> Sometime back I'd suggested an idea to be able to dynamically manage log
> levels for elog messages [1].
>


Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.

I don't think I saw the post.

Such a thing would obviously make built in backtrace support much more
useful.

>


Re: partition table and stddev() /variance() behaviour

2018-06-21 Thread David Rowley
On 22 June 2018 at 03:30, Tom Lane  wrote:
>> I think some coverage of the numerical aggregates is a good idea, so
>> I've added some in the attached. I managed to get a parallel plan
>> going with a query to onek, which is pretty cheap to execute. I didn't
>> touch the bool aggregates. Maybe I should have done that too..?
>
> This sort of blunderbuss testing was exactly what I *didn't* want to do.
> Not only is this adding about 20x as many cycles as we need (at least for
> this specific numeric_poly_combine issue), but I'm quite afraid that the
> float4 and/or float8 cases will show low-order-digit irreproducibility
> in the buildfarm.

okay. My sniper rifle was locked away for the evening. I decided it
was best to sleep before any careful aiming was required.

I see you've done the deed already. Thanks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-21 Thread Bruce Momjian
On Thu, Jun 21, 2018 at 12:12:40PM -0500, Nico Williams wrote:
> On Thu, Jun 21, 2018 at 10:14:54AM -0400, Bruce Momjian wrote:
> > On Wed, Jun 20, 2018 at 04:57:18PM -0500, Nico Williams wrote:
> > > Client-side crypto is hard to do well and still get decent performance.
> > > So on the whole I think that crypto is a poor fit for the DBAs-are-the-
> > > threat threat model.  It's better to reduce the number of DBAs/sysadmins
> > > and audit all privileged (and, for good measure, unprivileged) access.
> > 
> > Yeah, kind of.  There is the value of preventing accidental viewing of
> > the data by the DBA, and of course WAL and backup encryption are nice.
> 
> One generally does not use crypto to prevent "accidental" viewing of
> plaintext, but to provide real security relative to specific threats.
> 
> If you stop at encrypting values with no integrity protection for the
> PKs, and no binding to TX IDs and such, you will indeed protect against
> accidental viewing of the plaintext, but not against a determined
> malicious insider.
> 
> Is that worthwhile?  Remember: you'll have to reduce and audit sysadmin
> & DBA access anyways.
> 
> There is also the risk that users won't understand the limitations of
> this sort of encryption feature and might get a false sense of security
> from [mis]using it.
> 
> I'd want documentation to make it absolutely clear that such a feature
> is only meant to reduce the risk of accidental viewing of plaintext by
> DBAs and not a real security feature.

Agreed.  I can see from this discussion that we have a long way to go
before we can produce something clearly useful, but it will be worth it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: PATCH: backtraces for error messages

2018-06-21 Thread Andres Freund
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
> wrote:
> 
> >
> >
> > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
> > wrote:
> >
> >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> >> > I wrote it because I got sick of Assert(false) debugging, and I was
> >> chasing
> >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> >> Then I
> >> > got kind of caught up in it... you know how it is.
> >>
> >> Yes, I know that feeling!  I have been using as well the Assert(false)
> >> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> >> more easily backtraces would be really nice.
> >>
> >>
> > Sometime back I'd suggested an idea to be able to dynamically manage log
> > levels for elog messages [1].
> >
> 
> 
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
> 
> I don't think I saw the post.
> 
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.

- Andres



Re: Considering signal handling in plpython again

2018-06-21 Thread Hubert Zhang
Hi Heikki,
Not working on it now, you can go ahead.

On Fri, Jun 22, 2018 at 12:56 AM, Heikki Linnakangas 
wrote:

> Hi Hubert,
>
> Are you working on this, or should I pick this up? Would be nice to get
> this done as soon as v12 development begins.
>
> - Heikki
>



-- 
Thanks

Hubert Zhang


Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Michael Paquier
On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:
> On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier 
> wrote:
>> So an extra pair of eyes from another committer would be
>> welcome.  I am letting that cool down for a couple of days now.
> 
> I am not a committer, so don't know if my pair of eyes count, but FWIW the
> patch looks good to me except couple of minor points.

Thanks for grabbing some time, Pavan.  Any help is welcome!

> +/*
> + * Local copies of equivalent fields in the control file.  When running
> + * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
> + * expect to replay all the WAL available, and updateMinRecoveryPoint is
> + * switched to false to prevent any updates while replaying records.
> + * Those values are kept consistent as long as crash recovery runs.
> + */
>  static XLogRecPtr minRecoveryPoint; /* local copy of
>   * ControlFile->minRecoveryPoint */
> 
> The inline comment looks unnecessary now that we have comment at the
> top.

I'll fix that.

> @@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr
> RecPtr, int emode,
>   minRecoveryPoint = ControlFile->minRecoveryPoint;
>   minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
> 
> + /*
> + * The startup process can update its local copy of
> + * minRecoveryPoint from that point.
> + */
> 
> s/that/this

This one too.
--
Michael


signature.asc
Description: PGP signature


Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Michael Paquier
On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote:
> (I believe that) By definition recovery doesn't end until the
> end-of-recovery check point ends so from the viewpoint I think it
> is wrong to clear ControlFile->minRecoveryPoint before the end.
> 
> Invalid-page checking during crash recovery is hamful rather than
> useless. It is done by CheckRecoveryConsistency even in crash
> recovery against its expectation because there's a case where
> minRecoveryPoint is valid but InArchiveRecovery is false. The two
> variable there seems to be in contradiction with each other.
> 
> The immediate cause of the contradition is that StartXLOG wrongly
> initializes local minRecoveryPoint from control file's value even
> under crash recovery. updateMinRecoveryPoint also should be
> turned off during crash recovery. The case of crash after last
> checkpoint end has been treated in UpdateMinRecoveryPoint but
> forgot to consider this case, where crash recovery has been
> started while control file is still holding valid
> minRecoveryPoint.

I have been digesting your proposal and I reviewed it, and I think that
what you are proposing here is on the right track.  However, the updates
around minRecoveryPoint and minRecoveryPointTLI ought to be more
consistent because that could cause future issues.  I have spotted two
bug where I think the problem is not fixed: when replaying a WAL record
XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
still get updated from the control file values which can still lead to
failures as CheckRecoveryConsistency could still happily trigger a
PANIC, so I think that we had better maintain those values consistent as
long as crash recovery runs.  And XLogNeedsFlush() also has a similar
problem.

Note that there is as well the case where the startup process switches
from crash recovery to archive recovery, in which case the update of the
local copies have to happen once the switch is done.  Your patch covers
that with just updating updateMinRecoveryPoint each time crash recovery
happens but that's not completely consistent, but it seems that it also
missed the fact that updateMinRecoveryPoint needs to be switched back to
true as the startup process can update the control file.  Actually,
updateMinRecoveryPoint needs to be switched back to true in that case or
the startup process would not be able to update minRecoveryPoint when it
calls XLogFlush for example.

There is the point of trying to get rid of updateMinRecoveryPoint which
has crossed my mind, but that's not wise as it's default value allows
the checkpointer minRecoveryPoint when started, which also has to happen
once the startup process gets out of recovery and tells the postmaster
to start the checkpointer.  For backends as well that's a sane default.

There is also no point in taking ControlFileLock when checking if the
local copy of minRecoveryPoint is valid or not, so this can be
bypassed.

The assertion in CheckRecoveryConsistency is definitely a good idea as
this should only be called by the startup process, so we can keep it.

In the attached, I have fixed the issues I found and added the test case
which should be included in the final commit.  Its concept is pretty
simple, the idea is to keep the local copy of minRecoveryPoint to
InvalidXLogRecPtr as long as crash recovery runs, and is switched back
to normal if there is a switch to archive recovery after a crash
recovery.

This is not really a complicated patch, and it took a lot of energy from
me the last couple of days per the nature of the many scenarios to think
about...  So an extra pair of eyes from another committer would be
welcome.  I am letting that cool down for a couple of days now.
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adbd6a2126..ee837843f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -821,6 +821,13 @@ static XLogSource XLogReceiptSource = 0;	/* XLOG_FROM_* code */
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
 
+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
 static XLogRecPtr minRecoveryPoint; /* local copy of
 	 * ControlFile->minRecoveryPoint */
 static TimeLineID minRecoveryPointTLI;
@@ -2711,20 +2718,26 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
 		return;
 
+	/*
+	 * An invalid minRecoveryPoint means that we need to recover all the WAL,
+	 * i.e., we're doing crash recovery.  We never modify the control file's
+	 * value in that case, so 

Re: Incorrect comments in commit_ts.c

2018-06-21 Thread Michael Paquier
On Fri, Jun 22, 2018 at 03:45:01AM +, shao bret wrote:
> I think there is some mistake for this comment.  It should be
> ‘commitTS page’, not ‘CLOG page’. 

You are right, so pushed.
--
Michael


signature.asc
Description: PGP signature


Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-21 Thread Jeevan Chalke
On Wed, Jun 20, 2018 at 7:11 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Jun 19, 2018 at 2:13 PM, Jeevan Chalke
>  wrote:
> >
> >
> > In the reported testcase, parallel_workers is set to 0 for all partition
> > (child) relations. Which means partial parallel paths are not possible
> for
> > child rels. However, the parent can have partial parallel paths. Thus,
> when
> > we have a full partitionwise aggregate possible (as the group by clause
> > matches with the partition key), we end-up in a situation where we do
> create
> > a partially_grouped_rel for the parent but there won't be any
> > partially_grouped_live_children.
> >
> > The current code was calling add_paths_to_append_rel() without making
> sure
> > of any live children present or not (sorry, it was my fault). This
> causes an
> > Assertion failure in add_paths_to_append_rel() as this function assumes
> that
> > it will have non-NIL live_childrels list.
> >
>
> Thanks for the detailed explanation.
>
> > I have fixed partitionwise aggregate code which is calling
> > add_paths_to_append_rel() by checking the live children list correctly.
> And
> > for robustness, I have also added an Assert() in
> add_paths_to_append_rel().
> >
> > I have verified the callers of add_paths_to_append_rel() and except one,
> all
> > are calling it by making sure that they have a non-NIL live children
> list.
>
> I actually thought about moving if(live_childrel != NIL) test inside
> this function, but then every caller is doing something different when
> that condition occurs. So doesn't help much.
>
> > The one which is calling add_paths_to_append_rel() directly is
> > set_append_rel_pathlist(). And I think, at that place, we will never
> have an
> > empty live children list,
>
> Yes, I think so too. That's because set_append_rel_size() should have
> marked such a parent as dummy and subsequent set_rel_pathlist() would
> not create any paths for it.
>
> Here are some review comments.
>
> +/* We should end-up here only when we have few live child rels. */
>
> I think the right wording is " ... we have at least one child." I was
> actually
> thinking if we should just return from here when live_children == NIL. But
> then
> we will loose an opportunity to catch a bug earlier than set_cheapest().
>

Done.


>
> + * However, if there are no live children, then we cannot create any
> append
> + * path.
>
> I think "no live children" is kind of misleading here. We usually use that
> term
> to mean at least one non-dummy child. But in this case there is no child at
> all, so we might want to better describe this situation. May be also
> explain
> when this condition can happen.
>

Done. Tried re-phrasing it. Please check.


> +if (patype == PARTITIONWISE_AGGREGATE_FULL && grouped_live_children
> != NIL)
>
> I think for this to happen, the parent grouped relation and the underlying
> scan
> itself should be dummy. So, would an Assert be better? I think we have
> discussed this earlier, but I can not spot the mail.
>

Yep, Assert() will be better. Done.


>
> +-- Test when partition tables has parallel_workers = 0 but not the parent
>
> Better be worded as "Test when parent can produce parallel paths but not
> any of
> its children". parallel_worker = 0 is just a means to test that. Although
> the
> EXPLAIN output below doesn't really reflect that parent can have parallel
> plans. Is it possible to create a scenario to show that.
>

Added test.


>
> I see that you have posted some newer versions of this patch, but I
> think those still need to address these comments.
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>


Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From dd9951b3fd92a61d1b9fb02019a3af5b4d8e3b93 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Wed, 20 Jun 2018 18:47:12 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fewer children in the partially_grouped_live_children list. Don't
create append rel in that case.

Jeevan Chalke
---
 src/backend/optimizer/path/allpaths.c |   3 +
 src/backend/optimizer/plan/planner.c  |  25 +-
 src/test/regress/expected/partition_aggregate.out | 105 ++
 src/test/regress/sql/partition_aggregate.sql  |  24 +
 4 files changed, 155 insertions(+), 

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-21 Thread Jeevan Chalke
Hi,

Off-list Ashutosh Bapat has suggested using a flag instead of counting
number of
dummy rels and then manipulating on it. That will be simple and smoother.

I agree with his suggestion and updated my patch accordingly.

Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 41c885a31eb61538dc762c74f4d9782da1db9bc8 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Fri, 22 Jun 2018 10:59:24 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

Since it is very well possible that we could not have a
partially_grouped_rel for some of (or all) the children for which
aggregation in partial is not feasible.  In that case, we should not
try to create any append path by calling add_paths_to_append_rel().
Calling it when no child present to append will result in a server
crash. And appending only a few children will result into data loss.
Thus, don't try to create an append path at first place itself.

In passing, add an Assert() in add_paths_to_append_rel() to check
that it receives a valid live children list.

Jeevan Chalke, reviewed by Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |   3 +
 src/backend/optimizer/plan/planner.c  |  24 +++--
 src/test/regress/expected/partition_aggregate.out | 105 ++
 src/test/regress/sql/partition_aggregate.sql  |  24 +
 4 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3ada379..9f3d725 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have at least one child. */
+	Assert(live_childrels != NIL);
+
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..46128b4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7012,12 +7012,14 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
+	bool		found_partially_grouped_child;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
+	found_partially_grouped_child = true;
 	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
 	{
 		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
@@ -7091,6 +7093,8 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 lappend(partially_grouped_live_children,
 		child_partially_grouped_rel);
 		}
+		else if (found_partially_grouped_child)
+			found_partially_grouped_child = false;
 
 		if (patype == PARTITIONWISE_AGGREGATE_FULL)
 		{
@@ -7103,20 +7107,20 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	}
 
 	/*
-	 * All children can't be dummy at this point. If they are, then the parent
-	 * too marked as dummy.
-	 */
-	Assert(grouped_live_children != NIL ||
-		   partially_grouped_live_children != NIL);
-
-	/*
 	 * Try to create append paths for partially grouped children. For full
 	 * partitionwise aggregation, we might have paths in the partial_pathlist
 	 * if parallel aggregation is possible.  For partial partitionwise
 	 * aggregation, we may have paths in both pathlist and partial_pathlist.
+	 *
+	 * However, it is very well possible that we could not have a
+	 * partially_grouped_rel for some of (or all) the children for which
+	 * aggregation in partial is not feasible.  In that case, we cannot create
+	 * any append path.
 	 */
-	if (partially_grouped_rel)
+	if (partially_grouped_rel && found_partially_grouped_child)
 	{
+		Assert(partially_grouped_live_children != NIL);
+
 		add_paths_to_append_rel(root, partially_grouped_rel,
 partially_grouped_live_children);
 
@@ -7130,7 +7134,11 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 
 	/* If possible, create append paths for fully grouped children. */
 	if (patype == PARTITIONWISE_AGGREGATE_FULL)
+	{
+		Assert(grouped_live_children != NIL);
+
 		add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+	}
 }
 
 /*
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209..d286050 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -1394,3 +1394,108 @@ SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) <
  11 | 16500 

Re: PATCH: backtraces for error messages

2018-06-21 Thread Pavan Deolasee
On Fri, Jun 22, 2018 at 6:18 AM, Andres Freund  wrote:

> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> >
> >
> > Huge +1 from me on being able to selectively manage logging on a
> > module/subsystem, file, or line level.
> >
> > I don't think I saw the post.
> >
> > Such a thing would obviously make built in backtrace support much more
> > useful.
>
> I strongly suggest keeping these as separate as possible. Either is
> useful without the other, and both are nontrivial. Tackling them
> together imo makes it much more likely to get nowhere.
>
>
Sorry, I did not mean to mix up two patches. I brought it up just in case
it provides another idea about when and how to log the backtrace. So yeah,
let's discuss that patch separately.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-06-21 Thread Nico Williams
On Thu, Jun 21, 2018 at 07:46:35PM -0400, Bruce Momjian wrote:
> Agreed.  I can see from this discussion that we have a long way to go
> before we can produce something clearly useful, but it will be worth it.

Let's start with a set of threat models then.  I'll go first:

1) storage devices as the threat
   a) theft of storage devices
   b) malicious storage device operators

2) malicious backup operators as the threat

3) malicious servers as the threat
   a) compromised servers
   b) insider threat -- rogue admins

4) malicious clients as the threat
   a) compromised clients
   b) insider threat

5) passive adversaries on the network as the threat

6) active adversaries on the network as the threat

7) adversaries on the same host as the server or client


Am I missing any?


For example, modern version control systems that use a Merkle hash tree
have malicious servers as part of their threat model.  Git clients, for
example, can detect non-fast-forward history changes upstream.

For another example, DNSSEC also provides protection against malicious
servers by authenticating not the servers but the _data_.  DNSSEC is a
useful case in point because it's effectively a key/value database that
stores somewhat relational data...


Clearly PG currently covers threat models 4 through 7:

 - passive adversaries on the network (addressed via TLS)
 - active adversaries on the network (addressed via TLS)
 - local adversaries (addressed by standard OS user process isolation)
 - malicious clients (addressed via authentication and authorization)

(1) and (2) can be covered externally:

 - protection against malicious storage or backup operators is trivial
   to provide: just use encrypting filesystems or device drivers, and
   encrypt backups using standard technologies.

One shortcoming of relying on OS functionality for protection against
malicious storage is that not all OSes may provide such functionality.
This could be an argument for implementing full, transparent encryption
for an entire DB in the postgres server.  Not a very compelling
argument, but that's just my opinion -- reasonable people could differ
on this.


PG also authenticates servers, but does nothing to authenticate the data
or functions of the server.  So while PG protects against illegitimate
server impersonators as well as TLS/GSS/SCRAM/... will afford, it does
not protect against rogue server admins nor against compromised servers.


That leaves (3) as the only threat model not covered.  It's also the
most challenging threat model to deal with.

Now, if you're going to protect against malicious servers (insiders)...

 - you can't let the server see any sensitive plaintext (must encrypt it)
 - which includes private/secret keys (the server can't have them, only
   the clients can)
 - you have to not only encrypt but provide integrity protection for
   ciphertext as well as unencrypted plaintext
 - decryption and integrity protection validation can only be done on
   the client (because only they have the necessary secrets!)

There are a lot of choices to make here that will greatly affect any
analysis of the security of the result.

A full analysis will inexorably lead to one conclusion: it's better to
just not have malicious servers (insiders), because if you really have
to defend against them then the only usable models of how to apply
cryptography to the problem are a) Git-like VCS, b) DNSSEC, and both are
rather heavy-duty for a general-purpose RDBMS.

So I think for (3) the best answer is to just not have that problem:
just reduce and audit admin access.

Still, if anyone wants to cover (3), I argue that PG gives you
everything you need right now: FDW and pgcrypto.  Just build a
solution where you have a PG server proxy that acts as a smart
client to untrusted servers:

  +---+   ++
   ++ |   |   ||
   || | postgres  |   | postgres   |
   || | (proxy)   |   | (real server)  |
   | Client |>|   |-->||
   || |   |   ||
   || | (keys here)   |   | (no keys here) |
   ++ |   |   ||
  +---+   ++

In the proxy use FDW (to talk to the real server) and VIEWs with INSTEAD
OF triggers to do all crypto transparently to the client.

Presto.  Transparent crypto right in your queries and DMLs.

But, you do have to get a number of choices right as to the crypto, and
chances are you won't provide integrity protection for the entire DB
(see above).

Nico
-- 



Re: [HACKERS] Pluggable storage

2018-06-21 Thread Haribabu Kommi
On Thu, Jun 14, 2018 at 12:25 PM Amit Kapila 
wrote:

> On Thu, Jun 14, 2018 at 1:50 AM, Haribabu Kommi
>  wrote:
> >
> > On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi  >
> > wrote:
> >
> > VACUUM:
> > Not much changes are done in this apart moving the Vacuum visibility
> > functions as part of the
> > storage. But idea for the Vacuum was with each access method can define
> how
> > it should perform.
> >
>
> We are planning to have a somewhat different mechanism for vacuum (for
> non-delete marked indexes), so if you can provide some details or
> discuss what you have in mind before implementation of same, that
> would be great.
>

OK. Thanks for your input. We will discuss the changes before proceed to
code.

Apart from the this, the pluggable storage API contains some re-factored
changes
along with API, some of the re-factored changes are

1. Change the snapshot satisfies type from function to an enum
2. Try to return always the palloced tuple instead of a pointer to buffer
   (This change may have performance impact,so can be done later).
3. Perform a tuple visibility check at heap itself for the page mode
scenario also
4. New function ExecSlotCompare to compare two slots or tuple by storing it
a temp slot.
5. heap_fetch and heap_lock_tuple returns the palloced tuple, not the
pointer to the buffer
6. The index insertion logic decision is moved into heap itself(insert,
update), not in executor.
7. Split HeapscanDesc into two and remove it's usage outside heap of the
second split
8. Move the tuple traversing and providing the updated record to heap.

Is it fine to create these changes as separate patches and can go if the
changes are fine
and doesn't have any impact?

Any comments or additions or deletions to the above list?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Pavan Deolasee
On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier 
wrote:

>
>
> This is not really a complicated patch, and it took a lot of energy from
> me the last couple of days per the nature of the many scenarios to think
> about...


Thanks for the efforts. It wasn't an easy bug to chase to begin with. So I
am not surprised there were additional problems that I missed.


> So an extra pair of eyes from another committer would be
> welcome.  I am letting that cool down for a couple of days now.
>

I am not a committer, so don't know if my pair of eyes count, but FWIW the
patch looks good to me except couple of minor points.

+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
 static XLogRecPtr minRecoveryPoint; /* local copy of
  * ControlFile->minRecoveryPoint */

The inline comment looks unnecessary now that we have comment at the top.


@@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr
RecPtr, int emode,
  minRecoveryPoint = ControlFile->minRecoveryPoint;
  minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;

+ /*
+ * The startup process can update its local copy of
+ * minRecoveryPoint from that point.
+ */

s/that/this

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PANIC during crash recovery of a recently promoted standby

2018-06-21 Thread Kyotaro HORIGUCHI
Hello, sorry for the absense and I looked the second patch.

At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier  wrote 
in <20180622044521.gc5...@paquier.xyz>
> On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:
> > On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier 
> > wrote:
> >> So an extra pair of eyes from another committer would be
> >> welcome.  I am letting that cool down for a couple of days now.
> > 
> > I am not a committer, so don't know if my pair of eyes count, but FWIW the
> > patch looks good to me except couple of minor points.
> 
> Thanks for grabbing some time, Pavan.  Any help is welcome!

in previous mail:
> I have spotted two
> bug where I think the problem is not fixed: when replaying a WAL record
> XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
> still get updated from the control file values which can still lead to
> failures as CheckRecoveryConsistency could still happily trigger a
> PANIC, so I think that we had better maintain those values consistent as

The fix of StartupXLOG, CheckRecoveryConsistency, ReadRecrod and
xlog_redo looks (functionally, mendtioned below) fine.

> long as crash recovery runs.  And XLogNeedsFlush() also has a similar
> problem.

Here, on the other hand, this patch turns off
updateMinRecoverypoint if minRecoverPoint is invalid when
RecoveryInProgress() == true. Howerver RecovInProg() == true
means archive recovery is already started and and
minRecoveryPoint *should* be updated t for the
condition. Actually minRecoverypoint is updated just below. If
this is really right thing, I think that some explanation for the
reason is required here.

In xlog_redo there still be "minRecoverypoint != 0", which ought
to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
seems the only one. Double negation is a bit uneasy but there are
many instance of this kind of coding.)

# I'll go all-out next week.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-06-21 Thread Tsunakawa, Takayuki
From: Nico Williams [mailto:n...@cryptonector.com]
> Let's start with a set of threat models then.  I'll go first:

Thank you so much for summarizing the current situation.  I'd appreciate it if 
you could write this on the PostgreSQL wiki, when the discussion has settled 
somehow.


>  - local adversaries (addressed by standard OS user process isolation)

Does this also mean that we don't have to worry about the following?

* unencrypted data in the server process memory and core files
* passwords in .pgpass and recovery.conf (someone familiar with PCI DSS audit 
said this is a problem)
* user data in server logs


> One shortcoming of relying on OS functionality for protection against
> malicious storage is that not all OSes may provide such functionality.
> This could be an argument for implementing full, transparent encryption
> for an entire DB in the postgres server.  Not a very compelling
> argument, but that's just my opinion -- reasonable people could differ
> on this.

Yes, this is one reason I developed TDE in our product.  And in-database 
encryption allows optimization by encrypting only user data.


> So I think for (3) the best answer is to just not have that problem:
> just reduce and audit admin access.
> 
> Still, if anyone wants to cover (3), I argue that PG gives you
> everything you need right now: FDW and pgcrypto.  Just build a
> solution where you have a PG server proxy that acts as a smart
> client to untrusted servers:

Does sepgsql help?


Should a malfunctioning or buggy application be considered as as a threat?  
That's what sql_firewall extension addresses.

Regards
Takayuki Tsunakawa






Re: Push down Aggregates below joins

2018-06-21 Thread Antonin Houska
Tomas Vondra  wrote:

> On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:
> > Currently, the planner always first decides the scan/join order, and
> > adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
> > and beneficial, to perform the aggregation below a join. I've been
> > hacking on a patch to allow that.
> > 
> 
> There was a patch [1] from Antonin Houska aiming to achieve something
> similar. IIRC it aimed to push the aggregate down in more cases,
> leveraging the partial aggregation stuff.

Yes, I interrupted the work when it become clear that it doesn't find its way
into v11. I'm about to get back to it next week, and at least rebase it so it
can be applied to the current master branch.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-21 Thread Amit Khandekar
On 20 June 2018 at 14:28, Amit Khandekar  wrote:
> On 20 June 2018 at 14:24, Amit Kapila  wrote:
>> On Mon, Jun 18, 2018 at 3:11 PM, Amit Khandekar  
>> wrote:
>>> On 16 June 2018 at 10:44, Amit Kapila  wrote:
 On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane  wrote:
>
> It looks to me like traversal of the partial subpaths is the right
> thing here, in which case we should do
>
> -   foreach(l, subpaths)
> +   foreach(l, pathnode->subpaths)
>
> or perhaps better
>
> -   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
> +   pathnode->subpaths = subpaths = list_concat(subpaths, 
> partial_subpaths);
>
> to make the behavior clear and consistent.
>

 I agree with your analysis and proposed change.  However, I think in
 practice, it might not lead to any bug as in the loop, we are
 computing parallel_safety and partial_subpaths should be
 parallel_safe.
>>>
>>> Will have a look at this soon.
>>>
>>
>> Did you get a chance to look at it?
>
> Not yet, but I have planned to do this by tomorrow.


After list_concat, the subpaths no longer has only non-partial paths,
which it is supposed to have. So it anyways should not be used in the
subsequent code in that function. So I think the following change
should be good.
-   foreach(l, subpaths)
+   foreach(l, pathnode->subpaths)

Attached patch contains the above change.

You are right that the partial paths do not need to be used for
evaluating the pathnode->path.parallel_safe field. But since we also
have the parameterization-related assertion in the same loop, I think
it is ok to do both the things in one loop, covering all paths.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_subpaths_issue.patch
Description: Binary data


Re: Fix some error handling for read() and errno

2018-06-21 Thread Michael Paquier
On Wed, Jun 20, 2018 at 02:52:23PM +0200, Magnus Hagander wrote:
> On Mon, Jun 18, 2018 at 6:42 AM, Michael Paquier 
> wrote:
>> Okay, so this makes two votes in favor of keeping the messages simple
>> without context, shaped like "could not read file %s", with Robert and
>> Alvaro, and two votes for keeping some context with Andres and I.
>> Anybody else has an opinion to offer?
>>
> 
> Count me in with Robert and Alvaro with a +0.5 :)

Thanks for your opinion.

>> Please note that I think that some messages should keep some context
>> anyway, for example the WAL-related information is useful.  An example
>> is this one where the offset is good to know about:
>> +   ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
>> +   (errmsg("could not read from log segment %s, offset %u: read
>> %d bytes, expected %d",
>> +   fname, readOff, r, XLOG_BLCKSZ)));
> 
> Yeah, I think you'd need to make a call for the individual message to see
> how much it helps. In this one the context definitely does, in some others
> it doesn't.

Sure.  I have looked at that and I am finishing with the updated version
attached.

I removed the portion about slru in the code, but I would like to add at
least a mention about it in the commit message.  The simplifications are
also applied for the control file messages you changed as well in
cfb758b.  Also in ReadControlFile(), we use "could not open control
file", so for consistency this should be replaced with a more simple
message.  I noticed as well that the 2PC file handling loses errno a
couple of times, and that we had better keep the messages also
consistent if we do the simplification move...  relmapper.c also gains
simplifications. 

All the incorrect errno handling I found should be back-patched in my
opinion as we did so in the past with 2a11b188 for example.  I am not
really willing to bother about the error message improvements on
anything else than HEAD (not v11, but v12), but...  Opinions about all
that?
--
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 65194db70e..f72c149d46 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1219,6 +1219,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	uint32		crc_offset;
 	pg_crc32c	calc_crc,
 file_crc;
+	int			r;
 
 	TwoPhaseFilePath(path, xid);
 
@@ -1228,8 +1229,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		if (give_warnings)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not open two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not open file \"%s\": %m", path)));
 		return NULL;
 	}
 
@@ -1245,8 +1245,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		if (give_warnings)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not stat two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not stat file \"%s\": %m", path)));
 		return NULL;
 	}
 
@@ -1272,15 +1271,27 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	buf = (char *) palloc(stat.st_size);
 
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
-	if (read(fd, buf, stat.st_size) != stat.st_size)
+	r = read(fd, buf, stat.st_size);
+	if (r != stat.st_size)
 	{
+		int			save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
 		if (give_warnings)
-			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not read two-phase state file \"%s\": %m",
-			path)));
+		{
+			if (r < 0)
+			{
+errno = save_errno;
+ereport(WARNING,
+		(errcode_for_file_access(),
+		 errmsg("could not read file \"%s\": %m", path)));
+			}
+			else
+ereport(WARNING,
+		(errmsg("could not read file \"%s\": read %d bytes, expected %zu",
+path, r, stat.st_size)));
+		}
 		pfree(buf);
 		return NULL;
 	}
@@ -1627,8 +1638,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not remove two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not remove file \"%s\": %m", path)));
 }
 
 /*
@@ -1656,26 +1666,31 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	if (fd < 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
- errmsg("could not recreate two-phase state file \"%s\": %m",
-		path)));
+ errmsg("could not recreate file \"%s\": %m", path)));
 
 	/* Write content and CRC */
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_WRITE);
 	if (write(fd, content, len) != len)
 	{
+		int			save_errno = errno;
+
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
+		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
- errmsg("could not write two-phase state file: %m")));
+ errmsg("could not write file \"%s\": %m", path)));
 	}
 	if (write(fd, _crc, 

bug with expression index on partition

2018-06-21 Thread Amit Langote
Hi.

I noticed a bug with creating an expression index on a partitioned table.
The fact that a partition may have differing attribute numbers is not
accounted for when recursively *creating* the index on partition.  The
partition index gets created but is unusable.

create table p (a int) partition by hash (a);
create table p1 partition of p for values with (modulus 3, remainder 0);
create table p2 partition of p for values with (modulus 3, remainder 1);
create table p3 partition of p for values with (modulus 3, remainder 2);

-- make p3 have different attribute number for column a
alter table p detach partition p3;
alter table p3 drop a, add a int;
alter table p attach partition p3 for values with (modulus 3, remainder 2);

create index on p ((a+1));
\d p3
ERROR:  invalid attnum 1 for relation "p3"

That's because the IndexStmt that's passed for creating the index on p3
contains expressions whose Vars are exact copies of the parent's Vars,
which in this case is wrong.  We must have converted them to p3's
attribute numbers before calling DefineIndex on p3.

Note that if the same index already exists in a partition before creating
the parent's index, then that index already contains the right Vars, no
conversion is needed in that case.

alter table p detach partition p3;
alter table p3 drop a, add a int;
create index on p3 ((a+1));
alter table p attach partition p3 for values with (modulus 3, remainder 2);
\d p3
 Table "public.p3"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES WITH (modulus 3, remainder 2)
Indexes:
"p3_expr_idx" btree ((a + 1))

create index on p ((a+1));
\d p3
 Table "public.p3"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES WITH (modulus 3, remainder 2)
Indexes:
"p3_expr_idx" btree ((a + 1))

In other words, the code that checks whether an index with matching
definition exists in the partition correctly converts Vars before the
actual matching, so is able to conclude that, despite different Vars in
p3's index, it is same index as the one being created in the parent.  So,
the index is not created again, which actually if it were, it would be hit
by the very same bug I'm reporting.

Also is right the case where the partition being attached doesn't have the
index but the parent has and it is correctly *cloned* to the attached
partition.

alter table p detach partition p3;
alter table p3 drop a, add a int;
create index on p ((a+1));
alter table p attach partition p3 for values with (modulus 3, remainder 2);
\d p3
 Table "public.p3"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES WITH (modulus 3, remainder 2)
Indexes:
"p3_expr_idx" btree ((a + 1))


So, CompareIndexInfo and generateClonedIndexStmt are both doing the right
thing, but DefineIndex is not.  Attached is a patch to fix DefineIndex so
that it converts indexParams before recursing to create the index on a
partition.

By the way, do we want to do anything about the fact that we don't check
if a matching inherited index exists when creating an index directly on
the partition.  I wondered this because we do check the other way around.

\d p3
 Table "public.p3"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES WITH (modulus 3, remainder 2)
Indexes:
"p3_expr_idx" btree ((a + 1))

create index on p3 ((a+1));
\d p3
 Table "public.p3"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES WITH (modulus 3, remainder 2)
Indexes:
"p3_expr_idx" btree ((a + 1))
"p3_expr_idx1" btree ((a + 1))

Thanks,
Amit
From d7c090f1c14a5d5ebec04b5a58d626cbfe057e2c Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 21 Jun 2018 14:50:20 +0900
Subject: [PATCH v1] Convert indexParams to partition's attnos before recursing

---
 src/backend/commands/indexcmds.c   | 18 +-
 src/test/regress/expected/indexing.out | 45 ++
 src/test/regress/sql/indexing.sql  | 26 
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 3a3223bffb..74ffd2c392 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -922,7 +922,6 @@ DefineIndex(Oid relationId,

   gettext_noop("could not convert row type"));