Re: Warning messages appearing twice

2019-07-25 Thread vignesh C
On Fri, Jul 26, 2019 at 11:23 AM Dilip Kumar  wrote:

> On Fri, Jul 26, 2019 at 11:04 AM vignesh C  wrote:
> >
> > Hi,
> >
> > I have noticed in some cases the warning messages appear twice, one such
> instance is given below:
> > postgres=# begin;
> > BEGIN
> > postgres=# prepare transaction 't1';
> > PREPARE TRANSACTION
> > postgres=# rollback;
> > WARNING:  there is no transaction in progress
> > WARNING:  there is no transaction in progress
> > ROLLBACK
> >
> > However if logging is enabled, the warning message appears only once.
>
> Seems like you are seeing one message from the client and the other
> one from the server log as you have not enabled the logging collector
> the WARNING is printed on your console.
>
>
Thanks for the clarification Dilip. 

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Re: Warning messages appearing twice

2019-07-25 Thread Dilip Kumar
On Fri, Jul 26, 2019 at 11:04 AM vignesh C  wrote:
>
> Hi,
>
> I have noticed in some cases the warning messages appear twice, one such 
> instance is given below:
> postgres=# begin;
> BEGIN
> postgres=# prepare transaction 't1';
> PREPARE TRANSACTION
> postgres=# rollback;
> WARNING:  there is no transaction in progress
> WARNING:  there is no transaction in progress
> ROLLBACK
>
> However if logging is enabled, the warning message appears only once.

Seems like you are seeing one message from the client and the other
one from the server log as you have not enabled the logging collector
the WARNING is printed on your console.

>
> I'm not sure if this is already known.
> I'm not sure if this is widely used scenario or not.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: block-level incremental backup

2019-07-25 Thread Jeevan Ladhe
Hi Vignesh,

Please find my comments inline below:

1) If relation file has changed due to truncate or vacuum.
> During incremental backup the new files will be copied.
> There are chances that both the old  file and new file
> will be present. I'm not sure if cleaning up of the
> old file is handled.
>

When an incremental backup is taken it either copies the file in its
entirety if
a file is changed more than 90%, or writes .partial with changed blocks
bitmap
and actual data. For the files that are unchanged, it writes 0 bytes and
still
creates a .partial file for unchanged files too. This means there is a
.partitial
file for all the files that are to be looked up in full backup.
While composing a synthetic backup from incremental backup the
pg_combinebackup
tool will only look for those relation files in full(parent) backup which
are
having .partial files in the incremental backup. So, if vacuum/truncate
happened
between full and incremental backup, then the incremental backup image will
not
have a 0-length .partial file for that relation, and so the synthetic backup
that is restored using pg_combinebackup will not have that file as well.


> 2) Just a small thought on building the bitmap,
> can the bitmap be built and maintained as
> and when the changes are happening in the system.
> If we are building the bitmap while doing the incremental backup,
> Scanning through each file might take more time.
> This can be a configurable parameter, the system can run
> without capturing this information by default, but if there are some
> of them who will be taking incremental backup frequently this
> configuration can be enabled which should track the modified blocks.


IIUC, this will need changes in the backend. Honestly, I think backup is a
maintenance task and hampering the backend for this does not look like a
good
idea. But, having said that even if we have to provide this as a switch for
some
of the users, it will need a different infrastructure than what we are
building
here for constructing bitmap, where we scan all the files one by one. Maybe
for
the initial version, we can go with the current proposal that Robert has
suggested,
and add this switch at a later point as an enhancement.
- My thoughts.

Regards,
Jeevan Ladhe


Re: POC: Cleaning up orphaned files using undo logs

2019-07-25 Thread Dilip Kumar
On Thu, Jul 25, 2019 at 11:25 AM Dilip Kumar  wrote:
>
> Hi Thomas,
>
> I have started reviewing 0003-Add-undo-log-manager,  I haven't yet
> reviewed but some places I noticed that instead of UndoRecPtr you are
> directly
> using UndoLogOffset. Which seems like bugs to me
>
> 1.
> +UndoRecPtr
> +UndoLogAllocateInRecovery(UndoLogAllocContext *context,
> +   TransactionId xid,
> +   uint16 size,
> +   bool *need_xact_header,
> +   UndoRecPtr *last_xact_start,
> 
> + *need_xact_header =
> + context->try_location == InvalidUndoRecPtr &&
> + slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
> + *last_xact_start = slot->meta.unlogged.last_xact_start;
>
> the output parameter last_xact_start is of type UndoRecPtr whereas
> slot->meta.unlogged.last_xact_start is of type UndoLogOffset
> shouldn't we use MakeUndoRecPtr(logno, offset) here?
>
> 2.
> + slot = find_undo_log_slot(logno, false);
> + if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end)
> + {
> + *need_xact_header = false;
> + return try_offset;
> + }
>
> Here also you are returning directly try_offset instead of UndoRecPtr
>

+UndoLogRegister(UndoLogAllocContext *context, uint8 block_id,
UndoLogNumber logno)
+{
+ int i;
+
+ for (i = 0; i < context->num_meta_data_images; ++i)
+ {
+ if (context->meta_data_images[i].logno == logno)
+ {
+ XLogRegisterBufData(block_id,
+ (char *) >meta_data_images[i].data,
+ sizeof(context->meta_data_images[i].data));
+ return;
+ }
+ }
+}

I have observed one more thing that you are registering
"meta_data_images" with each buffer of that log.  Suppose, if one undo
record is spread across 2 undo blocks then both the blocks will
include a duplicate copy of this metadata image if this first changes
after the checkpoint?  It will not cause any issue but IMHO we can
avoid including 2 copies of the same meta_data_image.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Warning messages appearing twice

2019-07-25 Thread vignesh C
Hi,

I have noticed in some cases the warning messages appear twice, one such
instance is given below:
postgres=# begin;
BEGIN
postgres=# prepare transaction 't1';
PREPARE TRANSACTION
postgres=# rollback;

*WARNING:  there is no transaction in progressWARNING:  there is no
transaction in progress*
ROLLBACK

However if logging is enabled, the warning message appears only once.

I'm not sure if this is already known.
I'm not sure if this is widely used scenario or not.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


pqParseInput3 overruns

2019-07-25 Thread Kyotaro Horiguchi
Hello.

While looking [1], I noticed that NOTICE messages of the next
command is processed before PQgetResult returns. Clients can
receive such spurious NOTICE messages.

Looking pqParseInput3, its main loop seems considered to exit
after complete messages is processed. (As I read.)

> * Loop to parse successive complete messages available in the buffer.

But actually, 'C' message doesn't work that way. I think we
should do as the comment suggests. Clients still can process
async messages or (somehow issued) NOTICE messages in later
calls.


[1]: 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1904132231510.8961@lancre

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6dbc907142d7aaba7f02e6585e3fa407511725bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 26 Jul 2019 12:51:56 +0900
Subject: [PATCH] Don't parse further messages after COMPLETE message

There is a case where notify processor is called by spurious messages before PQgetResult returns a result set if NOTICE messages by the next command already came before getting the result set. Prevent that by letting pqParseInput3 return just after processing 'C' message.
---
 src/interfaces/libpq/fe-protocol3.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index bbba48bc8b..0ea6a918ab 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -437,6 +437,10 @@ pqParseInput3(PGconn *conn)
 			/* trust the specified message length as what to skip */
 			conn->inStart += 5 + msgLength;
 		}
+
+		/* As we get ready now, give caller a chance to process the result */
+		if (conn->asyncStatus == PGASYNC_READY)
+			return;
 	}
 }
 
-- 
2.16.3



Re: psql - add SHOW_ALL_RESULTS option

2019-07-25 Thread Kyotaro Horiguchi
Hello.

At Thu, 25 Jul 2019 21:42:11 + (GMT), Fabien COELHO  
wrote in 
> 
> Bonsoir Daniel,
> 
> > FYI you forgot to remove that bit:
> >
> > + "SINGLELINE|SINGLESTEP|SHOW_ALL_RESULTS"))
> 
> Indeed. I found another such instance in "help.c".
> 
> > Also copydml does not  seem to be exercised with combined
> > queries,  so do we need this chunk:
> 
> > --- a/src/test/regress/sql/copydml.sql
> 
> Yep, because I reorganized the notice code significantly, and I wanted
> to be sure that the right notices are displayed in the right order,
> which does not show if the trigger just says "NOTICE: UPDATE 8".
> 
> Attached a v2 for the always-show-all-results variant. Thanks for the
> debug!

I have some comments on this patch.

I'm +1 for always output all results without having knobs.

Documentation (psql-ref.sgml) has another place that needs the
same amendment.

Looking the output for -t, -0, -A or something like, we might need
to introduce result-set separator.

# -eH looks broken for me but it would be another issue.

Valid setting of FETCH_COUNT disables this feature. I think it is
unwanted behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SegFault on 9.6.14

2019-07-25 Thread Amit Kapila
On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila  wrote:
>
> On Tue, Jul 23, 2019 at 9:11 AM Thomas Munro  wrote:
> >
>
> > Another idea from the band-aid-solutions-that-are-easy-to-back-patch
> > department: in ExecutePlan() where we call ExecShutdownNode(), we
> > could write EXEC_FLAG_DONE into estate->es_top_eflags, and then have
> > ExecGatherShutdown() only run ExecParallelCleanup() if it sees that
> > flag.  That's not beautiful, but it's less churn that the 'bool final'
> > argument we discussed before, and could be removed in master when we
> > have a better way.
> >
>
> Right, that will be lesser code churn and it can also work.  However,
> one thing that needs some thought is till now es_top_eflags is only
> set in ExecutorStart and same is mentioned in comments where it is
> declared and it seems we are going to change that with this idea.  How
> about having a separate function ExecBlahShutdown which will clean up
> resources as parallel context and can be called only from ExecutePlan
> where we are calling ExecShutdownNode?  I think both these and the
> other solution we have discussed are on similar lines and another idea
> could be to relax the assert which again is not a superb idea.
>

It seems we don't have a clear preference for any particular solution
among these and neither there appears to be any better idea.  I guess
we can wait for a few days to see if Robert has any views on this,
otherwise, pick one of the above and move ahead.

Robert, let us know if you have any preference or better idea to fix
this problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-25 Thread Amit Kapila
On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro  wrote:
>
> On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila  wrote:
> > I have done some more review of undolog patch series and here are my 
> > comments:
>
> Hi Amit,
>
> Thanks!  There a number of actionable changes in your review.  I'll be
> posting a new patch set soon that will address most of your complaints
> individually.  In this message want to respond to one topic area,
> because the answer is long enough already:
>
> > 2.
> > allocate_empty_undo_segment()
> > {
> > ..
> > ..
> > /* Flush the contents of the file to disk before the next checkpoint. */
> > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> > ..
> > }
> >
> > +void
> > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid 
> > tablespace)
> > +{
> > + char path[MAXPGPATH];
> > + FileTag tag;
> > +
> > + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> > +
> > + /* Try to send to the checkpointer, but if out of space, do it here. */
> > + if (!RegisterSyncRequest(, SYNC_REQUEST, false))
> >
> >
> > The comment in allocate_empty_undo_segment indicates that the code
> > wants to flush before checkpoint, but the actual function tries to
> > register the request with checkpointer.  Shouldn't this be similar to
> > XLogFileInit where we use pg_fsync to flush the contents immediately?
> > I guess that will avoid what you have written in comments in the same
> > function (we just want to make sure that the filesystem has allocated
> > physical blocks for it so that non-COW filesystems will report ENOSPC
> > now rather than later when space is needed).  OTOH, I think it is
> > performance-wise better to postpone the work to checkpointer.  If we
> > want to push this work to checkpointer, then we might need to change
> > comments or alternatively, we might want to use bigger segment sizes
> > to mitigate the performance effect.
>
> In an early version I was doing the fsync() immediately.  While
> testing zheap, Mithun CY reported that whenever segments couldn't be
> recycled in the background, such as during a bit long-running
> transaction, he could measure ~6% of the time time spent waiting for
> fsync(), and throughput increased with bigger segments (and thus fewer
> files to fsync()).  Passing the work off to the checkpointer is better
> not only because it's done in the background but also because there is
> a chance that the work can be consolidated with other sync requests,
> and perhaps even avoided completely if the file is discarded and
> unlinked before the next checkpoint.
>
> I'll update the comment to make it clearer.
>

Okay, that makes sense.

> > If my above understanding is correct and the reason to fsync
> > immediately is to reserve space now, then we also need to think
> > whether we are always safe in postponing the work?  Basically, if this
> > means that it can fail when we are actually trying to write undo, then
> > it could be risky because we could be in the critical section at that
> > time.  I am not sure about this point, rather it is just to discuss if
> > there are any impacts of postponing the fsync work.
>
> Here is my theory for why this arrangement is safe, and why it differs
> from what we're doing with WAL segments and regular relation files.
> First, let's review why those things work the way they do (as I
> understand it):
>
> 1.  WAL's use of fdatasync():
>

I was referring to function XLogFileInit which doesn't appear to be
directly using fdatasync.

>
> 3.  Separate size tracking:  Another reason that regular relations
> write out zeroes at relation-extension time is that that's the only
..
>
> To summarise, we write zeroes so we can report ENOSPC errors as early
> as possible, but we defer and consolidate fsync() calls because the
> files' contents and names don't actually have to survive power loss
> until a checkpoint says they existed at that point in the WAL stream.
>
> Does this make sense?
>

Yes, this makes sense.  However, I wonder if we need to do some
special handling for ENOSPC while writing to file in this function
(allocate_empty_undo_segment). Basically, unlink/remove the file if
fail to write because of disk full, something similar to what we do in
XLogFileInit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Alvaro Herrera
On 2019-Jul-25, Alvaro Herrera wrote:

> > Uh, there are no known attacks on AES with known plain-text, e.g., SSL
> > uses AES, so I think we are good with encrypting everything after the
> > first 16 bytes.
> 
> Well, maybe there aren't any attacks *now*, but I don't know what will
> happen in the future.  I'm not clear what's the intended win by
> encrypting the all-zeroes page hole anyway.  If you leave it
> unencrypted, the attacker knows the size of the hole, as well as the
> size of the tuple data area and the size of the LP array.  Is that a
> side-channer that leaks much?

This answer https://crypto.stackexchange.com/a/31090 is interesting for
three reasons:

1. it says we don't really have to worry about cleartext attacks, at
least not in the immediate future, so encrypting the hole should be OK;

2. it seems to reinforces a point I tried to make earlier, which is that
reusing the IV a small number of times is *not that bad*:

> On the other hand if the Key and IV are reused between messages then
> the same plaintext will lead to the same ciphertext, so you can
> potentially decrypt a message using a sufficiently large corpus of known
> matching plaintext/ciphertext pairs, even without ever recovering the
> key.

Actually the attack being described presumes that you know *both the*
*unencrypted data and the encrypted data* for a certain key/IV pair,
and only then you can decrypt some other data.  It doesn't follow that
you can decrypt data just because somebody reused the IV for a second
page ... I haven't seen any literature referenced that explains what
this attack is.

3. It seems clear that AES is sufficiently complicated that explaining
it to non-cryptographers is a lost cause.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Michael Paquier
On Thu, Jul 25, 2019 at 01:00:34PM +0200, Julien Rouhaud wrote:
> The problem is that a user doing something like:
> 
> reindexdb -j48 -i some_index -S s1 -S s2 -S s3
> 
> will probably be disappointed to learn that he has to run a specific
> command for the index(es) that should be reindexed.  Maybe we can
> issue a warning that parallelism isn't used when an index list is
> processed and user asked for multiple jobs?

Arguments go in both directions as some other users may be surprised
by the performance of indexes as serialization is enforced.

> I don't send a new patch since the --index wanted behavior is not
> clear yet.

So I am sending one patch (actually two) after a closer review that I
have spent time shaping into a committable state.  And for this part I
have another suggestion that is to use a switch/case without a default
so as any newly-added object types would allow somebody to think about
those code paths as this would generate compiler warnings.

While reviewing I have found an extra bug in the logic: when using a
list of tables, the number of parallel slots is the minimum between
concurrentCons and tbl_count, but this does not get applied after
building a list of tables for a schema or database reindex, so we had
better recompute the number of items in reindex_one_database() before
allocating the number of parallel slots.  There was also a small gem
in the TAP tests for one of the commands using "-j2" in one of the
command arguments.

So here we go:
- 0001 is your original thing, with --jobs enforced to 1 for the index
part.
- 0002 is my addition to forbid --index with --jobs.

I am fine to be outvoted regarding 0002, and it is the case based on
the state of this thread with 2:1.  We could always revisit this
decision in this development cycle anyway.
--
Michael
From 77f24d9e3edb40b8ac2674230e10345988b8831c Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 26 Jul 2019 12:12:46 +0900
Subject: [PATCH 1/2] Base patch for support of --jobs in reindexdb

---
 doc/src/sgml/ref/reindexdb.sgml|  25 ++
 src/bin/scripts/Makefile   |   2 +-
 src/bin/scripts/reindexdb.c| 407 +
 src/bin/scripts/t/090_reindexdb.pl |  32 ++-
 4 files changed, 414 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ref/reindexdb.sgml b/doc/src/sgml/ref/reindexdb.sgml
index 25b5a72770..477b77c6e9 100644
--- a/doc/src/sgml/ref/reindexdb.sgml
+++ b/doc/src/sgml/ref/reindexdb.sgml
@@ -166,6 +166,31 @@ PostgreSQL documentation
   
  
 
+ 
+  -j njobs
+  --jobs=njobs
+  
+   
+Execute the reindex commands in parallel by running
+njobs
+commands simultaneously.  This option reduces the time of the
+processing but it also increases the load on the database server.
+   
+   
+reindexdb will open
+njobs connections to the
+database, so make sure your 
+setting is high enough to accommodate all connections.
+   
+   
+Note that this option is ignored with the --index
+option to prevent deadlocks when processing multiple indexes from
+the same relation, and incompatible with the --system
+option.
+   
+  
+ 
+
  
   -q
   --quiet
diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index 3cd793b134..ede665090f 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -29,7 +29,7 @@ dropdb: dropdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-
 dropuser: dropuser.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 clusterdb: clusterdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 vacuumdb: vacuumdb.o common.o scripts_parallel.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
-reindexdb: reindexdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
+reindexdb: reindexdb.o common.o scripts_parallel.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 pg_isready: pg_isready.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 
 install: all installdirs
diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index 219a9a9211..a877689eec 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -10,10 +10,14 @@
  */
 
 #include "postgres_fe.h"
+
+#include "catalog/pg_class_d.h"
 #include "common.h"
 #include "common/logging.h"
+#include "fe_utils/connect.h"
 #include "fe_utils/simple_list.h"
 #include "fe_utils/string_utils.h"
+#include "scripts_parallel.h"
 
 typedef enum ReindexType
 {
@@ -25,16 +29,26 @@ typedef enum ReindexType
 } ReindexType;
 
 
-static void reindex_one_database(const char *name, const char *dbname,
- ReindexType type, const char *host,
+static SimpleStringList *get_parallel_object_list(PGconn *conn,
+  ReindexType type,
+  

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Alvaro Herrera
On 2019-Jul-25, Bruce Momjian wrote:

> On Thu, Jul 25, 2019 at 03:43:34PM -0400, Alvaro Herrera wrote:

> > Why are we encrypting the page header in the first place?  It seems to
> > me that the encrypted area should cover only the line pointers and the
> > tuple data area; the page header needs to be unencrypted so that it can
> > be used at all: firstly because you need to obtain the LSN from it in
> 
> Yes, the plan was to not encrypt the first 16 bytes so the LSN was visible.

I don't see the value of encrypting the rest of the page header
(which includes the page checksum).

> > order to compute the IV, and secondly because the checksum must be
> > validated *before* decrypting (per Moxie Marlinspike's "cryptographic
> > doom" principle mentioned in a comment in the SE question).
> 
> Uh, I think we are still on the fence about writing the checksum _after_
> encryption,

I don't see what's the reason for doing that.  The "cryptographic doom
principle" page talks about this kind of scenario, and ISTM that the
ultimate suggestion is that the page checksum ought to be verifyable
prior to doing any decryption.

Are you worried about an attacker forging the page checksum by
installing another encrypted page that gives the same checksum?  I'm not
sure how that attack works ... I mean why can the attacker install
arbitrary pages?

> The only way offline tools can verify the CRC without access to the keys
> is via #2, but #2 gives us _no_ detection of tampering.  I realize the
> CRC tampering detection of #1 and #3 is not great, but it certainly has
> some value.

It seems to me that you're trying to invent a cryptographic signature
scheme on your own.  That seems very likely to backfire.

> > I am not totally clear on whether the special space and the "page hole"
> > need to be encrypted.  I tend to think that they should *not* be
> > encrypted; in particular, encrypting a large area containing zeroes seem
> > a plentiful source of known cleartext, which seems a bad thing.  Special
> > space also seems to contain known cleartext; maybe not as much as the
> > page hole, but still seems better avoided.
> 
> Uh, there are no known attacks on AES with known plain-text, e.g., SSL
> uses AES, so I think we are good with encrypting everything after the
> first 16 bytes.

Well, maybe there aren't any attacks *now*, but I don't know what will
happen in the future.  I'm not clear what's the intended win by
encrypting the all-zeroes page hole anyway.  If you leave it
unencrypted, the attacker knows the size of the hole, as well as the
size of the tuple data area and the size of the LP array.  Is that a
side-channer that leaks much?

> > The checksum we currently have is not cryptographically secure -- it's
> > not a crypto-strong signature.  If we want that, we need some further
> > protection.  Maybe for encrypted tables we replace our current checksum
> > with an cryptographically secure signature ...?  Pretty sure 16 bits are
> > insufficient for that, but I suppose we would just use a different page
> > header with room for a proper sig.
> 
> Yes, checksum is more for best-effort than fully secure, but replay of
> pages makes a fully secure solution hard anyway.

What do you mean with "replay of pages"?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Patch for SortSupport implementation on inet/cdir

2019-07-25 Thread Peter Geoghegan
On Fri, Feb 8, 2019 at 11:13 PM Edmund Horner  wrote:
> I have some comments on the comments:

Seems reasonable to me.

Where are we on this? I'd like to get the patch committed soon.

--
Peter Geoghegan




Re: Implementing Incremental View Maintenance

2019-07-25 Thread Yugo Nagata
Hi,

I've updated the wiki page of Incremental View Maintenance.

https://wiki.postgresql.org/wiki/Incremental_View_Maintenance

On Thu, 11 Jul 2019 13:28:04 +0900
Yugo Nagata  wrote:

> Hi Thomas,
> 
> Thank you for your review and discussion on this patch!
> 
> > > 2019年7月8日(月) 15:32 Thomas Munro :
> > > 
> > > > On Fri, Jun 28, 2019 at 10:56 PM Yugo Nagata  
> > > > wrote:
> > > > > Attached is a WIP patch of IVM which supports some aggregate 
> > > > > functions.
> > > >
> > > > Hi Nagata-san and Hoshiai-san,
> > > >
> > > > Thank you for working on this.  I enjoyed your talk at PGCon.  I've
> > > > added Kevin Grittner just in case he missed this thread; he has talked
> > > > often about implementing the counting algorithm, and he wrote the
> > > > "trigger transition tables" feature to support exactly this.  While
> > > > integrating trigger transition tables with the new partition features,
> > > > we had to make a number of decisions about how that should work, and
> > > > we tried to come up with answers that would work for IMV, and I hope
> > > > we made the right choices!
> 
> Transition tables is a great feature. I am now using this in my implementation
> of IVM, but the first time I used this feature was when I implemented a PoC
> for extending view updatability of PostgreSQL[1]. At that time, I didn't know
> that this feature is made originally aiming to support IVM. 
> 
> [1] https://www.pgcon.org/2017/schedule/events/1074.en.html
> 
> I think transition tables is a good choice to implement a statement level
> immediate view maintenance where materialized views are refreshed in a 
> statement
> level after trigger. However, when implementing a transaction level immediate
> view maintenance where views are refreshed per transaction, or deferred view
> maintenance, we can't update views in a after trigger, and we will need an
> infrastructure to manage change logs of base tables. Transition tables can be
> used to collect these logs, but using logical decoding of WAL is another 
> candidate.
> In any way, if these logs can be collected in a tuplestore, we might able to
> convert this to "ephemeral named relation (ENR)" and use this to calculate 
> diff
> sets for views.
> 
> > > >
> > > > I am quite interested to learn how IVM interacts with SERIALIZABLE.
> > > >
> > > 
> > >  Nagata-san has been studying this. Nagata-san, any comment?
> 
> In SERIALIZABLE or REPEATABLE READ level, table changes occurred in other 
> ransactions are not visible, so views can not be maintained correctly in AFTER
> triggers. If a view is defined on two tables and each table is modified in
> different concurrent transactions respectively, the result of view maintenance
> done in trigger functions can be incorrect due to the race condition. This is 
> the
> reason why such transactions are aborted immediately in that case in my 
> current
> implementation.
> 
> One idea to resolve this is performing view maintenance in two phases. 
> Firstly, 
> views are updated using only table changes visible in this transaction. Then, 
> just after this transaction is committed, views have to be updated 
> additionally 
> using changes happened in other transactions to keep consistency. This is a 
> just 
> idea, but  to implement this idea, I think, we will need keep to keep and 
> maintain change logs.
> 
> > > > A couple of superficial review comments:
> 
> 
>  
> > > > +const char *aggname = get_func_name(aggref->aggfnoid);
> > > > ...
> > > > +else if (!strcmp(aggname, "sum"))
> > > >
> > > > I guess you need a more robust way to detect the supported aggregates
> > > > than their name, or I guess some way for aggregates themselves to
> > > > specify that they support this and somehow supply the extra logic.
> > > > Perhaps I just waid what Greg Stark already said, except not as well.
> 
> Yes. Using name is not robust because users can make same name aggregates 
> like 
> sum(text) (although I am not sure this makes sense). We can use oids instead 
> of names, but it would be nice to extend pg_aggregate and add new attributes 
> for informing that this supports IVM and for providing functions for IVM 
> logic.
> 
> > > > As for the question of how
> > > > to reserve a namespace for system columns that won't clash with user
> > > > columns, according to our manual the SQL standard doesn't allow $ in
> > > > identifier names, and according to my copy SQL92 "intermediate SQL"
> > > > doesn't allow identifiers that end in an underscore.  I don't know
> > > > what the best answer is but we should probably decide on a something
> > > > based the standard.
> 
> Ok, so we should use "__ivm_count__" since this ends in "_" at least.
> 
> Another idea is that users specify the name of this special column when 
> defining materialized views with IVM support. This way can avoid the conflict 
> because users will specify a name which does not appear in the target list.
> 
> As for aggregates supports, it may be 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Jonathan S. Katz
Hi,

Before my reply, I wanted to say that I've been lurking on this thread
for a bit as I've tried to better inform myself on encryption at rest
and how it will apply to what we want to build. I actually built a
(poor) prototype in Python of the key management system that Joe &
Masahiko both laid out, in addition to performing some "buffer
encrpytion" with it. It's not worth sharing at this point.

With the disclaimer that I'm not as familiar with a lot of concepts as I
would like to be:

On 7/25/19 1:54 PM, Masahiko Sawada wrote:
> On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian  wrote:
>>
>> On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
>>> I've re-considered the design of TDE feature based on the discussion
>>> so far. The one of the main open question is the granular of
>>> encryption objects: cluster encryption or more-granular-than-cluster
>>> encryption. The followings describe about the new TDE design when we
>>> choose table-level encryption or something-new-group-level encryption.
>>>
>>> General
>>> 
>>> We will use AES and support both AES-128 and AES-256. User can specify
>>> the new initdb option something like --aes-128 or --aes-256 to enable
>>> encryption and must specify --encryption-key-passphrase-command along
>>> with. (I guess we also require openssl library.) If these options are
>>> specified, we write the key length to the control file and derive the
>>> KEK and generate MDEK during initdb. wal_log_hints will be enabled
>>> automatically in encryption mode, like we do for checksum mode,
>>
>> Agreed.  pg_control will store the none/AES128/AES256 indicator.
>>
>>> Key Management
>>> ==
>>> We will use 3-tier key architecture as Joe proposed.
>>>
>>>   1. A master key encryption key (KEK): this is the ley supplied by the
>>>  database admin using something akin to ssl_passphrase_command
>>>
>>>   2. A master data encryption key (MDEK): this is a generated key using a
>>>  cryptographically secure pseudo-random number generator. It is
>>>  encrypted using the KEK, probably with Key Wrap (KW):
>>>  or maybe better Key Wrap with Padding (KWP):
>>>
>>>   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
>>>   table specific keys.
>>
>> What is the value of a per-table encryption key?  How is HKDF derived?
> 
> Per-table encryption key is derived from MDEK with salt and its OID as
> info. I think we can store salts for each encryption keys into the
> separate file so that off-line tool also can read it.

+1 with using the info/salt for the HKDF as described above. The other
decision will be the hashing algorithm to use. SHA-256?


>>>   3b. WAL data encryption keys (WDEK):  Similarly use MDEK and a HKDF to
>>>   generate new keys when needed for WAL.
>>>
>>> We store MDEK to the plain file (say global/pgkey) after encrypted
>>> with the KEK. I might want to store the hash of passphrase of the KEK
>>> in order to verify the correctness of the given passphrase. However we
>>> don't need to store TDEK and WDEK as we can derive them as needed. The
>>> key file can be read by both backend processes and front-end tools.
>>
>> Yes, we need to verify the pass phrase.

Just to clarify, this would be a hash of the KEK?

From my experiments, the MDEK key unwrapping fails if you do not have
the correct KEK (as it should). If it's a matter of storing a hash of
the KEK, I'm not sure if there is much added benefit to have it, but I
would not necessarily oppose it either.

>>> When postmaster startup, it reads the key file and decrypts MDEK and
>>> derive WDEK using key id for WDEK.

I don't know if this is getting too far ahead, but what happens if the
supplied KEK fails to decrypt the MDEK? Will postmaster refuse to startup?

>>> WDEK is loaded to the key hash map
>>> (keyid -> key) on the shared memory. Also we derive TDEK as needed
>>> when reading tables or indexes and add it to the key hash map as well
>>> if not exists.

+1 to this approach.

>>>
>>> Buffer Encryption
>>> ==
>>> We will use AES-CBC for buffer encryption. We will add key id (4byte)
>>
>> I think we might want to use CTR for this, and will post after this.

Not sure if I missed this post or not (as several people mentioned, it
is easy to get lost in this thread).

I think what will help drive this decision is whether or not we consider
the data we are storing on disk as a "file system" in itself. Trying to
make myself literate in disk encryption theory[1], it seems a big
weakness in using CTR mode for encryption is we need to be able to
guarantee a fresh counter for every page we encrypt[2], so if we can
guarantee the uniqueness of IV per TDEK, this is on the table.

XTS mode, on the other hand, appears to be more durable to reusing an IV
as the "tweak" was designed to represent a disk sector, though there are
still problems. However, I presume this is one of many reasons why
fscrypt uses XTS[3].

For data malleability, CTR is described 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Jonathan S. Katz

 Buffer Encryption
 ==
 We will use AES-CBC for buffer encryption. We will add key id (4byte)
>>>
>>> I think we might want to use CTR for this, and will post after this.
> 
> Not sure if I missed this post or not (as several people mentioned, it
> is easy to get lost in this thread).
> 
> I think what will help drive this decision is whether or not we consider
> the data we are storing on disk as a "file system" in itself. Trying to
> make myself literate in disk encryption theory[1], it seems a big
> weakness in using CTR mode for encryption is we need to be able to
> guarantee a fresh counter for every page we encrypt[2], so if we can
> guarantee the uniqueness of IV per TDEK, this is on the table.
> 
> XTS mode, on the other hand, appears to be more durable to reusing an IV
> as the "tweak" was designed to represent a disk sector, though there are
> still problems. However, I presume this is one of many reasons why
> fscrypt uses XTS[3].

Much like Joe earlier, I forgot my citations:

[1] https://en.wikipedia.org/wiki/Disk_encryption_theory
[2]
https://crypto.stackexchange.com/questions/14628/why-do-we-use-xts-over-ctr-for-disk-encryption
[3] https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: PG 12 draft release notes

2019-07-25 Thread Peter Geoghegan
On Thu, Jul 25, 2019 at 6:37 PM Bruce Momjian  wrote:
> Attached patch applied, thanks.

Thanks Bruce,

-- 
Peter Geoghegan




Re: PG 12 draft release notes

2019-07-25 Thread Bruce Momjian
On Mon, Jul 15, 2019 at 06:21:59PM -0700, Peter Geoghegan wrote:
> On Wed, Jun 12, 2019 at 7:48 PM Bruce Momjian  wrote:
> > Great, patch applied.
> 
> I think that it would make sense to have a v12 release note item for
> amcheck's new "rootdescend" verification option:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c1afd175b5b2e5c44f6da34988342e00ecdfb518
> 
> It is a user facing feature, which increments the amcheck extension
> version number.

Attached patch applied, thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
new file mode 100644
index 4c55656..6b54a9e
*** a/doc/src/sgml/release-12.sgml
--- b/doc/src/sgml/release-12.sgml
*** Author: Tom Lane 
*** 3228,3233 
--- 3228,3245 
  
   
  
+ 
+   
+Add a parameter to a  function to check
+each index tuple from the root of the tree.
+   
+  
+ 
+  
+ 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 09:11:18PM -0400, Sehrope Sarkuni wrote:
> On Thu, Jul 25, 2019 at 8:50 PM Bruce Momjian  wrote:
> 
> On Thu, Jul 25, 2019 at 08:44:40PM -0400, Sehrope Sarkuni wrote:
> > You can still use CTR mode and include those to make the key + IV unique
> by
> > adding them to the derived key rather than the IV.
> >
> > The IV per-page would still be LSN + page-number (with the block number
> added
> > as it's evaluated across the page) and the relfilenode, heap/index,
> database,
> > and anything else to make it unique can be included in the HKDF to 
> create
> the
> > per-file derived key.
> 
> I thought if we didn't have to hash the stuff together we would be less
> likely to get collisions with the IV.
> 
>  
> IV creation not use any hashing and would never have collisions with the same
> key as it's LSN + page + block (concatenation).
> 
> The derived keys would also not have collisions as the HKDF prevents that.
> Deriving two matching keys with different inputs has the same chance as
> randomly generating matching HMACs (effectively nil with something like
> HMAC-SHA-256).
> 
> So there wouldn't be any reuse of the same key + IV. Even if two different
> files are encrypted with the same LSN + page the total operation (key + IV)
> would be different as they'd be using different derived keys.

Oh, mix the value into the derived key, not into the IV --- got it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Sehrope Sarkuni
On Thu, Jul 25, 2019 at 8:50 PM Bruce Momjian  wrote:

> On Thu, Jul 25, 2019 at 08:44:40PM -0400, Sehrope Sarkuni wrote:
> > You can still use CTR mode and include those to make the key + IV unique
> by
> > adding them to the derived key rather than the IV.
> >
> > The IV per-page would still be LSN + page-number (with the block number
> added
> > as it's evaluated across the page) and the relfilenode, heap/index,
> database,
> > and anything else to make it unique can be included in the HKDF to
> create the
> > per-file derived key.
>
> I thought if we didn't have to hash the stuff together we would be less
> likely to get collisions with the IV.
>

IV creation not use any hashing and would never have collisions with the
same key as it's LSN + page + block (concatenation).

The derived keys would also not have collisions as the HKDF prevents that.
Deriving two matching keys with different inputs has the same chance as
randomly generating matching HMACs (effectively nil with something like
HMAC-SHA-256).

So there wouldn't be any reuse of the same key + IV. Even if two different
files are encrypted with the same LSN + page the total operation (key + IV)
would be different as they'd be using different derived keys.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: PG 12 draft release notes

2019-07-25 Thread Bruce Momjian
On Tue, Jul 16, 2019 at 03:26:31PM +0900, Michael Paquier wrote:
> On Mon, Jul 15, 2019 at 08:51:34PM +0200, Laurenz Albe wrote:
> > I wonder if commits 0ba06e0bf and 40cfe8606 are worth mentioning
> > in the release notes.  They make "pg_test_fsync" work correctly
> > on Windows for the first time.
> 
> I don't know about this point specifically.  Improving support for
> pg_test_fsync on Windows is just a side effect of the first commit
> which benefits all frontend tools (the second commit is an
> embarrassing bug fix for the first one).  And at the same time we
> don't really add in the release notes low-level improvements like
> these ones.

Well, if we were reporting incorrect results before, that seems like a
fix, with updated wording, of course, to mention just the fix.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: PG 12 draft release notes

2019-07-25 Thread Bruce Momjian
On Mon, Jul 15, 2019 at 08:51:34PM +0200, Laurenz Albe wrote:
> On Sat, 2019-05-11 at 16:33 -0400, Bruce Momjian wrote:
> > I have posted a draft copy of the PG 12 release notes here:
> 
> I wonder if commits 0ba06e0bf and 40cfe8606 are worth mentioning
> in the release notes.  They make "pg_test_fsync" work correctly
> on Windows for the first time.

Oh, I missed adding that.  I applied the attached patch, with updated
wording, and moved it to the Server Applications section.  Thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
new file mode 100644
index 57365c5..4c55656
*** a/doc/src/sgml/release-12.sgml
--- b/doc/src/sgml/release-12.sgml
*** Author: Andrew Dunstan 

  
+  
+ 
+ 
+   
+Fix  to report accurate
+open_datasync durations on
+Windows (Laurenz Albe)
+   
+  
+ 
   
  
  


Re: Seek failure at end of FSM file during WAL replay (in 11)

2019-07-25 Thread Michael Paquier
On Wed, Jul 24, 2019 at 01:30:42PM -0400, Tom Lane wrote:
> Hm.  AFAICS the immediate issuer of the error must have been
> _mdnblocks(); there are other matches to that error string but
> they are in places where we can tell which file the seek must
> have been applied to, and it wasn't a FSM file.

Yep, that matches my guesses.  _mdnblocks() is the only match among
the three places with this error string.

> lseek() per se presumably would never return ENOENT.  A more likely
> theory is that the file wasn't actually open but only had a leftover
> VFD entry, and when FileSize() -> FileAccess() tried to open it,
> the open failed with ENOENT --- but _mdnblocks() would still call it
> a seek failure.
> 
> So I'd opine that this is a pretty high-level failure --- what are
> we doing trying to replay WAL against a table that's been dropped?
> Or if it wasn't dropped, why was the FSM removed?

Interesting theory.  In this particular workload, all DDLs are run
when the product runs firstboot and the schema does not change
afterwards, so the stuff does not drop any tables.  Actually I have
been able to extract more information from the log bundles I have as
this stuff does a lookup of all the files of the data folder at the
moment a log bundle is taken.  For this relation the main fork exists
on the primary and there is no FSM and VM.  On the standby, the main
fork also exists but there is a FSM, which I guess has been rebuilt at
the follow-up recovery.  So could it be possible that a FSM has been
removed on the primary, with its removal done on the standby because
of that without the proper VFD entry cleaned up?
--
Michael


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 08:44:40PM -0400, Sehrope Sarkuni wrote:
> On Thu, Jul 25, 2019 at 7:51 PM Bruce Momjian  wrote:
> 
> Looking at the bits we have, the IV for AES is 16 bytes.  Since we know
> we have to use LSN (to change the IV for each page write), and the page
> number (so WAL updates that change multiple pages with the same LSN use
> different IVs), that uses 12 bytes:
> 
>         LSN         8 bytes
>         page-number 4 bytes
> 
> That leaves 4 bytes unused.  If we use CTR, we need 11 bits for the
> counter to support 32k pages sizes (per Sehrope Sarkuni), and we can use
> the remaining 5 bits as constants to indicate heap, index, or WAL.
> (Technically, since we are not encrypting the first 16 bytes, we could
> use one less bit for the counter.)  If we also use relfilenode, that is
> 4 bytes, so we have no bits for the heap/index/WAL constant, and no
> space for the CTR counter, meaning we would have to use CBC mode.
> 
> 
> You can still use CTR mode and include those to make the key + IV unique by
> adding them to the derived key rather than the IV.
>
> The IV per-page would still be LSN + page-number (with the block number added
> as it's evaluated across the page) and the relfilenode, heap/index, database,
> and anything else to make it unique can be included in the HKDF to create the
> per-file derived key.

I thought if we didn't have to hash the stuff together we would be less
likely to get collisions with the IV.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 08:07:28PM -0400, Stephen Frost wrote:
> > Yes, we need to see how we are going to do that for checksums and
> > encryption and come up with a plan.
> 
> This is already being done though- Andres has a patch posted already and
> my recollection from a quick look at that is that it should work just
> fine for enabling checksums as well as potentially moving a table to be
> encrypted online- the main common bit being that we need a way to say
> "OK, everything has been done but we need to flip this flag and make
> sure that everyone knows that this is now all checksum'd or all
> encrypted".  The only thing there that I'm not sure about is that when
> it comes to checksums, I believe the idea is that it's cluster-wide,
> while with encryption that would only be true if we were trying to do
> something like move the entire cluster from unencrypted to encrypted in
> an online fashion (including WAL, CLOG, et al...) and if that's the case
> then there's a bunch of other complicated bits, I believe, that we'd
> have to work out, and I don't really think it's necessary or sensible to
> worry about that right now.  Those are problems we don't currently have
> with checksums either- the WAL already has them and I don't think
> anyone's trying to address the fact that other rather core pieces of
> the system don't currently.

OK,

> > > > > There seems to be a strong thrust on this thread to assume that a
> > > > > database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
> > > > > therefore we have to shut down the server to do it), but I don't get 
> > > > > why
> > > > > that's the case, particularly if we support any kind of mixed setup
> > > > > where there's some data that's encrypted and some that isn't, and 
> > > > > since
> > > > > we're talking about using different keys for different things, it 
> > > > > seems
> > > > > to me that we almost certainly should be able to easily say "well,
> > > > > there's no key for this, so just don't go through the decrypt/encrypt
> > > > > routines".
> > > > 
> > > > No, we can't easily do different keys for different things since all the
> > > > keys have to be online for crash recovery, so there isn't much value to
> > > > having different keys since they always have to be online.
> > > 
> > > Wasn't this already discussed?  We should have a master key and then
> > > additional keys for different tables, et al, which are encrypted with
> > > the master key.  Joe, I believe, covered all this quite well.
> > 
> > Yes, I am disagreeing with that.  I posted a 5-option email that went
> > over the issue and explored the value in different keys.  I am still
> > unclear of the benefit since it doesn't fix the 68GB limit unless we do
> > it per 1GB file, and we don't even know if that limit is per key or per
> > key/IV combo.  We can't move ahead until we decide that.
> 
> I understand the 68G limit that you're referring to to be key/IV combo,
> which means that a key per relation should be just fine.

Yes, that is what I thought too.

> Even if it was per key, and it means having a key per 1GB file,
> that wouldn't change the point that I was making, so I'm not entirely
> sure why it's being mentioned in this context.

Because I thought we would use a single key for the entire cluster
(heap/index/WAL), and only use another key for encryption key rotation.

> I disagree with any approach that lacks a master key with additional
> sub-keys, if that helps clarify things.

We all know we need a passphrase that unlocks an encryption key.  Are
you saying we need per-object/table keys?  Why, other than for key
rotation?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Sehrope Sarkuni
On Thu, Jul 25, 2019 at 7:51 PM Bruce Momjian  wrote:

> Looking at the bits we have, the IV for AES is 16 bytes.  Since we know
> we have to use LSN (to change the IV for each page write), and the page
> number (so WAL updates that change multiple pages with the same LSN use
> different IVs), that uses 12 bytes:
>
> LSN 8 bytes
> page-number 4 bytes
>
> That leaves 4 bytes unused.  If we use CTR, we need 11 bits for the
> counter to support 32k pages sizes (per Sehrope Sarkuni), and we can use
> the remaining 5 bits as constants to indicate heap, index, or WAL.
> (Technically, since we are not encrypting the first 16 bytes, we could
> use one less bit for the counter.)  If we also use relfilenode, that is
> 4 bytes, so we have no bits for the heap/index/WAL constant, and no
> space for the CTR counter, meaning we would have to use CBC mode.
>

You can still use CTR mode and include those to make the key + IV unique by
adding them to the derived key rather than the IV.

The IV per-page would still be LSN + page-number (with the block number
added as it's evaluated across the page) and the relfilenode, heap/index,
database, and anything else to make it unique can be included in the HKDF
to create the per-file derived key.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Jul 25, 2019 at 07:41:14PM -0400, Stephen Frost wrote:
> > > You are right that we can allow it online, but we haven't been
> > > discussing these cases since it is easy to do this because we have
> > > access to the keys.  I do think whatever code we use for checksum online
> > > changes will be used for encryption online changes.  We would need a
> > > per-page bit to indicate encryption, hopefully in the first 16 bytes.
> > 
> > Arranging to have an individual table move from being plain to
> > encrypted is something that would be nice to support in an online and
> > non-blocking manner, but *that* is a bunch of additional work that we
> > don't need to do today as opposed to being something that's just part of
> > the initial design.  Sure, it might use functions/capabilities that
> > pg_checksums also use, but I don't know that we need to think about the
> > code sharing there being much more than that, just that those
> > capabilities should be built out in a way that they can be used for
> > multiple things (and based on what I saw, that looks like it's exactly
> > how that code was being written already).
> 
> Yes, we need to see how we are going to do that for checksums and
> encryption and come up with a plan.

This is already being done though- Andres has a patch posted already and
my recollection from a quick look at that is that it should work just
fine for enabling checksums as well as potentially moving a table to be
encrypted online- the main common bit being that we need a way to say
"OK, everything has been done but we need to flip this flag and make
sure that everyone knows that this is now all checksum'd or all
encrypted".  The only thing there that I'm not sure about is that when
it comes to checksums, I believe the idea is that it's cluster-wide,
while with encryption that would only be true if we were trying to do
something like move the entire cluster from unencrypted to encrypted in
an online fashion (including WAL, CLOG, et al...) and if that's the case
then there's a bunch of other complicated bits, I believe, that we'd
have to work out, and I don't really think it's necessary or sensible to
worry about that right now.  Those are problems we don't currently have
with checksums either- the WAL already has them and I don't think
anyone's trying to address the fact that other rather core pieces of
the system don't currently.

> > > > There seems to be a strong thrust on this thread to assume that a
> > > > database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
> > > > therefore we have to shut down the server to do it), but I don't get why
> > > > that's the case, particularly if we support any kind of mixed setup
> > > > where there's some data that's encrypted and some that isn't, and since
> > > > we're talking about using different keys for different things, it seems
> > > > to me that we almost certainly should be able to easily say "well,
> > > > there's no key for this, so just don't go through the decrypt/encrypt
> > > > routines".
> > > 
> > > No, we can't easily do different keys for different things since all the
> > > keys have to be online for crash recovery, so there isn't much value to
> > > having different keys since they always have to be online.
> > 
> > Wasn't this already discussed?  We should have a master key and then
> > additional keys for different tables, et al, which are encrypted with
> > the master key.  Joe, I believe, covered all this quite well.
> 
> Yes, I am disagreeing with that.  I posted a 5-option email that went
> over the issue and explored the value in different keys.  I am still
> unclear of the benefit since it doesn't fix the 68GB limit unless we do
> it per 1GB file, and we don't even know if that limit is per key or per
> key/IV combo.  We can't move ahead until we decide that.

I understand the 68G limit that you're referring to to be key/IV combo,
which means that a key per relation should be just fine.

Even if it was per key, and it means having a key per 1GB file,
that wouldn't change the point that I was making, so I'm not entirely
sure why it's being mentioned in this context.

I disagree with any approach that lacks a master key with additional
sub-keys, if that helps clarify things.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set

2019-07-25 Thread Peter Geoghegan
On Thu, Jul 25, 2019 at 3:10 PM Andres Freund  wrote:
> > I agree that this is unfortunate. Are you planning on working on it?
>
> Not at the moment, no. Are you planning / hoping to take a stab at it?

The current behavior ought to be fixed, and it seems like it falls to
me to do that. OTOH, anything that's MultiXact adjacent makes my eyes
water. I don't consider myself to be particularly well qualified.

I'm sure that I could quickly find a way of making the behavior you
have pointed out match what is expected without causing any regression
tests to fail, but that's the easy part.

--
Peter Geoghegan




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 07:41:14PM -0400, Stephen Frost wrote:
> > You are right that we can allow it online, but we haven't been
> > discussing these cases since it is easy to do this because we have
> > access to the keys.  I do think whatever code we use for checksum online
> > changes will be used for encryption online changes.  We would need a
> > per-page bit to indicate encryption, hopefully in the first 16 bytes.
> 
> Arranging to have an individual table move from being plain to
> encrypted is something that would be nice to support in an online and
> non-blocking manner, but *that* is a bunch of additional work that we
> don't need to do today as opposed to being something that's just part of
> the initial design.  Sure, it might use functions/capabilities that
> pg_checksums also use, but I don't know that we need to think about the
> code sharing there being much more than that, just that those
> capabilities should be built out in a way that they can be used for
> multiple things (and based on what I saw, that looks like it's exactly
> how that code was being written already).

Yes, we need to see how we are going to do that for checksums and
encryption and come up with a plan.

> > > There seems to be a strong thrust on this thread to assume that a
> > > database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
> > > therefore we have to shut down the server to do it), but I don't get why
> > > that's the case, particularly if we support any kind of mixed setup
> > > where there's some data that's encrypted and some that isn't, and since
> > > we're talking about using different keys for different things, it seems
> > > to me that we almost certainly should be able to easily say "well,
> > > there's no key for this, so just don't go through the decrypt/encrypt
> > > routines".
> > 
> > No, we can't easily do different keys for different things since all the
> > keys have to be online for crash recovery, so there isn't much value to
> > having different keys since they always have to be online.
> 
> Wasn't this already discussed?  We should have a master key and then
> additional keys for different tables, et al, which are encrypted with
> the master key.  Joe, I believe, covered all this quite well.

Yes, I am disagreeing with that.  I posted a 5-option email that went
over the issue and explored the value in different keys.  I am still
unclear of the benefit since it doesn't fix the 68GB limit unless we do
it per 1GB file, and we don't even know if that limit is per key or per
key/IV combo.  We can't move ahead until we decide that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Compile from source using latest Microsoft Windows SDK

2019-07-25 Thread Michael Paquier
On Thu, Jul 25, 2019 at 09:02:14AM +0900, Michael Paquier wrote:
> Interesting.  I am not actually sure in which version of VS this has
> been introduced.  But it would be fine enough to do nothing if the
> variable is not defined and rely on the default.  Except for the
> formatting and indentation, the patch looks right.  Andrew, perhaps
> you would prefer doing the final touch on it and commit it?

Andrew has just applied the patch as of 20e99cd (+ cb9bb15 for an
extra fix by Tom).
--
Michael


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 02:05:05PM -0400, Bruce Momjian wrote:
> Masahiko Sawada copied this section as a desired direction, so I want to
> drill down into it.  I think we have five possible approaches for level
> 3 listed above.
> 
> The simplest approach would be to say that the LSN/page-number and WAL
> segment-number used as IV will not collide and we can just use them
> directly.

Looking at the bits we have, the IV for AES is 16 bytes.  Since we know
we have to use LSN (to change the IV for each page write), and the page
number (so WAL updates that change multiple pages with the same LSN use
different IVs), that uses 12 bytes:

LSN 8 bytes
page-number 4 bytes

That leaves 4 bytes unused.  If we use CTR, we need 11 bits for the
counter to support 32k pages sizes (per Sehrope Sarkuni), and we can use
the remaining 5 bits as constants to indicate heap, index, or WAL. 
(Technically, since we are not encrypting the first 16 bytes, we could
use one less bit for the counter.)  If we also use relfilenode, that is
4 bytes, so we have no bits for the heap/index/WAL constant, and no
space for the CTR counter, meaning we would have to use CBC mode.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Jul 25, 2019 at 05:50:57PM -0400, Stephen Frost wrote:
> > > > > pg_upgrade seems immune to must of this, and that is by design. 
> > > > > However, I am hesitant to change the heap/index page format for
> > > > > encryption because if we add fields, old pages might not fit as
> > > > > encrypted pages, and then you have to move rows around, and things
> > > > > become _much_ more complicated.
> > > > 
> > > > Yeah, I'm afraid we are going to have a hard time making this work
> > > > without changing the page format for encrypted..  I don't know if that's
> > > > actually a *huge* issue like we've considered it to be in the past or
> > > > not, as making someone rewrite just the few sensitive tables in their
> > > > environment might not be that bad, and there's also logical replication
> > > > today..
> > > 
> > > It is hard to do that while the server is offline.
> > 
> > I don't see any reason to assume we must only support encrypting
> > individual tables when the server is offline, or that we have to support
> > any option which involves the server being offline when it comes to
> > doing encryption.
> > 
> > I'm not against supporting a "shut down the server and then encrypt
> > everything and then start it up" option, but I don't see any
> > particularly good reason to explicitly design the system with that
> > use-case in mind.
> 
> You are right that we can allow it online, but we haven't been
> discussing these cases since it is easy to do this because we have
> access to the keys.  I do think whatever code we use for checksum online
> changes will be used for encryption online changes.  We would need a
> per-page bit to indicate encryption, hopefully in the first 16 bytes.

Arranging to have an individual table move from being plain to
encrypted is something that would be nice to support in an online and
non-blocking manner, but *that* is a bunch of additional work that we
don't need to do today as opposed to being something that's just part of
the initial design.  Sure, it might use functions/capabilities that
pg_checksums also use, but I don't know that we need to think about the
code sharing there being much more than that, just that those
capabilities should be built out in a way that they can be used for
multiple things (and based on what I saw, that looks like it's exactly
how that code was being written already).

> > There seems to be a strong thrust on this thread to assume that a
> > database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
> > therefore we have to shut down the server to do it), but I don't get why
> > that's the case, particularly if we support any kind of mixed setup
> > where there's some data that's encrypted and some that isn't, and since
> > we're talking about using different keys for different things, it seems
> > to me that we almost certainly should be able to easily say "well,
> > there's no key for this, so just don't go through the decrypt/encrypt
> > routines".
> 
> No, we can't easily do different keys for different things since all the
> keys have to be online for crash recovery, so there isn't much value to
> having different keys since they always have to be online.

Wasn't this already discussed?  We should have a master key and then
additional keys for different tables, et al, which are encrypted with
the master key.  Joe, I believe, covered all this quite well.

Either way though, I don't think it really goes against the point that I
was trying to make- we should be able to figure out if a table is
encrypted or not based on some information that we arrange to have
available during crash recovery and online processing, and the absence
of such should allow us to skip the encryption/decryption routines and
work with the table as we do today.  We should be thinking about
migrating from a completely unencrypted database to a database which has
all the 'core' bits encrypted (possibly as part of pg_upgrade or through
an offline tool of some kind) but the user data not encrypted, and then
online allow users to create new tables which are encrypted (maybe by
putting them in a particular tablespace or as a single table) and then
operate with those tables just like they would any other table in the
system, and let them manage how they move their sensitive data from some
other table into the encrypted table (or from another system into the
encrypted table).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 05:50:57PM -0400, Stephen Frost wrote:
> > > > pg_upgrade seems immune to must of this, and that is by design. 
> > > > However, I am hesitant to change the heap/index page format for
> > > > encryption because if we add fields, old pages might not fit as
> > > > encrypted pages, and then you have to move rows around, and things
> > > > become _much_ more complicated.
> > > 
> > > Yeah, I'm afraid we are going to have a hard time making this work
> > > without changing the page format for encrypted..  I don't know if that's
> > > actually a *huge* issue like we've considered it to be in the past or
> > > not, as making someone rewrite just the few sensitive tables in their
> > > environment might not be that bad, and there's also logical replication
> > > today..
> > 
> > It is hard to do that while the server is offline.
> 
> I don't see any reason to assume we must only support encrypting
> individual tables when the server is offline, or that we have to support
> any option which involves the server being offline when it comes to
> doing encryption.
> 
> I'm not against supporting a "shut down the server and then encrypt
> everything and then start it up" option, but I don't see any
> particularly good reason to explicitly design the system with that
> use-case in mind.

You are right that we can allow it online, but we haven't been
discussing these cases since it is easy to do this because we have
access to the keys.  I do think whatever code we use for checksum online
changes will be used for encryption online changes.  We would need a
per-page bit to indicate encryption, hopefully in the first 16 bytes.

> There seems to be a strong thrust on this thread to assume that a
> database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
> therefore we have to shut down the server to do it), but I don't get why
> that's the case, particularly if we support any kind of mixed setup
> where there's some data that's encrypted and some that isn't, and since
> we're talking about using different keys for different things, it seems
> to me that we almost certainly should be able to easily say "well,
> there's no key for this, so just don't go through the decrypt/encrypt
> routines".

No, we can't easily do different keys for different things since all the
keys have to be online for crash recovery, so there isn't much value to
having different keys since they always have to be online.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: On the stability of TAP tests for LDAP

2019-07-25 Thread Michael Paquier
On Fri, Jul 26, 2019 at 10:18:48AM +1200, Thomas Munro wrote:
> Pushed, thanks.

Thanks for fixing!  I'll update this thread if there are still some
problems.
--
Michael


signature.asc
Description: PGP signature


Re: On the stability of TAP tests for LDAP

2019-07-25 Thread Thomas Munro
On Thu, Jul 25, 2019 at 12:51 PM Michael Paquier  wrote:
> Thanks for the updated patch, this looks good.  I have done a series
> of tests keeping my laptop busy and I haven't seen a failure where I
> usually see problems 10%~20% of the time.  So that seems to help,
> thanks!

Pushed, thanks.

-- 
Thomas Munro
https://enterprisedb.com




Re: ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set

2019-07-25 Thread Andres Freund
Hi,

On 2019-07-24 17:14:39 -0700, Peter Geoghegan wrote:
> On Wed, Jul 24, 2019 at 4:24 PM Andres Freund  wrote:
> > but we really don't need to do any of that in this case - the only
> > locker is the current backend, after all.
> >
> > I think this isn't great, because it'll later will cause unnecessary
> > hint bit writes (although ones somewhat likely combined with setting
> > XMIN_COMMITTED), and even extra work for freezing.
> >
> > Based on a quick look this wasn't the case before the finer grained
> > tuple locking - which makes sense, there was no cases where locks would
> > need to be carried forward.
> 
> I agree that this is unfortunate. Are you planning on working on it?

Not at the moment, no. Are you planning / hoping to take a stab at it?

Greetings,

Andres Freund




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Jul 25, 2019 at 03:55:01PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > On Thu, Jul 25, 2019 at 03:41:05PM -0400, Stephen Frost wrote:
> > > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > > After talking to Joe Conway, I just want to mention that if we decide
> > > > > that the LSN is unique among heap and index, or among heap or index, 
> > > > > we
> > > > > will need to make sure future WAL records retain this uniqueness.
> > > > 
> > > > One thing comes to mind regarding this and I'll admit that I don't quite
> > > > remember exactly off-hand but I also don't want to not mention it now
> > > > and forget to later.
> > > > 
> > > > What about pg_upgrade?
> > > 
> > > So, we don't carry WAL from the old cluster to the new cluster, so if
> > > the WAL is changed and had duplicates, it would only be new WAL records.
> > 
> > Right, we don't carry it forward- but what I couldn't remember is if
> > start from more-or-less LSN 0 or if pg_upgrade will arrange it such that
> > the new major version will start from LSN-of-old+1 (or whatever).  Seems
> > like it'd *have* to be the latter, but just thought of it and wanted to
> > make sure.
> 
> pg_upgrade uses pg_resetwal -l to set the next WAL segment file based on
> the value in the old cluster:
> 
> /* now reset the wal archives in the new cluster */
> prep_status("Resetting WAL archives");
> exec_prog(UTILITY_LOG_FILE, NULL, true, true,
> /* use timeline 1 to match controldata and no WAL history file */
> -->   "\"%s/pg_resetwal\" -l 0001%s \"%s\"", new_cluster.bindir,
>   old_cluster.controldata.nextxlogfile + 8,
>   new_cluster.pgdata);

Ah, right, ok, we reset the timeline but not the LSN, ok.

> > > pg_upgrade seems immune to must of this, and that is by design. 
> > > However, I am hesitant to change the heap/index page format for
> > > encryption because if we add fields, old pages might not fit as
> > > encrypted pages, and then you have to move rows around, and things
> > > become _much_ more complicated.
> > 
> > Yeah, I'm afraid we are going to have a hard time making this work
> > without changing the page format for encrypted..  I don't know if that's
> > actually a *huge* issue like we've considered it to be in the past or
> > not, as making someone rewrite just the few sensitive tables in their
> > environment might not be that bad, and there's also logical replication
> > today..
> 
> It is hard to do that while the server is offline.

I don't see any reason to assume we must only support encrypting
individual tables when the server is offline, or that we have to support
any option which involves the server being offline when it comes to
doing encryption.

I'm not against supporting a "shut down the server and then encrypt
everything and then start it up" option, but I don't see any
particularly good reason to explicitly design the system with that
use-case in mind.

There seems to be a strong thrust on this thread to assume that a
database MUST go from ALL DECRYPTED to ALL ENCRYPTED in one shot (and
therefore we have to shut down the server to do it), but I don't get why
that's the case, particularly if we support any kind of mixed setup
where there's some data that's encrypted and some that isn't, and since
we're talking about using different keys for different things, it seems
to me that we almost certainly should be able to easily say "well,
there's no key for this, so just don't go through the decrypt/encrypt
routines".

I can see an argument for why we might need to go through a restart and
possibly use some off-line tool when a user decides they want, say, the
WAL, or CLOG, or the other control files, to be encrypted, or basic
encryption capability to be set up for the cluster (like generating the
master key and storing it or making some changes to the control file to
indicate that some things in this cluster has encrypted bits), but
saying we must have the server offline to support encrypted tables that
have a different page format is a bit like saying we need to take the
server offline to add a new tableam (like zheap) and that we have to use
some off-line utility while the server is down to convert a given table
from heapam to zheap, isn't it?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: psql - add SHOW_ALL_RESULTS option

2019-07-25 Thread Fabien COELHO


Bonsoir Daniel,


FYI you forgot to remove that bit:

+ "SINGLELINE|SINGLESTEP|SHOW_ALL_RESULTS"))


Indeed. I found another such instance in "help.c".


Also copydml does not  seem to be exercised with combined
queries,  so do we need this chunk:



--- a/src/test/regress/sql/copydml.sql


Yep, because I reorganized the notice code significantly, and I wanted to 
be sure that the right notices are displayed in the right order, which 
does not show if the trigger just says "NOTICE:  UPDATE 8".


Attached a v2 for the always-show-all-results variant. Thanks for the 
debug!


--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..5e4f653f88 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3355,10 +3355,8 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
+psql prints all results it receives, one
+after the other.
 
 
   
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 44a782478d..4534c45854 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -472,6 +472,16 @@ ResetCancelConn(void)
 #endif
 }
 
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.db);
+
+	if (strlen(error))
+		pg_log_info("%s", error);
+
+	CheckConnection();
+}
 
 /*
  * AcceptResult
@@ -482,7 +492,7 @@ ResetCancelConn(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -513,15 +523,8 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
-	{
-		const char *error = PQerrorMessage(pset.db);
-
-		if (strlen(error))
-			pg_log_info("%s", error);
-
-		CheckConnection();
-	}
+	if (!OK && show_error)
+		ShowErrorMessage(result);
 
 	return OK;
 }
@@ -701,7 +704,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		return 0;
@@ -999,199 +1002,113 @@ loop_exit:
 	return success;
 }
 
-
 /*
- * ProcessResult: utility function for use by SendQuery() only
- *
- * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
- * PQexec() has stopped at the PGresult associated with the first such
- * command.  In that event, we'll marshal data for the COPY and then cycle
- * through any subsequent PGresult objects.
- *
- * When the command string contained no such COPY command, this function
- * degenerates to an AcceptResult() call.
- *
- * Changes its argument to point to the last PGresult of the command string,
- * or NULL if that result was for a COPY TO STDOUT.  (Returning NULL prevents
- * the command status from being printed, which we want in that case so that
- * the status line doesn't get taken as part of the COPY data.)
- *
- * Returns true on complete success, false otherwise.  Possible failure modes
- * include purely client-side problems; check the transaction status for the
- * server-side opinion.
+ * Marshal the COPY data.  Either subroutine will get the
+ * connection out of its COPY state, then call PQresultStatus()
+ * once and report any error.
+ *
+ * For COPY OUT, direct the output to pset.copyStream if it's set,
+ * otherwise to pset.gfname if it's set, otherwise to queryFout.
+ * For COPY IN, use pset.copyStream as data source if it's set,
+ * otherwise cur_cmd_source.
+ *
+ * Update result if further processing is necessary.  (Returning NULL
+ * prevents the command status from being printed, which we want in that
+ * case so that the status line doesn't get taken as part of the COPY data.)
  */
 static bool
-ProcessResult(PGresult **results)
+HandleCopyResult(PGresult **result)
 {
 	bool		success = true;
-	bool		first_cycle = true;
+	FILE	   *copystream;
+	PGresult   *copy_result;
+	ExecStatusType result_status = PQresultStatus(*result);
 
-	for (;;)
+	Assert(result_status == PGRES_COPY_OUT ||
+		   result_status == PGRES_COPY_IN);
+
+	SetCancelConn();
+	if (result_status == PGRES_COPY_OUT)
 	{
-		ExecStatusType result_status;
-		bool		is_copy;
-		PGresult   *next_result;
+		bool		need_close = false;
+		bool		is_pipe = false;
 
-		if (!AcceptResult(*results))
+		if (pset.copyStream)
 		{
-			/*
-			 * Failure at this point is always a server-side failure or a
-			 * failure to submit the command string.  Either way, 

Re: Server crash due to assertion failure in CheckOpSlotCompatibility()

2019-07-25 Thread Andres Freund
Hi,

On 2019-06-10 22:42:52 -0400, Alvaro Herrera wrote:
> On 2019-May-30, Andres Freund wrote:
> > On 2019-05-30 16:31:39 +0530, Ashutosh Sharma wrote:
> > > Here are some more details on the crash reported in my previous e-mail for
> > > better clarity:
> > 
> > I'll look into this once pgcon is over... Thanks for finding!
> 
> Ping?

:( I've now finally pushed the fix.  I was kinda exhausted for a
while...

Ashutosh, thanks for the report!

Greetings,

Andres Freund




Re: Server crash due to assertion failure in CheckOpSlotCompatibility()

2019-07-25 Thread Andres Freund
Hi,

On 2019-05-30 16:31:39 +0530, Ashutosh Sharma wrote:
> > *Analysis:*
> > I did some quick investigation on this and found that when the aggregate
> > is performed on the first group i.e. group by 'a', all the input tuples are
> > fetched from the outer plan and stored into the tuplesort object and for
> > the subsequent groups i.e. from the second group onwards, the tuples stored
> > in tuplessort object during 1st phase is used. But, then, the tuples stored
> > in the tuplesort object are actually the minimal tuples whereas it is
> > expected to be a heap tuple which actually results into the assertion
> > failure.
> >
> > I might be wrong, but it seems to me like the slot fetched from tuplesort
> > object needs to be converted to the heap tuple. Actually the following
> > lines of code in agg_retrieve_direct() gets executed only when we have
> > crossed a group boundary. I think, at least the function call to
> > ExecCopySlotHeapTuple(outerslot); followed by ExecForceStoreHeapTuple();
> > should always happen irrespective of the group boundary limit is crossed or
> > not... Sorry if I'm saying something ...

I think that's mostly the right diagnosis, but I think it's not the
right fix.  We can just flag here that the slot type isn't fixed - we're
not using any slot type specific functions, we just are promising that
the slot type doesn't change (mostly for the benefit of JIT compiled
deforming, which needs to generate different code for different slot
types).

I've pushed a fix for that.  As the commit explains:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=af3deff3f2ac79585481181cb198b04c67486c09

we probably could quite easily optimize this case further by setting the
slot type separately for each "phase" of grouping set processing. As we
already generate separate expressions for each phase, that should be
quite doable.  But that's something for another day, and not for v12.

Greetings,

Andres Freund




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Mon, Jul 15, 2019 at 06:08:28PM -0400, Sehrope Sarkuni wrote:
> Hi,
> 
> Some more thoughts on CBC vs CTR modes. There are a number of
> advantages to using CTR mode for page encryption.
> 
> CTR encryption modes can be fully parallelized, whereas CBC can only
> parallelized for decryption. While both can use AES specific hardware
> such as AES-NI, CTR modes can go a step further and use vectorized
> instructions.
> 
> On an i7-8559U (with AES-NI) I get a 4x speed improvement for
> CTR-based modes vs CBC when run on 8K of data:
> 
> # openssl speed -evp ${cipher}
> type 16 bytes 64 bytes256 bytes   1024 bytes
> 8192 bytes  16384 bytes
> aes-128-cbc1024361.51k  1521249.60k  1562033.41k  1571663.87k
> 1574537.90k  1575512.75k
> aes-128-ctr 696866.85k  2214441.86k  4364903.85k  5896221.35k
> 6559735.81k  6619594.75k
> aes-128-gcm 642758.92k  1638619.09k  3212068.27k  5085193.22k
> 6366035.97k  6474006.53k
> aes-256-cbc 940906.25k  1114628.44k  1131255.13k  1138385.92k
> 1140258.13k  1143592.28k
> aes-256-ctr 582161.82k  1896409.32k  3216926.12k  4249708.20k
> 4680299.86k  4706375.00k
> aes-256-gcm 553513.89k  1532556.16k  2705510.57k  3931744.94k
> 4615812.44k  4673093.63k

I am back to this email now.  I think there is a strong case that we
should use CTR mode for both WAL and heap/index files because CTR mode
is faster.  CBC mode has the advantage of being more immune to IV
duplication, but I think the fact that the page format is similar enough
among pages means we don't gain a lot from that, and therefor IV
uniqueness must be closely honored anyway.

> For relation data where the encryption is going to be per page,
> there's flexibility in how the CTR nonce (IV + counter) is generated.
> With an 8K page, the counter need only go up to 512 for each page
> (8192-bytes per page / 16-bytes per AES-block). That would require
> 9-bits for the counter. Rounding that up to 16-bits allows for wider
> pages and it still uses only two bytes of the counter while ensuring
> that it'd be unique per AES-block. The remaining 14-bytes would be
> populated with some other data that is guaranteed unique per
> page-write to allow encryption via the same per-relation-file derived
> key. From what I gather, the LSN is a candidate though it'd have to be
> stored in plaintext for decryption.

Yes, LSN is 8 bytes, and the page number is 4 bytes.  That leaves four
bytes of the counter.

> What's important is that writing the two pages (either different
> locations or the same page back again) never reuses the same nonce
> with the same key. Using the same nonce with a different key is fine.
> 
> With any of these schemes the same inputs will generate the same
> outputs. With CTR mode for WAL this would be an issue if the same key
> and deterministic nonce (ex: LSN + offset) is reused in multiple
> places. That does not have to be the same cluster either. For example
> if two replicas are promoted from the same backup with the same master
> key, they would generate the same WAL CTR stream, reusing the
> key/nonce pair. Ditto for starting off with a master key and deriving
> per-relation keys in a cloned installation off some deterministic
> attribute such as oid.

I think we need to document that sharing keys among clusters (except
for identical replicas) is insecure.

We can add the "Database system identifier" into the IV, which would
avoid the problem of two clusters using the same key, but it wouldn't
avoid the problem with promoting two replicas to primaries because they
would have the same "Database system identifier" so I think it is just
simpler to say "don't do that".

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: psql - add SHOW_ALL_RESULTS option

2019-07-25 Thread Daniel Verite
Fabien COELHO wrote:

> Attached a "do it always version", which does the necessary refactoring. 
> There is seldom new code, it is rather moved around, some functions are 
> created for clarity.

Thanks for the update!
FYI you forgot to remove that bit:

--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3737,7 +3737,7 @@ psql_completion(const char *text, int start, int end)
else if (TailMatchesCS("\\set", MatchAny))
{
if (TailMatchesCS("AUTOCOMMIT|ON_ERROR_STOP|QUIET|"
- "SINGLELINE|SINGLESTEP"))
+
"SINGLELINE|SINGLESTEP|SHOW_ALL_RESULTS"))

Also copydml does not  seem to be exercised with combined
queries,  so do we need this chunk:

--- a/src/test/regress/sql/copydml.sql
+++ b/src/test/regress/sql/copydml.sql
@@ -70,10 +70,10 @@ drop rule qqq on copydml_test;
 create function qqq_trig() returns trigger as $$
 begin
 if tg_op in ('INSERT', 'UPDATE') then
-raise notice '% %', tg_op, new.id;
+raise notice '% % %', tg_when, tg_op, new.id;
 return new;
 else
-raise notice '% %', tg_op, old.id;
+raise notice '% % %', tg_when, tg_op, old.id;
 return old;
 end if;


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




AW: Question about MemoryContexts / possible memory leak in CachedPlanSource usage

2019-07-25 Thread Daniel Migowski
Ah, you are right, I looked in fe_memutils.c. Makes sense now, thanks!!


-Ursprüngliche Nachricht-
Von: Andres Freund  
Gesendet: Donnerstag, 25. Juli 2019 22:31
An: Daniel Migowski 
Cc: pgsql-hack...@postgresql.org
Betreff: Re: Question about MemoryContexts / possible memory leak in 
CachedPlanSource usage

Hi,

On 2019-07-25 20:21:06 +, Daniel Migowski wrote:
> When CachedPlanSource instances are created the field query_string is 
> filled with pstrdup(query_string) in CreateCachedPlan, 
> plancache.c:182, which is just a wrapper for strdup. According to the 
> docs the returned pointer should be freed with “free” sometimes later.

Note pstrdup is *not* just a wrapper for strdup:

return MemoryContextStrdup(CurrentMemoryContext, in);

i.e. it explicitly allocates memory in the current memory context.

Perhaps you looked at the version of pstrdup() in src/common/fe_memutils.c? 
That's just for "frontend" code (we call code that doesn't run in the server 
frontend, for reasons). There we don't have the whole memory context 
infrastructure... It's only there so we can reuse code that uses pstrdup() 
between frontend and server.


> I believe in DropCachedPlan the free should take place but I don’t see 
> it. Is it just missing or is memory allocated by strdup and friends 
> automatically created in the current MemoryContext?  It so, why do I 
> need to use palloc() instead of malloc()?

We don't intercept malloc itself (doing so would have a *lot* of issues, 
starting from palloc internally using malloc, and ending with having lots of 
problems with libraries because their malloc would suddenly behave differently).


Greetings,

Andres Freund


Re: "localtime" value in TimeZone

2019-07-25 Thread Tom Lane
Shay Rojansky  writes:
> I followed the conversations you linked to, and disagreements seem to be
> mostly about other aspects of timezone selection. Does it make sense to
> have a limited, restricted conversation specifically about avoiding
> "localtime"?

I've tried to kick-start the other thread...

regards, tom lane




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-07-25 Thread Tom Lane
Thomas Munro  writes:
> FWIW this is now fixed for FreeBSD 13-CURRENT, with a good chance of
> back-patch.  I don't know if there are any other operating systems
> that are shipping zoneinfo but failing to install zone1970.tab, but if
> there are it's a mistake IMHO and they'll probably fix that if someone
> complains, considering that zone.tab literally tells you to go and use
> the newer version, and Paul Eggert has implied that zone1970.tab is
> the "full" and "canonical" list[1].

I'm not sure we're any closer to a meeting of the minds on whether
consulting zone[1970].tab is a good thing to do, but we got an actual
user complaint[1] about how "localtime" should not be a preferred
spelling.  So I want to go ahead and insert the discussed anti-preference
against "localtime" and "posixrules", as per 0001 below.  If we do do
something with zone[1970].tab, we'd still need these special rules,
so I don't think this is blocking anything.

Also, I poked into the question of the "Factory" zone a bit more,
and was disappointed to find that not only does FreeBSD still install
the "Factory" zone, but they are apparently hacking the data so that
it emits the two-changes-back abbreviation "Local time zone must be
set--use tzsetup".  This bypasses the filter in pg_timezone_names that
is expressly trying to prevent showing such silly "abbreviations".
So I now feel that not only can we not remove initdb's discrimination
against "Factory", but we indeed need to make the pg_timezone_names
filter more aggressive.  Hence, I now propose 0002 below to tweak
what we're doing with "Factory".  I did remove our special cases for
it in zic.c, as we don't need them anymore with modern tzdb data, and
there's no reason to support running "zic -P" with hacked-up data.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/cadt4rqccnj6fklisvt8ttpftp4azphhdfjqdf1jfbboh5w4...@mail.gmail.com

diff --git a/src/bin/initdb/findtimezone.c b/src/bin/initdb/findtimezone.c
index a5c9c9e..786e787 100644
--- a/src/bin/initdb/findtimezone.c
+++ b/src/bin/initdb/findtimezone.c
@@ -608,22 +608,28 @@ check_system_link_file(const char *linkname, struct tztry *tt,
 /*
  * Given a timezone name, determine whether it should be preferred over other
  * names which are equally good matches. The output is arbitrary but we will
- * use 0 for "neutral" default preference.
- *
- * Ideally we'd prefer the zone.tab/zone1970.tab names, since in general those
- * are the ones offered to the user to select from. But for the moment, to
- * minimize changes in behaviour, simply prefer UTC over alternative spellings
- * such as UCT that otherwise cause confusion. The existing "shortest first"
- * rule would prefer "UTC" over "Etc/UTC" so keep that the same way (while
- * still preferring Etc/UTC over Etc/UCT).
+ * use 0 for "neutral" default preference; larger values are more preferred.
  */
 static int
 zone_name_pref(const char *zonename)
 {
+	/*
+	 * Prefer UTC over alternatives such as UCT.  Also prefer Etc/UTC over
+	 * Etc/UCT; but UTC is preferred to Etc/UTC.
+	 */
 	if (strcmp(zonename, "UTC") == 0)
 		return 50;
 	if (strcmp(zonename, "Etc/UTC") == 0)
 		return 40;
+
+	/*
+	 * We don't want to pick "localtime" or "posixrules", unless we can find
+	 * no other name for the prevailing zone.  Those aren't real zone names.
+	 */
+	if (strcmp(zonename, "localtime") == 0 ||
+		strcmp(zonename, "posixrules") == 0)
+		return -50;
+
 	return 0;
 }
 
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 4d8db1a..972fcd2 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4826,16 +4826,15 @@ pg_timezone_names(PG_FUNCTION_ARGS)
 			continue;			/* ignore if conversion fails */
 
 		/*
-		 * Ignore zic's rather silly "Factory" time zone.  The long string
-		 * about "see zic manual page" is used in tzdata versions before
-		 * 2016g; we can drop it someday when we're pretty sure no such data
-		 * exists in the wild on platforms using --with-system-tzdata.  In
-		 * 2016g and later, the time zone abbreviation "-00" is used for
-		 * "Factory" as well as some invalid cases, all of which we can
-		 * reasonably omit from the pg_timezone_names view.
+		 * IANA's rather silly "Factory" time zone used to emit ridiculously
+		 * long "abbreviations" such as "Local time zone must be set--see zic
+		 * manual page" or "Local time zone must be set--use tzsetup".  While
+		 * modern versions of tzdb emit the much saner "-00", it seems some
+		 * benighted packagers are hacking the IANA data so that it continues
+		 * to produce these strings.  To prevent producing a weirdly wide
+		 * abbrev column, reject ridiculously long abbreviations.
 		 */
-		if (tzn && (strcmp(tzn, "-00") == 0 ||
-	strcmp(tzn, "Local time zone must be set--see zic manual page") == 0))
+		if (tzn && strlen(tzn) > 31)
 			continue;
 
 		/* Found a displayable zone */
diff --git a/src/timezone/zic.c 

Re: Question about MemoryContexts / possible memory leak in CachedPlanSource usage

2019-07-25 Thread Andres Freund
Hi,

On 2019-07-25 20:21:06 +, Daniel Migowski wrote:
> When CachedPlanSource instances are created the field query_string is
> filled with pstrdup(query_string) in CreateCachedPlan,
> plancache.c:182, which is just a wrapper for strdup. According to the
> docs the returned pointer should be freed with “free” sometimes later.

Note pstrdup is *not* just a wrapper for strdup:

return MemoryContextStrdup(CurrentMemoryContext, in);

i.e. it explicitly allocates memory in the current memory context.

Perhaps you looked at the version of pstrdup() in
src/common/fe_memutils.c? That's just for "frontend" code (we call code
that doesn't run in the server frontend, for reasons). There we don't
have the whole memory context infrastructure... It's only there so we
can reuse code that uses pstrdup() between frontend and server.


> I believe in DropCachedPlan the free should take place but I don’t see
> it. Is it just missing or is memory allocated by strdup and friends
> automatically created in the current MemoryContext?  It so, why do I
> need to use palloc() instead of malloc()?

We don't intercept malloc itself (doing so would have a *lot* of issues,
starting from palloc internally using malloc, and ending with having
lots of problems with libraries because their malloc would suddenly
behave differently).


Greetings,

Andres Freund




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 03:55:01PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Thu, Jul 25, 2019 at 03:41:05PM -0400, Stephen Frost wrote:
> > > Greetings,
> > > 
> > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > After talking to Joe Conway, I just want to mention that if we decide
> > > > that the LSN is unique among heap and index, or among heap or index, we
> > > > will need to make sure future WAL records retain this uniqueness.
> > > 
> > > One thing comes to mind regarding this and I'll admit that I don't quite
> > > remember exactly off-hand but I also don't want to not mention it now
> > > and forget to later.
> > > 
> > > What about pg_upgrade?
> > 
> > So, we don't carry WAL from the old cluster to the new cluster, so if
> > the WAL is changed and had duplicates, it would only be new WAL records.
> 
> Right, we don't carry it forward- but what I couldn't remember is if
> start from more-or-less LSN 0 or if pg_upgrade will arrange it such that
> the new major version will start from LSN-of-old+1 (or whatever).  Seems
> like it'd *have* to be the latter, but just thought of it and wanted to
> make sure.

pg_upgrade uses pg_resetwal -l to set the next WAL segment file based on
the value in the old cluster:

/* now reset the wal archives in the new cluster */
prep_status("Resetting WAL archives");
exec_prog(UTILITY_LOG_FILE, NULL, true, true,
/* use timeline 1 to match controldata and no WAL history file */
-->   "\"%s/pg_resetwal\" -l 0001%s \"%s\"", new_cluster.bindir,
  old_cluster.controldata.nextxlogfile + 8,
  new_cluster.pgdata);

> > pg_upgrade seems immune to must of this, and that is by design. 
> > However, I am hesitant to change the heap/index page format for
> > encryption because if we add fields, old pages might not fit as
> > encrypted pages, and then you have to move rows around, and things
> > become _much_ more complicated.
> 
> Yeah, I'm afraid we are going to have a hard time making this work
> without changing the page format for encrypted..  I don't know if that's
> actually a *huge* issue like we've considered it to be in the past or
> not, as making someone rewrite just the few sensitive tables in their
> environment might not be that bad, and there's also logical replication
> today..

It is hard to do that while the server is offline.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Question about MemoryContexts / possible memory leak in CachedPlanSource usage

2019-07-25 Thread Daniel Migowski
Hello,

I am just starting to get my feet wet with PostgreSQL development and am 
starting to understand the source, so please be kind . I am working on the 
REL_11_4 tag.

When CachedPlanSource instances are created the field query_string is filled 
with pstrdup(query_string) in CreateCachedPlan, plancache.c:182, which is just 
a wrapper for strdup. According to the docs the returned pointer should be 
freed with “free” sometimes later.

I believe in DropCachedPlan the free should take place but I don’t see it. Is 
it just missing or is memory allocated by strdup and friends automatically 
created in the current MemoryContext? It so, why do I need to use palloc() 
instead of malloc()?

Kind regards,
Daniel Migowski


[DOC] Document auto vacuum interruption

2019-07-25 Thread James Coleman
We've discussed this internally many times, but today finally decided
to write up a doc patch.

Autovacuum holds a SHARE UPDATE EXCLUSIVE lock, but other processes
can cancel autovacuum if blocked by that lock unless the autovacuum is
to prevent wraparound.This can result in very surprising behavior:
imagine a system that needs to run ANALYZE manually before batch jobs
to ensure reasonable query plans. That ANALYZE will interrupt attempts
to run autovacuum, and pretty soon the table is far more bloated than
expected, and query plans (ironically) degrade further.

Attached is a patch to document that behavior (as opposed to just in
the code at src/backend/storage/lmgr/proc.c:1320-1321).

James Coleman


autovacuum-interruption-v1.patch
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 03:43:34PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-15, Bruce Momjian wrote:
> 
> > Uh, if someone modifies a few bytes of the page, we will decrypt it, but
> > the checksum (per-page or WAL) will not match our decrypted output.  How
> > would they make it match the checksum without already knowing the key. 
> > I read [1] but could not see that explained.
> > 
> > This post discussed it:
> > 
> > 
> > https://crypto.stackexchange.com/questions/202/should-we-mac-then-encrypt-or-encrypt-then-mac
> 
> I find all the discussion downthread from this post pretty confusing.

Agreed.

> Why are we encrypting the page header in the first place?  It seems to
> me that the encrypted area should cover only the line pointers and the
> tuple data area; the page header needs to be unencrypted so that it can
> be used at all: firstly because you need to obtain the LSN from it in

Yes, the plan was to not encrypt the first 16 bytes so the LSN was visible.

> order to compute the IV, and secondly because the checksum must be
> validated *before* decrypting (per Moxie Marlinspike's "cryptographic
> doom" principle mentioned in a comment in the SE question).

Uh, I think we are still on the fence about writing the checksum _after_
encryption, but I think we are leaning against that, meaning online or
offline encryption must be able to decrypt the page.  Since we will
already need an offline tool to enable/remove encryption anyway, it
seems we can just reuse that code for pg_checksums.

I think we have three options with for CRC:

1.  compute CRC and then encrypt everything

2   encrypt and then CRC, and store the CRC unchanged

3.  encrypt and then CRC, and store the CRC encrypted

The only way offline tools can verify the CRC without access to the keys
is via #2, but #2 gives us _no_ detection of tampering.  I realize the
CRC tampering detection of #1 and #3 is not great, but it certainly has
some value.

> I am not totally clear on whether the special space and the "page hole"
> need to be encrypted.  I tend to think that they should *not* be
> encrypted; in particular, encrypting a large area containing zeroes seem
> a plentiful source of known cleartext, which seems a bad thing.  Special
> space also seems to contain known cleartext; maybe not as much as the
> page hole, but still seems better avoided.

Uh, there are no known attacks on AES with known plain-text, e.g., SSL
uses AES, so I think we are good with encrypting everything after the
first 16 bytes.

> Given this, it seems to me that we should first encrypt those two data
> areas, and *then* compute the CRC on the complete page just like we do
> today ... and the result is stored in an unencrypted area (the page
> header) and so it doesn't affect the encryption.

Yes, that is a possibility.

> The checksum we currently have is not cryptographically secure -- it's
> not a crypto-strong signature.  If we want that, we need some further
> protection.  Maybe for encrypted tables we replace our current checksum
> with an cryptographically secure signature ...?  Pretty sure 16 bits are
> insufficient for that, but I suppose we would just use a different page
> header with room for a proper sig.

Yes, checksum is more for best-effort than fully secure, but replay of
pages makes a fullly secure solution hard anyway.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Jul 25, 2019 at 03:41:05PM -0400, Stephen Frost wrote:
> > Greetings,
> > 
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > After talking to Joe Conway, I just want to mention that if we decide
> > > that the LSN is unique among heap and index, or among heap or index, we
> > > will need to make sure future WAL records retain this uniqueness.
> > 
> > One thing comes to mind regarding this and I'll admit that I don't quite
> > remember exactly off-hand but I also don't want to not mention it now
> > and forget to later.
> > 
> > What about pg_upgrade?
> 
> So, we don't carry WAL from the old cluster to the new cluster, so if
> the WAL is changed and had duplicates, it would only be new WAL records.

Right, we don't carry it forward- but what I couldn't remember is if
start from more-or-less LSN 0 or if pg_upgrade will arrange it such that
the new major version will start from LSN-of-old+1 (or whatever).  Seems
like it'd *have* to be the latter, but just thought of it and wanted to
make sure.

> pg_upgrade seems immune to must of this, and that is by design. 
> However, I am hesitant to change the heap/index page format for
> encryption because if we add fields, old pages might not fit as
> encrypted pages, and then you have to move rows around, and things
> become _much_ more complicated.

Yeah, I'm afraid we are going to have a hard time making this work
without changing the page format for encrypted..  I don't know if that's
actually a *huge* issue like we've considered it to be in the past or
not, as making someone rewrite just the few sensitive tables in their
environment might not be that bad, and there's also logical replication
today..

> I don't see any other pg_upgrade issues, unless someone else does.  Oh,
> we will have to check pg_control for a matching encryption format.

Yes, certainly it'd need to be updated for at least that, when upgading
an encrypted cluster.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 03:41:05PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > After talking to Joe Conway, I just want to mention that if we decide
> > that the LSN is unique among heap and index, or among heap or index, we
> > will need to make sure future WAL records retain this uniqueness.
> 
> One thing comes to mind regarding this and I'll admit that I don't quite
> remember exactly off-hand but I also don't want to not mention it now
> and forget to later.
> 
> What about pg_upgrade?

So, we don't carry WAL from the old cluster to the new cluster, so if
the WAL is changed and had duplicates, it would only be new WAL records.
pg_upgrade seems immune to must of this, and that is by design. 
However, I am hesitant to change the heap/index page format for
encryption because if we add fields, old pages might not fit as
encrypted pages, and then you have to move rows around, and things
become _much_ more complicated.

I don't see any other pg_upgrade issues, unless someone else does.  Oh,
we will have to check pg_control for a matching encryption format.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Alvaro Herrera
On 2019-Jul-15, Bruce Momjian wrote:

> Uh, if someone modifies a few bytes of the page, we will decrypt it, but
> the checksum (per-page or WAL) will not match our decrypted output.  How
> would they make it match the checksum without already knowing the key. 
> I read [1] but could not see that explained.
> 
> This post discussed it:
> 
>   
> https://crypto.stackexchange.com/questions/202/should-we-mac-then-encrypt-or-encrypt-then-mac

I find all the discussion downthread from this post pretty confusing.
Why are we encrypting the page header in the first place?  It seems to
me that the encrypted area should cover only the line pointers and the
tuple data area; the page header needs to be unencrypted so that it can
be used at all: firstly because you need to obtain the LSN from it in
order to compute the IV, and secondly because the checksum must be
validated *before* decrypting (per Moxie Marlinspike's "cryptographic
doom" principle mentioned in a comment in the SE question).

I am not totally clear on whether the special space and the "page hole"
need to be encrypted.  I tend to think that they should *not* be
encrypted; in particular, encrypting a large area containing zeroes seem
a plentiful source of known cleartext, which seems a bad thing.  Special
space also seems to contain known cleartext; maybe not as much as the
page hole, but still seems better avoided.

Given this, it seems to me that we should first encrypt those two data
areas, and *then* compute the CRC on the complete page just like we do
today ... and the result is stored in an unencrypted area (the page
header) and so it doesn't affect the encryption.

The checksum we currently have is not cryptographically secure -- it's
not a crypto-strong signature.  If we want that, we need some further
protection.  Maybe for encrypted tables we replace our current checksum
with an cryptographically secure signature ...?  Pretty sure 16 bits are
insufficient for that, but I suppose we would just use a different page
header with room for a proper sig.

Am I misunderstanding something?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> After talking to Joe Conway, I just want to mention that if we decide
> that the LSN is unique among heap and index, or among heap or index, we
> will need to make sure future WAL records retain this uniqueness.

One thing comes to mind regarding this and I'll admit that I don't quite
remember exactly off-hand but I also don't want to not mention it now
and forget to later.

What about pg_upgrade?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 02:05:05PM -0400, Bruce Momjian wrote:
> The second approach is to say they will collide and that we need to mix
> a constant into the IV for tables/indexes and a different one for WAL. 
> In a way I would like to mix the pg_controldata Database system
> Identifier into there too, but I am unclear on the value and complexity
> involved.
> 
> A third approach would be to say that we will have duplicate LSNs
> between a table and its index?  Maybe we need three constants, one for
> heap, one for indexes, and one for WAL.
> 
> A fourth approach would be to say we will have duplicate LSNs on
> different heap files, or index files.  We would then modify pg_upgrade to
> preserve relfilenode and use that.  (I don't think pg_class.oid is
> visible during recovery, and it certainly isn't visible in offline
> mode.)
> 
> However, we need to be clear that adding relfilenode is only helping to
> avoid tables/indexes with the same LSN pages.  It doesn't address the
> 68GB limit since our tables can be larger than that.
> 
> A fifth approach would be to decide that 68GB is the limit for a single
> key (not single key/IV combo).  If that is the case we need a different
> key for each 68GB of a file, and because we break files into 1GB chunks,
> we would just use a different key for each chunk, and I guess store the
> keys in the file system, encrypted with the master key.
> 
> My big point is that we need to decide where the IV collisions will
> happen, and what our encryption limit per key (not per key/IV
> combination) is.

After talking to Joe Conway, I just want to mention that if we decide
that the LSN is unique among heap and index, or among heap or index, we
will need to make sure future WAL records retain this uniqueness.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Fri, Jul 19, 2019 at 01:59:41PM +0200, Tomas Vondra wrote:
> On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote:
> > We can guarantee integrity and authenticity of backup, but that's a separate
> > feature: someone may need this although it's o.k. for him to run the cluster
> > unencrypted.

> Yes, I do agree with that. I think attempts to guarantee data authenticity
> and/or integrity at the page level is mostly futile (replay attacks are an
> example of why). IMHO we should consider that to be outside the threat
> model TDE is expected to address.

Yes, I think we can say that checksums _help_ detect unauthorized
database changes, and usually detects database corruption, but it isn't
a fully secure solution.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Sat, Jul 20, 2019 at 07:30:30PM +0200, Tomas Vondra wrote:
> Forbid checksums? I don't see how that could be acceptable. We either have
> to accept the limitations of the current design (having to decrypt
> everything before checking the checksums) or change the design.

Yes, checksums certainly have to work.

> I personally think we should do the latter - not just because of this
> "decrypt-then-verify" issue, but consider how much work we've done to
> allow enabling checksums on-line (it's still not there, but it's likely
> doable in PG13). How are we going to do that with encryption? ISTM we
> should design it so that we can enable encryption on-line too - maybe not
> in v1, but it should be possible. So how are we going to do that? With
> checksums it's (fairly) easy because we can "not verify" the page before
> we know all pages have a checksum, but with encryption that's not
> possible. And if the whole page is encrypted, then what?

Well, I assumed we would start with a command-line offline tool to
add/remove encryption, and I assumed the command-line tool pg_checksums
would use the same code to decrypt the page to add/remove checksums and
rewrite it.  I don't think we will ever allow add/remove encryption in
online mode.

Does that help?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Fri, Jul 26, 2019 at 02:54:04AM +0900, Masahiko Sawada wrote:
> On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian  wrote:
> >
> > On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
> > > I've re-considered the design of TDE feature based on the discussion
> > > so far. The one of the main open question is the granular of
> > > encryption objects: cluster encryption or more-granular-than-cluster
> > > encryption. The followings describe about the new TDE design when we
> > > choose table-level encryption or something-new-group-level encryption.
> > >
> > > General
> > > 
> > > We will use AES and support both AES-128 and AES-256. User can specify
> > > the new initdb option something like --aes-128 or --aes-256 to enable
> > > encryption and must specify --encryption-key-passphrase-command along
> > > with. (I guess we also require openssl library.) If these options are
> > > specified, we write the key length to the control file and derive the
> > > KEK and generate MDEK during initdb. wal_log_hints will be enabled
> > > automatically in encryption mode, like we do for checksum mode,
> >
> > Agreed.  pg_control will store the none/AES128/AES256 indicator.
> >
> > > Key Management
> > > ==
> > > We will use 3-tier key architecture as Joe proposed.
> > >
> > >   1. A master key encryption key (KEK): this is the ley supplied by the
> > >  database admin using something akin to ssl_passphrase_command
> > >
> > >   2. A master data encryption key (MDEK): this is a generated key using a
> > >  cryptographically secure pseudo-random number generator. It is
> > >  encrypted using the KEK, probably with Key Wrap (KW):
> > >  or maybe better Key Wrap with Padding (KWP):
> > >
> > >   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
> > >   table specific keys.
> >
> > What is the value of a per-table encryption key?  How is HKDF derived?
> 
> Per-table encryption key is derived from MDEK with salt and its OID as
> info. I think we can store salts for each encryption keys into the
> separate file so that off-line tool also can read it.

Thanks. I just sent an email with five possible options for this.  I
think relfilenode will be fine --- I am not sure what salt adds to this.

> > Are we still unclear if the 68GB limit is per encryption key or per
> > encryption key/IV combination?
> 
> I think that 68GB refers to key+IV but I'll research that.

Yes, please.  I think we need a definite answer on that question, which
you will see in my later email.

> > I don't remember anyone suggesting different keys for different tables.
> > How would this even be managed by the user?
> 
> I think it's still unclear whether we implement one key for whole
> database cluster or different keys for different table as the first
> version. I'm evaluating the performance overhead of the latter that
> you concerned and will share it.

I am not worried about the performance of this --- if it not secure with
a single key, it is useless, so we have to do it.  If the single key is
secure, I don't think multiple keys helps us.  I think we already
decided that the keys always have to be online for crash recovery, so we
can't allow users to control their keys anyway.

> I prefer tablespace-level or something-new-group-level than
> table-level but if we choose the latter we can create a new group of
> tables that are encrypted with the same key. That is user create a
> group and then associate tables to that group. Tablespace-level is
> implemented in the patch I submitted before. Or it's just idea but
> another idea could be to allow users to create encryption key object
> first and then specify which tables are encrypted with which
> encryption key in DDL. For example, user creates an encryption keys
> with name by SQL function and creates an encrypted table by CREATE
> TABLE ... WITH (encryption_key = 'mykey');.

That seems very complex so I think we need agreement to go in that
direction, and see what I said above about multiple keys.

> > That handles changing the passphrase, but what about rotating the
> > encryption key?  Don't we want to support that, at least in offline
> > mode?
> 
> Yeah, supporting rotating the encryption key is a good idea. Agreed.
> 
> After more thoughts, it's a just idea but I wonder if the first
> implementation step of TDE for v13 could be key management module.
> That is, (in 3-tier case) PostgreSQL gets KEK by passphrase command or
> directly, and creates MDEK. User can create an encryption key with
> name using by SQL function, and the key manager derives DEK and store
> its salt to the disk. Also we have an internal interface to get an
> encryption key.
> 
> The good point is not only to develop incrementally but also that if
> PostgreSQL is able to manage (symmetric) encryption keys inside
> database cluster and has interfaces to get and add keys, pgcrypt also
> will be able to use it. That way, we will provide column-level TDE
> 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Sun, Jul 14, 2019 at 12:13:45PM -0400, Joe Conway wrote:
> In my email I linked the wrong page for [2]. The correct one is here:
> [2] https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html
> 
> Following that, I think we could end up with three tiers:
> 
> 1. A master key encryption key (KEK): this is the ley supplied by the
>database admin using something akin to ssl_passphrase_command
> 
> 2. A master data encryption key (MDEK): this is a generated key using a
>cryptographically secure pseudo-random number generator. It is
>encrypted using the KEK, probably with Key Wrap (KW):
>or maybe better Key Wrap with Padding (KWP):
> 
> 3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
> table specific keys.
> 
> 3b. WAL data encryption keys (WDEK):  Similarly use MDEK and a HKDF to
> generate new keys when needed for WAL (based on the other info we
> need to change WAL keys every 68 GB unless I read that wrong).
> 
> I believe that would allows us to have multiple keys but they are
> derived securely from the one DEK using available info similar to the
> way we intend to use LSN to derive the IVs -- perhaps table.oid for
> tables and something else for WAL.
> 
> We also need to figure out how/when to generate new WDEK. Maybe every
> checkpoint, also meaning we would have to force a checkpoint every 68GB?

Masahiko Sawada copied this section as a desired direction, so I want to
drill down into it.  I think we have five possible approaches for level
3 listed above.

The simplest approach would be to say that the LSN/page-number and WAL
segment-number used as IV will not collide and we can just use them
directly.

The second approach is to say they will collide and that we need to mix
a constant into the IV for tables/indexes and a different one for WAL. 
In a way I would like to mix the pg_controldata Database system
Identifier into there too, but I am unclear on the value and complexity
involved.

A third approach would be to say that we will have duplicate LSNs
between a table and its index?  Maybe we need three constants, one for
heap, one for indexes, and one for WAL.

A fourth approach would be to say we will have duplicate LSNs on
different heap files, or index files.  We would then modify pg_upgrade to
preserve relfilenode and use that.  (I don't think pg_class.oid is
visible during recovery, and it certainly isn't visible in offline
mode.)

However, we need to be clear that adding relfilenode is only helping to
avoid tables/indexes with the same LSN pages.  It doesn't address the
68GB limit since our tables can be larger than that.

A fifth approach would be to decide that 68GB is the limit for a single
key (not single key/IV combo).  If that is the case we need a different
key for each 68GB of a file, and because we break files into 1GB chunks,
we would just use a different key for each chunk, and I guess store the
keys in the file system, encrypted with the master key.

My big point is that we need to decide where the IV collisions will
happen, and what our encryption limit per key (not per key/IV
combination) is.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Masahiko Sawada
On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian  wrote:
>
> On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
> > I've re-considered the design of TDE feature based on the discussion
> > so far. The one of the main open question is the granular of
> > encryption objects: cluster encryption or more-granular-than-cluster
> > encryption. The followings describe about the new TDE design when we
> > choose table-level encryption or something-new-group-level encryption.
> >
> > General
> > 
> > We will use AES and support both AES-128 and AES-256. User can specify
> > the new initdb option something like --aes-128 or --aes-256 to enable
> > encryption and must specify --encryption-key-passphrase-command along
> > with. (I guess we also require openssl library.) If these options are
> > specified, we write the key length to the control file and derive the
> > KEK and generate MDEK during initdb. wal_log_hints will be enabled
> > automatically in encryption mode, like we do for checksum mode,
>
> Agreed.  pg_control will store the none/AES128/AES256 indicator.
>
> > Key Management
> > ==
> > We will use 3-tier key architecture as Joe proposed.
> >
> >   1. A master key encryption key (KEK): this is the ley supplied by the
> >  database admin using something akin to ssl_passphrase_command
> >
> >   2. A master data encryption key (MDEK): this is a generated key using a
> >  cryptographically secure pseudo-random number generator. It is
> >  encrypted using the KEK, probably with Key Wrap (KW):
> >  or maybe better Key Wrap with Padding (KWP):
> >
> >   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
> >   table specific keys.
>
> What is the value of a per-table encryption key?  How is HKDF derived?

Per-table encryption key is derived from MDEK with salt and its OID as
info. I think we can store salts for each encryption keys into the
separate file so that off-line tool also can read it.

> Are we still unclear if the 68GB limit is per encryption key or per
> encryption key/IV combination?

I think that 68GB refers to key+IV but I'll research that.

>
> >   3b. WAL data encryption keys (WDEK):  Similarly use MDEK and a HKDF to
> >   generate new keys when needed for WAL.
> >
> > We store MDEK to the plain file (say global/pgkey) after encrypted
> > with the KEK. I might want to store the hash of passphrase of the KEK
> > in order to verify the correctness of the given passphrase. However we
> > don't need to store TDEK and WDEK as we can derive them as needed. The
> > key file can be read by both backend processes and front-end tools.
>
> Yes, we need to verify the pass phrase.
>
> > When postmaster startup, it reads the key file and decrypts MDEK and
> > derive WDEK using key id for WDEK. WDEK is loaded to the key hash map
> > (keyid -> key) on the shared memory. Also we derive TDEK as needed
> > when reading tables or indexes and add it to the key hash map as well
> > if not exists.
> >
> > Buffer Encryption
> > ==
> > We will use AES-CBC for buffer encryption. We will add key id (4byte)
>
> I think we might want to use CTR for this, and will post after this.
>
> > to after the pd_lsn(8byte) in PageHeaderData and we will not encrypt
> > first 16 byte of each pages so the LSN and key id can be used. We can
> > store an invalid key id to tell us that the table is not encrypted.
> > There two benefits of storing key id to the page header: offline tools
> > can get key id (and know the table is encrypted or not) and it's
> > helpful for online rekey in the future.
>
> I don't remember anyone suggesting different keys for different tables.
> How would this even be managed by the user?

I think it's still unclear whether we implement one key for whole
database cluster or different keys for different table as the first
version. I'm evaluating the performance overhead of the latter that
you concerned and will share it.

I prefer tablespace-level or something-new-group-level than
table-level but if we choose the latter we can create a new group of
tables that are encrypted with the same key. That is user create a
group and then associate tables to that group. Tablespace-level is
implemented in the patch I submitted before. Or it's just idea but
another idea could be to allow users to create encryption key object
first and then specify which tables are encrypted with which
encryption key in DDL. For example, user creates an encryption keys
with name by SQL function and creates an encrypted table by CREATE
TABLE ... WITH (encryption_key = 'mykey');.

>
> > I've considered to store IV and key id to a new fork but I felt that
> > it is complex because we will always need to have the fork on the
> > shared buffer when any pages of its main fork is written to the disk.
> > If almost buffers of the shared buffers are dirtied and theirs new
> > forks are not  loaded to the shared buffer, we might need to load the
> > new fork and write 

Re: Fetching timeline during recovery

2019-07-25 Thread Jehan-Guillaume de Rorthais
Hello,

On Wed, 24 Jul 2019 14:33:27 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Wed, 24 Jul 2019 09:49:05 +0900
> Michael Paquier  wrote:
> 
> > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > wrote:  
[...]
> > I think that there are arguments for being more flexible with it, and
> > perhaps have a system-level view to be able to look at some of its fields.  
> 
> Great idea. I'll give it a try to keep the discussion on.

After some thinking, I did not find enough data to expose to justify the
creation a system-level view. As I just need the current timeline I
wrote "pg_current_timeline()". Please, find the patch in attachment.

The current behavior is quite simple: 
* if the cluster is in production, return ThisTimeLineID
* else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)

This is really naive implementation. We should probably add some code around
the startup process to gather and share general recovery stats. This would
allow to fetch eg. the current recovery method, latest xlog file name restored
from archives or streaming, its timeline, etc.

Any thoughts?

Regards,
>From 5b06d83e000132eca5a3173e96651ddf4531cff6 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Thu, 25 Jul 2019 19:36:40 +0200
Subject: [PATCH] Add function pg_current_timeline

---
 src/backend/access/transam/xlog.c  | 26 ++
 src/backend/access/transam/xlogfuncs.c | 17 +
 src/include/access/xlog.h  |  1 +
 src/include/catalog/pg_proc.dat|  6 ++
 4 files changed, 50 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index da3d250986..9da876c0ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12243,3 +12243,29 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Returns current active timeline.
+ * During production, returns ThisTimeLineID.
+ * During standby, returns the timeline of the latest record flushed to XLOG.
+ */
+TimeLineID
+GetCurrentTimeLine(void)
+{
+	TimeLineID	localTimeLineID;
+	bool		localRecoveryInProgress;
+
+	SpinLockAcquire(>info_lck);
+
+	localTimeLineID = XLogCtl->ThisTimeLineID;
+	localRecoveryInProgress = XLogCtl->SharedRecoveryInProgress;
+
+	SpinLockRelease(>info_lck);
+
+	if (localRecoveryInProgress) {
+		 GetWalRcvWriteRecPtr(NULL, );
+		 return localTimeLineID;
+	}
+
+	return localTimeLineID;
+}
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..c1cb9e8819 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -776,3 +776,20 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Returns the current timeline
+ */
+Datum
+pg_current_timeline(PG_FUNCTION_ARGS)
+{
+	TimeLineID currentTL = GetCurrentTimeLine();
+
+	/*
+	 * we have no information about the timeline if the walreceiver
+	 * is disabled or hasn't streamed anything yet,
+	 */
+	if (!currentTL) PG_RETURN_NULL();
+
+	PG_RETURN_INT32(currentTL);
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..f0502c0b41 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
+extern TimeLineID GetCurrentTimeLine(void);
 
 extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..42cd7c3486 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6006,6 +6006,12 @@
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
+{ oid => '3434',
+  descr => 'return the current timeline',
+  proname => 'pg_current_timeline', prorettype => 'int4',
+  proargtypes => '', proallargtypes => '{int4}',
+  proargmodes => '{o}', proargnames => '{timeline}',
+  prosrc => 'pg_current_timeline' },
 
 { oid => '3165', descr => 'difference in bytes, given two wal locations',
   proname => 'pg_wal_lsn_diff', prorettype => 'numeric',
-- 
2.20.1



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 01:18:44PM -0400, Bruce Momjian wrote:
> > Key Management
> > ==
> > We will use 3-tier key architecture as Joe proposed.
> > 
> >   1. A master key encryption key (KEK): this is the ley supplied by the
> >  database admin using something akin to ssl_passphrase_command
> > 
> >   2. A master data encryption key (MDEK): this is a generated key using a
> >  cryptographically secure pseudo-random number generator. It is
> >  encrypted using the KEK, probably with Key Wrap (KW):
> >  or maybe better Key Wrap with Padding (KWP):
> > 
> >   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
> >   table specific keys.
> 
> What is the value of a per-table encryption key?  How is HKDF derived?
> Are we still unclear if the 68GB limit is per encryption key or per
> encryption key/IV combination?

Oh, I see you got this from Joe Conway's email.  Let me reply to that
now.  (I am obviously having problems keeping this thread in my head as
well.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
> I've re-considered the design of TDE feature based on the discussion
> so far. The one of the main open question is the granular of
> encryption objects: cluster encryption or more-granular-than-cluster
> encryption. The followings describe about the new TDE design when we
> choose table-level encryption or something-new-group-level encryption.
> 
> General
> 
> We will use AES and support both AES-128 and AES-256. User can specify
> the new initdb option something like --aes-128 or --aes-256 to enable
> encryption and must specify --encryption-key-passphrase-command along
> with. (I guess we also require openssl library.) If these options are
> specified, we write the key length to the control file and derive the
> KEK and generate MDEK during initdb. wal_log_hints will be enabled
> automatically in encryption mode, like we do for checksum mode,

Agreed.  pg_control will store the none/AES128/AES256 indicator.

> Key Management
> ==
> We will use 3-tier key architecture as Joe proposed.
> 
>   1. A master key encryption key (KEK): this is the ley supplied by the
>  database admin using something akin to ssl_passphrase_command
> 
>   2. A master data encryption key (MDEK): this is a generated key using a
>  cryptographically secure pseudo-random number generator. It is
>  encrypted using the KEK, probably with Key Wrap (KW):
>  or maybe better Key Wrap with Padding (KWP):
> 
>   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
>   table specific keys.

What is the value of a per-table encryption key?  How is HKDF derived?
Are we still unclear if the 68GB limit is per encryption key or per
encryption key/IV combination?
 
>   3b. WAL data encryption keys (WDEK):  Similarly use MDEK and a HKDF to
>   generate new keys when needed for WAL.
> 
> We store MDEK to the plain file (say global/pgkey) after encrypted
> with the KEK. I might want to store the hash of passphrase of the KEK
> in order to verify the correctness of the given passphrase. However we
> don't need to store TDEK and WDEK as we can derive them as needed. The
> key file can be read by both backend processes and front-end tools.

Yes, we need to verify the pass phrase.

> When postmaster startup, it reads the key file and decrypts MDEK and
> derive WDEK using key id for WDEK. WDEK is loaded to the key hash map
> (keyid -> key) on the shared memory. Also we derive TDEK as needed
> when reading tables or indexes and add it to the key hash map as well
> if not exists.
> 
> Buffer Encryption
> ==
> We will use AES-CBC for buffer encryption. We will add key id (4byte)

I think we might want to use CTR for this, and will post after this.

> to after the pd_lsn(8byte) in PageHeaderData and we will not encrypt
> first 16 byte of each pages so the LSN and key id can be used. We can
> store an invalid key id to tell us that the table is not encrypted.
> There two benefits of storing key id to the page header: offline tools
> can get key id (and know the table is encrypted or not) and it's
> helpful for online rekey in the future.

I don't remember anyone suggesting different keys for different tables. 
How would this even be managed by the user?

> I've considered to store IV and key id to a new fork but I felt that
> it is complex because we will always need to have the fork on the
> shared buffer when any pages of its main fork is written to the disk.
> If almost buffers of the shared buffers are dirtied and theirs new
> forks are not  loaded to the shared buffer, we might need to load the
> new fork and write the page to the disk and then evict some pages,
> over and over.
> 
> We will use (page lsn, page number) to create a nonce. IVs are created
> by encrypting the nonce with its TDEK.

Agreed.

> WAL Encryption
> =
> We will use AES-CTR for WAL encryption and encrypt each WAL pages with WDEK.
> 
> We will use WAL segment number to create a nonce. Similar to buffer
> encryption, IVs are created using by the nonce and WDEK.

Yes.  If there is concern about collision of table/index and WAL IVs, we
can add a constant to the two uses, as Joe Conway mentioned.

> If we want to support enabling or disabling encryption after initdb we
> might want to have key id in the WAL page header.
> 
> Front-end Tool Support
> ==
> We will add --encryption-key-passphrase-command option to the
> front-end tools that read database files or WAL segment files directly.
> They can get KEK via --encryption-key-passphrase-command and get MDEK
> by reading the key file. Also they can know the key length by checking
> the control file. Since they can derive TDEK using by key id stored in
> the page header they can decrypt database files. Similarly, they also
> can decrypt WAL as they can know the key id of WDEK.
>
> Master Key Rotation
> 
> We will support new command-line tool that 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Tue, Jul 16, 2019 at 01:24:54PM +0900, Masahiko Sawada wrote:
> On Sat, Jul 13, 2019 at 12:33 AM Bruce Momjian  wrote:
> > then each row change gets its own LSN.  You are asking if an update that
> > just expires one row and adds it to a new page gets the same LSN.  I
> > don't know.
> 
> The following scripts can reproduce that different two pages have the same 
> LSN.
> 
> =# create table test (a int);
> CREATE TABLE
> =# insert into test select generate_series(1, 226);
> INSERT 0 226
> =# update test set a = a where a = 1;
> UPDATE 1
> =# select lsn from page_header(get_raw_page('test', 0));
> lsn
> ---
>  0/1690488
> (1 row)
> 
> =# select lsn from page_header(get_raw_page('test', 1));
> lsn
> ---
>  0/1690488
> (1 row)
> 
> So I think it's better to use LSN and page number to create IV. If we
> modify different tables by single WAL we also would need OID or
> relfilenode but I don't think currently we have such operations.

OK, good to know, thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Sat, Jul 20, 2019 at 03:39:25PM -0400, Sehrope Sarkuni wrote:
> How about storing the CRC of the encrypted pages? It would not leak
> any additional information and serves the same purpose as a
> non-encrypted one, namely I/O corruption detection. I took a look at
> pg_checksum and besides checking for empty pages, the checksum
> validation path does not interpret any other fields to calculate the
> checksum. I think even the offline checksum enabling path looks like
> it may work out of the box. Checksums of encrypted data are not a
> replacement for a MAC and this would allow that validation to run
> without any knowledge of keys.
> 
> Related, I think CTR mode should be considered for pages. It has
> performance advantages at 8K data sizes, but even better, allows for
> arbitrary bytes of the cipher text to be replaced. For example, after
> encrypting a block you can replace the two checksum bytes with the CRC
> of the cipher text v.s. CBC mode where that would cause corruption to
> cascade forward. Same could be used for leaving things like
> pd_pagesize_version in plaintext at its current offset. For anything
> deemed non-sensitive, having it readable without having to decrypt the
> page is useful.

Yes, I did cover that here:


https://www.postgresql.org/message-id/20190716002519.yyvgl7qi4ewl6...@momjian.us

Yes, it would only work if the checksum was the last part of the page,
or if we used CTR mode, where changing the source bits doesn't affect
the later bits.  I am thinking crazy here, I know, but it seemed worth
mentioning in case someone liked it.


https://www.postgresql.org/message-id/20190715194239.iqq5jdj54ru32...@momjian.us

If we want to go crazy, we could encrypt, assume zeros for the CRC,
compute the MAC and put it in the place of the CRC is, but then tools
that read CRC would see that as an error, so we don't want to go there.
Yes, crazy.

I know this thread is long so you might have missed it.

I do think CTR mode is the way to go for the heap/index pages and the
WAL, and will reply to another email on that topic now.

> It does not have to be full bytes either. CTR mode operates as a
> stream of bits so its possible to replace nibbles or even individual
> bits. It can be something as small one bit for an "is_encrypted" flag
> or a handful of bits used to infer a derived key. For example, with
> 2-bits you could have 00 represent unencrypted, 01/10 represent
> old/new key, and 11 be future use. Something like that could
> facilitate online key rotation.

Yes, if we do all-cluster encryption, we can just consult pg_control,
but if we do per-table/index, that might be needed.  There is another
email suggesting symlink file to a key file could indicate an encrypted
table/index.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Initdb failure

2019-07-25 Thread vignesh C
On Thu, Jul 25, 2019 at 8:39 PM Rafia Sabih  wrote:
>
> On Thu, 25 Jul 2019 at 16:44, Tom Lane  wrote:
> >
> > Rafia Sabih  writes:
> > > On Thu, 25 Jul 2019 at 13:50, vignesh C  wrote:
> > >>> Initdb fails when following path is provided as input:
> > >>> datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/
> >
> > > Now that you say this, it does make sense to atleast inform about the
> > > correct error and that too earlier. Something like the attached patch
> > > would make sense.
> >
> > I am not terribly excited about putting effort into this at all, because
> > I don't think that any actual user anywhere will ever get any benefit.
> > The proposed test case is just silly.
>
> That I totally agree upon!
> But on the other hand emitting the right error message atleast would
> be good for the sake of correctness if nothing else. But yes that
> definitely should be weighed against what is the effort required for
> this.
>
Thanks Tom for your opinion.
Thanks Rafia for your thoughts and effort in making the patch.

I'm not sure if we are planning to fix this.
If we are planning to fix, one suggestion from my side we can
choose a safe length which would include the subdirectories
and file paths. I think one of these will be the longest:
base/database_oid/tables
pg_wal/archive_status/
pg_wal/archive_file

Fix can be something like:
MAXPGPATH - (LONGEST_PATH_FROM_ABOVE)

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: add_path() for Path without InitPlan: cost comparison vs. Paths that require one

2019-07-25 Thread Dent John
Hi Tom,

> On 25 Jul 2019, at 14:25, Tom Lane  wrote:
> 
> Please explain yourself.  InitPlans will, as a rule, get stuck into the
> same place in the plan tree regardless of which paths are chosen; that's
> why we need not consider them in path cost comparisons.

Ah that’s true. I didn’t realise that at the time I wrote. 

But I think my problem is still real...

>  Moreover, once
> the initplan's own subplan is determined, it's going to be the same
> regardless of the shape of the outer query ---

Yes that’s true too. 

> so if we did factor it
> in, it'd contribute the same cost to every outer path, and thus it
> still wouldn't change any decisions.  

I think I’m exposed to the problem because I’m changing how certain queries are 
fulfilled. 

And in the case of a RECURSIVE CTE, the plan ends up being an InitPlan that 
materializes the CTE, and then a scan of that materialized result.

The problem is that I can fulfil the entire query with a scan against an MV 
table. Point is it’s an alternative that achieves both the InitPlan (because 
it’s unnecessary) and the final scan.

But the cost comparison during add_path() is only taking into account the cost 
of the final scan, which is so cheap that it is preferable even to a simple 
scan of an MV. 

> So I don't follow what you're
> on about here.

Hmm. Having written the above, I realise I’m not clear on why my extension 
isn’t offered the opportunity to materialise the work table for the InitPlan.

Sorry. I should have thought about that question first. It might just be an 
error in my code. I’ll follow up with an answer.

> 
>regards, tom lane





Re: "localtime" value in TimeZone

2019-07-25 Thread Shay Rojansky
> Yeah, this is something that some tzdb packagers do --- they put a
> "localtime" file into /usr/share/zoneinfo that is a symlink or hard link
> to the active zone file, and then initdb tends to seize on that as being
> the shortest available spelling of the active zone.

I see, I wasn't aware that this was a distribution-level mechanism - I
understand the situation better now.

I followed the conversations you linked to, and disagreements seem to be
mostly about other aspects of timezone selection. Does it make sense to
have a limited, restricted conversation specifically about avoiding
"localtime"?


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-25 Thread Bruce Momjian
On Mon, Jul 15, 2019 at 07:39:20PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-15, Bruce Momjian wrote:
> 
> > My point is that doing encryption of only some data might actually make
> > the system slower due to the lookups, so I think we need to implement
> > all-cluster encryption and then see what the overhead is, and if there
> > are use-cases for not encrypting only some data.
> 
> We can keep the keys in the relcache.  It doesn't have to be slow.  It
> is certainly slower to have to encrypt *all* data, which can be
> massively larger than the sensitive portion of the database.
> 
> If we need the keys for offline operation (where relcache is not
> reachable), we can keep pointers to the key files in the filesystem --
> for example for an encrypted table we would keep a new file, say
> .key, which could be a symlink to the encrypted key file.
> The tool already has access to the key data, but the symlink lets it
> know *which* key to use; random onlookers cannot get the key data
> because the file is encrypted with the master key.
> 
> Any table without the key file is assumed to be unencrypted.

The relcache and symlinks is an interesting idea.  Are we still
encrypting all of WAL?  If so, the savings is only on heap/index file
writes, and I just don't know much of a benefit skipping encryption will
be --- we can test it later.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [GSoC] artbufmgr

2019-07-25 Thread pantilimonov michael
Greetings all,

> So, this is a plan, that I would like to stick with subsequently:
>
> 1) Work on synchronization.
> 2) Examine bulk operations on buffers, replace them with tree (prefix) 
> iterator.
> 3) A reasonable way to size memory for tree maintenance.
>

previously i described the plan, that i wanted to follow; achieved results are 
summarized below.
Should note beforehand, that third item wasn't touched.

>
> Dynamic data structure like tree brings problems that are not relevant for 
> the current hashtable implementation.
> The number of LWLocks in the hashtable is fixed and equal to the number of 
> partitions(128),
> the memory footprint for buckets and hash-entries can be easily precomputed.
>
> Currently, I am thinking about the idea of tree decomposition, so that it is 
> really a
> tree of trees, where RelFileNode[12bytes] (maybe + ForkNum) can be used for 
> the top tree and
> the rest for the trees of blocks. Hence, each bottom tree can contain a 
> single private LWLock, that will be
> used to protect its content. Won't be that an overkill? if we take a 
> hypothetical example with
> 2 tablespaces(hdd + sdd), each with 500+ objects, so the number of LWLocks 
> should scale accordingly, or
> we will be forced to reuse them, by unloading block trees, that should be 
> done in some intelligent way.
>

Primarily i have started with a single-lock art tree, using the same locking 
strategy as an existing hashtable.
Both structs were surrounded by "defines", so i could run them independently or 
simultaneously.
The last option was mainly used as some kind of validation check that tree 
works in the same way
as hashtable. I have tested a single-lock tree version using pgbench and 
'installcheck',
in order to test it on some kind of activity where multiple parallel processes 
are involved and
fresh relations tags arrive/leave due to the create/drop of tables.

It was obvious that single-lock tree, by definition, can't stand
128-locks (each lock is used to protect specific region(partition)) hashtable 
in all concurrent cases, so
the idea was to split the current 'MainTree' into the tree of trees. Such 
separation has additional benefits,
besides throughput improvement:

a) Most of the time (relatively) 'MainTree' is not modified, as the time goes 
it gets saturated by
'most used' relation's keys. After that point, it is mostly used as a 
read-only structure.
b) 'Path' to specific subtree of 'MainTree' can be cached in SMgrRelation by 
saving pointers to the
 corresponding ForkNums. (just an array of pointers[max_forknum + 1])
 In the current implementation, this optimization can reduce key length 
from 20 to 4 bytes.

typedef struct buftag
{
Oid spcNode;/* tablespace */
Oid dbNode; /* database */
Oid relNode;/* relation */
ForkNumber  forkNum;
BlockNumber blockNum;   /* blknum relative to begin of reln */
} BufferTag;

To split 'MainTree' i have injected LWLock to the art_tree structure and created
an additional separate freelist of these subtree's nodes, that is used then for 
dynamic allocation
and deallocation.
The key length in 'MainTree' is 16 bytes (spcNode, dbnode, relNode, forkNum) 
and likely
can be reduced to 13 bytes by shifting forkNum value that ranges only from -1 
to 3, but occupies
4 bytes.
The key length of each subtree is 4 bytes only - BlockNumber.

Below are results of tests, performed on database initialized with
pgbench -i -s 50 with shared_buffers=128MB
and pgbench -i -s 100 with shared buffers=1GB.
It should be noted that such workload does not really represents 'real life' 
results,
as all contention goes into certain partitions (in case of hashtable) and 
subtrees (in case of tree).
Next goal is to compare data structures on some kind of realistic benchmark or 
just create
multiple databases inside cluster and run corresponding number of pgbench 
instances.

tested on pc with i7, ssd
each test performed 5 times, readonly ran subsequently,
full ran with fresh start(otherwise some problems to be fixed in tree..), best 
result is taken.

readonly test: pgbench --random-seed=2 -t 10 -S -c 6 -j 6 -n
full test: pgbench --random-seed=2 -t 1 -c 6 -j 6
drop test:
   create table test_drop2 as select a, md5(a::text) from 
generate_series(1, 2) as a; ~ 167 blocks
   drop test_drop2;

128MB:
   readonly HASH: latency average = 0.102 ms tps = 58934.229170 
(excluding connections establishing)
   full  HASH: latency average = 1.233 ms tps = 4868.332127 
(excluding connections establishing)

   readonly TREE: latency average = 0.106 ms tps = 56818.893538 
(excluding connections establishing)
   full  TREE: latency average = 1.227 ms tps = 4890.546989 
(excluding connections establishing)

1GB:
   readonly HASH: latency average = 0.100 ms 

Re: [proposal] de-TOAST'ing using a iterator

2019-07-25 Thread Binguo Bao
Hi John!
Sorry for the late reply. It took me some time to fix a random bug.

In the case where we don't know the slice size, how about the other
> aspect of my question above: Might it be simpler and less overhead to
> decompress entire chunks at a time? If so, I think it would be
> enlightening to compare performance.


Good idea. I've tested  your propopal with scripts and patch v5 in the
attachment:

  master  patch v4  patch v5
comp. beg.4364ms  1505ms  1529ms
comp. end   28321ms31202ms  26916ms
uncomp. beg. 3474ms  1513ms  1523ms
uncomp. end 27416ms30260ms  25888ms

The proposal improves suffix query performance greatly
with less calls to the decompression function.

Besides, do you have any other suggestions for the structure of
DetoastIterator or ToastBuffer?
Maybe they can be designed to be more reasonable.

Thanks again for the proposal.
-- 
Best regards,
Binguo Bao


init-test.sh
Description: application/shellscript


iterator-test.sh
Description: application/shellscript
From 19deb6d7609a156cb14571d83c27d72f3ecb5aa8 Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 498 +++
 src/backend/utils/adt/varlena.c  |  49 ++--
 src/include/access/tuptoaster.h  |  92 +++
 3 files changed, 623 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..fb52bfb 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	 DetoastIterator iter, int32 length);
 
 
 /* --
@@ -347,6 +354,147 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * --
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed value inside of the main tuple
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = NULL;
+		iterator->source = palloc0(sizeof(ToastBuffer));
+		iterator->source->buf = (const char*) attr;
+		iterator->source->position = TOAST_COMPRESS_RAWDATA(attr);
+		iterator->source->limit = (char *)attr + VARSIZE(attr);
+		iterator->source->capacity = iterator->source->limit;
+
+		iterator->buf = palloc0(sizeof(ToastBuffer));
+		init_toast_buffer(iterator->buf, TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ, false);
+
+		iterator->ctrlc = 0;
+		iterator->compressed = true;
+		iterator->done = false;
+	}
+
+	return iterator;
+}
+
+
+/* --
+ * free_detoast_iterator -
+ *
+ * Free the memory space occupied by the de-Toast iterator.
+ * --
+ */
+bool 

Re: Initdb failure

2019-07-25 Thread Rafia Sabih
On Thu, 25 Jul 2019 at 16:44, Tom Lane  wrote:
>
> Rafia Sabih  writes:
> > On Thu, 25 Jul 2019 at 13:50, vignesh C  wrote:
> >>> Initdb fails when following path is provided as input:
> >>> datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/
>
> > Now that you say this, it does make sense to atleast inform about the
> > correct error and that too earlier. Something like the attached patch
> > would make sense.
>
> I am not terribly excited about putting effort into this at all, because
> I don't think that any actual user anywhere will ever get any benefit.
> The proposed test case is just silly.

That I totally agree upon!
But on the other hand emitting the right error message atleast would
be good for the sake of correctness if nothing else. But yes that
definitely should be weighed against what is the effort required for
this.

-- 
Regards,
Rafia Sabih




Re: sepgsql seems rather thoroughly broken on Fedora 30

2019-07-25 Thread Tom Lane
Mike Palmiotto  writes:
> On Fri, Jul 19, 2019 at 4:29 PM Tom Lane  wrote:
>> I can confirm that the 0001 patch fixes things on my Fedora 30 box.
>> So that's good, though I don't know enough to evaluate it for style
>> or anything like that.

> I think the policy is in need of review/rewriting anyway. The proper
> thing to do would be to create a common template for all of the
> SELinux regtest user domains and create more of a hierarchical policy
> to reduce redundancy. If you want to wait for more formal policy
> updates, I can do that in my spare time. Otherwise, the patch I posted
> should work with the general style of this policy module.

Hearing no further comments, I went ahead and pushed 0001 (after
checking that it works on F28, which is the oldest Fedora version
I have at hand right now).  Stylistic improvements to the script
are fine, but let's get the bug fixed for now.

BTW, I noticed that the documentation about how to run the tests
is a bit stale as well --- for instance, it says to use

$ sudo semodule -u sepgsql-regtest.pp

but that slaps your wrist:

The --upgrade option is deprecated. Use --install instead.

So if anyone does feel like polishing things in this area, some doc
review seems indicated.

regards, tom lane




Re: Initdb failure

2019-07-25 Thread Tom Lane
Rafia Sabih  writes:
> On Thu, 25 Jul 2019 at 13:50, vignesh C  wrote:
>>> Initdb fails when following path is provided as input:
>>> datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/

> Now that you say this, it does make sense to atleast inform about the
> correct error and that too earlier. Something like the attached patch
> would make sense.

I am not terribly excited about putting effort into this at all, because
I don't think that any actual user anywhere will ever get any benefit.
The proposed test case is just silly.

However, if we are going to put effort into it, it needs to be more than
this.  First off, what is the actual failure point?  (It's surely less
than MAXPGPATH, because we tend to append subdirectory/file names onto
whatever is given.)  Second, what of absolute versus relative paths?
If converting the given path to absolute makes it exceed MAXPGPATH,
is that a problem?

regards, tom lane




Re: pg_upgrade version checking questions

2019-07-25 Thread Daniel Gustafsson
> On 24 Jul 2019, at 22:32, Peter Eisentraut  
> wrote:
> 
> On 2019-07-23 17:30, Daniel Gustafsson wrote:
>> The reason for moving is that we print default values in usage(), and that
>> requires the value to be computed before calling usage().  We already do this
>> for resolving environment values in parseCommandLine().  If we do it in 
>> setup,
>> then we’d have to split out resolving the new_cluster.bindir into it’s own
>> function exposed to option.c, or do you have any other suggestions there?
> 
> I think doing nontrivial work in order to print default values in
> usage() is bad practice, because in unfortunate cases it would even
> prevent you from calling --help.  Also, in this case, it would probably
> very often exceed the typical line length of --help output and create
> some general ugliness.  Writing something like "(default: same as this
> pg_upgrade)" would probably achieve just about the same.

Fair enough, those are both excellent points.  I’ve shuffled the code around to
move back the check for exec_path to setup (albeit earlier than before due to
where we perform directory checking).  This does mean that the directory
checking in the options parsing must learn to cope with missing directories,
which is a bit unfortunate since it’s already doing a few too many things IMHO.
To ensure dogfooding, I also removed the use of -B in ‘make check’ for
pg_upgrade, which should bump the coverage.

Also spotted a typo in a pg_upgrade file header in a file touched by this, so
included it in this thread too as a 0004.

Thanks again for reviewing, much appreciated!

cheers ./daniel



0004-Fix-typo-in-pg_upgrade-file-header-v4.patch
Description: Binary data


0003-Default-new-bindir-to-exec_path-v4.patch
Description: Binary data


0002-Check-all-used-executables-v4.patch
Description: Binary data


0001-Only-allow-upgrades-by-the-same-exact-version-new-v4.patch
Description: Binary data


Re: Initdb failure

2019-07-25 Thread Rafia Sabih
On Thu, 25 Jul 2019 at 13:50, vignesh C  wrote:
>
> On Thu, Jul 25, 2019 at 4:52 PM Rafia Sabih  wrote:
> >
> > On Thu, 25 Jul 2019 at 07:39, vignesh C  wrote:
> > >
> > > Hi,
> > >
> > > Initdb fails when following path is provided as input:
> > > datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/
> > >
> > > Also the cleanup also tends to fail in the cleanup path.
> > >
> > > Could be something to do with path handling.
> > This is because the value of MAXPGPATH is 1024 and the path you are
> > providing is more than that. Hence, when it is trying to read
> > PG_VERSION in ValidatePgVersion it is going to a wrong path with just
> > 1024 characters.
> >
>
> The error occurs at a very later point after performing the initial
> work like creating directory.  I'm thinking we should check this in
> the beginning and throw the error message at the beginning and exit
> cleanly.
>
Now that you say this, it does make sense to atleast inform about the
correct error and that too earlier. Something like the attached patch
would make sense.

-- 
Regards,
Rafia Sabih
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 04d77ad700..3af11b365b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2726,6 +2726,12 @@ create_data_directory(void)
 {
 	int			ret;
 
+	if (strlen(pg_data) > MAXPGPATH)
+	{
+		pg_log_error("too long directory name\"%s\": %m", pg_data);
+		exit(1);
+	}
+
 	switch ((ret = pg_check_dir(pg_data)))
 	{
 		case 0:


Re: add_path() for Path without InitPlan: cost comparison vs. Paths that require one

2019-07-25 Thread Tom Lane
Dent John  writes:
> However, if I factor back in the cost of the InitPlan, things net out much 
> more in favour of a scan against the MV. Of course, the add_path() comparison 
> logic doesn’t include the InitPlan cost, so the point is moot. 

Please explain yourself.  InitPlans will, as a rule, get stuck into the
same place in the plan tree regardless of which paths are chosen; that's
why we need not consider them in path cost comparisons.  Moreover, once
the initplan's own subplan is determined, it's going to be the same
regardless of the shape of the outer query --- so if we did factor it
in, it'd contribute the same cost to every outer path, and thus it
still wouldn't change any decisions.  So I don't follow what you're
on about here.

regards, tom lane




Re: Built-in connection pooler

2019-07-25 Thread Konstantin Knizhnik

Hello Ryan,

Thank you very much for review and benchmarking.
My answers are inside.

On 25.07.2019 0:58, Ryan Lambert wrote:


Applying the patch [1] has improved from v9, still getting these:


Fixed.

 used a DigitalOcean droplet with 2 CPU and 2 GB RAM and SSD for this 
testing, Ubuntu 18.04.  I chose the smaller server size based on the 
availability of similar and recent results around connection pooling 
[2] that used AWS EC2 m4.large instance (2 cores, 8 GB RAM) and 
pgbouncer.  Your prior pgbench tests [3] also focused on larger 
servers so I wanted to see how this works on smaller hardware.


Considering this from connpool.sgml:
"connection_proxies specifies number of connection 
proxy processes which will be spawned. Default value is zero, so 
connection pooling is disabled by default."


That hints to me that connection_proxies is the main configuration to 
start with so that was the only configuration I changed from the 
default for this feature.  I adjusted shared_buffers to 500MB (25% of 
total) and max_connections to 1000.  Only having one proxy gives 
subpar performance across the board, so did setting this value to 10.  
My hunch is this value should roughly follow the # of cpus available, 
but that's just a hunch.




I do not think that number of proxies should depend on number of CPUs. 
Proxy process is not performing any computations, it is just redirecting 
data from client to backend and visa versa.
Certainly starting  form some number of connections is becomes 
bottleneck. The same is true for pgbouncer: you need to start several 
pgbouncer instances to be able to utilize all resources and provide best 
performance at
computer with large number of cores. The optimal value greatly depends 
on number on workload, it is difficult to suggest some formula which 
allows to calculate optimal number of proxies for each configuration.

I don't understand yet how max_sessions ties in.
Also, having both session_pool_size and connection_proxies seemed 
confusing at first.  I still haven't figured out exactly how they 
relate together in the overall operation and their impact on performance.


"max_sessions" is mostly technical parameter. To listen client 
connections I need to initialize WaitEvent set specdify maximal number 
of events.
It should not somehow affect performance. So just  specifying large 
enough value should work in most cases.
But I do not want to hardcode some constants and that it why I add GUC 
variable.


"connections_proxies" is used mostly to toggle connection pooling.
Using more than 1 proxy is be needed only for huge workloads (hundreds 
connections).


And "session_pool_size" is core parameter  which determine efficiency of 
pooling.
The main trouble with it now, is that it is per database/user 
combination. Each such combination will have its own connection pool.
Choosing optimal value of pooler backends is non-trivial task. It 
certainly depends on number of available CPU cores.
But if backends and mostly disk-bounded, then optimal number of pooler 
worker can be large than number of cores.

Presence of several pools make this choice even more complicated.

The new view helped, I get the concept of **what** it is doing 
(connection_proxies = more rows, session_pool_size = n_backends for 
each row), it's more a lack of understanding the **why** regarding how 
it will operate.



postgres=# select * from pg_pooler_state();
 pid  | n_clients | n_ssl_clients | n_pools | n_backends | 
n_dedicated_backends | tx_bytes  | rx_bytes  | n_transactions

--+---+---+-++--+---+---+
 1682 |        75 |             0 |       1 |         10 |             
 0 | 366810458 | 353181393 |        5557109
 1683 |        75 |             0 |       1 |         10 |             
 0 | 368464689 | 354778709 |        5582174

(2 rows



I am not sure how I feel about this:
"Non-tainted backends are not terminated even if there are no more 
connected sessions."
PgPRO EE version of connection pooler has "idle_pool_worker_timeout" 
parameter which allows to terminate idle workers.
It is possible to implement it also for vanilla version of pooler. But 
primary intention of this patch was to minimize changes in Postgres core




Would it be possible (eventually) to monitor connection rates and free 
up non-tainted backends after a time?  The way I'd like to think of 
that working would be:


If 50% of backends are unused for more than 1 hour, release 10% of 
established backends.




The two percentages and time frame would ideally be configurable, but 
setup in a way that it doesn't let go of connections too quickly, 
causing unnecessary expense of re-establishing those connections.  My 
thought is if there's one big surge of connections followed by a long 
period of lower connections, does it make sense to keep those extra 
backends established?





I think that idle timeout is enough but more 

Re: Initdb failure

2019-07-25 Thread vignesh C
On Thu, Jul 25, 2019 at 4:52 PM Rafia Sabih  wrote:
>
> On Thu, 25 Jul 2019 at 07:39, vignesh C  wrote:
> >
> > Hi,
> >
> > Initdb fails when following path is provided as input:
> > datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/
> >
> > Also the cleanup also tends to fail in the cleanup path.
> >
> > Could be something to do with path handling.
> This is because the value of MAXPGPATH is 1024 and the path you are
> providing is more than that. Hence, when it is trying to read
> PG_VERSION in ValidatePgVersion it is going to a wrong path with just
> 1024 characters.
>

The error occurs at a very later point after performing the initial
work like creating directory.  I'm thinking we should check this in
the beginning and throw the error message at the beginning and exit
cleanly.

>
> > I'm not sure if this is already known.
> I am also not sure if it is known or intentional. On the other hand I
> also don't know if it is practical to give such long names for
> database directory anyway.
>

Usually they will not be using such long path, but this is one of the
odd scenarios.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2019-07-25 Thread Kyotaro Horiguchi
Sorry, there's a too-hard-to-read part.

At Thu, 25 Jul 2019 20:17:37 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190725.201737.192223037.horikyota@gmail.com>
> Hello.
> 
> At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthali...@gmail.com> 
> wrote in 
> > > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen 
> > >  wrote:
> > >
> > > On 7/22/19 1:44 AM, David Rowley wrote:
> > > > Here are the comments I noted down during the review:
> > > >
> > > > cost_index:
> > > >
> > > > I know you've not finished here, but I think it'll need to adjust
> > > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > > need to be part of the estimate there as there can only be at most one
> > > > value for these.
> > > >
> > > > For example, in a query such as:
> > > >
> > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > > >
> > > > you only need to perform estimate_num_groups() on "y".
> > > >
> > > > I'm really not quite sure on what exactly will be required from
> > > > amcostestimate() here.  The cost of the skip scan is not the same as
> > > > the normal scan. So other that API needs adjusted to allow the caller
> > > > to mention that we want skip scans estimated, or there needs to be
> > > > another callback.
> > > >
> > >
> > > I think this part will become more clear once the index skip scan patch
> > > is rebased, as we got the uniquekeys field on the Path, and the
> > > indexskipprefixy info on the IndexPath node.
> > 
> > Here is what I came up with to address the problems, mentioned above in this
> > thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. 
> > it
> > occurred to me, that `_bt_read_closest` probably wouldn't work, if a next 
> > key,
> > that passes an index condition is few pages away - I'll try to tackle it 
> > soon).
> > Just another small step forward, but I hope it's enough to rebase on top of 
> > it
> > planner changes.
> > 
> > Also I've added few tags, mostly to mention reviewers contribution.
> 
> I have some comments.
> 
> +   * The order of columns in the index should be the same, as for
> +   * unique distincs pathkeys, otherwise we cannot use _bt_search
> +   * in the skip implementation - this can lead to a missing
> +   * records.
> 
> It seems that it is enough that distinct pathkeys is contained in
> index pathkeys. If it's right, that is almost checked in existing
> code:
> 
> > if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
> 
> It is perfect when needed_pathkeys is distinct_pathkeys.  So
> additional check is required if and only if it is not the case.

So I think the following change will work.

- +/* Consider index skip scan as well */
- +if (enable_indexskipscan &&
- +  IsA(path, IndexPath) &&
- +  ((IndexPath *) path)->indexinfo->amcanskip &&
- +  (path->pathtype == T_IndexOnlyScan ||
- +   path->pathtype == T_IndexScan) &&
- +  root->distinct_pathkeys != NIL)

+ +if (enable_indexskipscan &&
+ +  IsA(path, IndexPath) &&
+ +  ((IndexPath *) path)->indexskipprefix >= 0 &&
+ +  (needed_pathkeys == root->distinct_pathkeys ||
+ +   pathkeys_contained_in(root->distinct_pathkeys,
+ +   path->pathkeys)))

Additional comments on the condition above are:

> path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
> the check against them isn't needed. If you have concern on that,
> we can add that as Assert().
> 
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?
> 
> 
> +  /*
> +   * XXX: In case of index scan quals evaluation happens after
> +   * ExecScanFetch, which means skip results could be fitered out
> +   */
> 
> Why can't we use skipscan path if having filter condition?  If
> something bad happens, the reason must be written here instead of
> what we do.
> 
> 
> +  if (path->pathtype == T_IndexScan &&
> +parse->jointree != NULL &&
> +parse->jointree->quals != NULL &&
> +((List *)parse->jointree->quals)->length != 0)
> 
> It's better to use list_length instead of peeping inside. It
> handles the NULL case as well. (The structure has recently
> changed but .length is not, though.)
> 
> 
> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
> 
> It doesn't seem needed to me. Could you elaborate on the reason
> for that?
> 
> 
> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
> + */
> +if (direction * indexonlyscan->indexorderdir < 0 &&
> +  !node->ioss_FirstTupleEmitted)
> 
> I'm confused by this. 

Re: POC: Cleaning up orphaned files using undo logs

2019-07-25 Thread vignesh C
Hi Thomas,

Few review comments on 0003-Add-undo-log-manager.patch:
1) Upgrade may fail
+/*
+ * Compute the new redo, and move the pg_undo file to match if necessary.
+ * Rather than renaming it, we'll create a new copy, so that a failure that
+ * occurs before the controlfile is rewritten won't be fatal.
+ */
+static void
+AdjustRedoLocation(const char *DataDir)
+{
+ uint64 old_redo = ControlFile.checkPointCopy.redo;
+ char old_pg_undo_path[MAXPGPATH];
+ char new_pg_undo_path[MAXPGPATH];
+ int old_fd;
+ int new_fd;
+ ssize_t nread;
+ ssize_t nwritten;
+ char buffer[1024];
+
+ /*
+ * Adjust fields as needed to force an empty XLOG starting at
+ * newXlogSegNo.
+ */

During the upgrade we delete the undo files present in the new cluster
and copy the undo files from the old cluster to the new cluster.
Then we try to readjust the redo location using pg_resetwal.
While trying to readjust we get the current control file details
from current cluster. We try to open the current undo file
present in the cluster using the details from the current cluster.
As the undo files from the current cluster have been removed
and replaced with the old cluster contents, the file open will fail.

Attached a patch to solve this problem.

2)  drop table space failure in corner case.

+ else
+ {
+ /*
+ * There is data we need in this undo log.  We can't force it to
+ * be detached.
+ */
+ ok = false;
+ }
+ LWLockRelease(>mutex);

+ /* If we failed, then give up now and report failure. */
+ if (!ok)
+ return false;

One thought, can we discard the current tablespace entries
and try not to fail.

3) There will be a problem if some files deletion is successful and some
 file deletion fails, the meta contents having end details also need to be
applied or to handle the case where the undo is created further after
rollback

+ while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+ {
+ char segment_path[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+ snprintf(segment_path, sizeof(segment_path), "%s/%s",
+ undo_path, de->d_name);
+ if (unlink(segment_path) < 0)
+ elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+ }

4)  In error case unlinked undo segment message will be logged
+ while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+ {
+ char segment_path[MAXPGPATH];
+
+ if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+ continue;
+ snprintf(segment_path, sizeof(segment_path), "%s/%s",
+ undo_path, de->d_name);
+ elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path);
+ if (unlink(segment_path) < 0)
+ elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+ }
+ FreeDir(dir);

In error case the success message will be logged.

5) UndoRecPtrIsValid can be used to check InvalidUndoRecPtr
+ /*
+ * 'size' is expressed in usable non-header bytes.  Figure out how far we
+ * have to move insert to create space for 'size' usable bytes, stepping
+ * over any intervening headers.
+ */
+ Assert(slot->meta.unlogged.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+ if (context->try_location != InvalidUndoRecPtr)

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 25, 2019 at 9:30 AM vignesh C  wrote:
>
> On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila  wrote:
> >
> > On Wed, Jul 24, 2019 at 11:04 PM vignesh C  wrote:
> > >
> > > Hi,
> > >
> > > I have done some review of undolog patch series
> > > and here are my comments:
> > > 0003-Add-undo-log-manager.patch
> > >
> > > 1) As undo log is being created in tablespace,
> > >  if the tablespace is dropped later, will it have any impact?
>
> Thanks Amit, that clarifies the problem I was thinking.
> I  have another question regarding drop table space failure, but I
> don't have a better solution for that problem.
> Let me think more about it and discuss.
> >
> > Yes, it drops the undo logs present in tablespace being dropped.  See
> > DropUndoLogsInTablespace() in the same patch.
> >
> > >
> > > 4) Should we add a readme file for undolog as it does a fair amount of 
> > > work
> > > and is core part of the undo system?
> > >
> Thanks Amit, I could get the details of readme.
> >
> > The Readme is already present in the patch series posted by Thomas.
> > See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
> > email [1].
> >
> > [1] - 
> > https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com
> >
> > --
> > With Regards,
> > Amit Kapila.
> > EnterpriseDB: http://www.enterprisedb.com
>
> --
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
From 66578336861e02b9dc5544b7d2fed46261681998 Mon Sep 17 00:00:00 2001
From: Vigneshwaran c 
Date: Mon, 8 Jul 2019 22:49:18 +0530
Subject: [PATCH] pg_upgrade failure fix.

For adjusting undo the old cluster redo need to be passed.

Patch by Vigneshwaran C, reviewed by Dilip Kumar.
---
 src/bin/pg_resetwal/pg_resetwal.c | 34 --
 

Re: Initdb failure

2019-07-25 Thread Rafia Sabih
On Thu, 25 Jul 2019 at 07:39, vignesh C  wrote:
>
> Hi,
>
> Initdb fails when following path is provided as input:
> datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/
>
> Also the cleanup also tends to fail in the cleanup path.
>
> Could be something to do with path handling.
This is because the value of MAXPGPATH is 1024 and the path you are
providing is more than that. Hence, when it is trying to read
PG_VERSION in ValidatePgVersion it is going to a wrong path with just
1024 characters.

> I'm not sure if this is already known.
I am also not sure if it is known or intentional. On the other hand I
also don't know if it is practical to give such long names for
database directory anyway.

-- 
Regards,
Rafia Sabih




Re: Index Skip Scan

2019-07-25 Thread Kyotaro Horiguchi
Hello.

At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthali...@gmail.com> wrote 
in 
> > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen 
> >  wrote:
> >
> > On 7/22/19 1:44 AM, David Rowley wrote:
> > > Here are the comments I noted down during the review:
> > >
> > > cost_index:
> > >
> > > I know you've not finished here, but I think it'll need to adjust
> > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > need to be part of the estimate there as there can only be at most one
> > > value for these.
> > >
> > > For example, in a query such as:
> > >
> > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > >
> > > you only need to perform estimate_num_groups() on "y".
> > >
> > > I'm really not quite sure on what exactly will be required from
> > > amcostestimate() here.  The cost of the skip scan is not the same as
> > > the normal scan. So other that API needs adjusted to allow the caller
> > > to mention that we want skip scans estimated, or there needs to be
> > > another callback.
> > >
> >
> > I think this part will become more clear once the index skip scan patch
> > is rebased, as we got the uniquekeys field on the Path, and the
> > indexskipprefixy info on the IndexPath node.
> 
> Here is what I came up with to address the problems, mentioned above in this
> thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
> occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
> that passes an index condition is few pages away - I'll try to tackle it 
> soon).
> Just another small step forward, but I hope it's enough to rebase on top of it
> planner changes.
> 
> Also I've added few tags, mostly to mention reviewers contribution.

I have some comments.

+   * The order of columns in the index should be the same, as for
+   * unique distincs pathkeys, otherwise we cannot use _bt_search
+   * in the skip implementation - this can lead to a missing
+   * records.

It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:

> if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))

It is perfect when needed_pathkeys is distinct_pathkeys.  So
additional check is required if and only if it is not the case.

>if (enable_indexskipscan &&
>  IsA(path, IndexPath) &&
>  ((IndexPath *) path)->indexinfo->amcanskip &&
>  (path->pathtype == T_IndexOnlyScan ||
>   path->pathtype == T_IndexScan) &&
>  (needed_pathkeys == root->distinct_pathkeys ||
>   pathkeys_contained_in(root->distinct_pathkeys,
>   path->pathkeys)))

path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().

I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?


+  /*
+   * XXX: In case of index scan quals evaluation happens after
+   * ExecScanFetch, which means skip results could be fitered out
+   */

Why can't we use skipscan path if having filter condition?  If
something bad happens, the reason must be written here instead of
what we do.


+  if (path->pathtype == T_IndexScan &&
+parse->jointree != NULL &&
+parse->jointree->quals != NULL &&
+((List *)parse->jointree->quals)->length != 0)

It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)


+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.

It doesn't seem needed to me. Could you elaborate on the reason
for that?


+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+if (direction * indexonlyscan->indexorderdir < 0 &&
+  !node->ioss_FirstTupleEmitted)

I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Julien Rouhaud
On Thu, Jul 25, 2019 at 12:03 PM Michael Paquier  wrote:
>
> On Thu, Jul 25, 2019 at 12:12:40PM +0300, Sergei Kornilov wrote:
> > Thank you, v9 code and behavior looks good for me. Builds cleanly,
> > works with older releases. I'll mark Ready to Commiter.
>
> The restriction with --jobs and --system is not documented, and that's
> good to have from the start.

That's documented in the patch:
+Note that this option is ignored with the --index
+option to prevent deadlocks when processing multiple indexes from
+the same relation, and incompatible with the --system
+option.

The restriction with --jobs and --concurrently is indeed not
specifically documented in reindexdb.sgml, there's only a mention in
reindex.sgml:

REINDEX SYSTEM does not support
CONCURRENTLY since system catalogs cannot be reindexed

The behavior doesn't really change with this patch, though we could
enhance the documentation.

>  This actually makes the parallel job
> handling with --index inconsistent as we mask parallelism in this case
> by enforcing the use of one connection.  I think that we should
> revisit the interactions with --index and --jobs actually, because,
> assuming that users never read the documentation, they may actually be
> surprised to see that something like --index idx1 .. --index idxN
> --jobs=N does not lead to any improvements at all, until they find out
> the reason why.

The problem is that a user doing something like:

reindexdb -j48 -i some_index -S s1 -S s2 -S s3

will probably be disappointed to learn that he has to run a specific
command for the index(es) that should be reindexed.  Maybe we can
issue a warning that parallelism isn't used when an index list is
processed and user asked for multiple jobs?

> Thinking deeper about it, there is also a point of gathering first all
> the relations if one associates --schemas and --tables in the same
> call of reindexdb and then pass down a list of decomposed relations
> which are processed in parallel.  The code as currently presented is
> rather straight-forward, and I don't think that this is worth the
> extra complication, but this was not mentioned until now on this
> thread :)

+1


> For the non-parallel case in reindex_one_database(), I would add an
> Assert on REINDEX_DATABASE and REINDEX_SYSTEM with a comment to
> mention that a NULL list of objects can just be passed down only in
> those two cases when the single-object list with the database name is
> built.

Something like that?

if (!parallel)
{
-   if (user_list == NULL)
+   /*
+* Database-wide and system catalogs processing should omit the list
+* of objects to process.
+*/
+   if (process_type == REINDEX_DATABASE || process_type == REINDEX_SYSTEM)
{
+   Assert(user_list == NULL);
+
process_list = pg_malloc0(sizeof(SimpleStringList));
simple_string_list_append(process_list, PQdb(conn));
}

There's another assert  after the else-parallel that checks that a
list is present, so there's no need to also check it here.

I don't send a new patch since the --index wanted behavior is not clear yet.




Re: POC: Cleaning up orphaned files using undo logs

2019-07-25 Thread Amit Kapila
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila  wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila  wrote:
> >
> > > Yep, that was completely wrong.  Here's a new version.
> > >
> >
> > One comment/question related to
> > 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.
> >
>
> I have done some more review of undolog patch series and here are my comments:
> 0003-Add-undo-log-manager.patch
>

Some more review of the same patch:
1.
+typedef struct UndoLogSharedData
+{
+ UndoLogNumber free_lists[UndoLogCategories];
+ UndoLogNumber low_logno;


What is the use of low_logno?  I don't see anywhere in the code this
being assigned any value.  Is it for some future use?

2.
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
{
..
+ /* Compute header checksum. */
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, >low_logno, sizeof(UndoLogShared->low_logno));
+ COMP_CRC32C(crc, >next_logno,
sizeof(UndoLogShared->next_logno));
+ COMP_CRC32C(crc, _logs, sizeof(num_logs));
+ FIN_CRC32C(crc);
+
+ /* Write out the number of active logs + crc. */
+ if ((write(fd, >low_logno,
sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno))
||
+ (write(fd, >next_logno,
sizeof(UndoLogShared->next_logno)) !=
sizeof(UndoLogShared->next_logno)) ||

Is it safe to read UndoLogShared without UndoLogLock?  All other
places accessing UndoLogShared uses UndoLogLock, so if this usage is
safe, maybe it is better to add a comment.

3.
UndoLogAllocateInRecovery()
{
..
/*
+ * Otherwise we need to do our own transaction tracking
+ * whenever we see a new xid, to match the logic in
+ * UndoLogAllocate().
+ */
+ if (xid != slot->meta.unlogged.xid)
+ {
+ slot->meta.unlogged.xid = xid;
+ if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
+ slot->meta.unlogged.last_xact_start =
+ slot->meta.unlogged.this_xact_start;
+ slot->meta.unlogged.this_xact_start =
+ slot->meta.unlogged.insert;

The code doesn't follow the comment.  In UndoLogAllocate, both
last_xact_start and this_xact_start are assigned in if block, so the
should be the case here.

4.
UndoLogAllocateInRecovery()
{
..
+ /*
+ * Just as in UndoLogAllocate(), the caller may be extending an existing
+ * allocation before committing with UndoLogAdvance().
+ */
+ if (context->try_location != InvalidUndoRecPtr)
+ {
..
}

I am not sure how will this work because unlike UndoLogAllocate, this
function doesn't set try_location initially.  It will be set later by
UndoLogAdvance which can easily go wrong because that doesn't include
UndoLogBlockHeaderSize.

5.
+UndoLogAdvance(UndoLogAllocContext *context, size_t size)
+{
+ context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
+ size);
+}

Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes
expects offset.

6.
UndoLogAllocateInRecovery()
{
..
+ /*
+ * At this stage we should have an undo log that can handle this
+ * allocation.  If we don't, something is screwed up.
+ */
+ if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) >
slot->meta.end)
+ elog(ERROR,
+ "cannot allocate %d bytes in undo log %d",
+ (int) size, slot->logno);
..
}

Similar to point-5, here you are using a pointer instead of offset.

7.
UndoLogAllocateInRecovery()
{
..
+ /* We found a reference to a different (or first) undo log. */
+ slot = find_undo_log_slot(logno, false);
..
+ /* TODO: check locking against undo log slot recycling? */
..
}

I think it is better to have an Assert here that slot can't be NULL.
AFAICS, slot can't be NULL unless there is some bug.  I don't
understand this 'TODO' comment.

8.
+ {
+ {"undo_tablespaces", PGC_USERSET,
CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the
tablespace(s) to use for undo logs."),
+ NULL,
+
GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+
_tablespaces,
+ "",
+ check_undo_tablespaces,
assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

9.
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+ xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) , sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+ XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?  Similarly,
there is a WAL logging in attach_undo_log which I can't understand why
it would be required for temporary/unlogged persistence levels.

10.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
+ if (oid == InvalidOid)
..
}

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

11.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char   *rawname;
+ List   *namelist;
+ bool
need_to_unlock;
+ int length;
+ int
i;
+
+ /* We need a modifiable copy of string. */
+ rawname =
pstrdup(undo_tablespaces);

I don't see the usage of 

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Sergei Kornilov
Hi

>>>  I don't have a strong opinion here. If we change for >=, it'd be
>>>  better to also adapt vacuumdb for consistency. I didn't change it for
>>>  now, to stay consistent with vacuumdb.
>>
>>  Yep, no strong opinion from me too.
>
> My opinion tends towards consistency. Consistency sounds good.

Which one consistency you prefer? Currently we have just one >= FD_SETSIZE in 
pgbench and one > FD_SETSIZE -1 in vacuumdb. That's all.

regards, Sergei




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Michael Paquier
On Thu, Jul 25, 2019 at 12:12:40PM +0300, Sergei Kornilov wrote:
> Thank you, v9 code and behavior looks good for me. Builds cleanly,
> works with older releases. I'll mark Ready to Commiter.

The restriction with --jobs and --system is not documented, and that's
good to have from the start.  This actually makes the parallel job
handling with --index inconsistent as we mask parallelism in this case
by enforcing the use of one connection.  I think that we should
revisit the interactions with --index and --jobs actually, because,
assuming that users never read the documentation, they may actually be
surprised to see that something like --index idx1 .. --index idxN
--jobs=N does not lead to any improvements at all, until they find out
the reason why.  It is also much easier to have an error as starting
point because it can be lifted later one.  There is an argument that
we may actually not have this restriction at all on --index as long as
the user knows what it is doing and does not define indexes from the
same relation, still I would keep an error.

Thinking deeper about it, there is also a point of gathering first all
the relations if one associates --schemas and --tables in the same
call of reindexdb and then pass down a list of decomposed relations
which are processed in parallel.  The code as currently presented is
rather straight-forward, and I don't think that this is worth the
extra complication, but this was not mentioned until now on this
thread :)

For the non-parallel case in reindex_one_database(), I would add an
Assert on REINDEX_DATABASE and REINDEX_SYSTEM with a comment to
mention that a NULL list of objects can just be passed down only in
those two cases when the single-object list with the database name is
built.

>> I don't have a strong opinion here. If we change for >=, it'd be
>> better to also adapt vacuumdb for consistency. I didn't change it for
>> now, to stay consistent with vacuumdb.
> 
> Yep, no strong opinion from me too.

My opinion tends towards consistency.  Consistency sounds good.
--
Michael


signature.asc
Description: PGP signature


Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Sergei Kornilov
Hi

Thank you, v9 code and behavior looks good for me. Builds cleanly, works with 
older releases. I'll mark Ready to Commiter.

> I don't have a strong opinion here. If we change for >=, it'd be
> better to also adapt vacuumdb for consistency. I didn't change it for
> now, to stay consistent with vacuumdb.

Yep, no strong opinion from me too.

regards, Sergei




Re: dropdb --force

2019-07-25 Thread Pavel Stehule
čt 25. 7. 2019 v 5:11 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > [ drop-database-force-20190708.patch ]
>
> I took a brief look at this, but I don't think it's really close to
> being committable.
>
> * The documentation claims FORCE will fail if you don't have privileges
> to terminate the other session(s) in the target DB.  This is a lie; the
> code issues kill() without any regard for such niceties.  You could
> perhaps make that better by going through pg_signal_backend().
>
> * You've hacked CountOtherDBBackends to do something that's completely
> outside the charter one would expect from its name, and not even
> bothered to update its header comment.  This requires more attention
> to not confusing future hackers; I'd say you can't even use that
> function name anymore.
>
> * You've also ignored the existing code in CountOtherDBBackends
> that is careful *not* to issue a kill() while holding the critical
> ProcArrayLock.  That problem would get enormously worse if you
> tried to sub in pg_signal_backend() there, since that function
> may do catalog accesses --- it's pretty likely that you could
> get actual deadlocks, never mind just trashing performance.
>
> * I really dislike the addition of more hard-wired delays and
> timeouts to dropdb().  It's bad enough that CountOtherDBBackends
> has got that.  Two layers of timeout are a rickety mess, and
> who's to say that a 1-minute timeout is appropriate for anything?
>
> * I'm concerned that the proposed syntax is not future-proof.
> FORCE is not a reserved word, and we surely don't want to make
> it one; but just appending it to the end of the command without
> any decoration seems like a recipe for problems if anybody wants
> to add other options later.  (Possible examples: RESTRICT/CASCADE,
> or a user-defined timeout.)  Maybe some parentheses would help?
> Or possibly I'm being overly paranoid, but ...
>
>
Can be

DROP DATABASE '(' options ...) [IF EXISTS] name

ok?


>
> I hadn't been paying any attention to this thread before now,
> but I'd assumed from the thread title that the idea was to implement
> any attempted kills in the dropdb app, not on the backend side.
> (As indeed it looks like the first version did.)  Maybe it would be
> better to go back to that, instead of putting dubious behaviors
> into the core server.
>

I don't think so server side implementation is too helpful - there is lot
of situations, where DDL command is much more practical.


> regards, tom lane
>


Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Julien Rouhaud
Thanks for the review!

On Thu, Jul 25, 2019 at 10:17 AM Sergei Kornilov  wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, failed
> Spec compliant:   not tested
> Documentation:tested, passed
>
> Hi
>
> I did some review and have few notes about behavior.
>
> reindex database does not work with concurrently option:
>
> > ./inst/bin/reindexdb --echo -d postgres -j 8 --concurrently
> > SELECT pg_catalog.set_config('search_path', '', false);
> > REINDEX SYSTEM CONCURRENTLY postgres;
> > reindexdb: error: reindexing of system catalogs on database "postgres" 
> > failed: ERROR:  cannot reindex system catalogs concurrently
>
> I think we need print message and skip system catalogs for concurrently 
> reindex.
> Or we can disallow concurrently database reindex with multiple jobs. I prefer 
> first option.

Good point.  I agree with 1st option, as that's already what would
happen without the --jobs switch:

$ reindexdb -d postgres --concurrently
WARNING:  cannot reindex system catalogs concurrently, skipping all

(although this is emitted by the backend)
I modified the client code to behave the same and added a regression test.

> > + reindex_one_database(dbname, REINDEX_SCHEMA, 
> > , host,
> > +  port, 
> > username, prompt_password, progname,
> > +  echo, 
> > verbose, concurrently,
> > +  
> > Min(concurrentCons, nsp_count));
>
> Should be just concurrentCons instead of Min(concurrentCons, nsp_count)

Indeed, that changed with v8 and I forgot to update it, fixed.

> reindex_one_database for REINDEX_SCHEMA will build tables list and then 
> processing by available workers. So:
> -j 8 -S public -S public -S public -S poblic -S public -S public - will work 
> with 6 jobs (and without multiple processing for same table)
> -j 8 -S public - will have only one worker regardless tables count
>
> > if (concurrentCons > FD_SETSIZE - 1)
>
> "if (concurrentCons >= FD_SETSIZE)" would not cleaner? Well, pgbench uses >= 
> condition, vacuumdb uses > FD_SETSIZE - 1. No more FD_SETSIZE in conditions =)

I don't have a strong opinion here.  If we change for >=, it'd be
better to also adapt vacuumdb for consistency.  I didn't change it for
now, to stay consistent with vacuumdb.


reindex_parallel_v9.diff
Description: Binary data


Re: pg_receivewal documentation

2019-07-25 Thread Jehan-Guillaume de Rorthais
On Thu, 25 Jul 2019 16:58:17 +0900
Michael Paquier  wrote:

> On Wed, Jul 24, 2019 at 03:03:04PM +0200, Jehan-Guillaume de Rorthais wrote:
> > Unless I am missing something, another solution might be to use a dedicated
> > role to pg_receive{xlog|wal} with synchronous_commit lower than
> > remote_apply.  
> 
> Aren't you confused by the same thing as I was upthread [1]?
> [1]: https://www.postgresql.org/message-id/20190710080423.gg1...@paquier.xyz
> 
> remote_apply affects all sessions.  So even if you use a replication
> role with synchronous_commit = on and have pg_receivewal use that with
> remote_apply set in postgresql.conf, then remote_apply is effective
> for all the other sessions so these will still be stuck at commit
> waiting for pg_receivewal to apply WAL if it is a synchronous
> standby.

Argh!

(Sorry for the noise)




add_path() for Path without InitPlan: cost comparison vs. Paths that require one

2019-07-25 Thread Dent John
Hi folks,

I’ve run into a planning conundrum with my query rewriting extension for MVs 
when attempting to rewrite a RECURSIVE CTE.

RECURSIVE CTEs are expensive — and presumably tricky to optimise — and so a 
good use case for query rewrite against an MV; all the more so if Yugo’s 
Incremental View Maintenance concept gets traction.

I want to add an alternative Path for the UPPERREL_FINAL of the CTE root, but 
my new MV scan path (which is actually a thin CustomScan atop a scan of the MV) 
is rejected in favour of the existing paths. 

This seems to be because my Path is more expensive than the Rel’s existing 
Paths when considered alone. (The CTE’s final scan is actually a scan Path over 
a worktable, so it really is much lighter.)

However, if I factor back in the cost of the InitPlan, things net out much more 
in favour of a scan against the MV. Of course, the add_path() comparison logic 
doesn’t include the InitPlan cost, so the point is moot. 

I’m wondering how I should approach this problem. First pass, I can’t see how 
to achieve an amicable solution with existing infrastructure.

I have a few possible solutions. Do any of the following make sense?

1. Override the add_path() logic to force my Path to win? This was initially my 
least favourite approach, but perhaps it’s actually the most pragmatic. 
Advantage is I think I could do this entirely in my EXTENSION. 

2. Make a new version of add_path() which is more aware of dependencies.

Seems #2 could have utility in PG generally. If I’m not wrong, my guess is that 
one of the reasons for the 
>=2-references-for-materialising-a-CTE;1-for-inlining policy is that we don’t 
have the planner logic to trade off materialisation versus inlining.

Also, I am wondering if my MV rewrite logic misses cases where the planner 
decides to materialise an intermediate result as an InitPlan for later 
processing. 

3. I considered creating a new root PlannerInfo structure, and burying the 
existing one another level down, alongside my MV scan, in a Gather-like 
arrangement. That coverts the costing conundrum to a choice between roots. 
Obviously that will include the InitPlan costs. I figured I could eliminate one 
sub-root much as Path elimination works. But on reflection, I’m not sure PG has 
enough flexibility in the Path concept to support this route forward.

I’d welcome any view, ideas or advice. 

d.




Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Sergei Kornilov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:tested, passed

Hi

I did some review and have few notes about behavior.

reindex database does not work with concurrently option:

> ./inst/bin/reindexdb --echo -d postgres -j 8 --concurrently
> SELECT pg_catalog.set_config('search_path', '', false);
> REINDEX SYSTEM CONCURRENTLY postgres;
> reindexdb: error: reindexing of system catalogs on database "postgres" 
> failed: ERROR:  cannot reindex system catalogs concurrently

I think we need print message and skip system catalogs for concurrently reindex.
Or we can disallow concurrently database reindex with multiple jobs. I prefer 
first option.

> + reindex_one_database(dbname, REINDEX_SCHEMA, , 
> host,
> +  port, 
> username, prompt_password, progname,
> +  echo, verbose, 
> concurrently,
> +  
> Min(concurrentCons, nsp_count));

Should be just concurrentCons instead of Min(concurrentCons, nsp_count)
reindex_one_database for REINDEX_SCHEMA will build tables list and then 
processing by available workers. So:
-j 8 -S public -S public -S public -S poblic -S public -S public - will work 
with 6 jobs (and without multiple processing for same table)
-j 8 -S public - will have only one worker regardless tables count

> if (concurrentCons > FD_SETSIZE - 1)

"if (concurrentCons >= FD_SETSIZE)" would not cleaner? Well, pgbench uses >= 
condition, vacuumdb uses > FD_SETSIZE - 1. No more FD_SETSIZE in conditions =)

regards, Sergei

Re: pg_receivewal documentation

2019-07-25 Thread Michael Paquier
On Wed, Jul 24, 2019 at 03:03:04PM +0200, Jehan-Guillaume de Rorthais wrote:
> Unless I am missing something, another solution might be to use a dedicated
> role to pg_receive{xlog|wal} with synchronous_commit lower than
> remote_apply.

Aren't you confused by the same thing as I was upthread [1]?
[1]: https://www.postgresql.org/message-id/20190710080423.gg1...@paquier.xyz

remote_apply affects all sessions.  So even if you use a replication
role with synchronous_commit = on and have pg_receivewal use that with
remote_apply set in postgresql.conf, then remote_apply is effective
for all the other sessions so these will still be stuck at commit
waiting for pg_receivewal to apply WAL if it is a synchronous
standby.
--
Michael


signature.asc
Description: PGP signature


Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-25 Thread Julien Rouhaud
Sorry for the late answer,

On Tue, Jul 23, 2019 at 9:38 AM Michael Paquier  wrote:
>
> On Mon, Jul 22, 2019 at 02:40:19PM +0200, Julien Rouhaud wrote:
> > Right, so I kept the long option.  Also this comment was outdated, as
> > the --jobs is now just ignored with a list of indexes, so I fixed that
> > too.
>
> +   if (!parallel)
> +   {
> +   if (user_list == NULL)
> +   {
> +   /*
> +* Create a dummy list with an empty string, as user requires an
> +* element.
> +*/
> +   process_list = pg_malloc0(sizeof(SimpleStringList));
> +   simple_string_list_append(process_list, "");
> +   }
> +   }
> This part looks a bit hacky and this is needed because we don't have a
> list of objects when doing a non-parallel system or database reindex.
> The deal is that we just want a list with one element: the database of
> the connection.  Wouldn't it be more natural to assign the database
> name here using PQdb(conn)?  Then add an assertion at the top of
> run_reindex_command() checking for a non-NULL name?

Good point, fixed it that way.
>
> > I considered this, but it would require to adapt all code that declare
> > SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> > createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> > trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> > have a strong opinion here, so I can go for it if you prefer.
>
> Okay.  This is a tad wider than the original patch proposal, and this
> adds two lines.  So let's discard that for now and keep it simple.

Ok!

> >> +$node->issues_sql_like([qw(reindexdb -j2)],
> >> +   qr/statement: REINDEX TABLE public.test1/,
> >> +   'Global and parallel reindex will issue per-table REINDEX');
> >> Would it make sense to have some tests for schemas here?
> >>
> >> One of my comments in [1] has not been answered.  What about
> >> the decomposition of a list of schemas into a list of tables when
> >> using the parallel mode?
> >
> > I did that in attached v6, and added suitable regression tests.
>
> The two tests for "reindexdb -j2" can be combined into a single call,
> checking for both commands to be generated in the same output.  The
> second command triggering a reindex on two schemas cannot be used to
> check for the generation of both s1.t1 and s2.t2 as the ordering may
> not be guaranteed.  The commands arrays also looked inconsistent with
> the rest.  Attached is an updated patch with some format modifications
> and the fixes for the tests.

Attached v8 based on your v7 + the fix for the dummy part.


reindex_parallel_v8.diff
Description: Binary data


Re: Issue in to_timestamp/to_date while handling the quoted literal string

2019-07-25 Thread Suraj Kharage
Thanks for the clarification Brendan, that really helps.

On Wed, Jul 24, 2019 at 6:24 PM Brendan Jurd  wrote:

> Hi Suraj,
>
> I think the documentation is reasonably clear about this behaviour, quote:
>
> " In to_date, to_number, and to_timestamp, literal text and double-quoted
> strings result in skipping the number of characters contained in the
> string; for example "XX" skips two input characters (whether or not they
> are XX)."
>
> I can appreciate that this isn't the behaviour you intuitively expected
> from to_timestamp, and I don't think you'd be the first or the last.  The
> purpose of these functions was never to validate that your input string
> precisely matches the non-coding parts of your format pattern.  For that, I
> think you'd be better served by using regular expressions.
>
> Just as an aside, in the example you gave, the string
> '2019-05-24T23:12:45' will cast directly to timestamp just fine, so it
> isn't the kind of situation to_timestamp was really intended for.  It's
> more for when your input string is in an obscure (or ambiguous) format that
> is known to you in advance.
>
> I hope that helps.
>
> Cheers
> Brendan
>
> On Wed, 24 Jul 2019 at 21:38, Suraj Kharage <
> suraj.khar...@enterprisedb.com> wrote:
>
>> Hi,
>>
>> I noticed the issue in to_timestamp()/to_date() while handling the double
>> quote literal string. If any double quote literal characters found in
>> format, we generate the NODE_TYPE_CHAR in parse format and store that
>> actual character in FormatNode->character. n DCH_from_char, we just
>> increment the input string by length of character for NODE_TYPE_CHAR.
>> We are actually not matching these characters in input string and because
>> of this, date values get changed if quoted literal string is not identical
>> in input and format string.
>>
>> e.g:
>>
>> postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
>> '-mm-dd"TT"hh24:mi:ss');
>>to_timestamp
>> ---
>>  2019-05-24 03:12:45+05:30
>> (1 row)
>>
>>
>> In above example, the quoted string is 'TT', so it just increment the
>> input string by 2 while handling these characters and returned the wrong
>> hour value.
>>
>> My suggestion is to match the exact characters from quoted literal string
>> in input string and if doesn't match then throw an error.
>>
>> Attached is the POC patch which almost works for all scenarios except for
>> whitespace - as a quote character.
>>
>> Suggestions?
>> --
>> --
>>
>> Thanks & Regards,
>> Suraj kharage,
>> EnterpriseDB Corporation,
>> The Postgres Database Company.
>>
>

-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.


RE: seems like a bug in pgbench -R

2019-07-25 Thread Fabien COELHO



Hello Yoshikazu,


|...] So I'll mark this ready for committer.


Ok, thanks for the review.

--
Fabien.