Re: Why is infinite_recurse test suddenly failing?

2019-08-15 Thread Thomas Munro
On Thu, Aug 15, 2019 at 5:49 PM Tom Lane  wrote:
> So that leads to the thought that "the infinite_recurse test is fine
> if it runs by itself, but it tends to fall over if there are
> concurrently-running backends".  I have absolutely no idea how that
> would happen on anything that passes for a platform built in this
> century.  Still, it's a place to start, which we hadn't before.

Hmm.  mereswin's recent failure on REL_11_STABLE was running the
serial schedule.

I read about 3 ways to get SEGV from stack-related faults: you can
exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
(per man pages), you can access a page that is inside the mapping but
is beyond the stack pointer (with some tolerance, exact details vary
by arch), and you can fail to allocate a page due to low memory.

The first kind of failure doesn't seem right -- we carefully set
max_stack_size based on RLIMIT_STACK minus some slop, so that theory
would require child processes to have different stack limits than the
postmaster as you said (perhaps OpenStack, Docker, related tooling or
concurrent activity on the host system is capable of changing it?), or
a bug in our slop logic.  The second kind of failure would imply that
we have a bug -- we're accessing something below the stack pointer --
but that doesn't seem right either -- I think various address
sanitising tools would have told us about that, and it's hard to
believe there is a bug in the powerpc and arm implementation of the
stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
leaves the third explanation, except then I'd expect to see other
kinds of problems like OOM etc before you get to that stage, and why
just here?  Confused.

> Also notable is that we now have a couple of hits on ARM, not
> only ppc64.  Don't know what to make of that.

Yeah, that is indeed interesting.

-- 
Thomas Munro
https://enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Masahiko Sawada
On Thu, Aug 15, 2019 at 10:19 AM Bruce Momjian  wrote:
>
> On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> > I can work on it right away but don't know where to start.
>
> I think the big open question is whether there will be acceptance of an
> all-cluster encyption feature.  I guess if no one objects, we can move
> forward.
>

I still feel that we need to have per table/tablespace keys although
it might not be the first implementation. I think the safeness of both
table/tablespace level and cluster level would be almost the same but
the former would have an advantage in terms of operation and
performance.

> > First, I think we should use a code repository to integrate [1] and [2]
> > instead of sending diffs back and forth. That would force us to resolve
> > conflicts soon and help to avoid duplicate work. The diffs would be created
> > only whe we need to post the next patch version to pgsql-hackers for review,
> > otherwise the discussions of details can take place elsewhere.
>
> Well, we can do that, or just follow the TODO list and apply items as we
> complete them.  We have found that doing everything in one big patch is
> just too hard to review and get accepted.
>
> > The most difficult problem I see now regarding the collaboration is 
> > agreement
> > on the key management user interface. The Full-Cluster Encryption feature 
> > [1]
> > should not add configuration variables or even tools that the next, more
> > sophisticated version [2] deprecates immediately. Part of the problem is 
> > that
>
> Yes, the all-cluster encryption feature has _no_ SQL-level API to
> control it, just a GUC variable that you can use SHOW to see the
> encryption mode.
>
> > [2] puts all (key management related) interaction of postgres with the
> > environment into an external library. As I pointed out in my response to 
> > [2],
> > this will not work for frontend applications (e.g. pg_waldump). I think the
> > key management UI for [2] needs to be designed first even if PG 13 should
> > adopt only [1].
>
> I think there are several directions we can go after all-cluster
> encryption, and it does matter because we would want minimal API
> breakage.  The options are:
>
> 1)  Allow per-table encyption control to limit encryption overhead,
> though all of WAL still needs to be encrypted;  we could add a
> per-record encyption flag to WAL records to avoid that.
>
> 2)  Allow user-controlled keys, which are always unlocked, and encrypt
> WAL with one key
>
> 3)  Encrypt only the user-data portion of pages with user-controlled
> keys.  FREEZE and crash recovery work since only the user data is
> encrypted.  WAL is not encrypted, except for the user-data portion
>
> I think once we implement all-cluster encryption, there will be little
> value to #1 unless we find that page encryption is a big performance
> hit, which I think is unlikely based on performance tests so far.
>
> I don't think #2 has much value since the keys have to always be
> unlocked to allow freeze and crash recovery.
>
> I don't think #3 is viable since there is too much information leakage,
> particularly for indexes because the tid is visible.
>
> Now, if someone says they still want 2 & 3, which has happened many
> times, explain how these issues can be reasonable addressed.
>
> I frankly think we will implement all-cluster encryption, and nothing
> else.  I think the next big encryption feature after that will be
> client-side encryption support, which can be done now but is complex;
> it needs to be easier.
>
> > At least it should be clear how [2] will retrieve the master key because [1]
> > should not do it in a differnt way. (The GUC cluster_passphrase_command
> > mentioned in [3] seems viable, although I think [1] uses approach which is
> > more convenient if the passphrase should be read from console.)

I think that we can also provide a way to pass encryption key directly
to postmaster rather than using passphrase. Since it's common that
user stores keys in KMS it's useful if we can do that.

> > Rotation of
> > the master key is another thing that both versions of the feature should do 
> > in
> > the same way. And of course, the fronend applications need consistent 
> > approach
> > too.
>
> I don't see the value of an external library for key storage.

I think that big benefit is that PostgreSQL can seamlessly work with
external services such as KMS. For instance, when key rotation,
PostgreSQL can register new key to KMS and use it, and it can remove
keys when it no longer necessary. That is, it can enable PostgreSQL to
not only not only getting key from KMS but also registering and
removing keys. And we also can decrypt MDEK in KMS instead of doing in
PostgreSQL which is more safety. In addition, once someone create the
plugin library of an external services individual projects don't need
to create that.


BTW I've created PoC patch for cluster encryption feature. Attached
patch set has done some items of TODO list and so

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Antonin Houska
Bruce Momjian  wrote:

> On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> > I can work on it right away but don't know where to start.
> 
> I think the big open question is whether there will be acceptance of an
> all-cluster encyption feature.  I guess if no one objects, we can move
> forward.
> 
> > First, I think we should use a code repository to integrate [1] and [2]
> > instead of sending diffs back and forth. That would force us to resolve
> > conflicts soon and help to avoid duplicate work. The diffs would be created
> > only whe we need to post the next patch version to pgsql-hackers for review,
> > otherwise the discussions of details can take place elsewhere.
> 
> Well, we can do that, or just follow the TODO list and apply items as we
> complete them.  We have found that doing everything in one big patch is
> just too hard to review and get accepted.
> 
> > The most difficult problem I see now regarding the collaboration is 
> > agreement
> > on the key management user interface. The Full-Cluster Encryption feature 
> > [1]
> > should not add configuration variables or even tools that the next, more
> > sophisticated version [2] deprecates immediately. Part of the problem is 
> > that
> 
> Yes, the all-cluster encryption feature has _no_ SQL-level API to
> control it, just a GUC variable that you can use SHOW to see the
> encryption mode.
> 
> > [2] puts all (key management related) interaction of postgres with the
> > environment into an external library. As I pointed out in my response to 
> > [2],
> > this will not work for frontend applications (e.g. pg_waldump). I think the
> > key management UI for [2] needs to be designed first even if PG 13 should
> > adopt only [1].
> 
> I think there are several directions we can go after all-cluster
> encryption,

I think I misunderstood. What you summarize in

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption

does include

https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com

i.e. per-tablespace keys, right? Then the collaboration should be easier than
I thought.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Zedstore - compressed in-core columnar storage

2019-08-15 Thread Heikki Linnakangas

On 14/08/2019 20:32, Ashwin Agrawal wrote:

On Wed, Aug 14, 2019 at 2:51 AM Ashutosh Sharma wrote:

2) Is there a chance that IndexOnlyScan would ever be required for
   zedstore tables considering the design approach taken for it?


We have not given much thought to IndexOnlyScans so far. But I think
IndexOnlyScan definitely would be beneficial for zedstore as
well. Even for normal index scans as well, fetching as many columns
possible from Index itself and only getting rest of required columns
from the table would be good for zedstore. It would help to further
cut down IO. Ideally, for visibility checking only TidTree needs to be
scanned and visibility checked with the same, so the cost of checking
is much lower compared to heap (if VM can't be consulted) but still is
a cost. Also, with vacuum, if UNDO log gets trimmed, the visibility
checks are pretty cheap. Still given all that, having VM type thing to
optimize the same further would help.


Hmm, yeah. An index-only scan on a zedstore table could perform the "VM 
checks" by checking the TID tree in the zedstore. It's not as compact as 
the 2 bits per TID in the heapam's visibility map, but it's pretty good.



Further, I tried creating a zedstore table with btree index on one of
it's column and loaded around 50 lacs record into the table. When the
indexed column was scanned (with enable_seqscan flag set to off), it
went for IndexOnlyScan and that took around 15-20 times more than it
would take for IndexOnly Scan on heap table just because IndexOnlyScan
in zedstore always goes to heap as the visibility check fails.


Currently, an index-only scan on zedstore should be pretty much the same 
speed as a regular index scan. All the visibility checks will fail, and 
you end up fetching every row from the table, just like a regular index 
scan. So I think what you're seeing is that the index fetches on a 
zedstore table is much slower than on heap.


Ideally, on a column store the index fetches would only fetch the needed 
columns, but I don't think that's been implemented yet, so all the 
columns are fetched. That can make a big difference, if you have a wide 
table with lots of columns, but only actually need a few of them. Was 
your test case something like that?


We haven't spent much effort on optimizing index fetches yet, so I hope 
there's many other little tweaks there as well, that we can do to make 
it faster.


- Heikki




Re: Built-in connection pooler

2019-08-15 Thread Konstantin Knizhnik



On 30.07.2019 16:12, Tomas Vondra wrote:

On Tue, Jul 30, 2019 at 01:01:48PM +0300, Konstantin Knizhnik wrote:



On 30.07.2019 4:02, Tomas Vondra wrote:


My idea (sorry if it wasn't too clear) was that we might handle some
cases more gracefully.

For example, if we only switch between transactions, we don't quite 
care
about 'SET LOCAL' (but the current patch does set the tainted flag). 
The

same thing applies to GUCs set for a function.
For prepared statements, we might count the number of statements we
prepared and deallocated, and treat it as 'not tained' when there 
are no

statements. Maybe there's some risk I can't think of.

The same thing applies to temporary tables - if you create and drop a
temporary table, is there a reason to still treat the session as 
tained?







I have implemented one more trick reducing number of tainted backends:
now it is possible to use session variables in pooled backends.

How it works?
Proxy determines "SET var=" statements and  converts them to "SET LOCAL 
var=".
Also all such assignments are concatenated and stored in session context 
at proxy.
Then proxy injects this statement inside each transaction block or 
prepend to standalone statements.


This mechanism works only for GUCs set outside transaction.
By default it is switched off. To enable it you should switch on 
"proxying_gucs" parameter.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index adf0490..5c2095f 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -11,6 +11,7 @@
 
 #include "commands/trigger.h"
 #include "executor/spi.h"
+#include "storage/proc.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -93,6 +94,8 @@ check_primary_key(PG_FUNCTION_ARGS)
 	else
 		tuple = trigdata->tg_newtuple;
 
+	MyProc->is_tainted = true;
+
 	trigger = trigdata->tg_trigger;
 	nargs = trigger->tgnargs;
 	args = trigger->tgargs;
@@ -284,6 +287,8 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: cannot process INSERT events");
 
+	MyProc->is_tainted = true;
+
 	/* Have to check tg_trigtuple - tuple being deleted */
 	trigtuple = trigdata->tg_trigtuple;
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1..7aaddfe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,153 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  

Useless bms_free() calls in build_child_join_rel()

2019-08-15 Thread Etsuro Fujita
Hi,

While working on the PWJ patch [1], I noticed $SUBJECT.  They seem to
be leftovers from the original partitionwise-join patch, perhaps.
Attached is a patch for removing them.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK16wDqJiUof8+e4HuGmrAqqoFzb=iQX4V+xicsJ5_BvJ=g...@mail.gmail.com


remove-useless-bms_free-calls.patch
Description: Binary data


Allow cluster owner to bypass authentication

2019-08-15 Thread Peter Eisentraut
This is an implementation of the idea I mentioned in [0].

The naming and description perhaps isn't ideal yet but it works in
principle.

The idea is that if you connect over a Unix-domain socket and the local
(effective) user is the same as the server's (effective) user, then
access should be granted immediately without any checking of
pg_hba.conf.  Because it's "your own" server and you can do anything you
want with it anyway.

I included an option to turn this off because (a) people are going to
complain, (b) you need this for the test suites to be able to test
pg_hba.conf, and (c) conceivably, someone might want to have all access
to go through pg_hba.conf for some auditing reasons (perhaps via PAM).

This addresses the shortcomings of using peer as the default mechanism
in initdb.  In a subsequent step, my idea would be to make the default
initdb authentication setup to use md5 (or scram, tbd.) for both local
and host.


[0]:
https://www.postgresql.org/message-id/29164e47-8dfb-4737-2a61-e67a18f847f3%402ndquadrant.com

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From d121f818cb0364e1ad006de4ae92c7472bc21878 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 6 Aug 2019 20:41:19 +0200
Subject: [PATCH v1] Allow cluster owner to bypass authentication

---
 doc/src/sgml/client-auth.sgml | 29 +++
 doc/src/sgml/config.sgml  | 21 ++
 src/backend/libpq/auth.c  | 22 +-
 src/backend/utils/misc/guc.c  |  9 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/libpq/auth.h  |  1 +
 src/test/perl/PostgresNode.pm |  1 +
 7 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index fada7289d4..c84565a115 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -54,6 +54,35 @@ Client Authentication
   database user names and OS user names.
  
 
+ 
+  Cluster Owner Authentication
+
+  
+   When connecting over the Unix-domain socket, if the client user is the same
+   as the user that runs the database server (which is also the same as the
+   owner of the data directory), then access is immediately granted without
+   further checking.  This allows a database cluster owner to connect to
+   their own database server without being subject to the rest of client
+   authentication (described in the rest of this chapter).
+  
+
+  
+   This mechanism is only available on operating systems providing the
+   getpeereid() function, the
+   SO_PEERCRED socket parameter, or similar mechanisms.
+   Currently that includes Linux, most
+   flavors of BSD including
+   macOS, and Solaris.  If it is not available, then cluster
+   owner connections are subject to the normal client authentication.
+  
+
+  
+   This mechanism can be disabled using the configuration parameter .
+  
+ 
+
  
   The pg_hba.conf File
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cdc30fa5e3..df2a08bb16 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1099,6 +1099,27 @@ Authentication

   
  
+
+ 
+  cluster_owner_bypass_auth (boolean)
+  
+   cluster_owner_bypass_auth configuration 
parameter
+  
+  
+  
+   
+If enabled, when connecting over the Unix-domain socket, if the client
+user is the same as the user that runs the database server, then
+access is immediately granted without further checking.  See .  This is enabled by default and
+usually very useful.  A possible reason to turn it off might be if all
+authentication is through PAM with auditing and one wants even cluster
+owner access to go through auditing that way.  Another reason to turn
+this off is to be able to test a pg_hba.conf
+configuration more easily.
+   
+  
+ 
  
  
 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 0e0a6d8752..af75170606 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -160,11 +160,12 @@ static intCheckCertAuth(Port *port);
 
 
 /*
- * Kerberos and GSSAPI GUCs
+ * GUCs
  *
  */
 char  *pg_krb_server_keyfile;
 bool   pg_krb_caseins_users;
+bool   cluster_owner_bypass_auth;
 
 
 /*
@@ -345,6 +346,25 @@ ClientAuthentication(Port *port)
int status = STATUS_ERROR;
char   *logdetail = NULL;
 
+   /*
+* If connecting over Unix-domain socket and peer uid matches current
+* process uid, then allow connection immediately.
+

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Bruce Momjian
On Wed, Aug 14, 2019 at 09:19:44PM -0400, Bruce Momjian wrote:
> I think there are several directions we can go after all-cluster
> encryption, and it does matter because we would want minimal API
> breakage.  The options are:
> 
> 1)  Allow per-table encyption control to limit encryption overhead,
> though all of WAL still needs to be encrypted;  we could add a
> per-record encyption flag to WAL records to avoid that.
> 
> 2)  Allow user-controlled keys, which are always unlocked, and encrypt
> WAL with one key
> 
> 3)  Encrypt only the user-data portion of pages with user-controlled
> keys.  FREEZE and crash recovery work since only the user data is
> encrypted.  WAL is not encrypted, except for the user-data portion
> 
...
> I don't think #3 is viable since there is too much information leakage,
> particularly for indexes because the tid is visible.

Thinking some more, it might be possible to encrypt the index tid and
for crash recovery and the freeze part of vacuum to work, which might be
sufficient to allow the user keys to remain locked.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Don't like getObjectDescription results for pg_amop/pg_amproc

2019-08-15 Thread Tom Lane
Alexander Korotkov  writes:
> On Thu, Aug 15, 2019 at 2:08 AM Tom Lane  wrote:
>> Or maybe we're just being too ambitious here and we should discard some of
>> this information.  I'm not really sure that the format_operator result
>> can be included without complete loss of intelligibility.

> Could we discard one pair of types from output?

Yeah, it would help to stop using format_operator and just print the
bare name of the operator.  (format_operator can actually make things
a whole lot worse than depicted in my example, because it may insist
on schema-qualification and double-quoting.)  In principle that could
be ambiguous ... but the pg_amop entry has already been identified fully,
and I don't think it needs to be part of the charter of this printout
to *also* identify the underlying operator with complete precision.

I'm still not sure how to cram the operator name into the output
without using a colon, though.

regards, tom lane




Re: Do not check unlogged indexes on standby

2019-08-15 Thread Andrey Borodin


> 13 авг. 2019 г., в 20:30, Peter Geoghegan  написал(а):
> 
> That's one possibility. When I first designed amcheck it was important
> to be conservative, so I invented a general rule about never acquiring
> multiple buffer locks at once. I still think that that was the correct
> decision for the bt_downlink_check() check (the main extra
> bt_index_parent_check() check), but I think that you're right about
> retrying to verify the sibling links when bt_index_check() is called
> from SQL.
> 
> nbtree will often "couple" buffer locks on the leaf level; it will
> acquire a lock on a leaf page, and not release that lock until it has
> also acquired a lock on the right sibling page (I'm mostly thinking of
> _bt_stepright()). I am in favor of a patch that makes amcheck perform
> sibling link verification within bt_index_check(), by retrying while
> pessimistically coupling buffer locks. (Though I think that that
> should just happen on the leaf level. We should not try to be too
> clever about ignorable/half-dead/deleted pages, to be conservative.)

PFA V1 of this check retry.

Best regards, Andrey Borodin.


0001-In-amcheck-nbtree-do-rightlink-verification-with-loc.patch
Description: Binary data


Re: Change ereport level for QueuePartitionConstraintValidation

2019-08-15 Thread Sergei Kornilov
Hi

> Sergei, can we enlist you to submit a patch for this? Namely reduce the
> log level to DEBUG1 and add a TAP test in src/test/modules/alter_table/
> that verifies that the message is or isn't emitted, as appropriate.

Yes, will do. Probably in few days.

regards, Sergei




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Bruce Momjian
On Thu, Aug 15, 2019 at 06:10:24PM +0900, Masahiko Sawada wrote:
> On Thu, Aug 15, 2019 at 10:19 AM Bruce Momjian  wrote:
> >
> > On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> > > I can work on it right away but don't know where to start.
> >
> > I think the big open question is whether there will be acceptance of an
> > all-cluster encyption feature.  I guess if no one objects, we can move
> > forward.
> >
> 
> I still feel that we need to have per table/tablespace keys although
> it might not be the first implementation. I think the safeness of both
> table/tablespace level and cluster level would be almost the same but
> the former would have an advantage in terms of operation and
> performance.

I assume you are talking about my option #1.  I can see if you only need
a few tables encrypted, e.g., credit card numbers, it can be excessive
to encrypt the entire cluster.  (I think you would need to encrypt
pg_statistic too.)

The tricky part will be WAL --- if we encrypt all of WAL, the per-table
overhead might be minimal compared to the WAL encryption overhead.  The
better solution would be to add a flag to WAL records to indicate
encrypted entries, but you would then leak when an encryption change
happens and WAL record length.  (FYI, numeric values have different
lengths, as do character strings.)  I assume we would still use a single
key for all tables/indexes, and one for WAL, plus key rotation
requirements.

I personally would like to see full cluster implemented first to find
out exactly what the overhead is.  As I stated earlier, the overhead of
determining which things to encrypt, both in code complexity, user
interface, and processing overhead, might not be worth it.

I can see why you would think that encrypting less would be easier than
encrypting more, but security boundaries are hard to construct, and
anything that requires a user API, even more so.

> > > At least it should be clear how [2] will retrieve the master key because 
> > > [1]
> > > should not do it in a differnt way. (The GUC cluster_passphrase_command
> > > mentioned in [3] seems viable, although I think [1] uses approach which is
> > > more convenient if the passphrase should be read from console.)
> 
> I think that we can also provide a way to pass encryption key directly
> to postmaster rather than using passphrase. Since it's common that
> user stores keys in KMS it's useful if we can do that.

Why would it not be simpler to have the cluster_passphrase_command run
whatever command-line program it wants?  If you don't want to use a
shell command, create an executable and call that.

> > > Rotation of
> > > the master key is another thing that both versions of the feature should 
> > > do in
> > > the same way. And of course, the fronend applications need consistent 
> > > approach
> > > too.
> >
> > I don't see the value of an external library for key storage.
> 
> I think that big benefit is that PostgreSQL can seamlessly work with
> external services such as KMS. For instance, when key rotation,
> PostgreSQL can register new key to KMS and use it, and it can remove
> keys when it no longer necessary. That is, it can enable PostgreSQL to
> not only not only getting key from KMS but also registering and
> removing keys. And we also can decrypt MDEK in KMS instead of doing in
> PostgreSQL which is more safety. In addition, once someone create the
> plugin library of an external services individual projects don't need
> to create that.

I think the big win for an external library is when you don't want the
overhead of calling an external program.  For example, we certainly
would not want to call an external program while processing a query.  Do
we have any such requirements for encryption, especially since we only
are going to allow offline mode for encryption mode changes and key
rotation in the first version?

> BTW I've created PoC patch for cluster encryption feature. Attached
> patch set has done some items of TODO list and some of them can be
> used even for finer granularity encryption. Anyway, the implemented
> components are followings:

Nice, thanks.

> * Initialization stuff (initdb support). initdb has new command line
> options: --enc-cipher and --cluster-passphrase-command. --enc-cipher
> option accepts either aes-128 or aes-256 values while
> --cluster-passphrase-command accepts an arbitrary command. ControlFile
> has an integer indicating cluster encryption support, 'off', 'aes-128'
> or 'aes-256'.

Nice.  If we get agreement we want to do this for PG 13, we can start
applying these patches.

> * 3-tier encryption keys. During initdb we create KEK and MDEK and
> write the meta data file(global/pg_kmgr file). When postmaster startup
> it reads the kmgr file, verifies the passphrase using HMAC, unwraps
> MDEK and derives TDEK and WDEK from MDEK. Currently MDEK, TDEK and
> WDEK are stored into shared memory as this is still PoC but we also
> can have them in process local memory.

Uh, I thought 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Bruce Momjian
On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > I think there are several directions we can go after all-cluster
> > encryption,
> 
> I think I misunderstood. What you summarize in
> 
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption
> 
> does include
> 
> https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com
> 
> i.e. per-tablespace keys, right? Then the collaboration should be easier than
> I thought.

No, there is a single tables/indexes key and a WAL key, plus keys for
rotation.  I explained why per-tablespace keys don't add much value.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-15 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> In hopes of moving this along, I've pushed Ian's last code change,
>> as there seems to be no real argument about that anymore.
>> 
>> As for the doc changes, how about the attached revision of what
>> I wrote previously?  It gives some passing mention to what ALTER
>> SYSTEM will do, without belaboring it or going into things that
>> are really implementation details.

> It's certainly better than what we have now.

Hearing no other comments, I've pushed that and marked this issue closed.

regards, tom lane




Re: pgbench - add \aset to store results of a combined query

2019-08-15 Thread Ibrar Ahmed
On Wed, Jul 10, 2019 at 11:33 AM Fabien COELHO  wrote:

>
> Hello Ibrar,
>
> >>  SELECT 1 AS one \;
> >>  SELECT 2 AS two UNION SELECT 2 \;
> >>  SELECT 3 AS three \aset
> >>
> >> will set both "one" and "three", while "two" is not set because there
> were
> >> two rows. It is a kind of more permissive \gset.
> >
> > Are you sure two is not set :)?
> >
> > SELECT 2 AS two UNION SELECT 2;   -- only returns one row.
> > but
> > SELECT 2 AS two UNION SELECT 10;  -- returns the two rows.
>
> Indeed, my intension was to show an example like the second.
>
> > Is this the expected behavior with \aset?
>
> > In my opinion throwing a valid error like "client 0 script 0 command 0
> > query 0: expected one row, got 2" make more sense.
>
> Hmmm. My intention with \aset is really NOT to throw an error. With
> pgbench, the existence of the variable can be tested later to know whether
> it was assigned or not, eg:
>
>SELECT 1 AS x \;
>-- 2 rows, no assignment
>SELECT 'calvin' AS firstname UNION SELECT 'hobbes' \;
>SELECT 2 AS z \aset
>-- next test is false
>\if :{?firstname}
>  ...
>\endif
>
> The rational is that one may want to benefit from combined queries (\;)
> which result in less communication thus has lower latency, but still be
> interested in extracting some results.
>
> The question is what to do if the query returns 0 or >1 rows. If an error
> is raised, the construct cannot be used for testing whether there is one
> result or not, eg for a query returning 0 or 1 row, you could not write:
>
>\set id random(1, :number_of_users)
>SELECT firtname AS fn FROM user WHERE id = :id \aset
>\if :{?fn}
>  -- the user exists, proceed with further queries
>\else
>  -- no user, maybe it was removed, it is not an error
>\endif
>
> Another option would to just assign the value so that
>   - on 0 row no assignment is made, and it can be tested afterwards.
>   - on >1 rows the last (first?) value is kept. I took last so to
> ensure that all results are received.
>
> I think that having some permissive behavior allows to write some more
> interesting test scripts that use combined queries and extract values.
>
> What do you think?
>
> Yes, I think that make more sense.

> > - With \gset
> >
> > SELECT 2 AS two UNION SELECT 10 \gset
> > INSERT INTO test VALUES(:two,0,0);
> >
> > client 0 script 0 command 0 query 0: expected one row, got 2
> > Run was aborted; the above results are incomplete.
>
> Yes, that is the intented behavior.
>
> > - With \aset
> >
> > SELECT 2 AS two UNION SELECT 10 \aset
> > INSERT INTO test VALUES(:two,0,0);
> > [...]
> > client 0 script 0 aborted in command 1 query 0: ERROR:  syntax error at
> or near ":"
>
> Indeed, the user should test whether the variable was assigned before
> using it if the result is not warranted to return one row.
>
> > The new status of this patch is: Waiting on Author
>
> The attached patch implements the altered behavior described above.
>
> --
> Fabien.



-- 
Ibrar Ahmed


Re: pgbench - add \aset to store results of a combined query

2019-08-15 Thread Ibrar Ahmed
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

The patch passed my review, I have not reviewed the documentation changes.

The new status of this patch is: Ready for Committer


UNION ALL

2019-08-15 Thread Mark Pasterkamp
Dear all,

I was wondering if someone could help me understands what a union all
actually does.

For my thesis I am using Apache Calcite to rewrite queries into using
materialized views which I then give to a Postgres database.
For some queries, this means that they will be rewritten in a UNION ALL
style query between an expression and a table scan of a materialized view.
However, contrary to what I expected, the UNION ALL query is actually a lot
slower.

As an example, say I have 2 tables: actor and movie. Furthermore, there is
also a foreign key index on movie to actor.
I also have a materialized view with the join of these 2 tables for all
movies <= 2015 called A.
Now, if I want to query all entries in the join between actor and movie, I
would assume that a UNION ALL between the join of actor and movie for
movies >2015 and A is faster than executing the original query..
If I look at the explain analyze part, I can certainly see a reduction in
cost up until the UNION ALL part, which carries a respective cost more than
negating the cost reduction up to a point where I might as well not use the
existing materialized view.

I have some trouble understanding this phenomenon.
One thought which came to my mind was that perhaps UNION ALL might create a
temporary table containing both result sets, and then do a table scan and
return that result.
this would greatly increase IO cost which could attribute to the problem.
However, I am really not sure what UNION ALL actually does to append both
result sets so I was wondering if someone would be able to help me out with
this.


Mark


Re: UNION ALL

2019-08-15 Thread Tom Lane
Mark Pasterkamp  writes:
> I was wondering if someone could help me understands what a union all
> actually does.

Generally speaking, it runs the first query and then the second query.
You'd really need to provide a lot more detail for anyone to say more
than that.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

regards, tom lane




Re: [PATCH] Implement INSERT SET syntax

2019-08-15 Thread Ibrar Ahmed
Patch conflict with this assertion 
Assert(pstate->p_expr_kind == EXPR_KIND_UPDATE_SOURCE); 

src/backend/parser/parse_expr.c line 1570

The new status of this patch is: Waiting on Author


Re: UNION ALL

2019-08-15 Thread 066ce286
Generally speaking, when executing UNION ; a DISTINCT is run afterward on the 
resultset.

So, if you're sure that each part of UNION cannot return a line returned by 
another one, you may use UNION ALL, you'll cut the cost of the final implicit 
DISTINCT.


- Mail original -
De: "Mark Pasterkamp" 
À: pgsql-hackers@lists.postgresql.org
Envoyé: Jeudi 15 Août 2019 20:37:06
Objet: UNION ALL


Dear all, 


I was wondering if someone could help me understands what a union all actually 
does. 


For my thesis I am using Apache Calcite to rewrite queries into using 
materialized views which I then give to a Postgres database. 
For some queries, this means that they will be rewritten in a UNION ALL style 
query between an expression and a table scan of a materialized view. 
However, contrary to what I expected, the UNION ALL query is actually a lot 
slower. 


As an example, say I have 2 tables: actor and movie. Furthermore, there is also 
a foreign key index on movie to actor. 
I also have a materialized view with the join of these 2 tables for all movies 
<= 2015 called A. 
Now, if I want to query all entries in the join between actor and movie, I 
would assume that a UNION ALL between the join of actor and movie for movies 
>2015 and A is faster than executing the original query.. 
If I look at the explain analyze part, I can certainly see a reduction in cost 
up until the UNION ALL part, which carries a respective cost more than negating 
the cost reduction up to a point where I might as well not use the existing 
materialized view. 


I have some trouble understanding this phenomenon. 
One thought which came to my mind was that perhaps UNION ALL might create a 
temporary table containing both result sets, and then do a table scan and 
return that result. 

this would greatly increase IO cost which could attribute to the problem. 
However, I am really not sure what UNION ALL actually does to append both 
result sets so I was wondering if someone would be able to help me out with 
this. 




Mark




Re: UNION ALL

2019-08-15 Thread Ibrar Ahmed
On Fri, Aug 16, 2019 at 12:16 AM <066ce...@free.fr> wrote:

> Generally speaking, when executing UNION ; a DISTINCT is run afterward on
> the resultset.
>
> So, if you're sure that each part of UNION cannot return a line returned
> by another one, you may use UNION ALL, you'll cut the cost of the final
> implicit DISTINCT.
>
>
> - Mail original -
> De: "Mark Pasterkamp" 
> À: pgsql-hackers@lists.postgresql.org
> Envoyé: Jeudi 15 Août 2019 20:37:06
> Objet: UNION ALL
>
>
> Dear all,
>
>
> I was wondering if someone could help me understands what a union all
> actually does.
>
>
> For my thesis I am using Apache Calcite to rewrite queries into using
> materialized views which I then give to a Postgres database.
> For some queries, this means that they will be rewritten in a UNION ALL
> style query between an expression and a table scan of a materialized view.
> However, contrary to what I expected, the UNION ALL query is actually a
> lot slower.
>
>
> As an example, say I have 2 tables: actor and movie. Furthermore, there is
> also a foreign key index on movie to actor.
> I also have a materialized view with the join of these 2 tables for all
> movies <= 2015 called A.
> Now, if I want to query all entries in the join between actor and movie, I
> would assume that a UNION ALL between the join of actor and movie for
> movies >2015 and A is faster than executing the original query..
> If I look at the explain analyze part, I can certainly see a reduction in
> cost up until the UNION ALL part, which carries a respective cost more than
> negating the cost reduction up to a point where I might as well not use the
> existing materialized view.
>
>
> I have some trouble understanding this phenomenon.
> One thought which came to my mind was that perhaps UNION ALL might create
> a temporary table containing both result sets, and then do a table scan and
> return that result.
>
> this would greatly increase IO cost which could attribute to the problem.
> However, I am really not sure what UNION ALL actually does to append both
> result sets so I was wondering if someone would be able to help me out with
> this.
>
>
>
>
> Mark
>
>
> 066ce...@free.fr:  Please, avoid top-posting. It makes harder to follow
the
discussion.

-- 
Ibrar Ahmed


Re: Extension development

2019-08-15 Thread Tomas Vondra

On Thu, Aug 15, 2019 at 06:58:07AM +, Yonatan Misgan wrote:

Hello, I am trying to develop calendar extension for PostgreSQL  but
there is a difficulties on how to get day, month and year from
PostgreSQL source code because when am read the PostgreSQL source code
it uses DateADT as a data type and this DateADT returns the total
numbers of day. So how can  I get day, month or year only. For example
the below code is PostgreSQL source code to return current date.
/*
* GetSQLCurrentDate -- implements CURRENT_DATE
*/
DateADT
GetSQLCurrentDate(void)
{
   TimestampTz ts;
   struct pg_tm tt,
  *tm = &tt;
   fsec_t   fsec;
   int   tz;

   ts = GetCurrentTransactionStartTimestamp();

   if (timestamp2tm(ts, &tz, tm, &fsec, NULL, NULL) != 0)
   ereport(ERROR,
   
(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
   errmsg("timestamp out 
of range")));

   return date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) - 
POSTGRES_EPOCH_JDATE;
}
From this source code how can I get only the year to convert my own
calendar year.  I need this because Ethiopian calendar is totally
differ from GC in terms of day, month and year.



I think you might want to look at timestamptz_part() function, in
timestamp.c. That's what's behind date_part() SQL function, which seems
doing the sort of stuff you need.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-15 Thread Tomas Vondra

On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote:

On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.


Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.



Would it really be a measurable overhead? I mean, we only really need
one int counter, and you don't need to do the check on every lock
acquisition - you just need to recheck on the first lock release. But
maybe I'm underestimating how expensive it is ...

Talking about dynahash - doesn't it already track this information?
Maybe not directly but surely it has to track the number of entries in
the hash table, in order to compute fill factor. Can't we piggy-back on
that and track the highest fill-factor for a particular period of time?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: BF failure: could not open relation with OID XXXX while querying pg_views

2019-08-15 Thread Tomas Vondra

On Wed, Aug 14, 2019 at 05:24:26PM +1200, Thomas Munro wrote:

On Wed, Aug 14, 2019 at 5:06 PM Tom Lane  wrote:

Oh, hmm --- yeah, that should mean it's safe.  Maybe somebody incautiously
changed one of the other tests that run concurrently with "rules"?


Looks like stats_ext.sql could be the problem.  It creates and drops
priv_test_view, not in a schema.  Adding Dean, author of commit
d7f8d26d.



Yeah, that seems like it might be the cause. I'll take a look at fixing
this, probably by creating the view in a different schema.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Extension development

2019-08-15 Thread Chapman Flack
On 08/15/19 02:58, Yonatan Misgan wrote:

> From this source code how can I get only the year to convert my own
> calendar year.  I need this because Ethiopian calendar is totally differ
> from GC in terms of day, month and year.

I find myself wondering whether getting only the year is sufficient to
do the conversion. There is already an Ethiopic calendar available for
Java (not included, but in org.threeten.extra[1]), and it seems to say
the years do not align precisely with Gregorian years (as I would not
have expected anyway):

"Dates are aligned such that 0001-01-01 (Ethiopic) is 0008-08-27 (ISO)."

So it seems more likely that you would need a calculation involving the
year, month, and day ... or even that the Julian day number already
stored in PostgreSQL could be the most convenient starting point for
the arithmetic you need.

It's possible you might want to crib some of the algorithm from the
threeten-extra Ethiopic date sources [2]. It would need adjustment for
the PostgreSQL epoch being Gregorian year 2000 rather than Java's 1970
(a simple constant offset), and for PostgreSQL using a Julian day number
rather than java.time's proleptic Gregorian (a difference changing by three
days every 400 years).

Another option would be to take advantage of PL/Java and directly use
the threeten-extra Ethiopic calendar.

Regards,
-Chap


[1]
https://www.threeten.org/threeten-extra/apidocs/org.threeten.extra/org/threeten/extra/chrono/EthiopicDate.html

[2]
https://github.com/ThreeTen/threeten-extra/blob/master/src/main/java/org/threeten/extra/chrono/EthiopicDate.java




Re: Extension development

2019-08-15 Thread Thomas Munro
On Fri, Aug 16, 2019 at 10:55 AM Chapman Flack  wrote:
> On 08/15/19 02:58, Yonatan Misgan wrote:
> > From this source code how can I get only the year to convert my own
> > calendar year.  I need this because Ethiopian calendar is totally differ
> > from GC in terms of day, month and year.
>
> I find myself wondering whether getting only the year is sufficient to
> do the conversion. There is already an Ethiopic calendar available for
> Java (not included, but in org.threeten.extra[1]), and it seems to say
> the years do not align precisely with Gregorian years (as I would not
> have expected anyway):

I can't think of a single reason not to use ICU for this.  It will
handle every kind of conversion you could need here, it's rock solid
and debugged and established, it'll handle 10+ different calendars
(not just Ethiopic), and it's already linked into PostgreSQL in most
deployments.

Obviously if Yonatan wants to write his own calendar logic that's
cool, but if we're talking about something that might eventually be
part of PostgreSQL core, or a contrib module shipped with PostgreSQL,
or even a widely used popular extension shipped separately, I would
bet on ICU rather than new hand rolled algorithms.

> "Dates are aligned such that 0001-01-01 (Ethiopic) is 0008-08-27 (ISO)."
>
> So it seems more likely that you would need a calculation involving the
> year, month, and day ... or even that the Julian day number already
> stored in PostgreSQL could be the most convenient starting point for
> the arithmetic you need.

Indeed.  I think you should convert between our internal day number
and date components (year, month, day) + various string formats
derived from them, and likewise for the internal microsecond number
that we use for timestamps.  I think it's probably a mistake to start
from the Gregorian y, m, d components and convert to Ethiopic y, m, d
(conversion algorithms that start from those components are more
suitable for humans; PostgreSQL already has days numbered sequentially
along one line, not y, m, d).  Or just let ICU do it.

-- 
Thomas Munro
https://enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Aug 15, 2019 at 11:24:46AM +0200, Antonin Houska wrote:
> > > I think there are several directions we can go after all-cluster
> > > encryption,
> > 
> > I think I misunderstood. What you summarize in
> > 
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption
> > 
> > does include
> > 
> > https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com
> > 
> > i.e. per-tablespace keys, right? Then the collaboration should be easier 
> > than
> > I thought.
> 
> No, there is a single tables/indexes key and a WAL key, plus keys for
> rotation.  I explained why per-tablespace keys don't add much value.

Nothing in the discussion that I've seen, at least, has changed my
opinion that tablespace-based keys *would* add significant value,
particularly if it'd be difficult to support per-table keys.  Of course,
if we can get per-table keys without too much difficulty then that would
be better.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Aug 15, 2019 at 06:10:24PM +0900, Masahiko Sawada wrote:
> > On Thu, Aug 15, 2019 at 10:19 AM Bruce Momjian  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> > > > I can work on it right away but don't know where to start.
> > >
> > > I think the big open question is whether there will be acceptance of an
> > > all-cluster encyption feature.  I guess if no one objects, we can move
> > > forward.
> > 
> > I still feel that we need to have per table/tablespace keys although
> > it might not be the first implementation. I think the safeness of both
> > table/tablespace level and cluster level would be almost the same but
> > the former would have an advantage in terms of operation and
> > performance.
> 
> I assume you are talking about my option #1.  I can see if you only need
> a few tables encrypted, e.g., credit card numbers, it can be excessive
> to encrypt the entire cluster.  (I think you would need to encrypt
> pg_statistic too.)

Or we would need a seperate encrypted pg_statistic, or a way to encrypt
certain entries inside pg_statistic.

> The tricky part will be WAL --- if we encrypt all of WAL, the per-table
> overhead might be minimal compared to the WAL encryption overhead.  The
> better solution would be to add a flag to WAL records to indicate
> encrypted entries, but you would then leak when an encryption change
> happens and WAL record length.  (FYI, numeric values have different
> lengths, as do character strings.)  I assume we would still use a single
> key for all tables/indexes, and one for WAL, plus key rotation
> requirements.

I don't think the fact that a change was done to an encrypted blob is an
actual 'leak'- anyone can tell that by looking at the at the encrypted
data before and after.  Further, the actual change would be encrypted,
right?  Length of data is necessary to include in the vast majority of
cases that the data is being dealt with and so I'm not sure that it
makes sense for us to be worrying about that as a leak, unless you have
a specific recommendation from a well known source discussing that
concern..?

> I personally would like to see full cluster implemented first to find
> out exactly what the overhead is.  As I stated earlier, the overhead of
> determining which things to encrypt, both in code complexity, user
> interface, and processing overhead, might not be worth it.

I disagree with this and feel that the overhead that's being discussed
here (user interface, figuring out if we should encrypt it or not,
processing overhead for those determinations) is along the lines of
UNLOGGED tables, yet there wasn't any question about if that was a valid
or useful feature to implement.  The biggest challenge here is really
around key management and I agree that's difficult but it's also really
important and something that we need to be thinking about- and thinking
about how to work with multiple keys and not just one.  Building in an
assumption that we will only ever work with one key would make this
capability nothing more than DBA-managed filesystem-level encryption
(though even there different tablespaces could have different keys...)
and I worry would make later work to support multiple keys more
difficult and less likely to actually happen.  It's also not clear to me
why we aren't building in *some* mechanism to work with multiple keys
from the start as part of the initial design.

> I can see why you would think that encrypting less would be easier than
> encrypting more, but security boundaries are hard to construct, and
> anything that requires a user API, even more so.

I'm not sure I'm follwing here- I'm pretty sure everyone understands
that selective encryption will require more work to implement, in part
because an API needs to be put in place and we need to deal with
multiple keys, etc.  I don't think anyone thinks that'll be "easier".

> > > > At least it should be clear how [2] will retrieve the master key 
> > > > because [1]
> > > > should not do it in a differnt way. (The GUC cluster_passphrase_command
> > > > mentioned in [3] seems viable, although I think [1] uses approach which 
> > > > is
> > > > more convenient if the passphrase should be read from console.)
> > 
> > I think that we can also provide a way to pass encryption key directly
> > to postmaster rather than using passphrase. Since it's common that
> > user stores keys in KMS it's useful if we can do that.
> 
> Why would it not be simpler to have the cluster_passphrase_command run
> whatever command-line program it wants?  If you don't want to use a
> shell command, create an executable and call that.

Having direct integration with a KMS would certainly be valuable, and I
don't see a reason to deny users that option if someone would like to
spend time implementing it- in addition to a simpler mechanism such as a
passphrase command, which I believe is what was being suggested here.

> > > > Rot

Re: [PATCH] Implement INSERT SET syntax

2019-08-15 Thread Gareth Palmer
Hi Ibrar,

> On 16/08/2019, at 7:14 AM, Ibrar Ahmed  wrote:
> 
> Patch conflict with this assertion 
> Assert(pstate->p_expr_kind == EXPR_KIND_UPDATE_SOURCE); 
> 
> src/backend/parser/parse_expr.c line 1570
> 
> The new status of this patch is: Waiting on Author

Thank-you for reviewing the patch.

Attached is version 2 of the patch that fixes the above by allowing
p_expr_kind to be EXPR_KIND_VALUES_SINGLE as well.


Gareth



insert-set-v2.patch
Description: Binary data


Re: Add "password_protocol" connection parameter to libpq

2019-08-15 Thread Stephen Frost
Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Wed, 2019-08-14 at 11:38 +0900, Michael Paquier wrote:
> > What I got in mind was a comma-separated list of authorized protocols
> > which can be specified as a connection parameter, which extends to
> > all
> > the types of AUTH_REQ requests that libpq can understand, plus an
> > extra for channel binding.  I also liked the idea mentioned upthread
> > of "any" to be an alias to authorize everything, which should be the
> > default.  So you basically get at that:
> > auth_protocol = {any,password,md5,scram-sha-256,scram-sha-256-
> > plus,krb5,gss,sspi}
> 
> What about something corresponding to the auth methods "trust" and
> "cert", where no authentication request is sent? That's a funny case,
> because the server trusts the client; but that doesn't imply that the
> client trusts the server.

I agree that "trust" is odd.  If you want my 2c, we should have to
explicitly *enable* that for psql, otherwise there's the potential that
a MITM could perform a downgrade attack to "trust" and while that might
not expose a user's password, it could expose other information that the
client ends up sending (INSERTs, UPDATEs, etc).

When it comes to "cert"- there is certainly an authentication that
happens and we already have options in psql/libpq to require validation
of the server.  If users want that, they should enable it (I wish we
could make it the default too but that's a different discussion...).

> This is another reason I don't really like the list. It's impossible to
> make it cleanly map to the auth methods, and there are a few ways it
> could be confusing to the users.

I agree with these concerns, just to be clear.

> Given that we all pretty much agree on the need for the separate
> channel_binding param, the question is whether we want to (a) address
> additional use cases with specific parameters that also justify
> themselves; or (b) have a generic list that is supposed to solve many
> future use cases.
> 
> I vote (a). With (b), the generic list is likely to cause more
> confusion, ugliness, and clients that break needlessly in the future.

Admittedly, one doesn't preclude the other, and so we could move forward
with the channel binding param, and that's fine- but I seriously hope
that someone finds time to work on further improving the ability for
clients to control what happens during authentication as this, imv
anyway, is an area that we are weak in and it'd be great to improve on
it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] Implement INSERT SET syntax

2019-08-15 Thread Amit Kapila
On Wed, Jul 17, 2019 at 10:00 AM Gareth Palmer  wrote:
>
> Hello,
>
> Attached is a patch that adds the option of using SET clause to specify
> the columns and values in an INSERT statement in the same manner as that
> of an UPDATE statement.
>
> A simple example that uses SET instead of a VALUES() clause:
>
> INSERT INTO t SET c1 = 'foo', c2 = 'bar', c3 = 'baz';
>
> Values may also be sourced from a CTE using a FROM clause:
>
> WITH x AS (
>   SELECT 'foo' AS c1, 'bar' AS c2, 'baz' AS c3
> )
> INSERT INTO t SET c1 = x.c1, c2 = x.c2, c3 = x.c3 FROM x;
>
> The advantage of using the SET clause style is that the column and value
> are kept together, which can make changing or removing a column or value from
> a large list easier.
>
> Internally the grammar parser converts INSERT SET without a FROM clause into
> the equivalent INSERT with a VALUES clause. When using a FROM clause it 
> becomes
> the equivalent of INSERT with a SELECT statement.
>
> There was a brief discussion regarding INSERT SET on pgsql-hackers in late
> August 2009 [1].
>
> INSERT SET is not part of any SQL standard (that I am aware of), however this
> syntax is also implemented by MySQL [2]. Their implementation does not support
> specifying a FROM clause.
>

I think this can be a handy feature in some cases as pointed by you,
but do we really want it for PostgreSQL?  In the last round of
discussions as pointed by you, there doesn't seem to be a consensus
that we want this feature.  I guess before spending too much time into
reviewing this feature, we should first build a consensus on whether
we need this.

Along with users, I request some senior hackers/committers to also
weigh in about the desirability of this feature.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> > > - I think there's two fairly fundamental, and related, problems with
> > >   the sequence outlined above:
> > >
> > >   - We can't search for target buffers to store undo data, while holding
> > > the "table" page content locked.
> > >
> > > The easy way to solve would be to require sites performing UNDO
> > > logging to acquire victim pages before even acquiring other content
> > > locks. Perhaps the better approach could be for the undo layer to
> > > hold onto a number of clean buffers, and to keep the last page in an
> > > already written to undo log pinned.
> > >
> > >   - We can't search for victim buffers for further undo data while
> > > already holding other undo pages content locked. Doing so means that
> > > while we're e.g. doing IO to clean out the new page, old undo data
> > > on the previous page can't be read.
> > >
> > > This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > > pinning and locking buffers, it'd need to be split into two
> > > operations. One that acquires buffers and pins them, and one that
> > > locks them.  I think it's quite possible that the locking operation
> > > could just be delayed until InsertPreparedUndo().  But if we solve
> > > the above problem, most of this might already be solved.
> >
> > Basically, that means
> > - the caller should call PreparedUndoInsert before acquiring table
> > page content lock right? because the PreparedUndoInsert just compute
> > the size, allocate the space and pin+lock the buffers and for pinning
> > the buffers we must compute the size and allocate the space using undo
> > storage layer.
>
> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would

Maybe we can provide an interface where the caller will input the max
prepared undo (maybe BeginUndoRecordInsert) and based on that we can
compute the max number of buffers we could potentially need for this
particular operation.  Most of the operation insert/update/delete will
need 1 or 2 undo record so we can avoid pinning very high number of
buffers in most of the cases.   Currently, only for the multi-insert
implementation of zheap we might need multiple undo-records (1 undo
per range of record).

> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>hold the maximum amount of undo needed. These buffers would be marked
>as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.
>
>
> > - So basically, if we delay the lock till InsertPreparedUndo and call
> > PrepareUndoInsert before acquiring table page content lock this
> > problem is solved?
> >
> > Although I haven't yet analyzed the AM specific part that whether it's
> > always possible to call the PrepareUndoInsert(basically getting all
> > the undo record ready) before the page content lock.  But, I am sure
> > that won't be much difficult part.
>
> I think that is somewhere between not possible, and so expensive in a
> lot of cases that we'd not want to do it anyway. You'd at leasthave to
> first acquire a content lock on the page, mark the target tuple as
> locked, then unlock the page, reserve undo, lock the table page,
> actually update it.
>
>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > > reading data into a the UndoPackContext is omitted, reading directly
> > > into t

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
>
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:

> >   I think that batch reading should just copy the underlying data into a
> >   char* buffer. Only the records that currently are being used by
> >   higher layers should get exploded into an unpacked record. That will
> >   reduce memory usage quite noticably (and I suspect it also drastically
> >   reduce the overhead due to a large context with a lot of small
> >   allocations that then get individually freed).
>
> Ok, I got your idea.  I will analyze it further and work on this if
> there is no problem.

I think there is one problem that currently while unpacking the undo
record if the record is compressed (i.e. some of the fields does not
exist in the record) then we read those fields from the first record
on the page.  But, if we just memcpy the undo pages to the buffers and
delay the unpacking whenever it's needed seems that we would need to
know the page boundary and also we need to know the offset of the
first complete record on the page from where we can get that
information (which is currently in undo page header).

As of now even if we leave this issue apart I am not very clear what
benefit you are seeing in the way you are describing compared to the
way I am doing it now?

a) Is it the multiple palloc? If so then we can allocate memory at
once and flatten the undo records in that.  Earlier, I was doing that
but we need to align each unpacked undo record so that we can access
them directly and based on Robert's suggestion I have modified it to
multiple palloc.
b) Is it the memory size problem that the unpack undo record will take
more memory compared to the packed record?
c) Do you think that we will not need to unpack all the records?  But,
I think eventually, at the higher level we will have to unpack all the
undo records ( I understand that it will be one at a time)

Or am I completely missing something here?

>
>   That will make the
> >   sorting of undo a bit more CPU inefficient, because individual records
> >   will need to be partially unpacked for comparison, but I think that's
> >   going to be a far smaller loss than the win.
> Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-15 Thread Masahiko Sawada
On Fri, Aug 16, 2019 at 10:01 AM Stephen Frost  wrote:
>
> Greetings,
>
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Thu, Aug 15, 2019 at 06:10:24PM +0900, Masahiko Sawada wrote:
> > > On Thu, Aug 15, 2019 at 10:19 AM Bruce Momjian  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 04:36:35PM +0200, Antonin Houska wrote:
> > > > > I can work on it right away but don't know where to start.
> > > >
> > > > I think the big open question is whether there will be acceptance of an
> > > > all-cluster encyption feature.  I guess if no one objects, we can move
> > > > forward.
> > >
> > > I still feel that we need to have per table/tablespace keys although
> > > it might not be the first implementation. I think the safeness of both
> > > table/tablespace level and cluster level would be almost the same but
> > > the former would have an advantage in terms of operation and
> > > performance.
> >
> > I assume you are talking about my option #1.  I can see if you only need
> > a few tables encrypted, e.g., credit card numbers, it can be excessive
> > to encrypt the entire cluster.  (I think you would need to encrypt
> > pg_statistic too.)
>
> Or we would need a seperate encrypted pg_statistic, or a way to encrypt
> certain entries inside pg_statistic.

I think we also need to encrypt other system catalogs. For instance
pg_procs might also have sensitive data in prosrc column. So I think
it's better to encrypt all system catalogs rather than picking up some
catalogs since it would not be a big overhead. Since system catalogs
are created during CREATE DATABASE by copying files tablespace level
or database level encryption would be well suited with system catalog
encryption.

>
> > The tricky part will be WAL --- if we encrypt all of WAL, the per-table
> > overhead might be minimal compared to the WAL encryption overhead.  The
> > better solution would be to add a flag to WAL records to indicate
> > encrypted entries, but you would then leak when an encryption change
> > happens and WAL record length.  (FYI, numeric values have different
> > lengths, as do character strings.)  I assume we would still use a single
> > key for all tables/indexes, and one for WAL, plus key rotation
> > requirements.
>
> I don't think the fact that a change was done to an encrypted blob is an
> actual 'leak'- anyone can tell that by looking at the at the encrypted
> data before and after.  Further, the actual change would be encrypted,
> right?  Length of data is necessary to include in the vast majority of
> cases that the data is being dealt with and so I'm not sure that it
> makes sense for us to be worrying about that as a leak, unless you have
> a specific recommendation from a well known source discussing that
> concern..?
>
> > I personally would like to see full cluster implemented first to find
> > out exactly what the overhead is.  As I stated earlier, the overhead of
> > determining which things to encrypt, both in code complexity, user
> > interface, and processing overhead, might not be worth it.
>
> I disagree with this and feel that the overhead that's being discussed
> here (user interface, figuring out if we should encrypt it or not,
> processing overhead for those determinations) is along the lines of
> UNLOGGED tables, yet there wasn't any question about if that was a valid
> or useful feature to implement.  The biggest challenge here is really
> around key management and I agree that's difficult but it's also really
> important and something that we need to be thinking about- and thinking
> about how to work with multiple keys and not just one.  Building in an
> assumption that we will only ever work with one key would make this
> capability nothing more than DBA-managed filesystem-level encryption
> (though even there different tablespaces could have different keys...)
> and I worry would make later work to support multiple keys more
> difficult and less likely to actually happen.  It's also not clear to me
> why we aren't building in *some* mechanism to work with multiple keys
> from the start as part of the initial design.
>
> > I can see why you would think that encrypting less would be easier than
> > encrypting more, but security boundaries are hard to construct, and
> > anything that requires a user API, even more so.
>
> I'm not sure I'm follwing here- I'm pretty sure everyone understands
> that selective encryption will require more work to implement, in part
> because an API needs to be put in place and we need to deal with
> multiple keys, etc.  I don't think anyone thinks that'll be "easier".
>
> > > > > At least it should be clear how [2] will retrieve the master key 
> > > > > because [1]
> > > > > should not do it in a differnt way. (The GUC 
> > > > > cluster_passphrase_command
> > > > > mentioned in [3] seems viable, although I think [1] uses approach 
> > > > > which is
> > > > > more convenient if the passphrase should be read from console.)
> > >
> > > I think that we can also provide a way to pass

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Thomas Munro
Hi Amit,

I've combined three of your messages into one below, and responded
inline.  New patch set to follow shortly, with the fixes listed below
(and others from other reviewers).

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila  wrote:
> 0003-Add-undo-log-manager.patch
> 1.
> allocate_empty_undo_segment()
...
> + /* create two parents up if not exist */
> + parentdir = pstrdup(undo_path);
> + get_parent_directory(parentdir);
> + get_parent_directory(parentdir);
> + /* Can't create parent and it doesn't already exist? */
> + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
>
>
> All of this code is almost same as we have code in
> TablespaceCreateDbspace still we have small differences like here you
> are using mkdir instead of MakePGDirectory which as far as I can see
> use similar permissions for creating directory.  Also, it checks
> whether the directory exists before trying to create it.  Is there a
> reason why we need to do a few things differently here if not, they
> can both the places use one common function?

Right, MakePGDirectory() arrived in commit da9b580d8990, and should
probably be used everywhere we create directories under pgdata.
Fixed.

Yeah, I think we could just use TablespaceCreateDbspace() for this, if
we are OK with teaching GetDatabasePath() and GetRelationPath() about
how to make special undo paths, OR we are OK with just using
"standard" paths, where undo files just live under database 9 (instead
of the special "undo" directory).  I stopped using a "9" directory in
earlier versions because undo moved to a separate namespace when we
agreed to use an extra discriminator in buffer tags and so forth; now
that we're back to using database number 9, the question of whether to
reflect that on the filesystem is back.  I have had some trouble
deciding which parts of the system should treat undo logs as some kind
of 'relation' (and the SLRU project will have to tackle the same
questions).  I'll think about that some more before making the change.

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }

> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?

I responded to the general question about when we sync files in an
earlier email.  I've updated the comments to make it clearer that it's
handing the work off, not doing it now.

> Another thing is that recently in commit 475861b261 (commit by you),
> we have introduced a mechanism to not fill the files with zero's for
> certain filesystems like ZFS.  Do we want similar behavior for undo
> files?

Good point.  I will create a separate thread to discuss how the
creation of a central file allocation routine (possibly with a GUC),
and see if we can come up with something reusable for this, but
independently committable.

> 3.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> +{
> + UndoLogSlot *slot;
> + size_t end;
> +
> + slot = find_undo_log_slot(logno, false);
> +
> + /* TODO review interlocking */
> +
> + Assert(slot != NULL);
> + Assert(slot->meta.end % UndoLogSegmentSize == 0);
> + Assert(new_end % UndoLogSegmentSize == 0);
> + Assert(InRecovery ||
> +CurrentSession->attached_undo_slots[slot->meta.category] == slot);
>
> Can you write some comments explaining the above Asserts?  Also, can
> you explain what interlocking issues are you worried about here?

I added comments about the assertions.  I will come back to the
interlocking in another message, which I've now addressed (alluded to
below as well).

> 4.
> while (end < new_end)
> + {
> + allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
> + end += UndoLogSegmentSize;
> + }
> +
> + /* Flush the directory entries before next checkpoint. */
> + undofile_request_sync_dir(slot->meta.tablespace);
>
> I see that at two places after allocating empty undo segment, the
> patch performs undofile_request_sync_dir whereas it doesn't perform
> the same in UndoLogNewSegment? Is there a reason for the same or is it
> missed from one of the places?

You're right.  Done.

> 5.
> +static void
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> /*
> + * We didn't need to acquire the mutex to read 'end' above because only
> + * we write to it.  But we need the mutex to update it, because the
> + * checkpointer might read it concurrently.
>
> Is this assumption correct?  It seems patch also modified
> slot->meta.end during discard in function UndoLogDiscard.  I am
> referring below code:
>
> +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
> {
> ..
> + /* Update shmem to show the new discard and end pointers. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Thomas Munro
Hi Kuntal,

On Thu, Jul 25, 2019 at 5:40 PM Kuntal Ghosh  wrote:
> Here are some review comments on 0003-Add-undo-log-manager.patch. I've
> tried to avoid duplicate comments as much as possible.

Thanks!  Replies inline.  I'll be posting a new patch set shortly with
these and other fixes.

> 1. In UndoLogAllocate,
> + * time this backend as needed to write to an undo log at all or because
> s/as/has

Fixed.

> + * Maintain our tracking of the and the previous transaction start
> Do you mean current log's transaction start as well?

Right, fixed.

> 2. In UndoLogAllocateInRecovery,
> we try to find the current log from the first undo buffer. So, after a
> log switch, we always have to register at least one buffer from the
> current undo log first. If we're updating something in the previous
> log, the respective buffer should be registered after that. I think we
> should document this in the comments.

I'm not sure I understand. Is this working correctly today?

> 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
> it seems the 'full' parameter is not used anywhere. Do we still need this?
>
> + /* It's been recycled.  SO it must have been entirely discarded. */
> s/SO/So

Fixed.

> 4. In CleanUpUndoCheckPointFiles,
> we can emit a debug2 message with something similar to : 'removed
> unreachable undo metadata files'

Done.

> + if (unlink(path) != 0)
> + elog(ERROR, "could not unlink file \"%s\": %m", path);
> according to my observation, whenever we deal with a file operation,
> we usually emit a ereport message with errcode_for_file_access().
> Should we change it to ereport? There are other file operations as
> well including read(), OpenTransientFile() etc.

Done.

> 5. In CheckPointUndoLogs,
> + /* Capture snapshot while holding each mutex. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + serialized[num_logs++] = slot->meta;
> + LWLockRelease(&slot->mutex);
> why do we need an exclusive lock to read something from the slot? A
> share lock seems to be sufficient.

OK.

> pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
> after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
> without calling pgstat_report_wait_end(). I think you've done the
> same to avoid an extra function call. But, it differs from other
> places in the PG code. Perhaps, we should follow this approach
> everywhere.

Ok, changed.

> 6. In StartupUndoLogs,
> + if (fd < 0)
> + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
> assuming your agreement upon changing above elog to ereport, the
> message should be more user friendly. May be something like 'cannot
> open pg_undo file'.

Done.

> + if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != 
> sizeof(slot->meta))
> The usage of size of doesn't look like a problem. But, we can save
> some extra padding bytes at the end if we use (offsetof + sizeof)
> approach similar to other places in PG.

It current ends in a 64 bit value, so there is no padding here.

> 7. In free_undo_log_slot,
> + /*
> + * When removing an undo log from a slot in shared memory, we acquire
> + * UndoLogLock, log->mutex and log->discard_lock, so that other code can
> + * hold any one of those locks to prevent the slot from being recycled.
> + */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + Assert(slot->logno != InvalidUndoLogNumber);
> + slot->logno = InvalidUndoLogNumber;
> + memset(&slot->meta, 0, sizeof(slot->meta));
> + LWLockRelease(&slot->mutex);
> + LWLockRelease(UndoLogLock);
> you've not taken the discard_lock as mentioned in the comment.

Right, I was half-way between two different ideas about how that
interlocking should work, but I have straightened this out now, and
will write about the overall locking model separately.

> 8. In find_undo_log_slot,
> + * 1.  If the calling code knows that it is attached to this lock or is the
> s/lock/slot

Fixed.

BTW I am experimenting with macros that would actually make assertions
about those programming rules.

> + * 2.  All other code should acquire log->mutex before accessing any members,
> + * and after doing so, check that the logno hasn't moved.  If it is not, the
> + * entire undo log must be assumed to be discarded (as if this function
> + * returned NULL) and the caller must behave accordingly.
> Perhaps, you meant '..check that the logno remains same. If it is not..'.

Fixed.

> + /*
> + * If we didn't find it, then it must already have been entirely
> + * discarded.  We create a negative cache entry so that we can answer
> + * this question quickly next time.
> + *
> + * TODO: We could track the lowest known undo log number, to reduce
> + * the negative cache entry bloat.
> + */
> This is an interesting thought. But, I'm wondering how we are going to
> search the discarded logno in the simple hash. I guess that's why it's
> in the TODO list.

Done.  Each backend tracks its idea of the lowest undo log that
exists.  There is a s

Re: POC: Cleaning up orphaned files using undo logs

2019-08-15 Thread Andres Freund
Hi,

On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar  wrote:
> >
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > >   I think that batch reading should just copy the underlying data into a
> > >   char* buffer. Only the records that currently are being used by
> > >   higher layers should get exploded into an unpacked record. That will
> > >   reduce memory usage quite noticably (and I suspect it also drastically
> > >   reduce the overhead due to a large context with a lot of small
> > >   allocations that then get individually freed).
> >
> > Ok, I got your idea.  I will analyze it further and work on this if
> > there is no problem.
> 
> I think there is one problem that currently while unpacking the undo
> record if the record is compressed (i.e. some of the fields does not
> exist in the record) then we read those fields from the first record
> on the page.  But, if we just memcpy the undo pages to the buffers and
> delay the unpacking whenever it's needed seems that we would need to
> know the page boundary and also we need to know the offset of the
> first complete record on the page from where we can get that
> information (which is currently in undo page header).

I don't understand why that's a problem?


> As of now even if we leave this issue apart I am not very clear what
> benefit you are seeing in the way you are describing compared to the
> way I am doing it now?
> 
> a) Is it the multiple palloc? If so then we can allocate memory at
> once and flatten the undo records in that.  Earlier, I was doing that
> but we need to align each unpacked undo record so that we can access
> them directly and based on Robert's suggestion I have modified it to
> multiple palloc.

Part of it.

> b) Is it the memory size problem that the unpack undo record will take
> more memory compared to the packed record?

Part of it.

> c) Do you think that we will not need to unpack all the records?  But,
> I think eventually, at the higher level we will have to unpack all the
> undo records ( I understand that it will be one at a time)

Part of it. There's a *huge* difference between having a few hundred to
thousand unpacked records, each consisting of several independent
allocations, in memory and having one large block containing all
packed records in a batch, and a few allocations for the few unpacked
records that need to exist.

There's also d) we don't need separate tiny memory copies while holding
buffer locks etc.

Greetings,

Andres Freund




Re: Useless bms_free() calls in build_child_join_rel()

2019-08-15 Thread Etsuro Fujita
On Thu, Aug 15, 2019 at 8:31 PM Etsuro Fujita  wrote:
> While working on the PWJ patch [1], I noticed $SUBJECT.  They seem to
> be leftovers from the original partitionwise-join patch, perhaps.
> Attached is a patch for removing them.

Pushed.

Best regards,
Etsuro Fujita




Re: Global temporary tables

2019-08-15 Thread Craig Ringer
On Tue, 13 Aug 2019 at 21:50, Konstantin Knizhnik 
wrote:

> As far as I understand relpages and reltuples are set only when you
>> perform "analyze" of the table.
>>
>
> Also autovacuum's autoanalyze.
>
>
> When it happen?
> I have created normal table, populated it with some data and then wait
> several hours but pg_class was not updated for this table.
>
>

heap_vacuum_rel() in src/backend/access/heap/vacuumlazy.c below

 * Update statistics in pg_class.

which I'm pretty sure is common to explicit vacuum and autovacuum. I
haven't run up a test to verify 100% but most DBs would never have relpages
etc set if autovac didn't do it since most aren't explicitly VACUUMed at
all.

I thought it was done when autovac ran an analyze, but it looks like it's
all autovac. Try setting very aggressive autovac thresholds and inserting +
deleting a bunch of tuples maybe.

I attach to this mail slightly refactored versions of this patches with
> fixes of issues reported in your review.
>

Thanks.

Did you have a chance to consider my questions too? I see a couple of
things where there's no patch change, which is fine, but I'd be interested
in your thoughts on the question/issue in those cases.

>

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: proposal: type info support functions for functions that use "any" type

2019-08-15 Thread Pavel Stehule
Hi

rebase

Pavel
diff --git a/contrib/decode/.gitignore b/contrib/decode/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/contrib/decode/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/decode/Makefile b/contrib/decode/Makefile
new file mode 100644
index 00..bb29732c61
--- /dev/null
+++ b/contrib/decode/Makefile
@@ -0,0 +1,20 @@
+# contrib/decode/Makefile
+
+MODULES = decode
+
+EXTENSION = decode
+DATA = decode--1.0.sql
+PGFILEDESC = "decode - example of parser support function"
+
+REGRESS = decode
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/decode
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/decode/decode--1.0.sql b/contrib/decode/decode--1.0.sql
new file mode 100644
index 00..34408809ff
--- /dev/null
+++ b/contrib/decode/decode--1.0.sql
@@ -0,0 +1,26 @@
+/* contrib/decode/decode--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION decode" to load this file. \quit
+
+--
+--  PostgreSQL code for decode.
+--
+
+--
+-- Parser support function - allow to specify returning type when
+-- system with polymorphic variables is possible to use.
+--
+CREATE FUNCTION decode_support(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+--
+-- decode function - example of function that returns "any" type
+--
+CREATE FUNCTION decode(variadic "any")
+RETURNS "any"
+AS 'MODULE_PATHNAME'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE SUPPORT decode_support;
+
diff --git a/contrib/decode/decode.c b/contrib/decode/decode.c
new file mode 100644
index 00..32919bf846
--- /dev/null
+++ b/contrib/decode/decode.c
@@ -0,0 +1,459 @@
+/*
+ * contrib/decode/decode.c
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "catalog/pg_type.h"
+#include "nodes/supportnodes.h"
+#include "parser/parse_coerce.h"
+#include "parser/parse_oper.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(decode_support);
+PG_FUNCTION_INFO_V1(decode);
+
+static void decode_detect_types(int nargs, Oid *argtypes, Oid *search_oid, Oid *result_oid);
+static Oid select_common_type_from_vector(int nargs, Oid *typeids, bool noerror);
+
+typedef struct decode_cache
+{
+	Oid			rettype;
+	int			nargs;
+	Oid		   *argtypes;
+	Oid		   *target_types;
+	Oid			search_typid;
+	Oid			result_typid;
+	CoercionPathType *ctype;
+	FmgrInfo   *cast_finfo;
+	FmgrInfo   *input_finfo;
+	Oid		   *typioparam;
+	FmgrInfo	eqop_finfo;
+} decode_cache;
+
+static void
+free_cache(decode_cache *cache)
+{
+	pfree(cache->argtypes);
+	pfree(cache->target_types);
+	pfree(cache->ctype);
+	pfree(cache->cast_finfo);
+	pfree(cache->input_finfo);
+	pfree(cache->typioparam);
+	pfree(cache);
+}
+
+/*
+ * prepare persistent cache used for fast internal parameter casting
+ */
+static decode_cache *
+build_cache(int nargs,
+			Oid *argtypes,
+			MemoryContext target_ctx)
+{
+	Oid		search_typid;
+	Oid		result_typid;
+	Oid			eqop;
+	decode_cache *cache;
+	MemoryContext		oldctx;
+	int		i;
+
+	oldctx = MemoryContextSwitchTo(target_ctx);
+
+	cache = palloc(sizeof(decode_cache));
+
+	cache->argtypes = palloc(nargs * sizeof(Oid));
+	cache->target_types = palloc(nargs * sizeof(Oid));
+	cache->ctype = palloc(nargs * sizeof(CoercionPathType));
+	cache->cast_finfo = palloc(nargs * sizeof(FmgrInfo));
+	cache->input_finfo = palloc(nargs * sizeof(FmgrInfo));
+	cache->typioparam = palloc(nargs * sizeof(Oid));
+
+	MemoryContextSwitchTo(oldctx);
+
+	decode_detect_types(nargs, argtypes, &search_typid, &result_typid);
+
+	cache->search_typid = search_typid;
+	cache->result_typid = result_typid;
+
+	for (i = 0; i < nargs; i++)
+	{
+		Oid		src_typid;
+		Oid		target_typid;
+
+		src_typid = cache->argtypes[i] = argtypes[i];
+
+		if (i == 0)
+			target_typid = search_typid;
+		else if (i % 2) /* even position */
+		{
+			if (i + 1 < nargs)
+target_typid = search_typid;
+			else
+/* last even argument is a default value */
+target_typid = result_typid;
+		}
+		else /* odd position */
+			target_typid = result_typid;
+
+		cache->target_types[i] = target_typid;
+
+		/* prepare cast if it is necessary */
+		if (src_typid != target_typid)
+		{
+			Oid		funcid;
+
+			cache->ctype[i] = find_coercion_pathway(target_typid, src_typid,
+		COERCION_ASSIGNMENT, &funcid);
+			if (cache->ctype[i] == COERCION_PATH_NONE)
+/* A previously detected cast is not available now */
+elog(ERROR, "could not find cast from %u to %u",
+	 src_typid, target_typid);
+
+			if (cache->ctype[i] != COERCION_PATH_RELABELTYPE)
+			{
+if (cache->ctype[i] == COERCION_PATH_FUNC)
+{
+	fmgr_info(funcid, &cache->cast_finfo[i]);
+}
+else
+{
+	Oid		outfuncoid;
+	Oi