Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-17 Thread Noah Misch
For two-phase commit, PrepareTransaction() needs to execute pending syncs.

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>   /* Remember if it's a system catalog */
>   is_system_catalog = IsSystemRelation(OldHeap);
>  
> - /*
> -  * We need to log the copied data in WAL iff WAL archiving/streaming is
> -  * enabled AND it's a WAL-logged rel.
> -  */
> - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> -
>   /* use_wal off requires smgr_targblock be initially invalid */
>   Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);

Since you're deleting the use_wal variable, update that last comment.

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
>   {
>   SMgrRelation srel;
>  
> - srel = smgropen(pending->relnode, 
> pending->backend);
> -
> - /* allocate the initial array, or extend it, if 
> needed */
> - if (maxrels == 0)
> + if (pending->dosync)
>   {
> - maxrels = 8;
> - srels = palloc(sizeof(SMgrRelation) * 
> maxrels);
> + /* Perform pending sync of WAL-skipped 
> relation */
> + 
> FlushRelationBuffersWithoutRelcache(pending->relnode,
> + 
> false);
> + srel = smgropen(pending->relnode, 
> pending->backend);
> + smgrimmedsync(srel, MAIN_FORKNUM);

This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.

The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point."  Please write the part to WAL-log the contents of small files
instead of syncing them.

> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
>* If it does commit, we'll have done the table_finish_bulk_insert() at
>* the bottom of this routine first.
>*
> -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> -  * is not always set correctly, since in rare cases 
> rd_newRelfilenodeSubid
> -  * can be cleared before the end of the transaction. The exact case is
> -  * when a relation sets a new relfilenode twice in same transaction, yet
> -  * the second one fails in an aborted subtransaction, e.g.
> -  *
> -  * BEGIN;
> -  * TRUNCATE t;
> -  * SAVEPOINT save;
> -  * TRUNCATE t;
> -  * ROLLBACK TO save;
> -  * COPY ...

The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.

> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -74,11 +74,13 @@ typedef struct RelationData
>   SubTransactionId rd_createSubid;/* rel was created in current 
> xact */
>   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> assigned in
>   
>  * current xact */
> + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> assigned
> + 
>  * first in current xact */

In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed.  Many bits of code need to look at all three,
e.g. RelationClose().  This field needs to be 100% reliable.  In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.

nm
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 6ebe75a..d74e9a5 100644
--- 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Stephen Frost
Greetings,

On Sat, Aug 17, 2019 at 18:30 Ahsan Hadi  wrote:

> The current calendar entry for TDE weekly call will not work for EST
> timezone. I will change the invite so we can accommodate people from
> multiple time zones.
>

I appreciate the thought but at least for my part, I already have regular
conference calls after midnight to support Asian and Australian time zones,
so I’m willing to work to support whatever has already been worked out.  (I
also won’t complain about a time that’s more convenient for everyone, of
course.)

Thanks!

Stephen

>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Ahsan Hadi
The current calendar entry for TDE weekly call will not work for EST
timezone. I will change the invite so we can accommodate people from
multiple time zones.

Stay tuned.


On Sun, 18 Aug 2019 at 2:29 AM, Sehrope Sarkuni  wrote:

> On Sat, Aug 17, 2019 at 12:43 PM Ibrar Ahmed 
> wrote:
>
>> +1 for voice call, bruce we usually have a weekly TDE call.
>>
>
> Please add me to the call as well. Thanks!
>
> Regards,
> -- Sehrope Sarkuni
> Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
>
>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Sehrope Sarkuni
On Sat, Aug 17, 2019 at 12:43 PM Ibrar Ahmed  wrote:

> +1 for voice call, bruce we usually have a weekly TDE call.
>

Please add me to the call as well. Thanks!

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Recent failures in IsolationCheck deadlock-hard

2019-08-17 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Aug 6, 2019 at 6:18 PM Tom Lane  wrote:
>> Yeah, there have been half a dozen failures since deadlock-parallel
>> went in, mostly on critters that are slowed by CLOBBER_CACHE_ALWAYS
>> or valgrind.  I've tried repeatedly to reproduce that here, without
>> success :-(.  It's unclear whether the failures represent a real
>> code bug or just a problem in the test case, so I don't really want
>> to speculate about fixes till I can reproduce it.

> I managed to reproduce a failure that looks a lot like lousyjack's
> (note that there are two slightly different failure modes):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack=2019-08-05%2011:33:02

> I did that by changing the deadlock_timeout values for sessions d1 and
> d2 to just a few milliseconds on my slowest computer, guessing that
> this might be a race involving the deadlock timeout and the time it
> takes for workers to fork and join a lock queue.

Yeah, I eventually managed to reproduce it (not too reliably) by
introducing a randomized delay into parallel worker startup.

The scenario seems to be: some d1a2 worker arrives so late that it's not
accounted for in the initial DeadLockCheck performed by some d2a1 worker.
The other d1a2 workers are released, and run and finish, but the late one
goes to sleep, with a long deadlock_timeout.  If the next DeadLockCheck is
run by e1l's worker, that prefers to release d2a1 workers, which then all
run to completion.  When the late d1a2 worker finally wakes up and runs
DeadLockCheck, *there is no deadlock to resolve*: the d2 session is idle,
not waiting for any lock.  So the worker goes back to sleep, and we sit
till isolationtester times out.

Another way to look at it is that there is a deadlock condition, but
one of the waits-for constraints is on the client side where DeadLockCheck
can't see it.  isolationtester is waiting for d1a2 to complete before it
will execute d1c which would release session d2, so that d2 is effectively
waiting for d1, but DeadLockCheck doesn't know that and thinks that it's
equally good to unblock either d1 or d2.

The attached proposed patch resolves this by introducing another lock
that is held by d1 and then d2 tries to take it, ensuring that the
deadlock detector will recognize that d1 must be released.

I've run several thousand iterations of the test this way without a
problem, where before the MTBF was maybe a hundred or two iterations
with the variable startup delay active.  So I think this fix is good,
but I could be wrong.  One notable thing is that every so often the
test takes ~10s to complete instead of a couple hundred msec.  I think
that what's happening there is that the last deadlock condition doesn't
form until after all of session d2's DeadLockChecks have run, meaning
that we don't spot the deadlock until some other session runs it.  The
test still passes, though.  This is probably fine given that it would
never happen except with platforms that are horridly slow anyway.
Possibly we could shorten the 10s values to make that case complete
quicker, but I'm afraid of maybe breaking things on slow machines.

> Another thing I noticed is that all 4 times I managed to reproduce
> this, the "rearranged to" queue had only two entries; I can understand
> that d1's workers might not feature yet due to bad timing, but it's
> not clear to me why there should always be only one d2a1 worker and
> not more.

I noticed that too, and eventually realized that it's a
max_worker_processes constraint: we have two parallel workers waiting
in e1l and e2l, so if d1a2 takes four, there are only two slots left for
d2a1; and for reasons that aren't totally clear, we don't get to use the
last slot.  (Not sure if that's a bug in itself.)

The attached patch therefore also knocks max_parallel_workers_per_gather
down to 3 in this test, so that we have room for at least 2 d2a1 workers.

regards, tom lane

diff --git a/src/test/isolation/expected/deadlock-parallel.out b/src/test/isolation/expected/deadlock-parallel.out
index 871a80c..cf4d07e 100644
--- a/src/test/isolation/expected/deadlock-parallel.out
+++ b/src/test/isolation/expected/deadlock-parallel.out
@@ -1,10 +1,10 @@
 Parsed test spec with 4 sessions
 
 starting permutation: d1a1 d2a2 e1l e2l d1a2 d2a1 d1c e1c d2c e2c
-step d1a1: SELECT lock_share(1,x) FROM bigt LIMIT 1;
-lock_share 
+step d1a1: SELECT lock_share(1,x), lock_excl(3,x) FROM bigt LIMIT 1;
+lock_share lock_excl  
 
-1  
+1  1  
 step d2a2: select lock_share(2,x) FROM bigt LIMIT 1;
 lock_share 
 
@@ -16,15 +16,19 @@ step d1a2: SET force_parallel_mode = on;
 			  SET parallel_tuple_cost = 0;
 			  SET min_parallel_table_scan_size = 0;
 			  SET parallel_leader_participation = off;
-			  SET max_parallel_workers_per_gather = 4;
+			  SET max_parallel_workers_per_gather = 3;
 			  SELECT sum(lock_share(2,x)) FROM bigt; 
 step d2a1: SET force_parallel_mode = 

Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Andres Freund
Hi,

On 2019-08-17 13:41:18 -0400, Tom Lane wrote:
> Try this:
> alter system set max_parallel_workers = 20;
> and restart the system.
> 
> max_parallel_workers is still 8, according to both SHOW and
> pg_controldata, nor can you launch more than 8 workers.

Hm. I can't reproduce that. I do get whatever number I configure.  Note
also that pg_controldata shows max_worker_processes, not
max_parallel_workers.


> Even odder, if you just do
> 
> regression=# set max_parallel_workers = 200;
> SET
> regression=# show max_parallel_workers;  
>  max_parallel_workers 
> --
>  200
> (1 row)
> 
> which should certainly not happen for a PGC_POSTMASTER parameter.

It's not PGC_POSTMASTER. That's max_worker_processes. The max_parallel_*
GUCs just control how many of max_worker_processes are used for $task.

Greetings,

Andres Freund




Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Tom Lane
Sergei Kornilov  writes:
>> which should certainly not happen for a PGC_POSTMASTER parameter.

> But max_parallel_workers is PGC_USERSET and this behavior seems be documented:

Argh!  I was looking at max_worker_processes in one window and
max_parallel_workers in another, and failed to see the discrepancy.

Those GUCs are too confusingly named :-(.

regards, tom lane




Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Ibrar Ahmed
On Sat, Aug 17, 2019 at 10:41 PM Tom Lane  wrote:

> Try this:
> alter system set max_parallel_workers = 20;
> and restart the system.
>
> max_parallel_workers is still 8, according to both SHOW and
> pg_controldata, nor can you launch more than 8 workers.
>
> Even odder, if you just do
>
> regression=# set max_parallel_workers = 200;
> SET
> regression=# show max_parallel_workers;
>  max_parallel_workers
> --
>  200
> (1 row)
>
> which should certainly not happen for a PGC_POSTMASTER parameter.
>
> We seem to have an awful lot of mechanism that's concerned with
> adjustments of max_parallel_workers, for something that apparently
> might as well be a compile-time #define ... so I assume it's supposed
> to be changeable at restart and somebody broke it.  But it's not
> working as I'd expect in any branch from 10 onwards.
>
> regards, tom lane
>
>
>
If I understand that correctly it works for me.

postgres=# alter system set max_parallel_workers = 1;
$ pg_ctl restart -l log
$ psql postgres
psql (13devel)
postgres=# explain analyze select * from test where b > 1;


   QUERY PLAN

---
 Gather  (cost=1000.00..98294.84 rows=1 width=8) (actual
time=1635.959..1636.028 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 1
   ->  Parallel Seq Scan on test  (cost=0.00..97294.74 rows=1 width=8)
(actual time=1632.239..1632.239 rows=0 loops=2)
 Filter: (b > 1)
 Rows Removed by Filter: 505

 Planning Time: 0.533 ms
 Execution Time: 1636.080 ms
(8 rows)



postgres=# alter system set max_parallel_workers = 2;
ALTER SYSTEM
postgres=# \q
vagrant@vagrant:~/percona/postgresql$ pg_ctl restart -l log
vagrant@vagrant:~/percona/postgresql$ psql postgres
psql (13devel)
Type "help" for help.
postgres=# explain analyze select * from test where b > 1;
  QUERY PLAN

---
 Gather  (cost=1000.00..98294.84 rows=1 width=8) (actual
time=1622.210..1622.274 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test  (cost=0.00..97294.74 rows=1 width=8)
(actual time=1616.000..1616.000 rows=0 loops=3)
 Filter: (b > 1)
 Rows Removed by Filter: 337
 Planning Time: 0.699 ms
 Execution Time: 1622.325 ms
(8 rows)




--
Ibrar Ahmed


Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Sergei Kornilov
Hello

> Try this:
> alter system set max_parallel_workers = 20;
> and restart the system.
> max_parallel_workers is still 8

Hmm, I got 20 on my local 11.5 and on HEAD.

> which should certainly not happen for a PGC_POSTMASTER parameter.

But max_parallel_workers is PGC_USERSET and this behavior seems be documented:

> Also, note that a setting for this value which is higher than 
> max_worker_processes will have no effect, since parallel workers are taken 
> from the pool of worker processes established by that setting.

regards, Sergei




Re: max_parallel_workers can't actually be set?

2019-08-17 Thread Komяpa
Hi,

On my PG11 I have set it to 64 upon setup and it propogated to
postgresql.auto.conf and is set after restart. I've upgraded to PG12 since
then, and parameter is read from postgresql.auto.conf correctly and is
displayed via SHOW (just checked on 12beta3).

I also spent some time trying to get a plan that will give me 32 workers.
Largest I ever got without taking a hammer was 16, which is half of
available cores, or all non-HT ones. I still haven't found a way to set
costs and limits to load all the system with a query.

ALTER TABLE ... SET (parallel_workers=32); is currently my most favorite
hammer. I set max_worker_processes to 512 and letting OS scheduler resolve
the hours when four queries run 128 CPU-bound processes on 32-core machine,
it's not as good as if the limits were adjusted dynamically after the query
start but much better than running a second query with just 1 worker even
after first one finishes.

On Sat, Aug 17, 2019 at 8:41 PM Tom Lane  wrote:

> Try this:
> alter system set max_parallel_workers = 20;
> and restart the system.
>
> max_parallel_workers is still 8, according to both SHOW and
> pg_controldata, nor can you launch more than 8 workers.
>
> Even odder, if you just do
>
> regression=# set max_parallel_workers = 200;
> SET
> regression=# show max_parallel_workers;
>  max_parallel_workers
> --
>  200
> (1 row)
>
> which should certainly not happen for a PGC_POSTMASTER parameter.
>
> We seem to have an awful lot of mechanism that's concerned with
> adjustments of max_parallel_workers, for something that apparently
> might as well be a compile-time #define ... so I assume it's supposed
> to be changeable at restart and somebody broke it.  But it's not
> working as I'd expect in any branch from 10 onwards.
>
> regards, tom lane
>
>
>

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> I will state whet I have already told some people privately, that for
> this feature, we have many people understanding 40% of the problem, but
> thinking they understand 90%.  I do agree we should plan for our
> eventual full feature set, but I can't figure out what that feature set
> looks like beyond full-cluster encryption, and no one is addressing my
> concerns to move that forward.  Vague complains that they don't like the
> process are not changing that.

I don't particularly care for these "40%" and "90%" characterizations
and I'm concerned that some might also, reasonably, find that to be a
way to dismiss the opinions and comments from anyone who isn't in the
clearly subjective "90%" crowd.

Regarding what the eventual feature-set is, I believe it's abundently
clear where we want to eventually go and it's surprising to me that it's
unclear- we should be aiming for parity with the other major database
vendors when it comes to TDE.  That's pretty clear and straight-forward
to define, as well, and as facts:

Oracle:
- Supports column-level and tablespace-level.
- Has a Master Encryption Key (MEK), and table keys.
- Supports having the MEK be external to the database system.
- For tablespaces, can also use an external key store with
  different keys for different tablespaces.
- Supports Triple-DES and AES (128, 192, 256 bit)
- Supports a NOMAC parameter to improve performance.
- Has a mechanism for changing the keys/algorithms for tables
  with encrypted columns.

SQL Server:
- Supports database-level encryption
- Has a Instance master key and a per-database master key
- Includes a key store for having other keys
- Provides a function-based approach for encrypting at a column level
  (imagine pgcrypto, but where the key can be pulled from a key-store in
  the database which has to be unlocked)

DB2:
- Supports a Master Key and a Data Encryption Key
- Support encryption at a per-database level

Sybase:
- Supports a key encryption key
- Supports column level encryption with column encryption keys

MySQL:
- Supports a master encryption key
- Supports having the master key in an external data store which speaks
  Oasis KMIP
- Supports per-tablespace encryption
- Supports per-table encryption

Every one of the database systems above uses at least a two-tier system
(SQL server seems to possibly support three-tier) where there is a MEK
and then multiple keys under the MEK to allow partial encryption of the
system, at *least* at a database or tablespace level but a number go
down to column-level, either directly or using a function-based approach
with a key store.

Every one has some kind of key store, and a number support an external
key store.

There is not one that uses a single key or which requires that the
enctire instance be encrypted.

Being PostgreSQL, I would expect us to shoot for as much flexibility as
we possible, similar to what we've done for our ACL system where we
support down to a column-level (and row level with RLS).

That's our target end-goal.  Having an incremental plan to get there
where we start with something simpler and then work towards a more
complicated implementation is fine- but that base, as I've said multiple
times and as supported by what we see other database systems have,
should include some kind of key store with support for multiple keys and
a way to encrypt something less than the entire system.  Every other
database system that we consider at all comparable has at least that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: range_agg

2019-08-17 Thread Paul A Jungwirth
On Mon, Jul 8, 2019 at 9:46 AM Paul A Jungwirth
 wrote:
> - A multirange type is an extra thing you get when you define a range
> (just like how you get a tstzrange[]). Therefore

I've been able to make a little more progress on multiranges the last
few days, but it reminded me of an open question I've had for awhile:
typmods! I see places in the range code that gesture toward supporting
typmods, but none of the existing range types permit them. For
example:

postgres=# select '5'::numeric(4,2);
 numeric
-
5.00
(1 row)

postgres=# select '[1,4)'::numrange(4,2);
ERROR:  type modifier is not allowed for type "numrange"
LINE 1: select '[1,4)'::numrange(4,2);

So I'm wondering how seriously I should take this for multiranges? I
guess if a range type did support typmods, it would just delegate to
the underlying element type for their meaning, and so a multirange
should delegate it too? Is there any historical discussion around
typemods on range types?

Thanks!
Paul




max_parallel_workers can't actually be set?

2019-08-17 Thread Tom Lane
Try this:
alter system set max_parallel_workers = 20;
and restart the system.

max_parallel_workers is still 8, according to both SHOW and
pg_controldata, nor can you launch more than 8 workers.

Even odder, if you just do

regression=# set max_parallel_workers = 200;
SET
regression=# show max_parallel_workers;  
 max_parallel_workers 
--
 200
(1 row)

which should certainly not happen for a PGC_POSTMASTER parameter.

We seem to have an awful lot of mechanism that's concerned with
adjustments of max_parallel_workers, for something that apparently
might as well be a compile-time #define ... so I assume it's supposed
to be changeable at restart and somebody broke it.  But it's not
working as I'd expect in any branch from 10 onwards.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Andres Freund
Hi,

On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.

There's two reasons here:

The primary one in the context here is that if we do *not* have to lock
the buffers all ahead of time, we can simplify the interface. We
certainly can't lock the buffers over IO (due to buffer reclaim) as
we're doing right now, so we'd need another phase, called by the "user"
during undo insertion. But if we do not need to lock the buffers before
the insertion over all starts, the inserting location doesn't have to
care.

Secondarily, all the reasoning for needing to lock all buffers ahead of
time was imo fairly unconvincing. Following the "recipe" for WAL
insertions is a good idea when writing a new run-of-the-mill WAL
inserting location - but when writing a new fundamental facility, that
already needs to modify how WAL works, then I find that much less
convincing.


> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Well, in the version of code that I was reviewing here, I don't there is
such a limit (there is a limit for buffers per undo record, but no limit
on the number of records inserted together). I think Dilip added a limit
since.  And we have the issue of a lot of IO happening while holding
content locks on several pages.  So I don't think it's a straw man at
all.


> 2. The write-ahead logging protocol says that you're supposed to lock
> all the buffers at once.  See src/backend/access/transam/README.  If
> you want to go patch that file, then this patch can follow whatever
> the locking rules in the patched version are.  But until then, the
> patch should follow *the actual rules* not some other protocol based
> on a hand-wavy explanation in an email someplace. Otherwise, you've
> got the same sort of undocumented disaster-waiting-to-happen that you
> keep complaining about in other parts of this patch.  We need fewer of
> those, not more!

But that's not what I'm asking for? I don't even know where you take
from that I don't want this to be documented. I'm mainly asking for a
comment explaining why the current behaviour is what it is. Because I
don't think an *implicit* "normal WAL logging rules" is sufficient
explanation, because all the locking here happens one or two layers away
from the WAL logging site - so it's absolutely *NOT* obvious that that's
the explanation. And I don't think any of the locking sites actually has
comments explaining why the locks are acquired at that time (in fact,
IIRC until the review some even only mentioned pinning, not locking).


> > > Suppose you insert one record for the transaction which split in
> > > block1 and 2.  Now, before this block is actually going to the disk
> > > the transaction committed and become all visible the undo logs are
> > > discarded.  It's possible that block 1 is completely discarded but
> > > block 2 is not because it might have undo for the next transaction.
> > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > > their so we need to skip inserting undo for block 1 as it does not
> > > exist.
> >
> > Hm. I'm quite doubtful this is a good idea. How will this not force us
> > to a emit a lot more expensive durable operations while writing undo?
> > And doesn't this reduce error detection quite remarkably?
> >
> > Thomas, Robert?
> 
> I think you're going to 

Re: Patch: New GUC prepared_statement_limit to limit memory used by prepared statements

2019-08-17 Thread Ibrar Ahmed
On Sat, Aug 17, 2019 at 6:58 PM Daniel Migowski 
wrote:

> Hello,
>
> attached you find a patch that adds a new GUC:
>

Quick questions before looking at the patch.

>
> prepared_statement_limit:
>
>  - Do we have a consensus about the name of GUC? I don't think it is
the right name for that.

 - Is this a WIP patch or the final patch? Because I can see TODO and
non-standard
comments in the patch.



>  Specifies the maximum amount of memory used in each session to
> cache
>  parsed-and-rewritten queries and execution plans. This affects
> the maximum memory
>  a backend threads will reserve when many prepared statements
> are used.
>  The default value of 0 disables this setting, but it is
> recommended to set this
>  value to a bit lower than the maximum memory a backend worker
> thread should reserve
>  permanently.
>
> If the GUC is configured after each save of a CachedPlanSource, or after
> creating a CachedPlan from it, the function
> EnforcePreparedStatementLimit is called now. It checks the mem usage of
> the existing saved CachedPlanSources and invalidates the query_list and
> the gplan if available until the memory limit is met again.
>
> CachedPlanSource are removed-and-tailadded in the saved_plan_list
> everytime GetCachedPlan is called on them so it can be used as a LRU list.
>
> I also reworked ResetPlanCache, PlanCacheRelCallback and
> PlanCacheObjectCallback a bit so when a CachedPlanSource is invalidated
> the query_list is not only marked as invalid but it is also fully
> released to free memory here.
>
> Regards,
> Daniel Migowski
>
> PS@Konstantin: This patch also includes the CachedPlanMemoryUsage
> function you like, maybe you like the review the patch for me?
>
>



-- 
Ibrar Ahmed


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Sat, Aug 17, 2019 at 08:16:06AM +0200, Antonin Houska wrote:
> > Bruce Momjian  wrote:
> > 
> > > On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> > > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > > Why would it not be simpler to have the cluster_passphrase_command run
> > > > > whatever command-line program it wants?  If you don't want to use a
> > > > > shell command, create an executable and call that.
> > > > 
> > > > Having direct integration with a KMS would certainly be valuable, and I
> > > > don't see a reason to deny users that option if someone would like to
> > > > spend time implementing it- in addition to a simpler mechanism such as a
> > > > passphrase command, which I believe is what was being suggested here.
> > > 
> > > OK,  I am just trying to see why we would not use the
> > > cluster_passphrase_command-like interface to do that.
> > 
> > One problem that occurs to me is that PG may need to send some sort of
> > credentials to the KMS. If it runs a separate process to execute the 
> > command,
> > it needs to pass those credentials to it. Whether it does so via parameters 
> > or
> > environment variables, both can be seen by other users.
> 
> Yes, that would be a good reason to use an external library, if we can't
> figure out a clean API like opening a pipe into the command-line tool
> and piping in the secret.

Having to install something additional to make that whole mechanism
happen would also be less than ideal, imv.  That includes even something
as install-X and then configure passphrase_command.  Our experience with
archive_command shows that it really isn't a very good approach, even
when everything can be passed in on a command line.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Stephen Frost
Greetings,

* Ibrar Ahmed (ibrar.ah...@gmail.com) wrote:
> On Sat, Aug 17, 2019 at 3:04 AM Bruce Momjian  wrote:
> > +1 for voice call, bruce we usually have a weekly TDE call. I will include
> you in
> that call.  Currently, in that group are

> moon_insung...@lab.ntt.co.jp,
> sawada.m...@gmail.com,
> shawn.w...@highgo.ca,
> ahsan.h...@highgo.ca,
> ibrar.ah...@gmail.com
> 
> I am able to do that for others as well.

If you could add me to that call, I'll do my best to attend.

(If it's a gmail calendar invite, please send to frost.stephen.p @
gmail).

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Ibrar Ahmed
On Sat, Aug 17, 2019 at 3:04 AM Bruce Momjian  wrote:

> On Fri, Aug 16, 2019 at 07:47:37PM +0200, Antonin Houska wrote:
> > Bruce Momjian  wrote:
> >
> > > I have seen no one present a clear description of how anything beyond
> > > all-cluster encryption would work or be secure.  Wishing that were not
> > > the case doesn't change things.
> >
> > Since this email thread has grown a lot and is difficult to follow, it
> might
> > help if we summarized various approaches on the wiki, with their pros and
> > cons, and included some links to the corresponding emails in the
> > archive. There might be people who would like think about the problems
> but
> > don't have time to read the whole thread. Overview of the pending
> problems of
> > particular approaches might be useful for newcomers, but also for people
> who
> > followed only part of the discussion. I mean an overview of the storage
> > problems; the key management seems to be less controversial.
> >
> > If you think it makes sense, I can spend some time next week on the
> > research. However I'd need at least an outline of the approaches proposed
> > because I also missed some parts of the thread.
>
> I suggest we schedule a voice call and I will go over all the issues and
> explain why I came to the conclusions listed.  It is hard to know what
> level of detail to explain that in an email, beyond what I have already
> posted on this thread.  The only other options is to read all the emails
> _I_ sent on the thread to get an idea.
>
> +1 for voice call, bruce we usually have a weekly TDE call. I will include
you in
that call.  Currently, in that group are
moon_insung...@lab.ntt.co.jp,
sawada.m...@gmail.com,
shawn.w...@highgo.ca,
ahsan.h...@highgo.ca,
ibrar.ah...@gmail.com

I am able to do that for others as well.
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


-- 
Ibrar Ahmed


Re: POC: Cleaning up orphaned files using undo logs

2019-08-17 Thread Robert Haas
On Wed, Aug 14, 2019 at 12:39 PM Andres Freund  wrote:
> > > Again, I think it's not ok to just assume you can lock an essentially
> > > unbounded number of buffers. This seems almost guaranteed to result in
> > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> >
> > I think for controlling that we need to put a limit on max prepared
> > undo?  I am not sure any other way of limiting the number of buffers
> > because we must lock all the buffer in which we are going to insert
> > the undo record under one WAL logged operation.
>
> I heard that a number of times. But I still don't know why that'd
> actually be true. Why would it not be sufficient to just lock the buffer
> currently being written to, rather than all buffers? It'd require a bit
> of care updating the official current "logical end" of a log, but
> otherwise ought to not be particularly hard? Only one backend can extend
> the log after all, and until the log is externally visibily extended,
> nobody can read or write those buffers, no?

Well, I don't understand why you're on about this.  We've discussed it
a number of times but I'm still confused.  I'll repeat my previous
arguments on-list:

1. It's absolutely fine to just put a limit on this, because the
higher-level facilities that use this shouldn't be doing a single
WAL-logged operation that touches a zillion buffers.  We have been
careful to avoid having WAL-logged operations touch an unbounded
number of buffers in plenty of other places, like the btree code, and
we are going to have to be similarly careful here for multiple
reasons, deadlock avoidance being one.  So, saying, "hey, you're going
to lock an unlimited number of buffers" is a straw man.  We aren't.
We can't.

2. The write-ahead logging protocol says that you're supposed to lock
all the buffers at once.  See src/backend/access/transam/README.  If
you want to go patch that file, then this patch can follow whatever
the locking rules in the patched version are.  But until then, the
patch should follow *the actual rules* not some other protocol based
on a hand-wavy explanation in an email someplace. Otherwise, you've
got the same sort of undocumented disaster-waiting-to-happen that you
keep complaining about in other parts of this patch.  We need fewer of
those, not more!

3. There is no reason to care about all of the buffers being locked at
once, because they are not unlimited in number (see point #1) and
nobody else is looking at them anyway (see the final sentence of what
I quoted above).

I think we are, or ought to be, talking about locking 2 (or maybe in
rare cases 3 or 4) undo buffers in connection with a single WAL
record.  If we're talking about more than that, then I think the
higher-level code needs to be changed.  If we're talking about that
many, then we don't need to be clever.  We can just do the standard
thing that the rest of the system does, and it will be fine just like
it is everywhere else.

> > Suppose you insert one record for the transaction which split in
> > block1 and 2.  Now, before this block is actually going to the disk
> > the transaction committed and become all visible the undo logs are
> > discarded.  It's possible that block 1 is completely discarded but
> > block 2 is not because it might have undo for the next transaction.
> > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > their so we need to skip inserting undo for block 1 as it does not
> > exist.
>
> Hm. I'm quite doubtful this is a good idea. How will this not force us
> to a emit a lot more expensive durable operations while writing undo?
> And doesn't this reduce error detection quite remarkably?
>
> Thomas, Robert?

I think you're going to need to spell out your assumptions in order
for me to be able to comment intelligently.  This is another thing
that seems pretty normal to me.  Generally, WAL replay might need to
recreate objects whose creation is not separately WAL-logged, and it
might need to skip operations on objects that have been dropped later
in the WAL stream and thus don't exist any more. This seems like an
instance of the latter pattern.  There's no reason to try to put valid
data into pages that we know have been discarded, and both inserting
and discarding undo data need to be logged anyway.

As a general point, I think the hope is that undo generated by
short-running transactions that commit and become all-visible quickly
will be cheap.  We should be able to dirty shared buffers but then
discard the data without ever writing it out to disk if we've logged a
discard of that data. Obviously, if you've got long-running
transactions that are either generating undo or holding old snapshots,
you're going to have to really flush the data, but we want to avoid
that when we can.  And the same is true on the standby: even if we
write the dirty data into shared buffers instead of skipping the write
altogether, we hope to be able to forget about those buffers when we
encounter a 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Bruce Momjian
On Sat, Aug 17, 2019 at 08:16:06AM +0200, Antonin Houska wrote:
> Bruce Momjian  wrote:
> 
> > On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > Why would it not be simpler to have the cluster_passphrase_command run
> > > > whatever command-line program it wants?  If you don't want to use a
> > > > shell command, create an executable and call that.
> > > 
> > > Having direct integration with a KMS would certainly be valuable, and I
> > > don't see a reason to deny users that option if someone would like to
> > > spend time implementing it- in addition to a simpler mechanism such as a
> > > passphrase command, which I believe is what was being suggested here.
> > 
> > OK,  I am just trying to see why we would not use the
> > cluster_passphrase_command-like interface to do that.
> 
> One problem that occurs to me is that PG may need to send some sort of
> credentials to the KMS. If it runs a separate process to execute the command,
> it needs to pass those credentials to it. Whether it does so via parameters or
> environment variables, both can be seen by other users.

Yes, that would be a good reason to use an external library, if we can't
figure out a clean API like opening a pipe into the command-line tool
and piping in the secret.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Bruce Momjian
On Fri, Aug 16, 2019 at 06:04:39PM -0400, Bruce Momjian wrote:
> I suggest we schedule a voice call and I will go over all the issues and
> explain why I came to the conclusions listed.  It is hard to know what
> level of detail to explain that in an email, beyond what I have already
> posted on this thread.  The only other options is to read all the emails
> _I_ sent on the thread to get an idea.
> 
> I am able to do that for others as well.  

Also, people can certainly ask questions on this list, and I can answer
them, or I can do a Skype/Zoom/IRC chat/call if people want.  The points
of complexity are really the amount of data encrypted, the number of
keys and who controls them, and performance overhead.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Patch: New GUC prepared_statement_limit to limit memory used by prepared statements

2019-08-17 Thread Daniel Migowski

Hello,

attached you find a patch that adds a new GUC:

prepared_statement_limit:

        Specifies the maximum amount of memory used in each session to 
cache
        parsed-and-rewritten queries and execution plans. This affects 
the maximum memory
    a backend threads will reserve when many prepared statements 
are used.
    The default value of 0 disables this setting, but it is 
recommended to set this
    value to a bit lower than the maximum memory a backend worker 
thread should reserve

    permanently.

If the GUC is configured after each save of a CachedPlanSource, or after 
creating a CachedPlan from it, the function 
EnforcePreparedStatementLimit is called now. It checks the mem usage of 
the existing saved CachedPlanSources and invalidates the query_list and 
the gplan if available until the memory limit is met again.


CachedPlanSource are removed-and-tailadded in the saved_plan_list 
everytime GetCachedPlan is called on them so it can be used as a LRU list.


I also reworked ResetPlanCache, PlanCacheRelCallback and 
PlanCacheObjectCallback a bit so when a CachedPlanSource is invalidated 
the query_list is not only marked as invalid but it is also fully 
released to free memory here.


Regards,
Daniel Migowski

PS@Konstantin: This patch also includes the CachedPlanMemoryUsage 
function you like, maybe you like the review the patch for me?




diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cdc30fa..2da4ba8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1702,6 +1702,24 @@ include_dir 'conf.d'
   
  
 
+ 
+  prepared_statement_limit (integer)
+  
+   prepared_statement_limit configuration 
parameter
+  
+  
+  
+   
+Specifies the maximum amount of memory used in each session to cache 
+parsed-and-rewritten queries and execution plans. This affects the 
maximum memory
+a backend threads will reserve when many prepared statements are used.
+The default value of 0 disables this setting, but it is recommended to 
set this
+value to a bit lower than the maximum memory a backend worker thread 
should reserve
+permanently. 
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
diff --git a/src/backend/utils/cache/plancache.c 
b/src/backend/utils/cache/plancache.c
index abc3062..dbeb5a2 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -88,6 +88,9 @@
  * those that are in long-lived storage and are examined for sinval events).
  * We use a dlist instead of separate List cells so that we can guarantee
  * to save a CachedPlanSource without error.
+ *
+ * This list is used as a LRU list for prepared statements when a
+ * prepared_statement_limit is configured.
  */
 static dlist_head saved_plan_list = DLIST_STATIC_INIT(saved_plan_list);
 
@@ -97,6 +100,7 @@ static dlist_head saved_plan_list = 
DLIST_STATIC_INIT(saved_plan_list);
 static dlist_head cached_expression_list = 
DLIST_STATIC_INIT(cached_expression_list);
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
+static void ReleaseQueryList(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
   
QueryEnvironment *queryEnv);
 static bool CheckCachedPlan(CachedPlanSource *plansource);
@@ -114,6 +118,7 @@ static TupleDesc PlanCacheComputeResultDesc(List 
*stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheObjectCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void EnforcePreparedStatementLimit(CachedPlanSource* ignoreThis);
 
 /* GUC parameter */
 intplan_cache_mode;
@@ -482,6 +487,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
dlist_push_tail(_plan_list, >node);
 
plansource->is_saved = true;
+
+   if( prep_statement_limit > 0 ) {
+   /* Clean up statements when mem limit is hit */
+   EnforcePreparedStatementLimit(plansource);
+   }
 }
 
 /*
@@ -536,6 +546,31 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
 }
 
 /*
+ * ReleaseQueryList: release a CachedPlanSource's query list, relationOids,
+ * invalItems and search_path.
+ */
+static void
+ReleaseQueryList(CachedPlanSource *plansource)
+{
+   if (plansource->query_list)
+   {
+   plansource->is_valid = false;
+   plansource->query_list = NIL;
+   plansource->relationOids = NIL;
+   plansource->invalItems = NIL;
+   plansource->search_path = NULL;
+
+   if (plansource->query_context)
+   {
+   MemoryContext qcxt = plansource->query_context;
+
+   plansource->query_context = NULL;
+   MemoryContextDelete(qcxt);
+  

Re: default_table_access_method is not in sample config file

2019-08-17 Thread Michael Paquier
On Fri, Aug 16, 2019 at 03:29:30PM -0700, Andres Freund wrote:
> but I don't quite see GUCs like default_tablespace, search_path (due to
> determining a created table's schema), temp_tablespace,
> default_table_access_method fit reasonably well under that heading. They
> all can affect persistent state. That seems pretty different from a
> number of other settings (client_min_messages,
> default_transaction_isolation, lock_timeout, ...) which only have
> transient effects.

Agreed.

> Should we perhaps split that group? Not that I have a good proposal for
> better names.

We could have a section for transaction-related parameters, and move
the vacuum ones into the section for autovacuum so as they get
grouped, renaming the section "autovacuum and vacuum".  An idea of
group for search_path, temp_tablespace, default_tablespace & co would
be "object parameters", or "relation parameters" for all the
parameters which interfere with object definitions?
--
Michael


signature.asc
Description: PGP signature


Re: some SCRAM read_any_attr() confusion

2019-08-17 Thread Michael Paquier
On Sat, Aug 17, 2019 at 10:11:27AM +0200, Peter Eisentraut wrote:
> I was a bit confused by some of the comments around the SCRAM function
> read_any_attr(), used to skip over extensions.
> 
> The comment "Returns NULL if there is attribute.", besides being
> strangely worded, appears to be wrong anyway, because the function never
> returns NULL.

This may have come from an incorrect merge with a past version when
this code has been developed.  +1 for me for your suggestions and your
patch.
--
Michael


signature.asc
Description: PGP signature


[PATCH] minor doc fix for create-role

2019-08-17 Thread tara
Hi,

I'm not subscribed to the list, so please include me directly if you want my 
attention. This is a "drive-by" patch as it were.

Attached is a minor patch to fix the name param documentation for create role, 
just adding a direct quote from user-manag.sgml talking about what the role 
name is allowed to be.  I was searching for this information and figured the 
reference page should have it as well.

Hope this helps,
-Tara


docs-create-role.v1.patch
Description: Binary data


some SCRAM read_any_attr() confusion

2019-08-17 Thread Peter Eisentraut
I was a bit confused by some of the comments around the SCRAM function
read_any_attr(), used to skip over extensions.

The comment "Returns NULL if there is attribute.", besides being
strangely worded, appears to be wrong anyway, because the function never
returns NULL.

This lead me to wonder how this loop would terminate if there is no "p"
attribute in the message:

/* ignore optional extensions */
do
{
proof = p - 1;
value = read_any_attr(, );
} while (attr != 'p');

What actually happens is

ERROR:  malformed SCRAM message
DETAIL:  Attribute expected, but found invalid character "0x00".

which serves the purpose but was probably not quite intended that way.

I propose the attached patch to clean this up a bit, with better
comments and a better error message.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 48c38857cb5dbccbe512bb489a4a69214d03d2e4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 17 Aug 2019 10:04:11 +0200
Subject: [PATCH] Clean up some SCRAM attribute processing

Correct the comment for read_any_attr().  Give a clearer error message
when parsing at the end of the string, when the client-final-message
does not contain a "p" attribute (for some reason).
---
 src/backend/libpq/auth-scram.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index aa918839fb..68792cb45e 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -790,7 +790,8 @@ sanitize_str(const char *s)
 /*
  * Read the next attribute and value in a SCRAM exchange message.
  *
- * Returns NULL if there is attribute.
+ * The attribute character is set in *attr_p, the attribute value is the
+ * return value.
  */
 static char *
 read_any_attr(char **input, char *attr_p)
@@ -799,6 +800,12 @@ read_any_attr(char **input, char *attr_p)
char   *end;
charattr = *begin;
 
+   if (attr == '\0')
+   ereport(ERROR,
+   (errcode(ERRCODE_PROTOCOL_VIOLATION),
+errmsg("malformed SCRAM message"),
+errdetail("Attribute expected, but found end 
of string.")));
+
/*--
 * attr-val= ALPHA "=" value
 *   ;; Generic syntax of any 
attribute sent
@@ -1298,7 +1305,7 @@ read_client_final_message(scram_state *state, const char 
*input)
 
state->client_final_nonce = read_attr_value(, 'r');
 
-   /* ignore optional extensions */
+   /* ignore optional extensions, read until we find "p" attribute */
do
{
proof = p - 1;
-- 
2.22.0



Re: [proposal] de-TOAST'ing using a iterator

2019-08-17 Thread Binguo Bao
Hi John,

> >> Also, I don't think
> >> the new logic for the ctrl/c variables is an improvement:
> >>
> >> 1. iter->ctrlc is intialized with '8' (even in the uncompressed case,
> >> which is confusing). Any time you initialize with something not 0 or
> >> 1, it's a magic number, and here it's far from where the loop variable
> >> is used. This is harder to read.
> >
> > `iter->ctrlc` is used to record the value of `ctrl` in pglz_decompress
> at the end of
> > the last iteration(or loop). In the pglz_decompress, `ctrlc`’s valid
> value is 0~7,
> > When `ctrlc` reaches 8,  a control byte is read from the source
> > buffer to `ctrl` then set `ctrlc` to 0. And a control bytes should be
> read from the
> > source buffer to `ctrlc` on the first iteration. So `iter->ctrlc` should
> be intialized with '8'.
>
> My point here is it looks strange out of context, but "0" looked
> normal. Maybe a comment in init_detoast_buffer(), something like "8
> means read a control byte from the source buffer on the first
> iteration, see pg_lzdecompress_iterate()".
>
> Or, possibly, we could have a macro like INVALID_CTRLC. That might
> even improve the readability of the original function. This is just an
> idea, and maybe others would disagree, so you don't need to change it
> for now.
>

All in all, the idea is much better than a magic number 8. So, I've
implemented it.


> At this point, there are no functional things that I think we need to
> change. It's close to ready-for-committer. For the next version, I'd
> like you go through the comments and edit for grammar, spelling, and
> clarity as you see fit. I know you're not a native speaker of English,
> so I can help you with anything that remains.


I've tried my best to improve the comments, but there should be room for
further improvement
I hope you can help me perfect it.


> Also note we use braces
> on their own lines
> {
> like this
> }
>
> Done.
-- 
Best regards,
Binguo Bao
From 13648adb56b96a75910bb5d0a8e21d358b266d51 Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 470 +++
 src/backend/utils/adt/varlena.c  |  40 ++-
 src/include/access/tuptoaster.h  |  97 
 src/include/fmgr.h   |   7 +
 4 files changed, 608 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 74233bb..a8924f1 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static void fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static void pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	DetoastIterator iter);
 
 
 /* --
@@ -347,6 +354,125 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * init_detoast_iterator -
+ *
+ * The "iterator" variable is normally just a local variable in the caller.
+ * It only make sense to initialize de-TOAST iterator for external on-disk value.
+ *
+ * --
+ */
+bool init_detoast_iterator(struct varlena *attr, DetoastIterator iterator)
+{
+	struct varatt_external toast_pointer;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- initialize fetch datum iterator
+		 */
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->ctrl = 0;
+			iterator->ctrlc = INVALID_CTRLC;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->ctrl = 0;
+			iterator->ctrlc = INVALID_CTRLC;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+		return true;
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		return 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-17 Thread Antonin Houska
Bruce Momjian  wrote:

> On Thu, Aug 15, 2019 at 09:01:05PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > Why would it not be simpler to have the cluster_passphrase_command run
> > > whatever command-line program it wants?  If you don't want to use a
> > > shell command, create an executable and call that.
> > 
> > Having direct integration with a KMS would certainly be valuable, and I
> > don't see a reason to deny users that option if someone would like to
> > spend time implementing it- in addition to a simpler mechanism such as a
> > passphrase command, which I believe is what was being suggested here.
> 
> OK,  I am just trying to see why we would not use the
> cluster_passphrase_command-like interface to do that.

One problem that occurs to me is that PG may need to send some sort of
credentials to the KMS. If it runs a separate process to execute the command,
it needs to pass those credentials to it. Whether it does so via parameters or
environment variables, both can be seen by other users.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com