date:20190808

Re: Problem with default partition pruning

2019-08-08 Thread Kyotaro Horiguchi

At Fri, 9 Aug 2019 14:02:36 +0900, Amit Langote  wrote 
in 
> On Fri, Aug 9, 2019 at 1:17 PM Amit Langote  wrote:
> > On Fri, Aug 9, 2019 at 12:09 PM Kyotaro Horiguchi
> >  wrote:
> > > At Thu, 8 Aug 2019 14:50:54 +0900, Amit Langote wrote:
> > > > When working on it, I realized
> > > > that the way RelOptInfo.partition_qual is processed is a bit
> > > > duplicative, so I created a separate patch to make that a bit more
> > > > consistent.
> > >
> > > 0001 seems reasonable. By the way, the patch doesn't touch
> > > get_relation_constraints(), but I suppose it can use the modified
> > > partition constraint qual already stored in rel->partition_qual
> > > in set_relation_partition_info. And we could move constifying to
> > > set_rlation_partition_info?
> >
> > Ah, good advice.  This make partition constraint usage within the
> > planner quite a bit more consistent.
> 
> Hmm, oops.  I think that judgement was a bit too rushed on my part.  I
> unintentionally ended up making the partition constraint to *always*
> be fetched, whereas we don't need it in most cases.  I've reverted

(v2 has been withdrawn before I see it:p)

Yeah, I agreed. It is needed only by (sub)partition parents.

> that change.  RelOptInfo.partition_qual is poorly named in retrospect.
> :(  It's not set for all partitions, only those that are partitioned
> themselves.
> 
> Attached updated patches.

+++ b/src/backend/optimizer/util/plancat.c
@@ -1267,10 +1267,14 @@ get_relation_constraints(PlannerInfo *root,
*/
   if (include_partition && relation->rd_rel->relispartition)
   {
...
+else
 {
+  /* Nope, fetch from the relcache. */

I seems to me that include_partition is true both and only for
modern and old-fasheoned partition parents.
set_relation_partition_info() is currently called only for modern
partition parents. If we need that at the place above,
set_relation_partition_info can be called also for old-fashioned
partition parent, and get_relation_constraints may forget the
else case in a broad way.


+  /* Nope, fetch from the relcache. */
+  List   *pcqual = RelationGetPartitionQual(relation);

If the comment above is right, This would be duplicate. What we
should do instaed is only eval_const_expression. And we could
move it to set_relation_partition_info completely. Consitify must
be useful in both cases.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Global temporary tables

2019-08-08 Thread Craig Ringer

On Thu, 8 Aug 2019 at 15:03, Konstantin Knizhnik 
wrote:

>
>
> On 08.08.2019 5:40, Craig Ringer wrote:
>
> On Tue, 6 Aug 2019 at 16:32, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>> New version of the patch with several fixes is attached.
>> Many thanks to Roman Zharkov for testing.
>>
>
> FWIW I still don't understand your argument with regards to using
> shared_buffers for temp tables having connection pooling benefits. Are you
> assuming the presence of some other extension in your extended  version of
> PostgreSQL ? In community PostgreSQL a temp table's contents in one backend
> will not be visible in another backend. So if your connection pooler in
> transaction pooling mode runs txn 1 on backend 42 and it populates temp
> table X, then the pooler runs the same app session's txn 2 on backend 45,
> the contents of temp table X are not visible anymore.
>
>
> Certainly here I mean built-in connection pooler which is not currently
> present in Postgres,
> but it is part of PgPRO-EE and there is my patch for vanilla at commitfest:
> https://commitfest.postgresql.org/24/2067
>

OK, that's what I assumed.

You're trying to treat this change as if it's a given that the other
functionality you want/propose is present in core or will be present in
core. That's far from given. My suggestion is to split it up so that the
parts can be reviewed and committed separately.

> In PgPRO-EE this problem was solved by binding session to backend. I.e.
> one backend can manage multiple sessions,
> but session can not migrate to another backend. The drawback of such
> solution is obvious: one long living transaction can block transactions of
> all other sessions scheduled to this backend.
> Possibility to migrate session to another backend is one of the obvious
> solutions of the problem. But the main show stopper for it is temporary
> tables.
> This is why  I consider moving temporary tables to shared buffers as very
> important step.
>

I can see why it's important for your use case.

I am not disagreeing.

I am however strongly suggesting that your patch has two fairly distinct
functional changes in it, and you should separate them out.

* Introduce global temp tables, a new relkind that works like a temp table
but doesn't require catalog changes. Uses per-backend relfilenode and
cleanup like existing temp tables. You could extend the relmapper to handle
the mapping of relation oid to per-backend relfilenode.

* Associate global temp tables with session state and manage them in
shared_buffers so they can work with the in-core connection pooler (if
committed)

Historically we've had a few efforts to get in-core connection pooling that
haven't gone anywhere. Without your pooler patch the changes you make to
use shared_buffers etc are going to be unhelpful at best, if not actively
harmful to performance, and will add unnecessary complexity. So I think
there's a logical series of patches here:

* global temp table relkind and support for it
* session state separation
* connection pooling
* pooler-friendly temp tables in shared_buffers

Make sense?

> But even if we forget about built-in connection pooler, don't you think
> that possibility to use parallel query plans for temporary tables is itself
> strong enough motivation to access global temp table through shared buffers?
>

I can see a way to share temp tables across parallel query backends being
very useful for DW/OLAP workloads, yes. But I don't know if putting them in
shared_buffers is the right answer for that. We have DSM/DSA, we have
shm_mq, various options for making temp buffers share-able with parallel
worker backends.

I'm suggesting that you not tie the whole (very useful) global temp tables
feature to this, but instead split it up into logical units that can be
understood, reviewed and committed separately.

I would gladly participate in review.

Would DISCARD TEMP_BUFFERS meet your needs?
>
>
> Actually I have already implemented DropLocalBuffers function (three line
> of code:)
>
> [...]
>
I do not think that we need such command at user level (i.e. have
> correspondent SQL command).
>

I'd be very happy to have it personally, but don't think it needs to be
tied in with your patch set here. Maybe I can cook up a patch soon.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Problem with default partition pruning

2019-08-08 Thread Amit Langote

On Fri, Aug 9, 2019 at 1:17 PM Amit Langote  wrote:
> On Fri, Aug 9, 2019 at 12:09 PM Kyotaro Horiguchi
>  wrote:
> > At Thu, 8 Aug 2019 14:50:54 +0900, Amit Langote wrote:
> > > When working on it, I realized
> > > that the way RelOptInfo.partition_qual is processed is a bit
> > > duplicative, so I created a separate patch to make that a bit more
> > > consistent.
> >
> > 0001 seems reasonable. By the way, the patch doesn't touch
> > get_relation_constraints(), but I suppose it can use the modified
> > partition constraint qual already stored in rel->partition_qual
> > in set_relation_partition_info. And we could move constifying to
> > set_rlation_partition_info?
>
> Ah, good advice.  This make partition constraint usage within the
> planner quite a bit more consistent.

Hmm, oops.  I think that judgement was a bit too rushed on my part.  I
unintentionally ended up making the partition constraint to *always*
be fetched, whereas we don't need it in most cases.  I've reverted
that change.  RelOptInfo.partition_qual is poorly named in retrospect.
:(  It's not set for all partitions, only those that are partitioned
themselves.

Attached updated patches.

Thanks,
Amit


v3-0001-Improve-RelOptInfo.partition_qual-usage.patch
Description: Binary data


v3-0002-Improve-constraint-exclusion-usage-in-partprune.c.patch
Description: Binary data

Re: Regression test failure in regression test temp.sql

2019-08-08 Thread Michael Paquier

On Wed, Aug 07, 2019 at 10:17:25AM -0400, Tom Lane wrote:
> Not objecting to the patch, exactly, just feeling like there's
> more here than meets the eye.  Not quite sure if it's worth
> investigating closer, or what we'd even need to do to do so.

Yes, something's weird here.  I'd think that the index only scan
ensures a proper ordering in this case, so it could be possible that a
different plan got selected here?  That would mean that the plan
selected would not be an index-only scan or an index scan.  So perhaps
that was a bitmap scan?

> BTW, I realize from looking at the plan that LIKE is interpreting the
> underscores as wildcards.  Maybe it's worth s/_/\_/ while you're

Right.  Looking around there are much more tests which have the same
problem.  This could become a problem if other tests running in
parallel use relation names with the same pattern, which is not a
issue as of HEAD, so I'd rather just back-patch the ORDER BY part of
it (temp.sql is the only test missing that).  What do you think about
the attached?
--
Michael
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index e5407bbf0f..577210e1ad 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2067,7 +2067,7 @@ insert into at_partitioned values(1, 'foo');
 insert into at_partitioned values(3, 'bar');
 create temp table old_oids as
   select relname, oid as oldoid, relfilenode as oldfilenode
-  from pg_class where relname like 'at_partitioned%';
+  from pg_class where relname like 'at\_partitioned%';
 select relname,
   c.oid = oldoid as orig_oid,
   case relfilenode
@@ -2078,7 +2078,7 @@ select relname,
 end as storage,
   obj_description(c.oid, 'pg_class') as desc
   from pg_class c left join old_oids using (relname)
-  where relname like 'at_partitioned%'
+  where relname like 'at\_partitioned%'
   order by relname;
relname| orig_oid | storage | desc  
 --+--+-+---
@@ -2091,7 +2091,7 @@ select relname,
 (6 rows)
 
 select conname, obj_description(oid, 'pg_constraint') as desc
-  from pg_constraint where conname like 'at_partitioned%'
+  from pg_constraint where conname like 'at\_partitioned%'
   order by conname;
conname|desc
 --+
@@ -2112,7 +2112,7 @@ select relname,
 end as storage,
   obj_description(c.oid, 'pg_class') as desc
   from pg_class c left join old_oids using (relname)
-  where relname like 'at_partitioned%'
+  where relname like 'at\_partitioned%'
   order by relname;
relname| orig_oid | storage | desc 
 --+--+-+--
@@ -2125,7 +2125,7 @@ select relname,
 (6 rows)
 
 select conname, obj_description(oid, 'pg_constraint') as desc
-  from pg_constraint where conname like 'at_partitioned%'
+  from pg_constraint where conname like 'at\_partitioned%'
   order by conname;
conname|   desc
 --+---
@@ -2189,7 +2189,7 @@ Inherits: test_inh_check
 
 select relname, conname, coninhcount, conislocal, connoinherit
   from pg_constraint c, pg_class r
-  where relname like 'test_inh_check%' and c.conrelid = r.oid
+  where relname like 'test\_inh\_check%' and c.conrelid = r.oid
   order by 1, 2;
relname|conname | coninhcount | conislocal | connoinherit 
 --++-++--
@@ -2220,7 +2220,7 @@ Inherits: test_inh_check
 
 select relname, conname, coninhcount, conislocal, connoinherit
   from pg_constraint c, pg_class r
-  where relname like 'test_inh_check%' and c.conrelid = r.oid
+  where relname like 'test\_inh\_check%' and c.conrelid = r.oid
   order by 1, 2;
relname|conname | coninhcount | conislocal | connoinherit 
 --++-++--
@@ -2260,7 +2260,7 @@ Inherits: test_inh_check
 
 select relname, conname, coninhcount, conislocal, connoinherit
   from pg_constraint c, pg_class r
-  where relname like 'test_inh_check%' and c.conrelid = r.oid
+  where relname like 'test\_inh\_check%' and c.conrelid = r.oid
   order by 1, 2;
relname|conname | coninhcount | conislocal | connoinherit 
 --++-++--
@@ -2300,7 +2300,7 @@ Inherits: test_inh_check
 
 select relname, conname, coninhcount, conislocal, connoinherit
   from pg_constraint c, pg_class r
-  where relname like 'test_inh_check%' and c.conrelid = r.oid
+  where relname like 'test\_inh\_check%' and c.conrelid = r.oid
   order by 1, 2;
relname|conname | coninhcount | conislocal | connoinherit

Re: Problem with default partition pruning

2019-08-08 Thread Amit Langote

Horiguchi-san,

Thanks for the review.

On Fri, Aug 9, 2019 at 12:09 PM Kyotaro Horiguchi
 wrote:
> At Thu, 8 Aug 2019 14:50:54 +0900, Amit Langote wrote:
> > When working on it, I realized
> > that the way RelOptInfo.partition_qual is processed is a bit
> > duplicative, so I created a separate patch to make that a bit more
> > consistent.
>
> 0001 seems reasonable. By the way, the patch doesn't touch
> get_relation_constraints(), but I suppose it can use the modified
> partition constraint qual already stored in rel->partition_qual
> in set_relation_partition_info. And we could move constifying to
> set_rlation_partition_info?

Ah, good advice.  This make partition constraint usage within the
planner quite a bit more consistent.

> Also, I'd like to see comments that the partition_quals is
> already varnode-fixed.

Added a one-line comment.

> And 0002, yeah, just +1 from me.

Thanks.

Attached updated patches; only 0001 changed per above comments.

Regards,
Amit


v2-0001-Improve-RelOptInfo.partition_qual-usage.patch
Description: Binary data


v2-0002-Improve-constraint-exclusion-usage-in-partprune.c.patch
Description: Binary data

Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Amit Kapila

On Fri, Aug 9, 2019 at 1:57 AM Robert Haas  wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.
>

Thanks for working on this.  Neither Kuntal nor I have got time to
look into this part in detail.

>  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c;
>

I can look into this and integrate with other parts of the patch next
week unless you are planning to do.  Right now, I am working on fixing
up some other comments raised on the patches which I will share today
or early next week after which I can start looking into this. I hope
that is fine with you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Problem with default partition pruning

2019-08-08 Thread Kyotaro Horiguchi

Sorry for being late. I found it is committed before I caught up
this thread again..

At Thu, 8 Aug 2019 14:50:54 +0900, Amit Langote  wrote 
in 
> Hi Alvaro,
> 
> On Thu, Aug 8, 2019 at 5:27 AM Alvaro Herrera  
> wrote:
> > On 2019-Aug-07, Simon Riggs wrote:
> > > I saw your recent commit and it scares me in various places, noted below.
> > >
> > > "Commit: Apply constraint exclusion more generally in partitioning"
> > >
> > > "This applies particularly to the default partition..."
> > >
> > > My understanding of the thread was the complaint was about removing the
> > > default partition. I would prefer to see code executed just for that case,
> > > so that people who do not define a default partition are unaffected.
> >
> > Well, as the commit message noted, it applies to other cases also, not
> > just the default partition.  The default partition just happens to be
> > the most visible case.
> 
> Just to be clear, I don't think there was any patch posted on this
> thread that was to address *non-default* partitions failing to be
> pruned by "partition pruning".  If that had been the problem, we'd be
> fixing the bugs of the partition pruning code, not apply constraint
> exclusion more generally to paper over such bugs.  I may be misreading
> what you wrote here though.
> 
> The way I interpret the "generally" in the "apply constraint exclusion
> more generally" is thus: we can't prune the default partition without
> the constraint exclusion clutches for evidently a broader sets of
> clauses than the previous design assumed.  The previous design assumed
> that only OR clauses whose arguments contradicted the parent's
> partition constraint are problematic, but evidently any clause set
> that contradicts the partition constraint is problematic.  Again, the
> problem is that it's impossible to prune the "default" partition with
> such clauses, not the *non-default* ones -- values extracted from
> contradictory clauses would not match any of the bounds so all
> non-default partitions would be pruned that way.
> 
> By the way, looking closer at the patch committed today, I realized I
> had misunderstood what you proposed as the *4th* possible place to
> move the constraint exclusion check to.  I had misread the proposal
> and thought you meant to move it outside the outermost loop of
> gen_partprune_steps_internal(), but that's not where the check is now.
> I think it's better to call predicate_refuted_by() only once by
> passing the whole list of clauses instead of for each clause
> separately.  The result would be the same but the former would be more
> efficient, because it avoids repeatedly paying the cost of setting up
> predtest.c data structures when predicate_refuted_by() is called.
> Sorry that I'm only saying this now.

+1 as I mentioned in [1].

> Also it wouldn't be incorrect to do the check only if the parent has a
> default partition.  That will also address the Simon's concern this
> might slow down the cases where this effort is useless.
> 
> I've attached a patch that does that.  When working on it, I realized
> that the way RelOptInfo.partition_qual is processed is a bit
> duplicative, so I created a separate patch to make that a bit more
> consistent.

0001 seems reasonable. By the way, the patch doesn't touch
get_relation_constraints(), but I suppose it can use the modified
partition constraint qual already stored in rel->partition_qual
in set_relation_partition_info. And we could move constifying to
set_rlation_partition_info?

Also, I'd like to see comments that the partition_quals is
already varnode-fixed.

And 0002, yeah, just +1 from me.


[1] 
https://www.postgresql.org/message-id/20190409.173725.31175835.horiguchi.kyot...@lab.ntt.co.jp

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add "password_protocol" connection parameter to libpq

2019-08-08 Thread Michael Paquier

On Thu, Aug 08, 2019 at 03:38:20PM -0700, Jeff Davis wrote:
> Libpq doesn't have a way to control which password protocols are used.
> For example, the client might expect the server to be using SCRAM, but
> it actually ends up using plain password authentication instead.

Thanks for working on this!

> I'm not 100% happy with the name "password_protocol", but other names I
> could think of seemed likely to cause confusion.

What about auth_protocol then?  It seems to me that it could be useful
to have the restriction on AUTH_REQ_MD5 as well.

> Sets the least-secure password protocol allowable when using password
> authentication. Options are: "plaintext", "md5", "scram-sha-256", or
> "scram-sha-256-plus".

This makes it sound like there is a linear hierarchy among all those
protocols, which is true in this case, but if the list of supported
protocols is extended in the future it may be not.

I think that this should have TAP tests in src/test/authentication/ so
as we make sure of the semantics.  For the channel-binding part, the
logic path for the test would be src/test/ssl.

+#define DefaultPasswordProtocol "plaintext"
I think that we are going to need another default value for that, like
"all" to reduce the confusion that SCRAM, MD5 and co are still
included in the authorized set in this case.

Another thing that was discussed on the topic would be to allow a list
of authorized protocols instead.  I personally don't think that we
need to go necessarily this way, but it could make the integration of
things line scram-sha-256,scram-sha-256-plus easier to integrate in
application flows.
--
Michael

signature.asc
Description: PGP signature

Re: Small patch to fix build on Windows

2019-08-08 Thread Michael Paquier

On Thu, Aug 08, 2019 at 10:46:07PM +0300, Dmitry Igrishin wrote:
> This looks nice for a Perl hacker :-). As for me, it looks unusual and
> a bit confusing. I never
> programmed in Perl, but I was able to quickly understand where the
> problem lies due to the
>  style adopted in other languages, when the contents are enclosed in
> quotation marks, and
> the quotation marks are escaped if they are part of the contents.
> So, should I fix it? Any thoughts?

FWIW, I like Alvaro's suggestion about qq{} in this case, as it makes
sure that double-quotes are correctly applied where they should.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Sehrope Sarkuni

On Thu, Aug 8, 2019 at 6:31 PM Stephen Frost  wrote:

> Strictly speaking, that isn't actually crash recovery, it's physical
> replication / HA, and while those are certainly nice to have it's no
> guarantee that they're required or that you'd want to have the same keys
> for them- conceptually, at least, you could have WAL with one key that
> both sides know and then different keys for the actual data files, if we
> go with the approach where the WAL is encrypted with one key and then
> otherwise is plaintext.
>

I like the idea of separating the WAL key from the rest of the data files.
It'd all be unlocked by the MDEK and you'd still need derived keys per
WAL-file, but disconnecting all that from the data files solves a lot of
the problems with promoted replicas.

This would complicate cloning a replica as using a different MDEK would
involve decrypting / encrypting everything rather than just copying the
files. Even if that's not baked in a first version, the separation allows
for eventually supporting that.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Sehrope Sarkuni

On Thu, Aug 8, 2019 at 2:16 PM Bruce Momjian  wrote:

> On Wed, Aug  7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote:
> > Simplest approach for derived keys would be to use immutable attributes
> of the
> > WAL files as an input to the key derivation. Something like HKDF(MDEK,
> "WAL:" |
>
> So, I am thinking we should use "WAL:" for WAL and "REL:" for heap/index
> files.
>

Sounds good. Any unique convention is fine. Main thing to keep in mind is
that they're directly tied to the master key so it's not possible to rotate
them without changing the master key.

This is in contrast to saving a WDEK key to a file (similar to how the MDEK
key would be saved) and unlocking it with the MDEK. That has more moving
parts but would allow that key to be independent of the MDEK. In a later
message Stephen refers to an example of a replica receiving encrypted WAL
and applying it with a different MDEK for the page buffers. That's doable
with an independent WDEK.

> > | timeline_id || wal_segment_num) should be fine for this as it is:
>
> I considered using the timeline in the nonce, but then remembered that
> in timeline switch, we _copy_ the part of the old WAL up to the timeline
> switch to the new timeline;  see:
>
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l5502
>
>* Initialize the starting WAL segment for the new timeline. If the
> switch
>* happens in the middle of a segment, copy data from the last WAL
> segment
>* of the old timeline up to the switch point, to the starting WAL
> segment
>* on the new timeline.
>
> We would need to decrypt/encrypt to do the copy, and just wasn't sure of
> the value of the timeline in the nonce.  One value to it is that if
> there some WAL that generated after the timeline switch in the old
> primary that isn't transfered, there would be potentially new data
> encrypted with the same key/nonce in the new primary, but if that WAL is
> not used, odds are it is gone/destroyed/inaccessible, or it would have
> been used during the switchover, so it didn't seem worth worrying about.
>
> One _big_ reason to add the timeline is if you had a standby that you
> recovered and rolled forward only to a specific transaction, then
> continued running it as a new primary.  In that case, you would have
> different WAL encrypted with the same key/nonce, but that sounds like
> the same as promoting two standbys, and we should just document not to
> do it.
>
> Maybe we need to consider this further.
>

Good points. Yes, anything short of generating a new key at promotion time
will have these issues. If we're not going to do that, no point in adding
the timeline id if it does not change anything. I had thought only the
combo was unique but sounds like the segment number is unique on its own.
One thing I like about a unique per-file key is that it simplifies the IV
generation (i.e. can start at zero).

What about discarding the rest of the WAL file at promotion and skipping to
a new file? With a random per-file key in the first page header would
ensure that going forward all WAL data is encrypted differently. Combine
that with independent WAL and MDEK keys and everything would be different
between two replicas promoted from the same point.

> > A unique WDEK per WAL file that is derived from the segment number would
> not
> > have that problem. A unique key per-file means the IVs can all start at
> zero
> > and the each file can be treated as one encrypted stream. Any encryption/
> > decryption code would only need to touch the write/read callsites.
>
> So, I am now wondering when we should be using a non-zero nonce to
> start, and when we should be using derived keys.   Should we add the
> page-number to the derived key for heap/index files too and just use the
> LSN for nonce, or add the LSN to the derived key too?
>

The main cost of using multiple keys is that you need to derive or unlock
them for each usage.

A per-type, per-relation, or per-file derived key with the same
non-repeating guarantees for the IV (ex: LSN + Page Number) is as secure
but allows for caching all needed derived keys in memory (it's one per open
file descriptor).

Having page-level derived keys incorporating the LSN + Page Number and
starting the per-page IV at zero works, but you'd have to perform an HKDF
for each page read or write. A cache of those derived keys would be much
larger (32-bytes per page) so presumably you're not going to have them all
cached or maybe not bother with any caching,

> > Even without a per-page MAC, a MAC at some level for WAL has its own
> benefits
> > such as perfect corruption detection. It could be per-record,
> per-N-records,
> > per-checkpoint, or per-file. The current WAL file format already handles
> > arbitrary gaps so there is significantly more flexibility in adding it vs
> > pages. I'm not saying it should be a requirement but, unlike pages, I
> would not

Re: Refactoring code stripping trailing \n and \r from strings

2019-08-08 Thread Michael Paquier

On Wed, Aug 07, 2019 at 10:10:36AM +0900, Michael Paquier wrote:
> Thanks for the review, Bruce!  Tom, do you have any objections?

hearing nothing but cicadas from outside, applied.  There were some
warnings I missed with the first version, which are fixed.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Bruce Momjian

On Thu, Aug  8, 2019 at 06:31:42PM -0400, Stephen Frost wrote:
> > >Crash recovery doesn't happen "all the time" and neither does vacuum
> > >freeze, and autovacuum processes are independent of individual client
> > >backends- we don't need to (and shouldn't) have the keys in shared
> > >memory.
> > 
> > Don't people do physical replication / HA pretty much all the time?
> 
> Strictly speaking, that isn't actually crash recovery, it's physical
> replication / HA, and while those are certainly nice to have it's no
> guarantee that they're required or that you'd want to have the same keys
> for them- conceptually, at least, you could have WAL with one key that
> both sides know and then different keys for the actual data files, if we
> go with the approach where the WAL is encrypted with one key and then
> otherwise is plaintext.

Uh, yes, you could have two encryption keys in the data directory, one
for heap/indexes, one for WAL, both unlocked with the same passphrase,
but what would be the value in that?

> > >>That might allow crash recovery and the freeze part of VACUUM FREEZE to
> > >>work.  (I don't think we could vacuum since we couldn't read the index
> > >>pages to find the matching rows since the index values would be encrypted
> > >>too.  We might be able to not encrypt the tid in the index typle.)
> > >
> > >Why do we need the indexed values to vacuum the index..?  We don't
> > >today, as I recall.  We would need the tids though, yes.
> > 
> > Well, we also do collect statistics on the data, for example. But even
> > if we assume we wouldn't do that for encrypted indexes (which seems like
> > a pretty bad idea to me), you'd probably end up leaking information
> > about ordering of the values. Which is generally a pretty serious
> > information leak, AFAICS.
> 
> I agree entirely that order information would be bad to leak- but this
> is all new ground here and we haven't actually sorted out what such a
> partially encrypted btree would look like.  We don't actually have to
> have the down-links in the tree be unencrypted to allow vacuuming of
> leaf pages, after all.

Agreed, but I think we kind of know that the value in cluster-wide
encryption is different from multi-key encryption --- both have their
value, but right now cluster-wide is the easiest and simplest, and
probably meets more user needs than multi-key encryption.  If others
want to start scoping out what multi-key encryption would look like, we
can discuss it.  I personally would like to focus on cluster-wide
encryption for PG 13.

> > >>Is this something considering in version one of this feature?  Probably
> > >>not, but later?  Never?  Would the information leakage be too great,
> > >>particularly from indexes?
> > >
> > >What would be leaking from the indexes..?  That an encrypted blob in the
> > >index pointed to a given tid?  Wouldn't someone be able to see that same
> > >information by looking directly at the relation too?
> > 
> > Ordering of values, for example. Depending on how exactly the data is
> > encrypted we might also be leaking information about which values are
> > equal, etc. It also seems quite a bit more expensive to use such index.
> 
> Using an encrypted index isn't going to be free.  It's not clear that
> this would be much more expensive than if the entire index is encrypted,
> or that people would actually be unhappy if there was such an additional
> expense if it meant that they could have vacuum run without the keys.

Yes, I think it is information leakage that is always going to make
multi-key unable to fulfill all the features of cluster-wide encryption.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: POC: converting Lists into arrays

2019-08-08 Thread David Rowley

On Fri, 9 Aug 2019 at 09:55, Tom Lane  wrote:
> Attached is a patch that removes simple_rte_array in favor of just
> accessing the query's rtable directly.  I concluded that there was
> not much point in messing with simple_rel_array or append_rel_array,
> because they are not in fact just mirrors of some List.  There's no
> List at all of baserel RelOptInfos, and while we do have a list of
> AppendRelInfos, it's a compact, random-order list not one indexable
> by child relid.
>
> Having done this, though, I'm a bit discouraged about whether to commit
> it.  In light testing, it's not any faster than HEAD and in complex
> queries seems to actually be a bit slower.  I suspect the reason is
> that we've effectively replaced
> root->simple_rte_array[i]
> with
> root->parse->rtable->elements[i-1]
> and the two extra levels of indirection are taking their toll.

If there are no performance gains from this then -1 from me.  We're
all pretty used to it the way it is

> I realized that it's pretty dumb not to
> just merge setup_append_rel_array into setup_simple_rel_arrays.
> The division of labor there is inconsistent with the fact that
> there's no such division in expand_planner_arrays.

ha, yeah I'd vote for merging those. It was coded that way originally
until someone objected! :)

> I still have hopes for getting rid of es_range_table_array though,
> and will look at that tomorrow or so.

Yes, please. I've measured that to be quite an overhead with large
partitioning setups. However, that was with some additional code which
didn't lock partitions until it was ... well  too late... as it
turned out. But it seems pretty good to remove code that could be a
future bottleneck if we ever manage to do something else with the
locking of all partitions during UPDATE/DELETE.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: POC: converting Lists into arrays

2019-08-08 Thread David Rowley

On Fri, 9 Aug 2019 at 04:24, Tom Lane  wrote:
> I wrote:
> > Perhaps there's an argument for doing something to change the behavior
> > of list_union and list_difference and friends.  Not sure --- it could
> > be a foot-gun for back-patching.  I'm already worried about the risk
> > of back-patching code that assumes the new semantics of list_concat.
> > (Which might be a good argument for renaming it to something else?
> > Just not list_union, please.)
>
> Has anyone got further thoughts about naming around list_concat
> and friends?
>
> If not, I'm inclined to go ahead with the concat-improvement patch as
> proposed in [1], modulo the one improvement David spotted.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/6704.1563739...@sss.pgh.pa.us

I'm okay with the patch once that one improvement is done.

I think if we want to think about freeing the 2nd input List then we
can do that in another commit. Removing the redundant list_copy()
calls seems quite separate from that.

--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: block-level incremental backup

2019-08-08 Thread Jeevan Ladhe

Hi Jeevan,

I have reviewed the backup part at code level and still looking into the
restore(combine) and functional part of it. But, here are my comments so
far:

The patches need rebase.

+   if (!XLogRecPtrIsInvalid(previous_lsn))
+   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
+(uint32) (previous_lsn >> 32), (uint32)
previous_lsn);

May be we should rename to something like:
"INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START
LOCATION"
to make it more intuitive?



+typedef struct

+{

+   uint32  magic;

+   pg_crc32c   checksum;

+   uint32  nblocks;

+   uint32  blocknumbers[FLEXIBLE_ARRAY_MEMBER];

+} partial_file_header;


File header structure is defined in both the files basebackup.c and
pg_combinebackup.c. I think it is better to move this to
replication/basebackup.h.



+   boolisrelfile = false;

I think we can avoid having flag isrelfile in sendFile().
Something like this:

if (startincrptr && OidIsValid(dboid) && looks_like_rel_name(filename))
{
//include the code here that is under "if (isrelfile)" block.
}
else
{
_tarWriteHeader(tarfilename, NULL, statbuf, false);
while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp))
> 0)
{
...
}
}



Also, having isrelfile as part of following condition:
{code}
+   while (!isrelfile &&
+  (cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len),
fp)) > 0)
{code}

is confusing, because even the relation files in full backup are going to be
backed up by this loop only, but still, the condition reads '(!isrelfile
&&...)'.



verify_page_checksum()
{
while(1)
{

break;
}
}

IMHO, while labels are not advisable in general, it may be better to use a
label
here rather than a while(1) loop, so that we can move to the label in case
we
want to retry once. I think here it opens doors for future bugs if someone
happens to add code here, ending up adding some condition and then the
break becomes conditional. That will leave us in an infinite loop.



+/* magic number in incremental backup's .partial file */
+#define INCREMENTAL_BACKUP_MAGIC   0x494E4352

Similar to structure partial_file_header, I think above macro can also be
moved
to basebackup.h instead of defining it twice.



In sendFile():

+   buf = (char *) malloc(RELSEG_SIZE * BLCKSZ);

I think this is a huge memory request (1GB) and may fail on busy/loaded
server at
times. We should check for failures of malloc, maybe throw some error on
getting ENOMEM as errno.



+   /* Perform incremenatl backup stuff here. */
+   if ((cnt = fread(buf, 1, Min(RELSEG_SIZE * BLCKSZ,
statbuf->st_size), fp)) > 0)
+   {

Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and it
should be safe to read just statbuf_st_size always I guess? But, I am ok
with
having this extra guard here.



In sendFile(), I am sorry if I am missing something, but I am not able to
understand why 'cnt' and 'i' should have different values when they are
being
passed to verify_page_checksum(). I think passing only one of them should be
sufficient.



+   XLogRecPtr  pglsn;
+
+   for (i = 0; i < cnt / BLCKSZ; i++)
+   {

Maybe we should just have a variable no_of_blocks to store a number of
blocks,
rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worst
case.


+   len += cnt;
+   throttle(cnt);
+   }

Sorry if I am missing something, but, should not it be just:

len = cnt;



As I said earlier in my previous email, we now do not need
+decode_lsn_internal()
as it is already taken care by the introduction of function
pg_lsn_in_internal().

Regards,
Jeevan Ladhe

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Bruce Momjian

On Thu, Aug  8, 2019 at 03:07:59PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Tue, Jul  9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote:
> > > On Tue, Jul  9, 2019 at 10:59:12AM -0400, Stephen Frost wrote:
> > > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > I agree that all of that isn't necessary for an initial implementation,
> > > > I was rather trying to lay out how we could improve on this in the
> > > > future and why having the keying done at a tablespace level makes sense
> > > > initially because we can then potentially move forward with further
> > > > segregation to improve the situation.  I do believe it's also useful in
> > > > its own right, to be clear, just not as nice since a compromised backend
> > > > could still get access to data in shared buffers that it really
> > > > shouldn't be able to, even broadly, see.
> > > 
> > > I think TDE is feature of questionable value at best and the idea that
> > > we would fundmentally change the internals of Postgres to add more
> > > features to it seems very unlikely.  I realize we have to discuss it so
> > > we don't block reasonable future feature development.
> > 
> > I have a new crazy idea.  I know we concluded that allowing multiple
> > independent keys, e.g., per user, per table, didn't make sense since
> > they have to be unlocked all the time, e.g., for crash recovery and
> > vacuum freeze.
> 
> I'm a bit confused as I never agreed that made any sense and I continue
> to feel that it doesn't make sense to have one key for everything.

I clearly explained why multiple keys, while desirable, have many
negatives.  If you want to address my replies, we can go over them
again.  What people want, and what we can reasonably accomplish, are two
different things.

> Crash recovery doesn't happen "all the time" and neither does vacuum
> freeze, and autovacuum processes are independent of individual client
> backends- we don't need to (and shouldn't) have the keys in shared
> memory.

Uh, I just don't know what that would look like, honestly.  I am trying
to get us toward something that is easily implemented and easy to
control.
 
> > However, that assumes that all heap/index pages are encrypted, and all
> > of WAL.  What if we encrypted only the user-data part of the page, i.e.,
> > tuple data.  We left xmin/xmax unencrypted, and only stored the
> > encrypted part of that data in WAL, and didn't encrypt any more of WAL. 
> 
> This is pretty much what Alvaro was suggesting a while ago, isn't it..?
> Have just the user data be encrypted in the table and in the WAL stream.

Well, I think he was saying that to reduce the overhead of encryption. 
I didn't see it as a way of allowing recovery and vacuum freeze.  My
exact reply was:

> Well, you would need to decide what WAL information needs to be secured.
> Is the fact an insert was performed on a table a security issue?
> Depends on your risks.  My point is that almost anything you do beyond
> cluster-level encryption either adds complexity that is bug-prone or
> fragile, or adds unacceptable overhead, or leaks security information.

> > That might allow crash recovery and the freeze part of VACUUM FREEZE to
> > work.  (I don't think we could vacuum since we couldn't read the index
> > pages to find the matching rows since the index values would be encrypted
> > too.  We might be able to not encrypt the tid in the index typle.)
> 
> Why do we need the indexed values to vacuum the index..?  We don't
> today, as I recall.  We would need the tids though, yes.

Uh, well, if we are doing index cleaning by doing a sequential scan of
the index, which I think we have done for many years, I think just
looking at the tids should work.  However, I don't know if we ever
adjust index entries, like re-balance the trees.

> > Is this something considering in version one of this feature?  Probably
> > not, but later?  Never?  Would the information leakage be too great,
> > particularly from indexes?
> 
> What would be leaking from the indexes..?  That an encrypted blob in the
> index pointed to a given tid?  Wouldn't someone be able to see that same
> information by looking directly at the relation too?

Well, I assume we would encrypt the heap and its indexes.  For example,
if there is an employee table, and there is an index on the employee
last name and employee salary, it would be trivial to get a list of
employee salaries sorted by last name by just joining the tids, though
you would not know the last names.  That seems like an information leak
to me.  Plus, which tables were updated would be visible in WAL.  And we
would have issues with system tables, pg_statistics, and lots of other
complexity.

I can see value in eventually doing this, perhaps before we perform
cluster-wide encryption, but doing it without cluster-wide encryption
seems like it would leak too much information to be useful.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB

Add "password_protocol" connection parameter to libpq

2019-08-08 Thread Jeff Davis

Libpq doesn't have a way to control which password protocols are used.
For example, the client might expect the server to be using SCRAM, but
it actually ends up using plain password authentication instead.

This patch adds:

  password_protocol = {plaintext|md5|scram-sha-256|scram-sha-256-plus}

as a connection parameter. Libpq will then reject any authentication
request from the server that is less secure than this setting. Setting
it to "plaintext" (default) will answer to any kind of authentication
request.

I'm not 100% happy with the name "password_protocol", but other names I
could think of seemed likely to cause confusion.

Regards,
Jeff Davis

From 9d82cbad4e1bf1c3e159df6e7c8972c8fa2313ae Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 7 Aug 2019 20:17:44 -0700
Subject: [PATCH] Add "password_protocol" connection parameter to libpq.

Sets the least-secure password protocol allowable when using password
authentication. Options are: "plaintext", "md5", "scram-sha-256", or
"scram-sha-256-plus".

Without setting this option, it's possible that the server will use a
less-secure authentication method than the client expects.
---
 doc/src/sgml/libpq.sgml   | 34 +
 src/interfaces/libpq/fe-auth.c| 28 -
 src/interfaces/libpq/fe-connect.c | 42 +++
 src/interfaces/libpq/libpq-int.h  |  2 ++
 4 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index e7295abda28..b337a781560 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1118,6 +1118,40 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  password_protocol
+  
+  
+   Specifies the least-secure password protocol allowable when
+   authenticating with a password:
+   plaintext, md5,
+   scram-sha-256, or
+   scram-sha-256-plus. The default
+   is plaintext, meaning that any password protocol is
+   acceptable.
+  
+  
+Note that this setting is unrelated to the use of SSL. Use of the
+plaintext password protocol over SSL will be
+encrypted over the network, but the server will have access to the
+plaintext password.
+  
+  
+The scram-sha-256-plus password protocol uses
+channel binding, supported when communicating
+with PostgreSQL 11.0 or later
+servers. Channel binding additionally requires an SSL connection.
+  
+  
+The plaintext password protocol must be used when
+the server is using one of the following authentication
+methods: password,
+ldap, radius,
+pam, or bsd.
+  
+  
+ 
+
  
   connect_timeout
   
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index ab227421b3b..f9b23b457ca 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -493,6 +493,16 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 			selected_mechanism = SCRAM_SHA_256_NAME;
 	}
 
+	if (selected_mechanism &&
+		strcmp(selected_mechanism, SCRAM_SHA_256_NAME) == 0 &&
+		strcmp(conn->password_protocol, "scram-sha-256-plus") == 0)
+	{
+		printfPQExpBuffer(>errorMessage,
+		  libpq_gettext(
+			  "server doesn't support SCRAM_SHA_256_PLUS, but it is required\n"));
+		goto error;
+	}
+
 	if (!selected_mechanism)
 	{
 		printfPQExpBuffer(>errorMessage,
@@ -914,11 +924,27 @@ pg_fe_sendauth(AuthRequest areq, int payloadlen, PGconn *conn)
 			  libpq_gettext("Crypt authentication not supported\n"));
 			return STATUS_ERROR;
 
-		case AUTH_REQ_MD5:
 		case AUTH_REQ_PASSWORD:
+if (conn->password_protocol[0] == 'm')
+{
+	printfPQExpBuffer(>errorMessage,
+	  "server not configured for MD5, but it was required\n");
+	return STATUS_ERROR;
+}
+/* FALL THROUGH */
+
+		case AUTH_REQ_MD5:
 			{
 char	   *password;
 
+if (conn->password_protocol[0] == 's')
+{
+	printfPQExpBuffer(>errorMessage,
+	  "server not configured for %s, but it was required\n",
+	  SCRAM_SHA_256_NAME);
+	return STATUS_ERROR;
+}
+
 conn->password_needed = true;
 password = conn->connhost[conn->whichhost].password;
 if (password == NULL)
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index d262b57021d..b42f08ebbdd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -123,6 +123,7 @@ static int	ldapServiceLookup(const char *purl, PQconninfoOption *options,
 #define DefaultTty		""
 #define DefaultOption	""
 #define DefaultAuthtype		  ""
+#define DefaultPasswordProtocol "plaintext"
 #define DefaultTargetSessionAttrs	"any"
 #ifdef USE_SSL
 #define DefaultSSLMode "prefer"
@@ -210,6 +211,10 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"Database-Password-File", "", 64,
 	offsetof(struct pg_conn, pgpassfile)},

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Thu, Aug 08, 2019 at 03:07:59PM -0400, Stephen Frost wrote:
> >* Bruce Momjian (br...@momjian.us) wrote:
> >>On Tue, Jul  9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote:
> >>> On Tue, Jul  9, 2019 at 10:59:12AM -0400, Stephen Frost wrote:
> >>> > * Bruce Momjian (br...@momjian.us) wrote:
> >>> > I agree that all of that isn't necessary for an initial implementation,
> >>> > I was rather trying to lay out how we could improve on this in the
> >>> > future and why having the keying done at a tablespace level makes sense
> >>> > initially because we can then potentially move forward with further
> >>> > segregation to improve the situation.  I do believe it's also useful in
> >>> > its own right, to be clear, just not as nice since a compromised backend
> >>> > could still get access to data in shared buffers that it really
> >>> > shouldn't be able to, even broadly, see.
> >>>
> >>> I think TDE is feature of questionable value at best and the idea that
> >>> we would fundmentally change the internals of Postgres to add more
> >>> features to it seems very unlikely.  I realize we have to discuss it so
> >>> we don't block reasonable future feature development.
> >>
> >>I have a new crazy idea.  I know we concluded that allowing multiple
> >>independent keys, e.g., per user, per table, didn't make sense since
> >>they have to be unlocked all the time, e.g., for crash recovery and
> >>vacuum freeze.
> >
> >I'm a bit confused as I never agreed that made any sense and I continue
> >to feel that it doesn't make sense to have one key for everything.
> >
> >Crash recovery doesn't happen "all the time" and neither does vacuum
> >freeze, and autovacuum processes are independent of individual client
> >backends- we don't need to (and shouldn't) have the keys in shared
> >memory.
> 
> Don't people do physical replication / HA pretty much all the time?

Strictly speaking, that isn't actually crash recovery, it's physical
replication / HA, and while those are certainly nice to have it's no
guarantee that they're required or that you'd want to have the same keys
for them- conceptually, at least, you could have WAL with one key that
both sides know and then different keys for the actual data files, if we
go with the approach where the WAL is encrypted with one key and then
otherwise is plaintext.

> >>However, that assumes that all heap/index pages are encrypted, and all
> >>of WAL.  What if we encrypted only the user-data part of the page, i.e.,
> >>tuple data.  We left xmin/xmax unencrypted, and only stored the
> >>encrypted part of that data in WAL, and didn't encrypt any more of WAL.
> >
> >This is pretty much what Alvaro was suggesting a while ago, isn't it..?
> >Have just the user data be encrypted in the table and in the WAL stream.
> 
> It's also moving us much closer to pgcrypto-style encryption ...

Yes, it is, and there's good parts and bad parts to that, to be sure.

> >>That might allow crash recovery and the freeze part of VACUUM FREEZE to
> >>work.  (I don't think we could vacuum since we couldn't read the index
> >>pages to find the matching rows since the index values would be encrypted
> >>too.  We might be able to not encrypt the tid in the index typle.)
> >
> >Why do we need the indexed values to vacuum the index..?  We don't
> >today, as I recall.  We would need the tids though, yes.
> 
> Well, we also do collect statistics on the data, for example. But even
> if we assume we wouldn't do that for encrypted indexes (which seems like
> a pretty bad idea to me), you'd probably end up leaking information
> about ordering of the values. Which is generally a pretty serious
> information leak, AFAICS.

I agree entirely that order information would be bad to leak- but this
is all new ground here and we haven't actually sorted out what such a
partially encrypted btree would look like.  We don't actually have to
have the down-links in the tree be unencrypted to allow vacuuming of
leaf pages, after all.

> >>Is this something considering in version one of this feature?  Probably
> >>not, but later?  Never?  Would the information leakage be too great,
> >>particularly from indexes?
> >
> >What would be leaking from the indexes..?  That an encrypted blob in the
> >index pointed to a given tid?  Wouldn't someone be able to see that same
> >information by looking directly at the relation too?
> 
> Ordering of values, for example. Depending on how exactly the data is
> encrypted we might also be leaking information about which values are
> equal, etc. It also seems quite a bit more expensive to use such index.

Using an encrypted index isn't going to be free.  It's not clear that
this would be much more expensive than if the entire index is encrypted,
or that people would actually be unhappy if there was such an additional
expense if it meant that they could have vacuum run without the keys.

Thanks,

Stephen


signature.asc
Description: PGP

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Tomas Vondra

On Thu, Aug 08, 2019 at 03:07:59PM -0400, Stephen Frost wrote:

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:

On Tue, Jul  9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote:
> On Tue, Jul  9, 2019 at 10:59:12AM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > I agree that all of that isn't necessary for an initial implementation,
> > I was rather trying to lay out how we could improve on this in the
> > future and why having the keying done at a tablespace level makes sense
> > initially because we can then potentially move forward with further
> > segregation to improve the situation.  I do believe it's also useful in
> > its own right, to be clear, just not as nice since a compromised backend
> > could still get access to data in shared buffers that it really
> > shouldn't be able to, even broadly, see.
>
> I think TDE is feature of questionable value at best and the idea that
> we would fundmentally change the internals of Postgres to add more
> features to it seems very unlikely.  I realize we have to discuss it so
> we don't block reasonable future feature development.

I have a new crazy idea.  I know we concluded that allowing multiple
independent keys, e.g., per user, per table, didn't make sense since
they have to be unlocked all the time, e.g., for crash recovery and
vacuum freeze.

I'm a bit confused as I never agreed that made any sense and I continue
to feel that it doesn't make sense to have one key for everything.

Crash recovery doesn't happen "all the time" and neither does vacuum
freeze, and autovacuum processes are independent of individual client
backends- we don't need to (and shouldn't) have the keys in shared
memory.

Don't people do physical replication / HA pretty much all the time?

However, that assumes that all heap/index pages are encrypted, and all
of WAL.  What if we encrypted only the user-data part of the page, i.e.,
tuple data.  We left xmin/xmax unencrypted, and only stored the
encrypted part of that data in WAL, and didn't encrypt any more of WAL.

This is pretty much what Alvaro was suggesting a while ago, isn't it..?
Have just the user data be encrypted in the table and in the WAL stream.

It's also moving us much closer to pgcrypto-style encryption ...

That might allow crash recovery and the freeze part of VACUUM FREEZE to
work.  (I don't think we could vacuum since we couldn't read the index
pages to find the matching rows since the index values would be encrypted
too.  We might be able to not encrypt the tid in the index typle.)

Why do we need the indexed values to vacuum the index..?  We don't
today, as I recall.  We would need the tids though, yes.

Well, we also do collect statistics on the data, for example. But even
if we assume we wouldn't do that for encrypted indexes (which seems like
a pretty bad idea to me), you'd probably end up leaking information
about ordering of the values. Which is generally a pretty serious
information leak, AFAICS.

Is this something considering in version one of this feature?  Probably
not, but later?  Never?  Would the information leakage be too great,
particularly from indexes?

What would be leaking from the indexes..?  That an encrypted blob in the
index pointed to a given tid?  Wouldn't someone be able to see that same
information by looking directly at the relation too?

Ordering of values, for example. Depending on how exactly the data is
encrypted we might also be leaking information about which values are
equal, etc. It also seems quite a bit more expensive to use such index.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Locale support

2019-08-08 Thread Thomas Munro

On Fri, Aug 9, 2019 at 6:19 AM Peter Geoghegan  wrote:
> On Thu, Aug 8, 2019 at 6:29 AM Yonatan Misgan  wrote:
> > So my question is after  developing the converter function where I put it 
> > for accessing it on PostgreSQL.
>
> Maybe you can take some inspiration from the postgresql-unit extension:
>
> https://github.com/df7cb/postgresql-unit

Here's a 5 minute bare bones extension with place holders functions
showing what I had in mind.  That is, assuming that "date" is a
reasonable type, and we're just talking about different ways of
converting to/from text.

https://github.com/macdice/calendars

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: converting Lists into arrays

2019-08-08 Thread Tom Lane

I wrote:
> BTW, further on the subject of performance --- I'm aware of at least
> these topics for follow-on patches:

> * Fix places that are maintaining arrays parallel to Lists for
> access-speed reasons (at least simple_rte_array, append_rel_array,
> es_range_table_array).

Attached is a patch that removes simple_rte_array in favor of just
accessing the query's rtable directly.  I concluded that there was
not much point in messing with simple_rel_array or append_rel_array,
because they are not in fact just mirrors of some List.  There's no
List at all of baserel RelOptInfos, and while we do have a list of
AppendRelInfos, it's a compact, random-order list not one indexable
by child relid.

Having done this, though, I'm a bit discouraged about whether to commit
it.  In light testing, it's not any faster than HEAD and in complex
queries seems to actually be a bit slower.  I suspect the reason is
that we've effectively replaced
root->simple_rte_array[i]
with
root->parse->rtable->elements[i-1]
and the two extra levels of indirection are taking their toll.

It'd be possible to get rid of one of those indirections by maintaining a
copy of root->parse->rtable directly in PlannerInfo; but that throws away
most of the intellectual appeal of not having two sources of truth to
maintain, and it won't completely close the performance gap.

Other minor objections include:

* Many RTE accesses now look randomly different from adjacent 
RelOptInfo accesses.

* The inheritance-expansion code is a bit sloppy about how much it will
expand these arrays, which means it's possible in corner cases for
length(parse->rtable) to be less than root->simple_rel_array_size-1.
This resulted in a crash in add_other_rels_to_query, which was assuming
it could fetch a possibly-null RTE pointer from indexes up to
simple_rel_array_size-1.  While that wasn't hard to fix, I wonder
whether any third-party code has similar assumptions.

So on the whole, I'm inclined not to do this.  There are some cosmetic
bits of this patch that I do want to commit though: I found some
out-of-date comments, and I realized that it's pretty dumb not to
just merge setup_append_rel_array into setup_simple_rel_arrays.
The division of labor there is inconsistent with the fact that
there's no such division in expand_planner_arrays.

I still have hopes for getting rid of es_range_table_array though,
and will look at that tomorrow or so.

regards, tom lane

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 06a2058..8bd1c47 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2198,7 +2198,7 @@ postgresPlanDirectModify(PlannerInfo *root,
 	}
 	else
 		foreignrel = root->simple_rel_array[resultRelation];
-	rte = root->simple_rte_array[resultRelation];
+	rte = planner_rt_fetch(resultRelation, root);
 	fpinfo = (PgFdwRelationInfo *) foreignrel->fdw_private;

 	/*
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e9ee32b..fe7f8b1 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -307,7 +307,7 @@ set_base_rel_sizes(PlannerInfo *root)
 		if (rel->reloptkind != RELOPT_BASEREL)
 			continue;

-		rte = root->simple_rte_array[rti];
+		rte = planner_rt_fetch(rti, root);

 		/*
 		 * If parallelism is allowable for this query in general, see whether
@@ -349,7 +349,7 @@ set_base_rel_pathlists(PlannerInfo *root)
 		if (rel->reloptkind != RELOPT_BASEREL)
 			continue;

-		set_rel_pathlist(root, rel, rti, root->simple_rte_array[rti]);
+		set_rel_pathlist(root, rel, rti, planner_rt_fetch(rti, root));
 	}
 }

@@ -1008,7 +1008,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			continue;

 		childRTindex = appinfo->child_relid;
-		childRTE = root->simple_rte_array[childRTindex];
+		childRTE = planner_rt_fetch(childRTindex, root);

 		/*
 		 * The child rel's RelOptInfo was already created during
@@ -1239,7 +1239,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,

 		/* Re-locate the child RTE and RelOptInfo */
 		childRTindex = appinfo->child_relid;
-		childRTE = root->simple_rte_array[childRTindex];
+		childRTE = planner_rt_fetch(childRTindex, root);
 		childrel = root->simple_rel_array[childRTindex];

 		/*
@@ -3742,9 +3742,8 @@ print_relids(PlannerInfo *root, Relids relids)
 	{
 		if (!first)
 			printf(" ");
-		if (x < root->simple_rel_array_size &&
-			root->simple_rte_array[x])
-			printf("%s", root->simple_rte_array[x]->eref->aliasname);
+		if (x <= list_length(root->parse->rtable))
+			printf("%s", planner_rt_fetch(x, root)->eref->aliasname);
 		else
 			printf("%d", x);
 		first = false;
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index d19ff41..2576439 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -609,7 +609,7 @@

Re: Small const correctness patch

2019-08-08 Thread Ibrar Ahmed

On Fri, Aug 9, 2019 at 1:25 AM Mark G  wrote:

>
>
> On Thu, Aug 8, 2019 at 8:25 PM Ibrar Ahmed  wrote:
>
>> +1
>>
>> Patch successfully applied to the master (
>> 43211c2a02f39d6568496168413dc00e0399dc2e)
>>
>
> That change looks like an unrelated patch for initdb. I'm still not seeing
> my patch there.
>

I said I checked and verified patch against that hash. It applied to that
without any failure. Sorry for the confusion.

>
> -Mark
>
>

-- 
Ibrar Ahmed

Re: Small const correctness patch

2019-08-08 Thread Mark G

On Thu, Aug 8, 2019 at 8:25 PM Ibrar Ahmed  wrote:

> +1
>
> Patch successfully applied to the master (
> 43211c2a02f39d6568496168413dc00e0399dc2e)
>

That change looks like an unrelated patch for initdb. I'm still not seeing
my patch there.

-Mark

Re: SQL/JSON path: collation for comparisons, minor typos in docs

2019-08-08 Thread Alexander Korotkov

Hi, Markus!

On Thu, Aug 8, 2019 at 11:53 AM Markus Winand  wrote:
> The patch makes my tests pass.

Cool.

> I wonder about a few things:
>
> - Isn’t there any code that could be re-used for that (the one triggered by 
> ‘a’ < ‘A’ COLLATE ucs_basic)?

PostgreSQL supports ucs_basic, but it's alias to C collation and works
only for utf-8.  Jsonpath code may work in different encodings.  New
string comparison code can work in different encodings.

> - For object key members, the standard also refers to unicode code point 
> collation (SQL-2:2016 4.46.3, last paragraph).
> - I guess it also applies to the “starts with” predicate, but I cannot find 
> this explicitly stated in the standard.

For object keys we don't actually care about whether strings are less
or greater.  We only search for equal keys.  So, per-byte comparison
we currently use should be fine.  The same states for "starts with"
predicate.

> My tests check whether those cases do case-sensitive comparisons. With my 
> default collation "en_US.UTF-8” I cannot discover potential issues there. I 
> haven’t played around with nondeterministic ICU collations yet :(

That's OK. There should be other beta testers around :)

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Robert Haas

On Thu, Aug 8, 2019 at 9:31 AM Andres Freund  wrote:
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.

Here's what I have at the moment.  This is not by any means a complete
replacement for Amit's undo worker machinery, but it is a significant
redesign (and I believe a significant improvement to) the queue
management stuff from Amit's patch.  I wrote this pretty quickly, so
while it passes simple testing, it probably has a number of bugs, and
to actually use it, it would need to be integrated with xact.c; right
now it's just a standalone module that doesn't do anything except let
itself be tested.

Some of the ways it is different from Amit's patches include:

* It uses RBTree rather than binaryheap, so when we look ahead, we
look ahead in the right order.

* There's no limit to the lookahead distance; when looking ahead, it
will search the entirety of all 3 RBTrees for an entry from the right
database.

* It doesn't have a separate hash table keyed by XID.  I didn't find
that necessary.

* It's better-isolated, as you can see from the fact that I've
included a test module that tests this code without actually ever
putting an UndoRequestManager in shared memory.  I would've liked to
expand this test module, but I don't have time to do that today and
felt it better to get this much sent out.

* It has a lot of comments explaining the design and how it's intended
to integrate with the rest of the system.

Broadly, my vision for how this would get used is:

- Create an UndoRecordManager in shared memory.
- Before a transaction first attaches to a permanent or unlogged undo
log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
store a pointer to the UndoRecord for the lifetime of the toplevel
transaction.
- Immediately after attaching to a permanent or unlogged undo log,
xact.c would call UndoRequestSetLocation.
- xact.c would track the number of bytes of permanent and unlogged
undo records the transaction generates.  If the transaction goes onto
abort, it reports these by calling FinalizeUndoRequest.
- If the transaction commits, it doesn't need that information, but
does need to call UnregisterUndoRequest() as a post-commit step in
CommitTransaction().
- In the case of an abort, after calling FinalizeUndoRequest, xact.c
would call PerformUndoInBackground() to find out whether to do undo in
the background or the foreground.  If undo is to be done in the
foreground, the backend must go on to call UnregisterUndoRequest() if
undo succeeds, and RescheduleUndoRequest() if it fails.

- In the case of a prepared transaction, a pointer to the UndoRequest
would get stored in the GlobalTransaction (but nothing extra would get
stored in the twophase state file).
- COMMIT PREPARED calls UnregisterUndoRequest().
- ROLLBACK PREPARED calls PerformUndoInBackground; if told to do undo
in the foreground, it must go on to call either
UnregisterUndoRequest() on success or RescheduleUndoRequest() on
failure, just like in the regular abort case.

- After a crash, once recovery is complete but before we open for
connections, or at least before we allow any new undo activity, the
discard worker scans all the logs and makes a bunch of calls to
RecreateUndoRequest(). Then, for each prepared transaction that still
exists, it calls SuspendPreparedUndoRequest() and use the return value
to reset the UndoRequest pointer in the GlobalTransaction.  Only once
both of those steps are completed can undo workers be safely started.
- Undo workers call GetNextUndoRequest() to get the next task that
they should perform, and once they do, they "own" the undo request.
When undo succeeds or fails, they must call either
UnregisterUndoRequest() or RescheduleUndoRequest(), as appropriate,
just like for foreground undo.  Making sure this is water-tight will
probably require some well-done integration with xact.c, so that an
undo request that we "own" because we got it in a background undo
apply process looks exactly the same as one we "own" because it's our
transaction originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-Draft-of-new-undo-request-manager.patch
Description: Binary data

Re: Small const correctness patch

2019-08-08 Thread Mark G

On Thu, Aug 8, 2019 at 8:51 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:


> How did you find this?  Any special compiler settings?
>

16 hours stuck in a plane on an international flight. I was just eyeballing
the code to kill the boredom.

-Mark

Re: no default hash partition

2019-08-08 Thread Alvaro Herrera

On 2019-Aug-08, Amit Langote wrote:

> On Thu, Aug 8, 2019 at 6:22 AM Tom Lane  wrote:

> > OK, but maybe also s/created as a default partition/created as the default
> > partition/ ?  Writing "a" carries the pretty clear implication that there
> > can be more than one, and contradicting that a sentence later doesn't
> > improve it.
> 
> +1.  Maybe also remove the last sentence of the 2nd paragraph, that
> is, this one:
> 
> There can be only one default partition for a given parent table.

Thanks!  I pushed with these two changes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Small patch to fix build on Windows

2019-08-08 Thread Dmitry Igrishin

чт, 8 авг. 2019 г. в 20:07, Alvaro Herrera :
>
> On 2019-Aug-08, Dmitry Igrishin wrote:
>
> >   my $prefixcmd =
> > -   $solution->{options}->{python} . "\\python -c 
> > \"$pythonprog\"";
> > +   "\"$solution->{options}->{python}\\python\" -c 
> > \"$pythonprog\"";
>
> I think you can make this prettier like this:
>
>my $prefixcmd = qq{"$solution->{options}->{python}\\python" -c 
> "$pythonprog"};
This looks nice for a Perl hacker :-). As for me, it looks unusual and
a bit confusing. I never
programmed in Perl, but I was able to quickly understand where the
problem lies due to the
 style adopted in other languages, when the contents are enclosed in
quotation marks, and
the quotation marks are escaped if they are part of the contents.
So, should I fix it? Any thoughts?

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Jul  9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote:
> > On Tue, Jul  9, 2019 at 10:59:12AM -0400, Stephen Frost wrote:
> > > * Bruce Momjian (br...@momjian.us) wrote:
> > > I agree that all of that isn't necessary for an initial implementation,
> > > I was rather trying to lay out how we could improve on this in the
> > > future and why having the keying done at a tablespace level makes sense
> > > initially because we can then potentially move forward with further
> > > segregation to improve the situation.  I do believe it's also useful in
> > > its own right, to be clear, just not as nice since a compromised backend
> > > could still get access to data in shared buffers that it really
> > > shouldn't be able to, even broadly, see.
> > 
> > I think TDE is feature of questionable value at best and the idea that
> > we would fundmentally change the internals of Postgres to add more
> > features to it seems very unlikely.  I realize we have to discuss it so
> > we don't block reasonable future feature development.
> 
> I have a new crazy idea.  I know we concluded that allowing multiple
> independent keys, e.g., per user, per table, didn't make sense since
> they have to be unlocked all the time, e.g., for crash recovery and
> vacuum freeze.

I'm a bit confused as I never agreed that made any sense and I continue
to feel that it doesn't make sense to have one key for everything.

Crash recovery doesn't happen "all the time" and neither does vacuum
freeze, and autovacuum processes are independent of individual client
backends- we don't need to (and shouldn't) have the keys in shared
memory.

> However, that assumes that all heap/index pages are encrypted, and all
> of WAL.  What if we encrypted only the user-data part of the page, i.e.,
> tuple data.  We left xmin/xmax unencrypted, and only stored the
> encrypted part of that data in WAL, and didn't encrypt any more of WAL. 

This is pretty much what Alvaro was suggesting a while ago, isn't it..?
Have just the user data be encrypted in the table and in the WAL stream.

> That might allow crash recovery and the freeze part of VACUUM FREEZE to
> work.  (I don't think we could vacuum since we couldn't read the index
> pages to find the matching rows since the index values would be encrypted
> too.  We might be able to not encrypt the tid in the index typle.)

Why do we need the indexed values to vacuum the index..?  We don't
today, as I recall.  We would need the tids though, yes.

> Is this something considering in version one of this feature?  Probably
> not, but later?  Never?  Would the information leakage be too great,
> particularly from indexes?

What would be leaking from the indexes..?  That an encrypted blob in the
index pointed to a given tid?  Wouldn't someone be able to see that same
information by looking directly at the relation too?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: First draft of back-branch release notes is done

2019-08-08 Thread Jonathan S. Katz

On 8/8/19 2:45 PM, Alvaro Herrera wrote:
> On 2019-Aug-08, Jonathan S. Katz wrote:
> 
>> On 8/8/19 2:40 PM, Alvaro Herrera wrote:
>>> On 2019-Aug-08, Jonathan S. Katz wrote:
>>>
 I modified the copy of the announcement on the website to include the
 pg_upgrade option.

 https://www.postgresql.org/about/news/1960/
>>>
>>> Ooh, had I thought you were going to do that, I would have told you
>>> about the item ending in a comma :-)
>>
>> :) I made a quick modification and opted for an "either" at the
>> beginning of that clause and a capitalized "OR" towards the end.
> 
> Oh, heh ...  I was thinking of this line:
> 
>Fix for multi-column foreign keys when rebuilding a foreign key constraint,

Oh oops. Fixed :) Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: First draft of back-branch release notes is done

2019-08-08 Thread Alvaro Herrera

On 2019-Aug-08, Jonathan S. Katz wrote:

> On 8/8/19 2:40 PM, Alvaro Herrera wrote:
> > On 2019-Aug-08, Jonathan S. Katz wrote:
> > 
> >> I modified the copy of the announcement on the website to include the
> >> pg_upgrade option.
> >>
> >> https://www.postgresql.org/about/news/1960/
> > 
> > Ooh, had I thought you were going to do that, I would have told you
> > about the item ending in a comma :-)
> 
> :) I made a quick modification and opted for an "either" at the
> beginning of that clause and a capitalized "OR" towards the end.

Oh, heh ...  I was thinking of this line:

   Fix for multi-column foreign keys when rebuilding a foreign key constraint,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: First draft of back-branch release notes is done

2019-08-08 Thread Jonathan S. Katz

On 8/8/19 2:40 PM, Alvaro Herrera wrote:
> On 2019-Aug-08, Jonathan S. Katz wrote:
> 
>> I modified the copy of the announcement on the website to include the
>> pg_upgrade option.
>>
>> https://www.postgresql.org/about/news/1960/
> 
> Ooh, had I thought you were going to do that, I would have told you
> about the item ending in a comma :-)

:) I made a quick modification and opted for an "either" at the
beginning of that clause and a capitalized "OR" towards the end.

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: First draft of back-branch release notes is done

2019-08-08 Thread Alvaro Herrera

On 2019-Aug-08, Jonathan S. Katz wrote:

> I modified the copy of the announcement on the website to include the
> pg_upgrade option.
> 
> https://www.postgresql.org/about/news/1960/

Ooh, had I thought you were going to do that, I would have told you
about the item ending in a comma :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: converting Lists into arrays

2019-08-08 Thread Tom Lane

Andres Freund  writes:
> On 2019-07-31 19:40:09 -0400, Tom Lane wrote:
>> That makes the other idea (of a foreach-ish macro declaring the
>> listcell value variable) problematic, too :-(.

> Hm. One way partially around that would be using an anonymous struct
> inside the for(). Something like
> #define foreach_node(membertype, name, lst)   \
> for (struct {membertype *node; ListCell *lc; const List *l; int i;} name = 
> {...}; \
>  ...)
> which then would allow code like

> foreach_node(OpExpr, cur, list)
> {
> do_something_with_node(cur.node);
> foreach_delete_current(cur);
> }

I'm hesitant to change the look of our loops quite that much, mainly
because it'll be a pain for back-patching.  If you write some code
for HEAD like this, and then have to back-patch it, you'll need to
insert/change significantly more code than if it's just a matter
of whether there's a ListCell variable or not.

I experimented with the "aforeach" idea I suggested upthread,
to the extent of writing the macros and then converting
parse_clause.c (a file chosen more or less at random) to use
aforeach instead of foreach.  I was somewhat surprised to find
that every single foreach() did convert pleasantly.  (There are
several forboth's that I didn't try to do anything with, though.)

If we do go in this direction, I wouldn't suggest trying to
actually do wholesale conversion of existing code like this;
that seems more likely to create back-patching land mines than
do anything helpful.  I am slightly tempted to try to convert
everyplace using foreach_delete_current, though, since those
loops are different from v12 already.

Thoughts?

regards, tom lane

diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index 2a6b2ff..39d8d8e 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -117,8 +117,6 @@ static Node *transformFrameOffset(ParseState *pstate, int frameOptions,
 void
 transformFromClause(ParseState *pstate, List *frmList)
 {
-	ListCell   *fl;
-
 	/*
 	 * The grammar will have produced a list of RangeVars, RangeSubselects,
 	 * RangeFunctions, and/or JoinExprs. Transform each one (possibly adding
@@ -128,9 +126,9 @@ transformFromClause(ParseState *pstate, List *frmList)
 	 * Note we must process the items left-to-right for proper handling of
 	 * LATERAL references.
 	 */
-	foreach(fl, frmList)
+	aforeach(frmList)
 	{
-		Node	   *n = lfirst(fl);
+		Node	   *n = (Node *) aforeach_current();
 		RangeTblEntry *rte;
 		int			rtindex;
 		List	   *namespace;
@@ -267,11 +265,10 @@ extractRemainingColumns(List *common_colnames,
 	{
 		char	   *colname = strVal(lfirst(lnames));
 		bool		match = false;
-		ListCell   *cnames;
 
-		foreach(cnames, common_colnames)
+		aforeach(common_colnames)
 		{
-			char	   *ccolname = strVal(lfirst(cnames));
+			char	   *ccolname = strVal(aforeach_current());
 
 			if (strcmp(colname, ccolname) == 0)
 			{
@@ -475,7 +472,6 @@ transformRangeFunction(ParseState *pstate, RangeFunction *r)
 	List	   *coldeflists = NIL;
 	bool		is_lateral;
 	RangeTblEntry *rte;
-	ListCell   *lc;
 
 	/*
 	 * We make lateral_only names of this level visible, whether or not the
@@ -505,9 +501,9 @@ transformRangeFunction(ParseState *pstate, RangeFunction *r)
 	 * Likewise, collect column definition lists if there were any.  But
 	 * complain if we find one here and the RangeFunction has one too.
 	 */
-	foreach(lc, r->functions)
+	aforeach(r->functions)
 	{
-		List	   *pair = (List *) lfirst(lc);
+		List	   *pair = aforeach_current_node(List);
 		Node	   *fexpr;
 		List	   *coldeflist;
 		Node	   *newfexpr;
@@ -551,11 +547,9 @@ transformRangeFunction(ParseState *pstate, RangeFunction *r)
 fc->over == NULL &&
 coldeflist == NIL)
 			{
-ListCell   *lc;
-
-foreach(lc, fc->args)
+aforeach(fc->args)
 {
-	Node	   *arg = (Node *) lfirst(lc);
+	Node	   *arg = (Node *) aforeach_current();
 	FuncCall   *newfc;
 
 	last_srf = pstate->p_last_srf;
@@ -700,7 +694,6 @@ transformRangeTableFunc(ParseState *pstate, RangeTableFunc *rtf)
 	Oid			docType;
 	RangeTblEntry *rte;
 	bool		is_lateral;
-	ListCell   *col;
 	char	  **names;
 	int			colno;
 
@@ -743,9 +736,9 @@ transformRangeTableFunc(ParseState *pstate, RangeTableFunc *rtf)
 	names = palloc(sizeof(char *) * list_length(rtf->columns));
 
 	colno = 0;
-	foreach(col, rtf->columns)
+	aforeach(rtf->columns)
 	{
-		RangeTableFuncCol *rawc = (RangeTableFuncCol *) lfirst(col);
+		RangeTableFuncCol *rawc = aforeach_current_node(RangeTableFuncCol);
 		Oid			typid;
 		int32		typmod;
 		Node	   *colexpr;
@@ -837,15 +830,13 @@ transformRangeTableFunc(ParseState *pstate, RangeTableFunc *rtf)
 	/* Namespaces, if any, also need to be transformed */
 	if (rtf->namespaces != NIL)
 	{
-		ListCell   *ns;
-		ListCell   *lc2;
 		List	   *ns_uris = NIL;
 		List	   *ns_names = NIL;
 		bool		default_ns_seen = false;
 
-		foreach(ns, rtf->namespaces)
+

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Bruce Momjian

On Tue, Jul  9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote:
> On Tue, Jul  9, 2019 at 10:59:12AM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > I agree that all of that isn't necessary for an initial implementation,
> > I was rather trying to lay out how we could improve on this in the
> > future and why having the keying done at a tablespace level makes sense
> > initially because we can then potentially move forward with further
> > segregation to improve the situation.  I do believe it's also useful in
> > its own right, to be clear, just not as nice since a compromised backend
> > could still get access to data in shared buffers that it really
> > shouldn't be able to, even broadly, see.
> 
> I think TDE is feature of questionable value at best and the idea that
> we would fundmentally change the internals of Postgres to add more
> features to it seems very unlikely.  I realize we have to discuss it so
> we don't block reasonable future feature development.

I have a new crazy idea.  I know we concluded that allowing multiple
independent keys, e.g., per user, per table, didn't make sense since
they have to be unlocked all the time, e.g., for crash recovery and
vacuum freeze.

However, that assumes that all heap/index pages are encrypted, and all
of WAL.  What if we encrypted only the user-data part of the page, i.e.,
tuple data.  We left xmin/xmax unencrypted, and only stored the
encrypted part of that data in WAL, and didn't encrypt any more of WAL. 
That might allow crash recovery and the freeze part of VACUUM FREEZE to
work.  (I don't think we could vacuum since we couldn't read the index
pages to find the matching rows since the index values would be encrypted
too.  We might be able to not encrypt the tid in the index typle.)

Is this something considering in version one of this feature?  Probably
not, but later?  Never?  Would the information leakage be too great,
particularly from indexes?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Bruce Momjian

On Wed, Aug  7, 2019 at 07:40:05PM -0400, Sehrope Sarkuni wrote:
> With the provisos above, yes I think that would work though I don't think it's
> a good idea. Better to start off using the functions directly and then look
> into optimizing only if they're a bottleneck. As a first pass I'd break it up
> as separate writes with the encryption happening at write time. If that works
> fine there's no need to complicate things further.

OK, sounds like a plan!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: First draft of back-branch release notes is done

2019-08-08 Thread Jonathan S. Katz

On 8/8/19 2:15 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2019-Aug-04, Jonathan S. Katz wrote:
>>> * Ensure that partition key columns will not be dropped as the result of an
>>> "indirect drop," such as from a cascade from dropping the key column's data
>>> type (e.g. a custom data type). This fix is applied only to newly created
>>> partitioned tables: if you believe you have an affected partition table 
>>> (e.g.
>>> one where the partition key uses a custom data type), you will need to
>>> create a new table and move your data into it.
> 
>> Hmm, if I have this problem, I can pg_upgrade and the new database will
>> have correct dependencies, right?  For some people, doing that might be
>> easier than creating and reloading large tables.
> 
> Yeah, that should work.

I modified the copy of the announcement on the website to include the
pg_upgrade option.

https://www.postgresql.org/about/news/1960/

Thanks!

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: Locale support

2019-08-08 Thread Peter Geoghegan

On Thu, Aug 8, 2019 at 6:29 AM Yonatan Misgan  wrote:
> So my question is after  developing the converter function where I put it for 
> accessing it on PostgreSQL.

Maybe you can take some inspiration from the postgresql-unit extension:

https://github.com/df7cb/postgresql-unit

Note that it is built on top of GNU units, which is itself highly extensible.

I'm not sure if this will be useful, since I am not an expert on
calendar systems.

-- 
Peter Geoghegan

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-08 Thread Bruce Momjian

On Wed, Aug  7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote:
> On Mon, Aug 5, 2019 at 9:02 PM Bruce Momjian  wrote:
> 
> On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote:
> > Even if we do not include a separate per-relation salt or things like
> > relfilenode when generating a derived key, we can still include other
> types of
> > immutable attributes. For example the fork type could be included to
> eventually
> > allow multiple forks for the same relation to be encrypted with the same
> IV =
> > LSN + Page Number as the derived key per-fork would be distinct.
> 
> Yes, the fork number could be useful in this case.  I was thinking we
> would just leave the extra bits as zeros and we can then set it to '1'
> or something else for a different fork.
> 
> 
> Key derivation has more flexibility as you're not limited by the number of
> unused bits in the IV.

Understood, though I was not aware of the usefulness of key derivation
until this thread.

> > WAL encryption should not use the same key as page encryption so there's
> no
> > need to design the IV to try to avoid matching the page IVs. Even a 
> basic
> > derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK = HKDF
> (MDEK,
> > "PAGE") would ensure separate keys. That's the the literal string "WAL"
> or
> > "PAGE" being added as a salt to generate the respective keys, all that
> matters
> > is they're different.
> 
> I was thinking the WAL would use the same key since the nonce is unique
> between the two.  What value is there in using a different key?
> 
> 
> Never having to worry about overlap in Key + IV usage is main advantage. While
> it's possible to structure IVs to avoid that from happening, it's much easier
> to completely avoid that situation by ensuring different parts of an
> application are using separate derived keys.

Understood.

> > Ideally WAL encryption would generating new derived keys as part of the
> WAL
> > stream. The WAL stream is not fixed so you have the luxury of being able
> to add
> > a "Use new random salt XZY going forward" records. Forcing generation of
> a new
> > salt/key upon promotion of a replica would ensure that at least the WAL
> is
> > unique going forward. Could also generate a new upon server startup,
> after
> 
> Ah, yes, good point, and using a derived key would make that easier.
> The tricky part is what to use to create the new derived key, unless we
> generate a random number and store that somewhere in the data directory,
> but that might lead to fragility, so I am worried. 
> 
> 
> Simplest approach for derived keys would be to use immutable attributes of the
> WAL files as an input to the key derivation. Something like HKDF(MDEK, "WAL:" 
> |

So, I am thinking we should use "WAL:" for WAL and "REL:" for heap/index
files.

> | timeline_id || wal_segment_num) should be fine for this as it is:

I considered using the timeline in the nonce, but then remembered that
in timeline switch, we _copy_ the part of the old WAL up to the timeline
switch to the new timeline;  see:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb=HEAD#l5502

   * Initialize the starting WAL segment for the new timeline. If the switch
   * happens in the middle of a segment, copy data from the last WAL segment
   * of the old timeline up to the switch point, to the starting WAL segment
   * on the new timeline.

We would need to decrypt/encrypt to do the copy, and just wasn't sure of
the value of the timeline in the nonce.  One value to it is that if
there some WAL that generated after the timeline switch in the old
primary that isn't transfered, there would be potentially new data
encrypted with the same key/nonce in the new primary, but if that WAL is
not used, odds are it is gone/destroyed/inaccessible, or it would have
been used during the switchover, so it didn't seem worth worrying about.

One _big_ reason to add the timeline is if you had a standby that you
recovered and rolled forward only to a specific transaction, then
continued running it as a new primary.  In that case, you would have
different WAL encrypted with the same key/nonce, but that sounds like
the same as promoting two standbys, and we should just document not to
do it.

Maybe we need to consider this further.

> * Unique per WAL file
> * Known prior to writing to a given WAL file
> * Known prior to reading a given WAL file
> * Does not require any additional persistence

Agreed.

> We have pg_rewind,
> which allows to make the WAL go backwards.  What is the value in doing
> this?
> 
> 
> Good point re: pg_rewind. Having key rotation records in the stream would
> complicate that as you'd have to jump back / forward to figure out which key 
> to
> use. It's doable but much more

Re: First draft of back-branch release notes is done

2019-08-08 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Aug-04, Jonathan S. Katz wrote:
>> * Ensure that partition key columns will not be dropped as the result of an
>> "indirect drop," such as from a cascade from dropping the key column's data
>> type (e.g. a custom data type). This fix is applied only to newly created
>> partitioned tables: if you believe you have an affected partition table (e.g.
>> one where the partition key uses a custom data type), you will need to
>> create a new table and move your data into it.

> Hmm, if I have this problem, I can pg_upgrade and the new database will
> have correct dependencies, right?  For some people, doing that might be
> easier than creating and reloading large tables.

Yeah, that should work.

regards, tom lane

Re: Small const correctness patch

2019-08-08 Thread Peter Eisentraut

On 2019-08-08 08:46, Mark G wrote:
> Please find attached a trivial patch making a few arrays const (in
> addition to the data they point to).

How did you find this?  Any special compiler settings?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Small const correctness patch

2019-08-08 Thread Ibrar Ahmed

+1

Patch successfully applied to the master (
43211c2a02f39d6568496168413dc00e0399dc2e)

On Thu, Aug 8, 2019 at 12:30 PM Mark G  wrote:

> Hello all,
>
> Please find attached a trivial patch making a few arrays const (in
> addition to the data they point to).
>
>
>

-- 
Ibrar Ahmed

Re: First draft of back-branch release notes is done

2019-08-08 Thread Alvaro Herrera

I realize that this has now been sent, but I wanted to comment on one
item:

On 2019-Aug-04, Jonathan S. Katz wrote:

> * Ensure that partition key columns will not be dropped as the result of an
> "indirect drop," such as from a cascade from dropping the key column's data
> type (e.g. a custom data type). This fix is applied only to newly created
> partitioned tables: if you believe you have an affected partition table (e.g.
> one where the partition key uses a custom data type), you will need to
> create a new table and move your data into it.

Hmm, if I have this problem, I can pg_upgrade and the new database will
have correct dependencies, right?  For some people, doing that might be
easier than creating and reloading large tables.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Small patch to fix build on Windows

2019-08-08 Thread Alvaro Herrera

On 2019-Aug-08, Dmitry Igrishin wrote:

>   my $prefixcmd =
> -   $solution->{options}->{python} . "\\python -c 
> \"$pythonprog\"";
> +   "\"$solution->{options}->{python}\\python\" -c 
> \"$pythonprog\"";

I think you can make this prettier like this:

   my $prefixcmd = qq{"$solution->{options}->{python}\\python" -c 
"$pythonprog"};

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: converting Lists into arrays

2019-08-08 Thread Tom Lane

[ returning to this topic now that the CF is over ]

I wrote:
> Perhaps there's an argument for doing something to change the behavior
> of list_union and list_difference and friends.  Not sure --- it could
> be a foot-gun for back-patching.  I'm already worried about the risk
> of back-patching code that assumes the new semantics of list_concat.
> (Which might be a good argument for renaming it to something else?
> Just not list_union, please.)

Has anyone got further thoughts about naming around list_concat
and friends?

If not, I'm inclined to go ahead with the concat-improvement patch as
proposed in [1], modulo the one improvement David spotted.

regards, tom lane

[1] https://www.postgresql.org/message-id/6704.1563739...@sss.pgh.pa.us

Re: clean up obsolete initdb locale handling

2019-08-08 Thread Tom Lane

I wrote:
> ... In particular, I'm concerned that this patch will
> result in subtle changes in what settings get chosen during initdb.

OK, after reviewing the code a bit more I take that back --- initdb's
choices are entirely made within initdb.

However, I don't much like the choice to set LC_COLLATE and LC_CTYPE
differently.  That seems to be risking weird behavior, and for what?
I'd be inclined to just remove the WIN32 stanza, initialize all
three of these variables with "", and explain it along the lines of

* In the postmaster, absorb the environment values for LC_COLLATE
* and LC_CTYPE.  Individual backends will change these later to
* settings taken from pg_database, but the postmaster cannot do
* that.  If we leave these set to "C" then message localization
* might not work well in the postmaster.

That ends up being no code change in main.c, except on Windows.
I concur that we can drop the transmission of LC_COLLATE and
LC_CTYPE via environment variables.

regards, tom lane

Re: clean up obsolete initdb locale handling

2019-08-08 Thread Tom Lane

Peter Eisentraut  writes:
> A long time ago, we changed LC_COLLATE and LC_CTYPE from cluster-wide to
> per-database (61d967498802ab86d8897cb3c61740d7e9d712f6).  There is some
> leftover code from that in initdb.c and backend/main/main.c to pass
> these environment variables around in the expectations that the backend
> will write them to pg_control during bootstrap, which is of course all a
> lie now.

Well, the comments' references to pg_control are indeed obsolete, but why
wouldn't we just replace that with references to "the appropriate entry in
pg_database"?  I don't see why that movement changed anything about what
should happen here.  In particular, I'm concerned that this patch will
result in subtle changes in what settings get chosen during initdb.

regards, tom lane

Re: PG_TRY()/CATCH() in a loop reports uninitialized variables

2019-08-08 Thread Tom Lane

Adam Lee  writes:
> That's fine actually, but if we put the PG_TRY()/CATCH() in a loop, high
> version gcc might complain.

I'd be inclined to say "so don't do that then".  Given this interpretation
(which sure looks like a bug to me, gcc maintainers' opinion or no),
you're basically going to have to mark EVERYTHING in that function
volatile.  Better to structure the code so you don't have to do that,
which would mean not putting the TRY and the loop in the same level
of function.

I've seen other weird-maybe-bug gcc behaviors in the vicinity of
setjmp calls, too, which is another factor that pushes me not to
want to assume too much about what you can do in the same function
as a TRY call.

regards, tom lane

clean up obsolete initdb locale handling

2019-08-08 Thread Peter Eisentraut

A long time ago, we changed LC_COLLATE and LC_CTYPE from cluster-wide to
per-database (61d967498802ab86d8897cb3c61740d7e9d712f6).  There is some
leftover code from that in initdb.c and backend/main/main.c to pass
these environment variables around in the expectations that the backend
will write them to pg_control during bootstrap, which is of course all a
lie now.

The attached patch cleans that up.  (Not totally sure about the WIN32
block, but the change seems good to me.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 51568d2923ce26ecf43f028dd4075376cfb864fb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Aug 2019 12:33:07 +0200
Subject: [PATCH] initdb: Remove obsolete locale handling

The method of passing LC_COLLATE and LC_CTYPE to the backend during
initdb is obsolete as of 61d967498802ab86d8897cb3c61740d7e9d712f6.
This can all be removed.
---
 src/backend/main/main.c | 39 ++-
 src/bin/initdb/initdb.c | 14 --
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 7b18f8c758..1bd849fb51 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -99,42 +99,23 @@ main(int argc, char *argv[])
MemoryContextInit();
 
/*
-* Set up locale information from environment.  Note that LC_CTYPE and
-* LC_COLLATE will be overridden later from pg_control if we are in an
-* already-initialized database.  We set them here so that they will be
-* available to fill pg_control during initdb.  LC_MESSAGES will get set
-* later during GUC option processing, but we set it here to allow 
startup
-* error messages to be localized.
+* Set up locale information
 */
-
set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres"));
 
-#ifdef WIN32
-
/*
-* Windows uses codepages rather than the environment, so we work around
-* that by querying the environment explicitly first for LC_COLLATE and
-* LC_CTYPE. We have to do this because initdb passes those values in 
the
-* environment. If there is nothing there we fall back on the codepage.
+* LC_COLLATE and LC_CTYPE are set at backend start to values from
+* pg_database.  We start out LC_COLLATE as "C" (nothing in the 
postmaster
+* should care).  But LC_CTYPE should be set to something reasonable so
+* that gettext can work.
 */
-   {
-   char   *env_locale;
-
-   if ((env_locale = getenv("LC_COLLATE")) != NULL)
-   init_locale("LC_COLLATE", LC_COLLATE, env_locale);
-   else
-   init_locale("LC_COLLATE", LC_COLLATE, "");
-
-   if ((env_locale = getenv("LC_CTYPE")) != NULL)
-   init_locale("LC_CTYPE", LC_CTYPE, env_locale);
-   else
-   init_locale("LC_CTYPE", LC_CTYPE, "");
-   }
-#else
-   init_locale("LC_COLLATE", LC_COLLATE, "");
+   init_locale("LC_COLLATE", LC_COLLATE, "C");
init_locale("LC_CTYPE", LC_CTYPE, "");
-#endif
 
+   /*
+* LC_MESSAGES will get set later during GUC option processing, but we 
set
+* it here to allow startup error messages to be localized.
+*/
 #ifdef LC_MESSAGES
init_locale("LC_MESSAGES", LC_MESSAGES, "");
 #endif
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 551d379d85..88a261d9bd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1417,20 +1417,6 @@ bootstrap_template1(void)
bki_lines = replace_token(bki_lines, "LC_CTYPE",
  
escape_quotes_bki(lc_ctype));
 
-   /*
-* Pass correct LC_xxx environment to bootstrap.
-*
-* The shell script arranged to restore the LC settings afterwards, but
-* there doesn't seem to be any compelling reason to do that.
-*/
-   snprintf(cmd, sizeof(cmd), "LC_COLLATE=%s", lc_collate);
-   putenv(pg_strdup(cmd));
-
-   snprintf(cmd, sizeof(cmd), "LC_CTYPE=%s", lc_ctype);
-   putenv(pg_strdup(cmd));
-
-   unsetenv("LC_ALL");
-
/* Also ensure backend isn't confused by this environment var: */
unsetenv("PGCLIENTENCODING");
 
-- 
2.22.0

Re: Locale support

2019-08-08 Thread Chapman Flack

On 8/8/19 9:29 AM, Yonatan Misgan wrote:
> From: Thomas Munro 
>> Perl, Python etc.  If I were you I think I'd experiment with a
>> prototype implementation using  PL/Perl, PL/Python etc (a way to

As a bit of subtlety that might matter, the internal representation
in PostgreSQL, as in ISO 8601, applies the Gregorian calendar
'proleptically', that is, forever into the future, and forever into
the past, before it was even invented or in use anywhere.

That matches the documented behavior of the standard 'datetime'
class included with Python (though in Python you need an add-on
module to support time zones).

In Perl, an add-on module may be required.

(In Java, the classes in the java.time package match the PostgreSQL /
ISO 8601 behavior, while the classes in the java.sql package do not.)

The effects of any mismatch are most likely to show up in dates
earlier than 15 October 1582 Gregorian.

This was freshly on my mind from a recent thread over here[1].

Regards,
-Chap

[1]
https://www.postgresql.org/message-id/5D3AF944.6020900%40anastigmatix.net

Re: Resume vacuum and autovacuum from interruption and cancellation

2019-08-08 Thread Rafia Sabih

On Tue, 16 Jul 2019 at 13:57, Masahiko Sawada  wrote:
>
> On Wed, Jun 12, 2019 at 1:30 PM Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > Long-running vacuum could be sometimes cancelled by administrator. And
> > autovacuums could be cancelled by concurrent processes. Even if it
> > retries after cancellation, since it always restart from the first
> > block of table it could vacuums blocks again that we vacuumed last
> > time. We have visibility map to skip scanning all-visible blocks but
> > in case where the table is large and often modified, we're more likely
> > to reclaim more garbage from blocks other than we processed last time
> > than scanning from the first block.
> >
> > So I'd like to propose to make vacuums save its progress and resume
> > vacuuming based on it. The mechanism I'm thinking is simple; vacuums
> > periodically report the current block number to the stats collector.
> > If table has indexes, reports it after heap vacuum whereas reports it
> > every certain amount of blocks (e.g. 1024 blocks = 8MB) if no indexes.
> > We can see that value on new column vacuum_resume_block of
> > pg_stat_all_tables. I'm going to add one vacuum command option RESUME
> > and one new reloption vacuum_resume. If the option is true vacuums
> > fetch the block number from stats collector before starting and start
> > vacuuming from that block. I wonder if we could make it true by
> > default for autovacuums but it must be false when aggressive vacuum.
> >
> > If we start to vacuum from not first block, we can update neither
> > relfrozenxid nor relfrozenxmxid. And we might not be able to update
> > even relation statistics.
> >

Sounds like an interesting idea, but does it really help? Because if
vacuum was interrupted previously, wouldn't it already know the dead
tuples, etc in the next run quite quickly, as the VM, FSM is already
updated for the page in the previous run.

A few minor things I noticed in the first look,
+/*
+ * When a table has no indexes, save the progress every 8GB so that we can
+ * resume vacuum from the middle of table. When table has indexes we save it
+ * after the second heap pass finished.
+ */
+#define VACUUM_RESUME_BLK_INTERVAL 1024 /* 8MB */
Discrepancy with the memory unit here.

/* No found valid saved block number, resume from the first block */
Can be better framed.

--
Regards,
Rafia Sabih

Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Andres Freund

Hi,

On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > > +/*
> > > + * Binary heap comparison function to compare the time at which an error
> > > + * occurred for transactions.
> > > + *
> > > + * The error queue is sorted by next_retry_at and err_occurred_at.  
> > > Currently,
> > > + * the next_retry_at has some constant delay time (see 
> > > PushErrorQueueElem), so
> > > + * it doesn't make much sense to sort by both values.  However, in 
> > > future, if
> > > + * we have some different algorithm for next_retry_at, then it will work
> > > + * seamlessly.
> > > + */
> >
> > Why is it useful to have error_occurred_at be part of the comparison at
> > all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> > conflicts for next_retry_at, then we can also get conflicts in
> > err_occurred_at).  Seems better to use something actually guaranteed to
> > be unique for a tiebreaker.
> >
> 
> This was to distinguish the cases where the request is failing
> multiple times with the request failed the first time.  I agree that
> we need a better unique identifier like FullTransactionid though.  Do
> let me know if you have any other suggestion?

Sure, I get why you have the field. Even if it were just for debugging
or such. Was just commenting upon it being used as part of the
comparison.  I'd just go for (next_retry_at, fxid).


> > > +  * backends. This will ensure that it won't get filled.
> > > +  */
> >
> > How does this ensure anything?
> >
> 
> Because based on this we will have a hard limit on the number of undo
> requests after which we won't allow more requests.  See some more
> detailed explanation for the same later in this email.   I think the
> comment needs to be updated.

Well, as your code stands, I don't think there is an actual hard limit
on the number of transactions needing to be undone due to the way errors
are handled. There's no consideration of prepared transactions.



> > > + START_CRIT_SECTION();
> > > +
> > > + /* Update the progress in the transaction header. */
> > > + UndoRecordUpdateTransInfo(, 0);
> > > +
> > > + /* WAL log the undo apply progress. */
> > > + {
> > > + XLogRecPtr  lsn;
> > > + xl_undoapply_progress xlrec;
> > > +
> > > + xlrec.urec_ptr = progress_urec_ptr;
> > > + xlrec.progress = block_num;
> > > +
> > > + XLogBeginInsert();
> > > + XLogRegisterData((char *) , sizeof(xlrec));
> > > +
> > > + RegisterUndoLogBuffers(, 1);
> > > + lsn = XLogInsert(RM_UNDOACTION_ID, 
> > > XLOG_UNDO_APPLY_PROGRESS);
> > > + UndoLogBuffersSetLSN(, lsn);
> > > + }
> > > +
> > > + END_CRIT_SECTION();
> > > +
> > > + /* Release undo buffers. */
> > > + FinishUndoRecordInsert();
> > > +}
> >
> > This whole prepare/execute split for updating apply pregress, and next
> > undo pointers makes no sense to me.
> >
> 
> Can you explain what is your concern here?  Basically, in the prepare
> phase, we do read and lock the buffer and in the actual update phase
> (which is under critical section), we update the contents in the
> shared buffer.  This is the same idea as we use in many places in
> code.

I'll comment on the concerns with the whole API separately.


> > >  typedef struct TwoPhaseFileHeader
> > >  {
> > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > >   uint16  gidlen; /* length of the GID - GID 
> > > follows the header */
> > >   XLogRecPtr  origin_lsn; /* lsn of this record at 
> > > origin node */
> > >   TimestampTz origin_timestamp;   /* time of prepare at origin node */
> > > +
> > > + /*
> > > +  * We need the locations of the start and end undo record pointers 
> > > when
> > > +  * rollbacks are to be performed for prepared transactions using 
> > > undo-based
> > > +  * relations.  We need to store this information in the file as the 
> > > user
> > > +  * might rollback the prepared transaction after recovery and for 
> > > that we
> > > +  * need its start and end undo locations.
> > > +  */
> > > + UndoRecPtr  start_urec_ptr[UndoLogCategories];
> > > + UndoRecPtr  end_urec_ptr[UndoLogCategories];
> > >  } TwoPhaseFileHeader;
> >
> > Why do we not need that knowledge for undo processing of a non-prepared
> > transaction?

> The non-prepared transaction also needs to be aware of that.  It is
> stored in TransactionStateData.  I am not sure if I understand your
> question here.

My concern is that I think it's fairly ugly to store data like this in
the 2pc state file. And it's not an insubstantial amount of additional
data either, compared to the current size, even when no undo is in
use. There's a difference between an unused feature increasing backend
local memory and increasing the

RE: Locale support

2019-08-08 Thread Yonatan Misgan

Thank you for your quick response. I am also impressed to develop Ethiopian 
calendar as an extension on PostgreSQL and I I have already developed the 
function that convert Gregorian calendar time to Ethiopian calendar time. But 
the difficulty is on how to use this function on PostgreSQL as well on 
PostgreSQL month names are key words when I  am developing Ethiopian calendar 
the date data type is doesn't accept Ethiopian month name as a date data type 
value only the numeric representation of the months are accepted by compiler.
So my question is after  developing the converter function where I put it for 
accessing it on PostgreSQL.

 Original message 
From: Thomas Munro 
Date: 8/8/19 11:34 AM (GMT+03:00)
To: Yonatan Misgan 
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: Locale support

On Thu, Aug 8, 2019 at 7:31 PM Yonatan Misgan  wrote:
> I am Yonathan Misgan from Ethiopia, want to add some functionalities on 
> PostgreSQL to support Ethiopian locales. I want your advice where I start to 
> hack the PostgresSQL source code. I have attached some synopsis about the 
> existing problems of PostgresSQL related with Ethiopian locale specially 
> related with calendar, date and time format.

Hi Yonatan,

I'm not sure if this requires hacking the PostgreSQL source code.  It
sounds more like an extension.  My first impression is that you might
not need new types like "date".  Instead you might be able to develop
a suite of functions that can convert the existing types to and from
the display formats (ie strings) and perhaps also components (year,
month, day etc) that you use in your calendar system.  For example:

SELECT to_char_ethiopian(CURRENT_DATE, '-MM-DD'), or whatever kind
of format control string would be more appropriate.

However, I see from https://en.wikipedia.org/wiki/Time_in_Ethiopia
that new days start at 1 o'clock, not midnight, so that makes
CURRENT_DATE a bit more confusing -- you might need to write a
function current_eth_date() to deal with that small difference.  Other
than that detail, which is really a detail of CURRENT_DATE and not of
the date type, dates are internally represented as a number of days
since some arbitrary "epoch" day (I think Gregorian 2000-01-01), not
as the components you see when you look at the output of SELECT
CURRENT_DATE.  That is, the Gregorian calendar concepts exist mostly
in the display/input functions, and the operators that can add
intervals etc.  You could supply a different set of functions, but use
the same types, and I suspect that'd be convenient because then you'll
be able to use Gregorian and Ethiopian conventions with the same data,
whenever you need to.  It's much the same for timestamps, but with
more complicated details.

I see that there are libraries and bits of example code around to do
the various kinds of calendar maths required for Ethiopian dates in
Perl, Python etc.  If I were you I think I'd experiment with a
prototype implementation using  PL/Perl, PL/Python etc (a way to
define new PostgreSQL functions written in those languages), and if
that goes well, try writing an extension in C to do it more
efficiently.

The end goal of that woudn't need to be part of PostgreSQL itself, but
just an extension that anyone can download and install to use
Ethiopian dates conveniently.

--
Thomas Munro
https://enterprisedb.com

Re: partition routing layering in nodeModifyTable.c

2019-08-08 Thread Etsuro Fujita

Hi,

On Thu, Aug 8, 2019 at 10:10 AM Amit Langote  wrote:
> On Wed, Aug 7, 2019 at 6:00 PM Etsuro Fujita  wrote:
> > On Wed, Aug 7, 2019 at 4:28 PM Amit Langote  wrote:
> > > I just noticed obsolete references to es_result_relation_info that
> > > 0002 failed to remove.  One of them is in fdwhandler.sgml:
> > >
> > > 
> > > TupleTableSlot *
> > > IterateDirectModify(ForeignScanState *node);
> > > 
> > >
> > > ... The data that was actually inserted, updated
> > >  or deleted must be stored in the
> > >  
> > > es_result_relation_info-ri_projectReturning-pi_exprContext-ecxt_scantuple
> > >  of the node's EState.
> > >
> > > We will need to rewrite this without mentioning
> > > es_result_relation_info.  How about as follows:
> > >
> > > - 
> > > es_result_relation_info-ri_projectReturning-pi_exprContext-ecxt_scantuple
> > > - of the node's EState.
> > > + 
> > > ri_projectReturning-pi_exprContext-ecxt_scantuple
> > > + of the result relation'sResultRelInfo that 
> > > has
> > > + been made available via node.
> > >
> > > I've updated 0001 with the above change.

> > This would be nitpicking, but:
> >
> > * IIUC, we don't use the term "result relation" in fdwhandler.sgml.
> > For consistency with your change to the doc for BeginDirectModify, how
> > about using the term "target foreign table" instead of "result
> > relation"?
>
> Agreed, done.
>
> > * ISTM that "ResultRelInfo that has been made
> > available via node" would be a bit fuzzy to FDW authors.  To be more
> > specific, how about changing it to
> > "ResultRelInfo passed to
> > BeginDirectModify" or something like that?
>
> That works for me, although an FDW author reading this still has got
> to make the connection.
>
> Attached updated patches; only 0001 changed in this version.

Thanks for the updated version, Amit-san!  I updated the 0001 patch a
bit further:

* Tweaked comments in plannodes.h, createplan.c, and nodeForeignscan.c.
* Made cosmetic changes to postgres_fdw.c.
* Adjusted doc changes a bit, mainly not to produce unnecessary diff.
* Modified the commit message.

Attached is an updated version of the 0001 patch.  Does that make sense?

Best regards,
Etsuro Fujita


v7-0001-Remove-dependency-on-estate-es_result_relation_info-efujita.patch
Description: Binary data

Re: POC: Cleaning up orphaned files using undo logs

2019-08-08 Thread Amit Kapila

On Wed, Aug 7, 2019 at 5:06 PM Thomas Munro  wrote:
>
> On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila  wrote:
> > On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila  
> > wrote:
> > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  
> > > wrote:
> > > >  but
> > > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > > the undo worker launcher:
> > > >
> > >
> > > I could see the problem, will fix in next version.
> >
> > Fixed both of these problems in the patch just posted by me [1].
>
> I reran the script that found that problem, so I could play with the
> linger logic.
>

Thanks for the test.  I will look into it and get back to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Problem with default partition pruning

2019-08-08 Thread yuzuko

Hello,

On Thu, Aug 8, 2019 at 5:09 PM Amit Langote  wrote:
>
> Hi Simon,
>
> On Thu, Aug 8, 2019 at 4:54 PM Simon Riggs  wrote:
> > On Wed, 7 Aug 2019 at 21:27, Alvaro Herrera  
> > wrote:
> >> Well, yes, avoiding that is the point of this commit also: we were
> >> scanning some partitions for some queries, after this patch we're
> >> supposed not to.
> >
> >
> > Understood
> >
> > My concern was about the additional execution time caused when there would 
> > be no benefit, especially if the algoithmic cost is O(N) or similar (i.e. 
> > worse than O(k))
>
> Note that we apply constraint exclusion only to the (sub-partitioned)
> parent, not to all partitions, so the cost is not O(N) in the number
> of partitions.  The idea is that if the parent is excluded, all of its
> partitions are.  We normally wouldn't need to use constrain exclusion,
> because partition pruning can handle most cases.  What partition
> pruning can't handle sufficiently well though is the case where a
> clause set that contradicts the partition constraint is specified --
> while all non-default partitions are correctly pruned, the default
> partition is not.  Using constraint exclusion is a workaround for that
> deficiency of the partition pruning logic.
>
Besides that,  the additional code will not be executed if people don't
define any default partition in the latest patch Amit proposed.  So I think
this patch has no effect such as Simon's concern.

I looked at Amit's patches and found it worked correctly.

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Re: s/rewinded/rewound/?

2019-08-08 Thread Michael Paquier

On Wed, Aug 07, 2019 at 12:48:29PM +0300, Liudmila Mantrova wrote:
> If we decide to fix this, we should probably revise and back-patch the whole
> paragraph where it appears as it seems to mix up scanning target cluster
> WALs and applying source cluster WALs. A small patch is attached for your
> consideration (originally proposed on pgsql-docs [1]).

Okay, I can see the confusion, and your proposed rewording looks fine
to me.  Any objections?
--
Michael


signature.asc
Description: PGP signature

Re: Documentation clarification re: ANALYZE

2019-08-08 Thread Michael Paquier

On Wed, Aug 07, 2019 at 05:54:14PM -0400, Tom Lane wrote:
> Actually, looking in the source code finds
> 
>  * We allow the user to vacuum or analyze a table if he is superuser, the
>  * table owner, or the database owner (but in the latter case, only if
>  * it's not a shared relation).
> 
> It's definitely a documentation omission that this isn't spelled out in
> the ANALYZE reference page (VACUUM's page does have text about it).

As far as I recall we have been doing that for ages, so +1 for the
documentation fix you have just done.
--
Michael


signature.asc
Description: PGP signature

Re: SQL/JSON path: collation for comparisons, minor typos in docs

2019-08-08 Thread Markus Winand

Hi!

The patch makes my tests pass.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ 
< ‘A’ COLLATE ucs_basic)?

- For object key members, the standard also refers to unicode code point 
collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find 
this explicitly stated in the standard.

My tests check whether those cases do case-sensitive comparisons. With my 
default collation "en_US.UTF-8” I cannot discover potential issues there. I 
haven’t played around with nondeterministic ICU collations yet :(

-markus
ps.: for me, testing the regular expression dialect of like_regex is out of 
scope


> On 8 Aug 2019, at 02:27, Alexander Korotkov  wrote:
> 
> On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
> mailto:a.korot...@postgrespro.ru>> wrote:
>> On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
>>  wrote:
>>> On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
>>>  wrote:
 On Wed, Aug 7, 2019 at 2:25 PM Markus Winand  
 wrote:
> I was playing around with JSON path quite a bit and might have found one 
> case where the current implementation doesn’t follow the standard.
> 
> The functionality in question are the comparison operators except ==. 
> They use the database default collation rather then the standard-mandated 
> "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, 
> last sentence in first paragraph).
 
 Thank you for pointing!  Nikita is about to write a patch fixing that.
>>> 
>>> Please, see the attached patch.
>>> 
>>> Our idea is to not sacrifice "==" operator performance for standard
>>> conformance.  So, "==" remains per-byte comparison.  For consistency
>>> in other operators we compare code points first, then do per-byte
>>> comparison.  In some edge cases, when same Unicode codepoints have
>>> different binary representations in database encoding, this behavior
>>> diverges standard.  In future we can implement strict standard
>>> conformance by normalization of input JSON strings.
>> 
>> Previous version of patch has buggy implementation of
>> compareStrings().  Revised version is attached.
> 
> Nikita pointed me that for UTF-8 strings per-byte comparison result
> matches codepoints comparison result.  That allows simplify patch a
> lot.
> 
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com 
> 
> The Russian Postgres Company
> <0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>

Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-08 Thread Amit Langote

On Thu, Aug 8, 2019 at 5:33 PM amul sul  wrote:
> On Thu, Aug 8, 2019 at 1:27 PM Amit Langote  wrote:

>> Thanks for the patch.  This was discussed recently in the "hyrax vs.
>> RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed
>> an approach that's similar to yours.  Not sure why it wasn't pursued
>> though.  Maybe the reason is buried somewhere in that discussion.
>
> Oh, quite similar, thanks Amit for pointing that out.
>
> Look like "hyrax vs.RelationBuildPartitionDesc()" is in discussion for the 
> master
> branch only, not sure though, but we need the similar fix for the back 
> branches as well.

Well, this is not a bug as such, so it's very unlikely that a fix like
this will be back-patched.  Also, if this becomes an issue only for
more than over 1000 partitions, then it's not very relevant for PG 10
and PG 11, because we don't recommend using so many partitions with
them.  Maybe we can consider fixing PG 12 though.

Thanks,
Amit

Re: Locale support

2019-08-08 Thread Thomas Munro

On Thu, Aug 8, 2019 at 7:31 PM Yonatan Misgan  wrote:
> I am Yonathan Misgan from Ethiopia, want to add some functionalities on 
> PostgreSQL to support Ethiopian locales. I want your advice where I start to 
> hack the PostgresSQL source code. I have attached some synopsis about the 
> existing problems of PostgresSQL related with Ethiopian locale specially 
> related with calendar, date and time format.

Hi Yonatan,

I'm not sure if this requires hacking the PostgreSQL source code.  It
sounds more like an extension.  My first impression is that you might
not need new types like "date".  Instead you might be able to develop
a suite of functions that can convert the existing types to and from
the display formats (ie strings) and perhaps also components (year,
month, day etc) that you use in your calendar system.  For example:

SELECT to_char_ethiopian(CURRENT_DATE, '-MM-DD'), or whatever kind
of format control string would be more appropriate.

However, I see from https://en.wikipedia.org/wiki/Time_in_Ethiopia
that new days start at 1 o'clock, not midnight, so that makes
CURRENT_DATE a bit more confusing -- you might need to write a
function current_eth_date() to deal with that small difference.  Other
than that detail, which is really a detail of CURRENT_DATE and not of
the date type, dates are internally represented as a number of days
since some arbitrary "epoch" day (I think Gregorian 2000-01-01), not
as the components you see when you look at the output of SELECT
CURRENT_DATE.  That is, the Gregorian calendar concepts exist mostly
in the display/input functions, and the operators that can add
intervals etc.  You could supply a different set of functions, but use
the same types, and I suspect that'd be convenient because then you'll
be able to use Gregorian and Ethiopian conventions with the same data,
whenever you need to.  It's much the same for timestamps, but with
more complicated details.

I see that there are libraries and bits of example code around to do
the various kinds of calendar maths required for Ethiopian dates in
Perl, Python etc.  If I were you I think I'd experiment with a
prototype implementation using  PL/Perl, PL/Python etc (a way to
define new PostgreSQL functions written in those languages), and if
that goes well, try writing an extension in C to do it more
efficiently.

The end goal of that woudn't need to be part of PostgreSQL itself, but
just an extension that anyone can download and install to use
Ethiopian dates conveniently.

-- 
Thomas Munro
https://enterprisedb.com

Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-08 Thread amul sul

On Thu, Aug 8, 2019 at 1:27 PM Amit Langote  wrote:

> Hi Amul,
>
> On Thu, Aug 8, 2019 at 4:15 PM amul sul  wrote:
> >
> > Hi,
> >
> > In RelationBuildPartitionDesc(), a memory space that use to gather
> partitioning
> > bound info wasn't free at the end.  This might not a problem because this
> > allocated memory will eventually be recovered when the top-level context
> is
> > freed, but the case when a partitioned table having 1000s or more
> partitions and
> > this partitioned relation open & close, and its cached entry invalidated
> in loop
> > then we'll have too may call to RelationBuildPartitionDesc() which will
> keep
> > wasting some space with every loop.
> >
> > For a demonstration purpose, I did the following changes to
> > heap_drop_with_catalog() and tried to drop a partitioned table having
> 5000
> > partitions(attached create script) which hit OOM on a machine in no time:
> >
> > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> > index b7bcdd9d0f..6b7bc0d7ae 100644
> > --- a/src/backend/catalog/heap.c
> > +++ b/src/backend/catalog/heap.c
> > @@ -1842,6 +1842,8 @@ heap_drop_with_catalog(Oid relid)
> > parentOid = get_partition_parent(relid);
> > LockRelationOid(parentOid, AccessExclusiveLock);
> >
> > +   rel = relation_open(parentOid, NoLock);
> > +   relation_close(rel, NoLock);
> > /*
> >  * If this is not the default partition, dropping it will change
> the
> >  * default partition's partition constraint, so we must lock it.
> >
> >
> > I think we should do all partitioned bound information gathering and
> > calculation in temporary memory context which can be released at the end
> of
> > RelationBuildPartitionDesc(), thoughts/Comments?
> >
> > I did the same in the attached patch and the aforesaid OOM issue is
> disappeared.
>
> Thanks for the patch.  This was discussed recently in the "hyrax vs.
> RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed
> an approach that's similar to yours.  Not sure why it wasn't pursued
> though.  Maybe the reason is buried somewhere in that discussion.
>
> Thanks,
> Amit
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2BTgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw%40mail.gmail.com


Oh, quite similar, thanks Amit for pointing that out.

Look like "hyrax vs.RelationBuildPartitionDesc()" is in discussion for the
master
branch only, not sure though, but we need the similar fix for the back
branches as well.

Regards,
Amul

PG_TRY()/CATCH() in a loop reports uninitialized variables

2019-08-08 Thread Adam Lee

Hi, hackers

"The local variables that do not have the volatile type and have been changed
between the setjmp() invocation and longjmp() call are indeterminate". This is
what the POSIX (and C standard for setjmp) says.

That's fine actually, but if we put the PG_TRY()/CATCH() in a loop, high
version gcc might complain.

Version:
$ gcc-9 --version
gcc-9 (Ubuntu 9.1.0-2ubuntu2~19.04) 9.1.0
(Actually from gcc 7)

Reproducer:
```
#include 

extern int other(void);
extern void trigger(int *cond1);
extern sigjmp_buf *PG_exception_stack;

void
trigger(int *cond1)
{
while (1)
{
if (*cond1 == 0)
*cond1 = other();

while (*cond1)
{
sigjmp_buf *save_exception_stack = PG_exception_stack;
sigjmp_buf local_sigjmp_buf;

if (sigsetjmp(local_sigjmp_buf, 0) == 0)
PG_exception_stack = _sigjmp_buf;
else
PG_exception_stack = (sigjmp_buf *) 
save_exception_stack;

PG_exception_stack = (sigjmp_buf *) 
save_exception_stack;
}
}
}
```

```
$ gcc-9 -O1 -Werror=uninitialized -fexpensive-optimizations -ftree-pre -c -o 
/dev/null reproducer.c
reproducer.c: In function 'trigger':
reproducer.c:17:16: error: 'save_exception_stack' is used uninitialized in this 
function [-Werror=uninitialized]
   17 |sigjmp_buf *save_exception_stack = PG_exception_stack;
  |^~~~
cc1: some warnings being treated as errors
```

Codes re-ordering matters, when it warns:
```
sigjmp_buf *save_exception_stack = PG_exception_stack;
  2f:   48 8b 1d 00 00 00 00mov0x0(%rip),%rbx# 36 
  36:   48 89 5c 24 18  mov%rbx,0x18(%rsp)
sigjmp_buf local_sigjmp_buf;

if (sigsetjmp(local_sigjmp_buf, 0) == 0)
```

When it doesn't complain:
```
sigjmp_buf *save_exception_stack = PG_exception_stack;
sigjmp_buf local_sigjmp_buf;

if (sigsetjmp(local_sigjmp_buf, 0) == 0)
  29:   48 8d 44 24 20  lea0x20(%rsp),%rax
  2e:   48 89 44 24 08  mov%rax,0x8(%rsp)
...
sigjmp_buf *save_exception_stack = PG_exception_stack;
  3c:   48 8b 1d 00 00 00 00mov0x0(%rip),%rbx# 43 
  43:   48 89 5c 24 18  mov%rbx,0x18(%rsp)
```

Greenplum had an issue reporting save_exception_stack and save_context_stack
not initialized.
https://github.com/greenplum-db/gpdb/issues/8262

I filed a bug report for gcc, they think it's expected.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91395

Since high version gcc thinks it's supposed to report warning, we need to make
these two variables volatile? Or have I missed something?

-- 
Adam Lee
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index dbfd8efd26..97e8405bfa 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -300,8 +300,8 @@ extern PGDLLIMPORT ErrorContextCallback 
*error_context_stack;
  */
 #define PG_TRY()  \
do { \
-   sigjmp_buf *save_exception_stack = PG_exception_stack; \
-   ErrorContextCallback *save_context_stack = error_context_stack; 
\
+   sigjmp_buf * volatile save_exception_stack = 
PG_exception_stack; \
+   ErrorContextCallback * volatile save_context_stack = 
error_context_stack; \
sigjmp_buf local_sigjmp_buf; \
if (sigsetjmp(local_sigjmp_buf, 0) == 0) \
{ \

Re: Problem with default partition pruning

2019-08-08 Thread Amit Langote

Hi Simon,

On Thu, Aug 8, 2019 at 4:54 PM Simon Riggs  wrote:
> On Wed, 7 Aug 2019 at 21:27, Alvaro Herrera  wrote:
>> Well, yes, avoiding that is the point of this commit also: we were
>> scanning some partitions for some queries, after this patch we're
>> supposed not to.
>
>
> Understood
>
> My concern was about the additional execution time caused when there would be 
> no benefit, especially if the algoithmic cost is O(N) or similar (i.e. worse 
> than O(k))

Note that we apply constraint exclusion only to the (sub-partitioned)
parent, not to all partitions, so the cost is not O(N) in the number
of partitions.  The idea is that if the parent is excluded, all of its
partitions are.  We normally wouldn't need to use constrain exclusion,
because partition pruning can handle most cases.  What partition
pruning can't handle sufficiently well though is the case where a
clause set that contradicts the partition constraint is specified --
while all non-default partitions are correctly pruned, the default
partition is not.  Using constraint exclusion is a workaround for that
deficiency of the partition pruning logic.

Thanks,
Amit

Re: [PROPOSAL] Temporal query processing with range types

2019-08-08 Thread Peter Moser

Hi Ibrar, Thomas and Robert,

On 8/2/19 11:00 PM, Ibrar Ahmed wrote:
> I have rebased the patch and currently reviewing the patch 
> on master (1e2fddfa33d3c7cc93ca3ee0f32852699bd3e012).

Thanks a lot for your effort. We are now trying to put again more work
and time in this patch.
We are grateful for any feedback.

Thanks,
Peter

Re: Some memory not freed at the exit of RelationBuildPartitionDesc()

2019-08-08 Thread Amit Langote

Hi Amul,

On Thu, Aug 8, 2019 at 4:15 PM amul sul  wrote:
>
> Hi,
>
> In RelationBuildPartitionDesc(), a memory space that use to gather 
> partitioning
> bound info wasn't free at the end.  This might not a problem because this
> allocated memory will eventually be recovered when the top-level context is
> freed, but the case when a partitioned table having 1000s or more partitions 
> and
> this partitioned relation open & close, and its cached entry invalidated in 
> loop
> then we'll have too may call to RelationBuildPartitionDesc() which will keep
> wasting some space with every loop.
>
> For a demonstration purpose, I did the following changes to
> heap_drop_with_catalog() and tried to drop a partitioned table having 5000
> partitions(attached create script) which hit OOM on a machine in no time:
>
> diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> index b7bcdd9d0f..6b7bc0d7ae 100644
> --- a/src/backend/catalog/heap.c
> +++ b/src/backend/catalog/heap.c
> @@ -1842,6 +1842,8 @@ heap_drop_with_catalog(Oid relid)
> parentOid = get_partition_parent(relid);
> LockRelationOid(parentOid, AccessExclusiveLock);
>
> +   rel = relation_open(parentOid, NoLock);
> +   relation_close(rel, NoLock);
> /*
>  * If this is not the default partition, dropping it will change the
>  * default partition's partition constraint, so we must lock it.
>
>
> I think we should do all partitioned bound information gathering and
> calculation in temporary memory context which can be released at the end of
> RelationBuildPartitionDesc(), thoughts/Comments?
>
> I did the same in the attached patch and the aforesaid OOM issue is 
> disappeared.

Thanks for the patch.  This was discussed recently in the "hyrax vs.
RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed
an approach that's similar to yours.  Not sure why it wasn't pursued
though.  Maybe the reason is buried somewhere in that discussion.

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/flat/CA%2BTgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw%40mail.gmail.com

Re: Problem with default partition pruning

2019-08-08 Thread Simon Riggs

On Wed, 7 Aug 2019 at 21:27, Alvaro Herrera 
wrote:

> On 2019-Aug-07, Simon Riggs wrote:
>
> > I saw your recent commit and it scares me in various places, noted below.
> >
> > "Commit: Apply constraint exclusion more generally in partitioning"
> >
> > "This applies particularly to the default partition..."
> >
> > My understanding of the thread was the complaint was about removing the
> > default partition. I would prefer to see code executed just for that
> case,
> > so that people who do not define a default partition are unaffected.
>
> Well, as the commit message noted, it applies to other cases also, not
> just the default partition.  The default partition just happens to be
> the most visible case.
>
> > "So in certain cases
> > we're scanning partitions that we don't need to."
> >
> > Avoiding that has been the subject of months of work.
>
> Well, yes, avoiding that is the point of this commit also: we were
> scanning some partitions for some queries, after this patch we're
> supposed not to.
>

Understood

My concern was about the additional execution time caused when there would
be no benefit, especially if the algoithmic cost is O(N) or similar (i.e.
worse than O(k))

If people have a default partition, I have no problem in there being
additional execution time in that case only since there is only ever one
default partition.


> > Please could we do perf checks, with tests up to 1000s of partitions? And
> > if there is a regression, I would vote to revoke this patch or address
> the
> > request in a less general way.
>
> I'll have a look.
>

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: Small patch to fix build on Windows

2019-08-08 Thread Dmitry Igrishin

> > As for the replace comment, I'm not
> > sure it is needed since I think quoting is not the task for
> > AddLibrary/AddIncludeDir in the first place (and AddIncludeDir
> > doesn't have the same comment).
The attached 3rd version of the patch contains no comment in AddLibrary().

Sorry, forgot to attach the patch to the previous mail.
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index d1d0aed07e..76834f5188 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -495,7 +495,7 @@ sub mkvcbuild
 		my $pythonprog = "import sys;print(sys.prefix);"
 		  . "print(str(sys.version_info[0])+str(sys.version_info[1]))";
 		my $prefixcmd =
-		  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+		  "\"$solution->{options}->{python}\\python\" -c \"$pythonprog\"";
 		my $pyout = `$prefixcmd`;
 		die "Could not query for python version!\n" if $?;
 		my ($pyprefix, $pyver) = split(/\r?\n/, $pyout);
diff --git a/src/tools/msvc/Project.pm b/src/tools/msvc/Project.pm
index b5d1dc6e89..88e9e3187d 100644
--- a/src/tools/msvc/Project.pm
+++ b/src/tools/msvc/Project.pm
@@ -132,11 +132,6 @@ sub AddLibrary
 {
 	my ($self, $lib, $dbgsuffix) = @_;
 
-	if ($lib =~ m/\s/)
-	{
-		$lib = '' . $lib . "";
-	}
-
 	push @{ $self->{libraries} }, $lib;
 	if ($dbgsuffix)
 	{
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 318594db5d..327e556c53 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -126,7 +126,7 @@ sub GetOpenSSLVersion
 	# Attempt to get OpenSSL version and location.  This assumes that
 	# openssl.exe is in the specified directory.
 	my $opensslcmd =
-	  $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
+	  "\"$self->{options}->{openssl}\\bin\\openssl.exe\" version 2>&1";
 	my $sslout = `$opensslcmd`;
 
 	$? >> 8 == 0

Re: Small patch to fix build on Windows

2019-08-08 Thread Dmitry Igrishin

чт, 8 авг. 2019 г. в 06:18, Kyotaro Horiguchi :
>
> Hello.
>
> At Wed, 7 Aug 2019 12:14:48 +0300, Dmitry Igrishin  wrote 
> in 
> > > -if ($lib =~ m/\s/)
> > > -{
> > > -$lib = '' . $lib . "";
> > > -}
> > > +# Since VC automatically quotes paths specified as the data of
> > > +#  in VC project file, it's mistakably
> > > +# to quote them here. Thus, it's okay if $lib contains spaces.
> > >
> > > I'm not sure, but it's not likely that someone adds it without
> > > actually stumbling on space-containing paths with the ealier
> > > version. Anyway if we shouldn't touch this unless the existing
> > > code makes actual problem.
> > So, do you think a comment is not needed here?
>
> # Sorry the last phrase above is broken.
>
> I meant "if it ain't broke don't fix it".
>
> I doubt that some older versions of VC might need it. I confirmed
> that the extra  actually harms at least VC2019 and the code
> removal in the patch works.
The code removal is required also to build on VC2017.

> As for the replace comment, I'm not
> sure it is needed since I think quoting is not the task for
> AddLibrary/AddIncludeDir in the first place (and AddIncludeDir
> doesn't have the same comment).
The attached 3rd version of the patch contains no comment in AddLibrary().

>
> Now I'm trying to install VS2015 into my alomost-filled-up disk..
Thank you!

Re: Problem with default partition pruning

2019-08-08 Thread Shawn Wang

Hi Hosoya-san, 

I am sorry for so late to reply to you.

I merged the patches into master(commit: 
44460d7017cde005d7a2e246db0b32375bfec15d).

I tested the case I used in the previous patches and didn't find any issues. 

Now I find that you are rethinking some of the details. 

I will continue to pay attention to this and will follow up and feedback in 
time.

Regards, 

-- 

Shawn Wang

 On Thu, 27 Jun 2019 10:34:13 +0800 yuzuko  wrote 

Hello, 

On Tue, Jun 25, 2019 at 1:45 PM yuzuko  wrote: 
> 
> Hello Shawn, Alvaro, 
> 
> Thank you for testing patches and comments. 
> Yes, there are two patches: 
> (1) v4_default_partition_pruning.patch fixes problems with default 
> partition pruning 
> and (2) v3_ignore_contradictory_where_clauses_at_partprune_step.patch 
> determines 
> if a given clause contradicts a sub-partitioned table's partition constraint. 
> I'll post two patches together next time. 
> 
> Anyway, I'll rebase two patches to apply on master and fix space. 
> 

Attach the latest patches discussed in this thread.  I rebased the second 
patch (v5_ignore_contradictory_where_clauses_at_partprune_step.patch) 
on the current master.  Could you please check them again? 

-- 
Best regards, 
Yuzuko Hosoya 
NTT Open Source Software Center

I am Yonathan Misgan from Ethiopia, want to add some functionalities on
PostgreSQL to support Ethiopian locales. I want your advice where I start to
hack the PostgresSQL source code. I have attached some synopsis about the
existing problems of PostgresSQL related with Ethiopian locale specially
related with calendar, date and time format. Please don't mind about date and
time written with Amharic because I used only to show the problems.
Calendar: A calendar is a system of organizing days for social, religious,
commercial or administrative purposes. This is done by giving names to periods
of time, typically days, weeks, months and years. A date is the designation of
a single, specific day within such a system. The Gregorian calendar is the most
widely used calendar in the world today specially for database and other
computer system. It is the calendar used in the international standard for
representation of dates and times: ISO 8601:2004. It is a solar calendar based
on a 365 days common year divideinto 12 months of irregular lengths 11 of the
months have either 30 or 31 days, while the second month, February, has only 28
days during the common year. However, nearly every four years is a leap year,
when one extra or intercalary day is added on February. Making the leap year in
the Gregorian calendar 366 days long. The days of the year in the calendar are
divided into 7 days weeks. The international standard is to start the week on
Monday. However, several countries, including the Ethiopia, US and Canada,
count Sunday as the first day of the week .
The Ethiopian calendar is the principal calendar used in Ethiopia and serves as
the liturgical year for Christians in Eritrea and Ethiopia belonging to the
Eritrean Orthodox Tewahedo Church, Ethiopian Orthodox Tewahedo Church, Eastern
Catholic Churches and Coptic Orthodox Church of Alexandria. The Ethiopian
calendar has difference with the Gregorian calendar in terms of day, month and
year. Like the Coptic calendar, the Ethiopic calendar has 12 months of 30 days
plus 5 or 6 epagomenal days, which comprise a thirteenth month. Maybe the
clearest example of the different ways cultures view the world around them is
demonstrated by our perceptions of the nontangible entity of time. The seven
days week is nearly universal but we disagree on what we consider the first day
of the week to be. This is the case even under the same calendar system.
Date and Time: Ethiopia shares the 24 hour day convention with the rest of the
world but differs on when the day begins. The two part day division around
mid-day (AM for anti-meridian and PM for post-meridian) is also a foreign
notion taken for universal in localization systems. Where the “” and “”
day divisions of Amharic translation is an approximation that is no more than
serviceable:
ጠዋት and ከሰዓት
While these translations could be understood under the context of the foreign
conventions that they map, they are not ideal for Ethiopia. Naturally, Ethiopia
will want to apply its own conventions that are already millennium old. ጠዋት,
ረፋድ, እኩለ ቀን, እኩለ ሌሊት some of the examples of day division in Ethiopia. Ethiopia
does not have a well-established preference for digital time formats in
database system and other computer systems, but we want to establish
computerized systems into a society. An example digital time format under
United States English conventions appears as: Mon 27 Feb 2018 12:00:00 PM EAT
The equivalent date and time under the Ethiopian Amharic convention as
available on Linux systems today appears as:
ማክሰ ፌብሩ 26 ቀን 12： 00： 00 ከሰዓት EAT 2018 ዓ/ም
This represents a loose mapping of some Amharic conventions onto an external
reckoning of time. This is only translation and not localization in its truest
sense. A hypothetical Ethiopic date and time presentation might looks as:
ማክሰኞ፣ የካቲት 19 ቀን 6： 00： 00 እኩለ ቀን 2010 ዓ/ም or
ማክሰኞ፣ የካቲት 19 ቀን 6： 00： 00 እኩለ ቀን፳ ፻ ፲ዓ/ም
Let see the drawbacks that exist in PostgreSQL with examples
Most of the database systems use time stamp to store data is in Gregorian
calendar system for specific time zone.
However, most the data in Ethiopia are available with Ethiopian calendar system
and users in Ethiopia are not comfortable with the Gregorian calendar as they
use Ethiopian calendar in their day-to-day activities. This usually create
inconvenience when the user want to have a reference to Ethiopic date as they
had Gregorian calendar at database system. An example query to demonstrate this
is given below.
 Q2: Select current_date;
Q2 returns ‘2019-08-08’ but currently in Ethiopia calendar the year is ‘2011’,
 Q3: Select to_char(to_timestamp(to_char(4, '999'), 'MM'), 'Month');
Q3 returns ‘April’ whereas the 4th month is ታህሳስ(December) in Ethiopian
calendar system.
 Q4: Select to_char(to_timestamp (to_char(13, '999'), 'MM'), 'Month');
Q4 returns an error message since the GC have only ‘12’ month per a year.
Where Q2, Q3 and Q4 are queries.

Small const correctness patch

2019-08-08 Thread Mark G

Hello all,

Please find attached a trivial patch making a few arrays const (in addition
to the data they point to).
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f553523857..aafc7187b0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -769,7 +769,7 @@ typedef enum
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d5f9b617c8..0bba32cde5 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -119,7 +119,7 @@ static bool noverify_checksums = false;
  * Note: this list should be kept in sync with the filter lists in pg_rewind's
  * filemap.c.
  */
-static const char *excludeDirContents[] =
+static const char *const excludeDirContents[] =
 {
 	/*
 	 * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
@@ -160,7 +160,7 @@ static const char *excludeDirContents[] =
 /*
  * List of files excluded from backups.
  */
-static const char *excludeFiles[] =
+static const char *const excludeFiles[] =
 {
 	/* Skip auto conf temporary file. */
 	PG_AUTOCONF_FILENAME ".tmp",

Re: Global temporary tables

2019-08-08 Thread Konstantin Knizhnik




On 08.08.2019 5:40, Craig Ringer wrote:
On Tue, 6 Aug 2019 at 16:32, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


New version of the patch with several fixes is attached.
Many thanks to Roman Zharkov for testing.


FWIW I still don't understand your argument with regards to using 
shared_buffers for temp tables having connection pooling benefits. Are 
you assuming the presence of some other extension in your extended  
version of PostgreSQL ? In community PostgreSQL a temp table's 
contents in one backend will not be visible in another backend. So if 
your connection pooler in transaction pooling mode runs txn 1 on 
backend 42 and it populates temp table X, then the pooler runs the 
same app session's txn 2 on backend 45, the contents of temp table X 
are not visible anymore.


Certainly here I mean built-in connection pooler which is not currently 
present in Postgres,

but it is part of PgPRO-EE and there is my patch for vanilla at commitfest:
https://commitfest.postgresql.org/24/2067



Can you explain? Because AFAICS so long as temp table contents are 
backend-private there's absolutely no point ever using shared buffers 
for their contents.



Sure, there is no such problem with temporary tables now.
There is another problem: you can not use temporary table with any 
existed connection poolers (pgbouncer,...) with pooling level other than 
session unless temporary table is used inside one transaction.
One of the advantages of built-in connection pooler is that it can 
provide session semantic (GUCs, prepared statement, temporary 
tables,...) with limited number of backends (smaller than number of 
sessions).


In PgPRO-EE this problem was solved by binding session to backend. I.e. 
one backend can manage multiple sessions,
but session can not migrate to another backend. The drawback of such 
solution is obvious: one long living transaction can block transactions 
of all other sessions scheduled to this backend.
Possibility to migrate session to another backend is one of the obvious 
solutions of the problem. But the main show stopper for it is temporary 
tables.
This is why  I consider moving temporary tables to shared buffers as 
very important step.


In vanilla version of built-in connection pooler situation is slightly 
different.
Right now if client is using temporary tables without "ON COMMIT DROP" 
clause, backend is marked as "tainted" and is pinned for this session.
So it is actually excluded from connection pool and servers only this 
session. Once again - if I will be able to access temporary table data 
from other backend, there will be no need to mark backend as tainted in 
this case.
Certainly it also requires shared metadata. And here we come to the 
concept of global temp tables (if we forget for a moment that global 
temp tables were "invented" long time ago by Oracle and many other DBMSes:)


Perhaps you mean that in a connection pooling case, each backend may 
land up filling up temp buffers with contents from *multiple different 
temp tables*? If so, sure, I get that, but the answer there seems to 
be to improve eviction and memory accounting, not make backends waste 
precious shared_buffers space on non-shareable data.


Anyhow, I strongly suggest you simplify the feature to add the basic 
global temp table feature so the need to change pg_class, pg_attribute 
etc to use temp tables is removed, but separate changes to temp table 
memory handling etc into a follow-up patch. That'll make it smaller 
and easier to review and merge too. The two changes are IMO logically 
quite separate anyway.


I agree that them are separate.
But even if we forget about built-in connection pooler, don't you think 
that possibility to use parallel query plans for temporary tables is 
itself strong enough motivation to access global temp table through 
shared buffers
(while still supporting private page pool for local temp tables). So 
both approaches (shared vs. private buffers) have their pros and 
contras. This is why it seems to be reasonable to support both of them 
and let user to make choice most suitable for concrete application. 
Certainly it is possible to provide "global shared temp tables" and 
"global private temp tables". But IMHO it is overkill.


Come to think of it, I think connection poolers might benefit from an 
extension to the DISCARD command, say "DISCARD TEMP_BUFFERS", which 
evicts temp table buffers from memory *without* dropping the temp 
tables. If they're currently in-memory tuplestores they'd be written 
out and evicted. That way a connection pooler could "clean" the 
backend, at the cost of some probably pretty cheap buffered writes to 
the system buffer cache. The kernel might not even bother to write out 
the buffercache and it won't be forced to do so by fsync, checkpoints, 
etc, nor will the writes go via WAL so such evictions could be pretty 
cheap - and if not under lots of memory pressure the backend could 
often read the temp table

Re: POC: converting Lists into arrays

2019-08-08 Thread Craig Ringer

On Thu, 8 Aug 2019 at 12:18, Andres Freund  wrote:

> Hi,
>
> On 2019-08-08 11:36:44 +0800, Craig Ringer wrote:
> > > you can only put one  into the first element of a
> > > for (;;).
> > >
> >
> > Use an anonymous block outer scope? Or if not permitted even by C99
> (which
> > I think it is), a do {...} while (0);  hack?
>
> You can't easily - the problem is that there's no real way to add the
> closing }, because that's after the macro.


Ah, right. Hence our

PG_TRY();
{
}
PG_CATCH();
{
}
PG_END_TRY();

construct in all its beauty.

I should've seen that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

75 matches

Mail list logo