Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Masahiko Sawada
On Sat, Aug 10, 2019 at 12:18 AM Tomas Vondra
 wrote:
>
> On Fri, Aug 09, 2019 at 11:51:23PM +0900, Masahiko Sawada wrote:
> >On Fri, Aug 9, 2019 at 10:25 AM Bruce Momjian  wrote:
> >>
> >> On Thu, Aug  8, 2019 at 06:31:42PM -0400, Stephen Frost wrote:
> >> > > >Crash recovery doesn't happen "all the time" and neither does vacuum
> >> > > >freeze, and autovacuum processes are independent of individual client
> >> > > >backends- we don't need to (and shouldn't) have the keys in shared
> >> > > >memory.
> >> > >
> >> > > Don't people do physical replication / HA pretty much all the time?
> >> >
> >> > Strictly speaking, that isn't actually crash recovery, it's physical
> >> > replication / HA, and while those are certainly nice to have it's no
> >> > guarantee that they're required or that you'd want to have the same keys
> >> > for them- conceptually, at least, you could have WAL with one key that
> >> > both sides know and then different keys for the actual data files, if we
> >> > go with the approach where the WAL is encrypted with one key and then
> >> > otherwise is plaintext.
> >>
> >> Uh, yes, you could have two encryption keys in the data directory, one
> >> for heap/indexes, one for WAL, both unlocked with the same passphrase,
> >> but what would be the value in that?
> >>
> >> > > >>That might allow crash recovery and the freeze part of VACUUM FREEZE 
> >> > > >>to
> >> > > >>work.  (I don't think we could vacuum since we couldn't read the 
> >> > > >>index
> >> > > >>pages to find the matching rows since the index values would be 
> >> > > >>encrypted
> >> > > >>too.  We might be able to not encrypt the tid in the index typle.)
> >> > > >
> >> > > >Why do we need the indexed values to vacuum the index..?  We don't
> >> > > >today, as I recall.  We would need the tids though, yes.
> >> > >
> >> > > Well, we also do collect statistics on the data, for example. But even
> >> > > if we assume we wouldn't do that for encrypted indexes (which seems 
> >> > > like
> >> > > a pretty bad idea to me), you'd probably end up leaking information
> >> > > about ordering of the values. Which is generally a pretty serious
> >> > > information leak, AFAICS.
> >> >
> >> > I agree entirely that order information would be bad to leak- but this
> >> > is all new ground here and we haven't actually sorted out what such a
> >> > partially encrypted btree would look like.  We don't actually have to
> >> > have the down-links in the tree be unencrypted to allow vacuuming of
> >> > leaf pages, after all.
> >>
> >> Agreed, but I think we kind of know that the value in cluster-wide
> >> encryption is different from multi-key encryption --- both have their
> >> value, but right now cluster-wide is the easiest and simplest, and
> >> probably meets more user needs than multi-key encryption.  If others
> >> want to start scoping out what multi-key encryption would look like, we
> >> can discuss it.  I personally would like to focus on cluster-wide
> >> encryption for PG 13.
> >
> >I agree that cluster-wide is more simpler but I'm not sure that it
> >meets real needs from users. One example is re-encryption; when the
> >key leakage happens, in cluster-wide encryption we end up with doing
> >re-encrypt whole database regardless the amount of user sensitive data
> >in database. I think it's a big constraint for users because it's
> >common that the amount of data such as master table that needs to be
> >encrypted doesn't account for a large potion of database. That's one
> >reason why I think more fine granularity encryption such as
> >table/tablespace is required.
> >
>
> TBH I think it's mostly pointless to design for key leakage.
>
> My understanding it that all this work is motivated by the assumption that
> Bob can obtain access to the data directory (say, a backup of it). So if
> he also manages to get access to the encryption key, we probably have to
> assume he already has access to current snapshot of the data directory,
> which means any re-encryption is pretty futile.
>
> What we can (and should) optimize for is key rotation, but as that only
> changes the master key and not the actual encryption keys, the overhead is
> pretty low.
>
> We can of course support "forced" re-encryption, but I think it's
> acceptable if that's fairly expensive as long as it can be throttled and
> executed in the background (kinda similar to the patch to enable checksums
> in the background).

I'm not sure that we can ignore the risk of MDEK leakage. Once MDEK is
leaked for whatever reason all that is left for attacker is to steal
data. User who realized that MDEK is leaked will have to re-encrypt
data. Even if the data is already stolen user will want to re-encrypt
data to protect further attacks. KEK rotation is futile in this case.

>
> >And in terms of feature development we would implement
> >fine-granularity encryption in the future even if the first step is
> >cluster-wide encryption? And both TDEs encrypt the same kind of
> >database

Re: default_table_access_method is not in sample config file

2019-08-12 Thread Michael Paquier
On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> On 11/04/2019 19:49, Andres Freund wrote:
>> Hm, I think we should rather add it to sample. That's an oversight, not
>> intentional.
> 
> I just noticed that this is still an issue. default_table_access_method is
> not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> I'll add this to the open items list so we don't forget.

I think that we should give it the same visibility as default_tablespace,
so adding it to the sample file sounds good to me.
--
Michael
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 65a6da18b3..39fc787851 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -622,6 +622,7 @@
 #default_tablespace = ''		# a tablespace name, '' uses the default
 #temp_tablespaces = ''			# a list of tablespace names, '' uses
 	# only default tablespace
+#default_table_access_method = 'heap'
 #check_function_bodies = on
 #default_transaction_isolation = 'read committed'
 #default_transaction_read_only = off


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro  wrote:
>

> Hi Dilip,
>
> > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> > Author: Dilip Kumar 
> > Date:   Thu May 2 11:28:13 2019 +0530
> >
> >Provide interfaces to store and fetch undo records.
>
> +#include "commands/tablecmds.h"
> +#include "storage/block.h"
> +#include "storage/buf.h"
> +#include "storage/buf_internals.h"
> +#include "storage/bufmgr.h"
> +#include "miscadmin.h"
>
> "miscadmin.h" comes before "storage...".
Right, fixed.
>
> +/*
> + * Compute the size of the partial record on the undo page.
> + *
> + * Compute the complete record size by uur_info and variable field length
> + * stored in the page header and then subtract the offset of the record so 
> that
> + * we can get the exact size of partial record in this page.
> + */
> +static inline Size
> +UndoPagePartialRecSize(UndoPageHeader phdr)
> +{
> +Sizesize;
>
> We decided to use size_t everywhere in new code (except perhaps
> functions conforming to function pointer types that historically use
> Size in their type).
>
> +/*
> + * Compute the header size from undo record uur_info, stored in the page
> + * header.
> + */
> +size = UndoRecordHeaderSize(phdr->uur_info);
> +
> +/*
> + * Add length of the variable part and undo length. Now, we know the
> + * complete length of the undo record.
> + */
> +size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
> +
> +/*
> + * Subtract the size which is stored in the previous page to get the
> + * partial record size stored in this page.
> + */
> +size -= phdr->record_offset;
> +
> +return size;
>
> This is probably a stupid question but why isn't it enough to just
> store the offset of the first record that begins on this page, or 0
> for none yet?  Why do we need to worry about the partial record's
> payload etc?
Right, as this patch stand it would be enough to just store the offset
where the first complete record start.  But for undo page consistency
checker we need to mask the CID field in the partial record as well.
So we need to know how many bytes of the partial records are already
written in the previous page (phdr->record_offset), what all fields
are there in the partial record (uur_info) and the variable part to
compute the next record offset.  Currently, I have improved it by
storing the complete record length instead of payload and tuple length
but this we can further improve by storing the next record offset
directly that will avoid some computation.  I haven't worked on undo
consistency patch much in this version so I will analyze this further
in the next version.

>
> +UndoRecPtr
> +PrepareUndoInsert(UndoRecordInsertContext *context,
> +  UnpackedUndoRecord *urec,
> +  Oid dbid)
> +{
> ...
> +/* Fetch compression info for the transaction. */
> +compression_info = GetTopTransactionUndoCompressionInfo(category);
>
> How can this work correctly in recovery?  [Edit: it doesn't, as you
> just pointed out]
>
> I had started reviewing an older version of your patch (the version
> that had made it as far as the undoprocessing branch as of recently),
> before I had the bright idea to look for a newer version.  I was going
> to object to the global variable you had there in the earlier version.
> It seems to me that you have to be able to reproduce the exact same
> compression in recovery that you produced as "do" time, no?  How can
> TopTranasctionStateData be the right place for this in recovery?
>
> One data structure that could perhaps hold this would be
> UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> with pretty fast lookups; used for things like
> UndoLogNumberGetCategory()).  As long as you never want to have
> inter-transaction compression, that should have the right scope to
> give recovery per-undo log tracking.  If you ever wanted to do
> compression between transactions too, maybe UndoLogSlot could work,
> but that'd have more complications.
Currently, I have read it from the first record on the page.
>
> +/*
> + * Read undo records of the transaction in bulk
> + *
> + * Read undo records between from_urecptr and to_urecptr until we exhaust the
> + * the memory size specified by undo_apply_size.  If we could not read all 
> the
> + * records till to_urecptr then the caller should consume current set
> of records
> + * and call this function again.
> + *
> + * from_urecptr- Where to start fetching the undo records.
> If we can not
> + *  read all the records because of memory limit then 
> this
> + *  will be set to the previous undo record
> pointer from where
> + *  we need to start fetching on next call.
> Otherwise it will
> + *  be set to InvalidUndoRecPtr.
> + * to_urecptr- Last undo record pointer to be fetched.
> + * undo_apply_size- Memory segment limit to collect und

Re: [HACKERS] CLUSTER command progress monitor

2019-08-12 Thread Tatsuro Yamada

Hi Alvaro!

On 2019/08/02 3:43, Alvaro Herrera wrote:

Hmm, I'm trying this out now and I don't see the index_rebuild_count
ever go up.  I think it's because the indexes are built using parallel
index build ... or maybe it was the table AM changes that moved things
around, not sure.  There's a period at the end when the CLUSTER command
keeps working, but it's gone from pg_stat_progress_cluster.



Thanks for your report.
I'll investigate it. :)


Thanks,
Tatsuro Yamada





Re: Global temporary tables

2019-08-12 Thread Craig Ringer
On Tue, 13 Aug 2019 at 00:47, Pavel Stehule  wrote:


> But Postgres is not storing this information now anywhere else except
>> statistic, isn't it?
>>
>
> not only - critical numbers are reltuples, relpages from pg_class
>

That's a very good point. relallvisible too. How's the global temp table
impl handling that right now, since you won't be changing the pg_class row?
AFAICS relpages doesn't need to be up to date (and reltuples certainly
doesn't) so presumably you're just leaving them as zero?

What happens right now if you ANALYZE or VACUUM ANALYZE a global temp
table? Is it just disallowed?

I'll need to check, but I wonder if periodically updating those fields in
pg_class impacts logical decoding too. Logical decoding must treat
transactions with catalog changes as special cases where it creates custom
snapshots and does other expensive additional work.
(See ReorderBufferXidSetCatalogChanges in reorderbuffer.c and its
callsites). We don't actually need to know relpages and reltuples during
logical decoding. It can probably ignore relfrozenxid and relminmxid
changes too, maybe even pg_statistic changes though I'd be less confident
about that one.

At some point I need to patch in a bunch of extra tracepoints and do some
perf tracing to see how often we do potentially unnecessary snapshot
related work in logical decoding.


> There was proposal to cache relation size,  but it is not implemented yet.
>> If such cache exists, then we can use it to store local information about
>> global temporary tables.
>> So if 99% of users do not perform analyze for temporary tables, then them
>> will not be faced with this problem, right?
>>
>
> they use default statistics based on relpages. But for 1% of applications
> statistics are critical - almost always for OLAP applications.
>

Agreed. It's actually quite a common solution to user problem reports /
support queries about temp table performance: "Run ANALYZE. Consider
creating indexes too."

Which reminds me - if global temp tables do get added, it'll further
increase the desirability of exposing a feature to let users
disable+invalidate and then later reindex+enable indexes without icky
catalog hacking. So they can disable indexes for their local copy, load
data, re-enable indexes. That'd be "interesting" to implement for global
temp tables given that index state is part of the pg_index row associated
with an index rel though.


1. hold these data only in memory in special buffers
>>
>>
I don't see that working well for pg_statistic or anything else that holds
nontrivial user data though.

> 2. hold these data in global temporary tables - it is similar to normal
>> tables - we can use global temp tables for metadata like classic persistent
>> tables are used for metadata of classic persistent tables. Next syscache
>> can be enhanced to work with union of two system tables.
>>
>>
Very meta. Syscache and relcache are extremely performance critical but
could probably skip scans entirely on backends that haven't used any global
temp tables.

I don't know the relevant caches well enough to have a useful opinion here.

> I think that it not possible to assume that temporary data will aways fir
>> in memory.
>> So 1) seems to be not acceptable solution.
>>
>
It'd only be the metadata, but if it includes things like column histograms
and most frequent value data that'd still be undesirable to have pinned in
backend memory.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: progress report for ANALYZE

2019-08-12 Thread Tatsuro Yamada

Hi Robert and All!


On 2019/08/02 2:48, Robert Haas wrote:> On Thu, Aug 1, 2019 at 4:45 AM Thomas Munro 
 wrote:

On Tue, Jul 23, 2019 at 4:51 PM Tatsuro Yamada
 wrote:

Attached v4 patch file only includes this fix.


I've moved this to the September CF, where it is in "Needs review" state.


/me reviews.



Thanks for your comments! :)



+  scanning_table

I think this should be retitled to something that ends in 'relid',
like all of the corresponding cases in existing progress views.
Perhaps 'active_relid' or 'current_relid'.



Fixed.
I changed "scanning_table" to "current_relid" for analyze in monitoring.sgml.
However, I didn't change "relid" on other places for other commands because
I'd like to create other patch for that later. :)


 > +   The command is computing extended stats from the samples

obtained in the previous phase.

I think you should change this (and the previous one) to say "from the
samples obtained during the table scan."



Fixed.



+   Total number of heap blocks in the scanning_table.

Perhaps I'm confused, but it looks to me like what you are advertising
is the number of blocks that will be sampled, not the total number of
blocks in the table.  I think that's the right thing to advertise, but
then the column should be named and documented that way.



Ah, you are right. Fixed.
I used the following sentence based on Vinayak's patch created two years a go.

- heap_blks_total
- bigint
- 
-   Total number of heap blocks in the current_relid.
- 

+ sample_blks_total
+ bigint
+ 
+   Total number of heap blocks that will be sampled.
+




+ {
+ const int   index[] = {
+ PROGRESS_ANALYZE_TOTAL_BLOCKS,
+ PROGRESS_ANALYZE_SCANREL
+ };
+ const int64 val[] = {
+ nblocks,
+ RelationGetRelid(onerel)
+ };
+
+ pgstat_progress_update_multi_param(2, index, val);
+ }

This block seems to be introduced just so you can declare variables; I
don't think that's good style. It's arguably unnecessary because we
now are selectively allowing variable declarations within functions,
but I think you should just move the first array to the top of the
function and the second declaration to the top of the function
dropping const, and then just do val[0] = nblocks and val[1] =
RelationGetRelid(onerel).  Maybe you can also come up with better
names than 'index' and 'val'.  Same comment applies to another place
where you have something similar.



I agreed and fixed.



Patch seems to need minor rebasing.

Maybe "scanning table" should be renamed "acquiring sample rows," to
match the names used in the code?



I fixed as following:

s/PROGRESS_ANALYZE_PHASE_SCAN_TABLE/
  PROGRESS_ANALYZE_PHASE_ACQUIRING_SAMPLE_ROWS/

s/WHEN 1 THEN 'scanning table'::text/
  WHEN 1 THEN 'acquiring sample rows'::text/



I'm not a fan of the way you set the scan-table phase and inh flag in
one place, and then slightly later set the relation OID and block
count.  That creates a race during which users could see the first bit
of data set and the second not set.  I don't see any reason not to set
all four fields together.



Hmm... I understand but it's little difficult because if there are
child rels, acquire_inherited_sample_rows() calls acquire_sample_rows()
(See below). So, it is able to set all four fields together if inh flag
is given as a parameter of those functions, I suppose. But I'm not sure
whether it's okay to add the parameter to both functions or not.
Do you have any ideas? :)


# do_analyze_rel()
...
if (inh)
numrows = acquire_inherited_sample_rows(onerel, elevel,
rows, targrows,
&totalrows, &totaldeadrows);
else
numrows = (*acquirefunc) (onerel, elevel,
  rows, targrows,
  &totalrows, &totaldeadrows);


# acquire_inherited_sample_rows()
...
foreach(lc, tableOIDs)
{
...
   /* Check table type (MATVIEW can't happen, but might as well allow) */
if (childrel->rd_rel->relkind == RELKIND_RELATION ||
childrel->rd_rel->relkind == RELKIND_MATVIEW)
{
/* Regular table, so use the regular row acquisition function */
acquirefunc = acquire_sample_rows;
...
/* OK, we'll process this child */
has_child = true;
rels[nrels] = childrel;
acquirefuncs[nrels] = acquirefunc;
...
}
...
for (i = 0; i < nrels; i++)
{
...
AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
...
if (childtargrows > 0)
{
...
/* Fetch a random sample of the child's rows */
childrows = (*acquirefunc) (childrel, elevel,
rows + numrows, childtargrows,
&trows, &tdrows)



Please be sure to make the names of the constants you use match up
with the e

Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia
 wrote:
>
> Hi,
>
> I have stated review of
> 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
> quick comments.
>
> 1) README.undointerface should provide more information like API details or
> the sequence in which API should get called.
I have improved the readme where I am describing the more user
specific details based on Robert's suggestions offlist.  I think I
need further improvement which can describe the order of api's to be
called.  Unfortunately that is not yet included in this patch set.
>
> 2) Information about the API's in the undoaccess.c file header block would
> good.  For reference please look at heapam.c.
Done
>
> 3) typo
>
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
>
> %s/thse/these
Fixed
>
> 4) UndoRecordUpdateTransInfo() comments says that this must be called under
> the critical section, but seems like undo_xlog_apply_progress() do call it
> outside of critical section?  Is there exception, then should add comments?
> or Am I missing anything?
During recovery, there is an exception but we can add comments for the same.
I think I missed this in the latest patch, I will keep a note of it
and will do this in the next version.

>
>
> 5) In function UndoBlockGetFirstUndoRecord() below code:
>
> /* Calculate the size of the partial record. */
> partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
>phdr->tuple_len + phdr->payload_len -
>phdr->record_offset;
>
> can directly use UndoPagePartialRecSize().

This function is part of another patch in undoprocessing patch set
>
> 6)
>
> +static int
> +UndoGetBufferSlot(UndoRecordInsertContext *context,
> +  RelFileNode rnode,
> +  BlockNumber blk,
> +  ReadBufferMode rbm)
> +{
> +inti;
>
> In the above code variable "i" is mean "block index".  It would be good
> to give some valuable name to the variable, maybe "blockIndex" ?
>
Fixed
> 7)
>
> * We will also keep a previous undo record pointer to the first and last undo
>  * record of the transaction in the previous log.  The last undo record
>  * location is used find the previous undo record pointer during rollback.
>
>
> %s/used fine/used to find
Fixed
>
> 8)
>
> /*
>  * Defines the number of times we try to wait for rollback hash table to get
>  * initialized.  After these many attempts it will return error and the user
>  * can retry the operation.
>  */
> #define ROLLBACK_HT_INIT_WAIT_TRY  60
>
> %s/error/an error
This is part of different patch in undoprocessing patch set
>
> 9)
>
>  * we can get the exact size of partial record in this page.
>  */
>
> %s/of partial/of the partial"
This comment is removed in the latest code
>
> 10)
>
> * urecptr - current transaction's undo record pointer which need to be set in
> * the previous transaction's header.
>
> %s/need/needs
Done
>
> 11)
>
> /*
>  * If we are writing first undo record for the page the we can set the
>  * compression so that subsequent records from the same transaction can
>  * avoid including common information in the undo records.
>  */
>
>
> %s/the page the/the page then
>
> 12)
>
> /*
>  * If the transaction's undo records are split across the undo logs.  So
>  * we need to  update our own transaction header in the previous log.
>  */
>
> double space between "to" and "update"
Fixed
>
> 13)
>
> * The undo record should be freed by the caller by calling ReleaseUndoRecord.
>  * This function will old the pin on the buffer where we read the previous 
> undo
>  * record so that when this function is called repeatedly with the same 
> context
>
> %s/old/hold
Fixed
>
> I will continue further review for the same patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Problem with default partition pruning

2019-08-12 Thread Amit Langote
Hi Alvaro,

On Tue, Aug 13, 2019 at 2:45 AM Alvaro Herrera  wrote:
> v3-0001 still seems to leave things a bit duplicative.  I think we can
> make it better if we move the logic to set RelOptInfo->partition_qual to
> a separate routine (set_baserel_partition_constraint mirroring the
> existing set_baserel_partition_key_exprs), and then call that from both
> places that need access to partition_qual.
>
> So I propose that the attached v4 patch should be the final form of this
> (also rebased across today's list_concat API change).  I verified that
> constraint exclusion is not being called by partprune unless a default
> partition exists (thanks errbacktrace()); I think that should appease
> Simon's performance concern for the most common case of default
> partition not existing.
>
> I think I was not really understanding the comments being added by
> Amit's v3, so I reworded them.  I hope I understood the intent of the
> code correctly.

Thanks a lot for revising.  Looks neat, except:

+ * This is a measure of last resort only to be used because the default
+ * partition cannot be pruned using the steps; regular pruning, which is
+ * cheaper, is sufficient when no default partition exists.

This text appears to imply that the default can *never* be pruned with
steps.  Maybe, the first sentence should read something like: "...the
default cannot be pruned using the steps generated from clauses that
contradict the parent's partition constraint".

> I'm not comfortable with RelOptInfo->partition_qual.  But I'd rather
> leave that for another time.

Sure.

Regards,
Amit




Re: Fix typos and inconsistencies for HEAD (take 10)

2019-08-12 Thread Michael Paquier
On Sun, Aug 11, 2019 at 11:00:00AM +0300, Alexander Lakhin wrote:
> 10.44. serendipitiously -> serendipitously

I didn't know that this even was a word:
https://www.thefreedictionary.com/serendipitously
But it seems to come from Horace Walpole.

> 10.50. sigsetmask -> pgsigsetmask

It should be pqsigsetmask.

> 10.81. SUE -> ShareUpdateExclusive

Your new comment had a typo here.

Please note that I have discarded the portion about the isolation
tests for now.  I need to have a second look at that part, and there
were already a lot of changes to review.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila  wrote:
>
> On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar  wrote:
> >
> > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas  wrote:
> > >
> > > I don't like the fact that undoaccess.c has a new global,
> > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > really need that?  I think it's never modified (so it could just be
> > > declared const),
> >
> > Actually, this will get modified otherwise across undo record
> > insertion how we will know what was the values of the common fields in
> > the first record of the page.  Another option could be that every time
> > we insert the record, read the value from the first complete undo
> > record on the page but that will be costly because for every new
> > insertion we need to read the first undo record of the page.
> >
>
> This information won't be shared across transactions, so can't we keep
> it in top transaction's state?   It seems to me that will be better
> than to maintain it as a global state.

As replied separetly that during recovery we would not have
transaction state so I have decided to read from the first record on
the page please check in the latest patch.
>
> Few more comments on this patch:
> 1.
> PrepareUndoInsert()
> {
> ..
> + if (logswitched)
> + {
> ..
> + }
> + else
> + {
> ..
> + resize = true;
> ..
> + }
> +
> ..
> +
> + do
> + {
> + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm);
> ..
> + rbm = RBM_ZERO;
> + cur_blk++;
> + } while (cur_size < size);
> +
> + /*
> + * Set/overwrite compression info if required and also exclude the common
> + * fields from the undo record if possible.
> + */
> + if (UndoSetCommonInfo(compression_info, urec, urecptr,
> +   context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf))
> + resize = true;
> +
> + if (resize)
> + size = UndoRecordExpectedSize(urec);
>
> I see that in some cases where resize is possible are checked before
> buffer allocation and some are after.  Isn't it better to do all these
> checks before buffer allocation?  Also, isn't it better to even
> compute changed size before buffer allocation as that might sometimes
> help in lesser buffer allocations?

Right, fixed.
>
> Can you find a better way to write
> :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)?
>  It makes the line too long and difficult to understand.  Check for
> similar instances in the patch and if possible, change them as well.
This code is gone.  While replying I realised that I haven't scanned
complete code for such occurance.  I will work on that in next
version.
>
> 2.
> +InsertPreparedUndo(UndoRecordInsertContext *context)
> {
> ..
> /*
> + * Try to insert the record into the current page. If it
> + * doesn't succeed then recall the routine with the next page.
> + */
> + InsertUndoData(&ucontext, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;
> + }
> + MarkBufferDirty(buffer);
> ..
> }
>
> Can't we call MarkBufferDirty(buffer) just before 'if' check?  That
> will avoid calling it twice.

Done
>
> 3.
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
> +struct PreparedUndoBuffer
>
> /thse/these
Done
>
> 4.
> + /*
> + * If we are writing first undo record for the page the we can set the
> + * compression so that subsequent records from the same transaction can
> + * avoid including common information in the undo records.
> + */
> + if (first_complete_undo)
>
> /page the we/page then we
This code is gone
>
> 5.
> PrepareUndoInsert()
> {
> ..
> After
> + * allocation We'll only advance by as many bytes as we turn out to need.
> + */
> + UndoRecordSetInfo(urec);
>
> Change the beginning of comment as: "After allocation, we'll .."

Done
>
> 6.
> PrepareUndoInsert()
> {
> ..
> * TODO:  instead of storing this in the transaction header we can
> + * have separate undo log switch header and store it there.
> + */
> + prevlogurp =
> + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp),
> +(UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen));
> +
>
> I don't think this TODO is valid anymore because now the patch has a
> separate log-switch header.

Yup.  Anyway now the log switch design is changed.
>
> 7.
> /*
> + * If undo log is switched then set the logswitch flag and also reset the
> + * compression info because we can use same compression info for the new
> + * undo log.
> + */
> + if (UndoRecPtrIsValid(prevlog_xact_start))
>
> /can/can't
Right.  But now compression code is changed so this comment does not exist.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-12 Thread Dilip Kumar
On Thu, Jul 18, 2019 at 4:58 PM Amit Kapila  wrote:
>
> On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar  wrote:
> >
>
> Few comments on the new patch:
>
> 1.
> Additionally,
> +there is a mechanism for multi-insert, wherein multiple records are prepared
> +and inserted at a time.
>
> Which mechanism are you talking about here?  By any chance is this
> related to some old code?

Current code also we have option to prepare multiple records and
insert them at once.  I have enhanced the comments to make it more
clear.
>
> 2.
> +Fetching and undo record
> +
> +To fetch an undo record, a caller must provide a valid undo record pointer.
> +Optionally, the caller can provide a callback function with the information 
> of
> +the block and offset, which will help in faster retrieval of undo record,
> +otherwise, it has to traverse the undo-chain.
>
> I think this is out-dated information.  You seem to forget updating
> README after latest changes in API.
Right, fixed.
>
> 3.
> + * The cid/xid/reloid/rmid information will be added in the undo record 
> header
> + * in the following cases:
> + * a) The first undo record of the transaction.
> + * b) First undo record of the page.
> + * c) All subsequent record for the transaction which is not the first
> + *   transaction on the page.
> + * Except above cases,  If the rmid/reloid/xid/cid is same in the subsequent
> + * records this information will not be stored in the record, these 
> information
> + * will be retrieved from the first undo record of that page.
> + * If any of the member rmid/reloid/xid/cid has changed, the changed
> information
> + * will be stored in the undo record and the remaining information will be
> + * retrieved from the first complete undo record of the page
> + */
> +UndoCompressionInfo undo_compression_info[UndoLogCategories];
>
> a. Do we want to compress fork_number also?  It is an optional field
> and is only include when undo record is for not MAIN_FORKNUM.  For
> zheap, this means it will never be included, but in future, it could
> be included for some other AM or some other use case.  So, not sure if
> there is any benefit in compressing the same.
Yeah, so as of now I haven't compressed forkno
>
> b. cid/xid/reloid/rmid - I think it is better to write it as rmid,
> reloid, xid, cid in the same order as you declare them in
> UndoPackStage.
>
> c. Some minor corrections. /Except above/Except for above/; /, If
> the/, if the/;  /is same/is the same/; /record, these
> information/record rather this information/
>
> d. I think there is no need to start the line "If any of the..." from
> a new line, it can be continued where the previous line ends.  Also,
> at the end of that line, add a full stop.

This comments are removed in new patch
>
> 4.
> /*
> + * Copy the compression global compression info to our context before
> + * starting prepare because this value might get updated multiple time in
> + * case of multi-prepare but the global value should be updated only after
> + * we have successfully inserted the undo record.
> + */
>
> In the above comment, the first 'compression' is not required. /time/times/
This comments are changed now as design is changed
>
> 5.
> +/*
> + * The below common information will be stored in the first undo
> record of the page.
> + * Every subsequent undo record will not store this information, if
> required this information
> + * will be retrieved from the first undo record of the page.
> + */
> +typedef struct UndoCompressionInfo
>
> The line length in the above comments exceeds the 80-char limit.  You
> might want to run pgindent to avoid such problems.

Fixed,
>
> 6.
> +/*
> + * Exclude the common info in undo record flag and also set the compression
> + * info in the context.
> + *
>
> 'flag' seems to be a redundant word here?
Obsolete comment as per new changes
>
> 7.
> +UndoSetCommonInfo(UndoCompressionInfo *compressioninfo,
> +   UnpackedUndoRecord *urec, UndoRecPtr urp,
> +   Buffer buffer)
> +{
> +
> + /*
> + * If we have valid compression info and the for the same transaction and
> + * the current undo record is on the same block as the last undo record
> + * then exclude the common information which are same as first complete
> + * record on the page.
> + */
> + if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
>
> Here the comment is just a verbal for of if-check.  How about writing
> it as: "Exclude the common information from the record which is same
> as the first record on the page."

Tried to improved in new code.
>
> 8.
> UndoSetCommonInfo()
> {
> ..
> if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
> + {
> + urec->uur_info &= ~UREC_INFO_XID;
> +
> + /* Don't include rmid if it's same. */
> + if (urec->uur_rmid == compressioninfo->rmid)
>

small improvement of the elapsed time for truncating heap in vacuum

2019-08-12 Thread Kasahara Tatsuhito
Hi,

I got following log messages when measured the heap truncating
duration in a vacuum.

=
INFO:  "dst": suspending truncate due to conflicting lock request
INFO:  "dst": truncated 550073 to 101472 pages
DETAIL:  CPU: user: 0.35 s, system: 4.92 s, elapsed: 6.96 s
INFO:  "dst": truncated 101472 to 164 pages
DETAIL:  CPU: user: 0.35 s, system: 11.02 s, elapsed: 13.46 s
=

Above message shows that postgres detected a access to the table
during heap truncating so suspend the truncating,
and then resumed truncating after the access finish. The messages were
no-problem.
But "usage" and "elapsed (time)" were bit confusing.
Total truncating duration was about 13.5s, but log said 6.96s (before
suspend) + 13.46s (remain).
# I confirmed the total truncating duration by elog debugging.

In lazy_truncate_heap() pg_rusage_init is only called once at the
truncating start.
So the last-truncating-phase-log shows the total truncating-phase
usages and elapsed time.
Attached patch make pg_rusage_init would be called after each
ereport() of heap-truncating,
so log messages will change like following.

=
INFO:  "dst": suspending truncate due to conflicting lock request
INFO:  "dst": truncated 550073 to 108288 pages
DETAIL:  CPU: user: 0.20 s, system: 4.88 s, elapsed: 7.41 s
INFO:  "dst": truncated 108288 to 164 pages
DETAIL:  CPU: user: 0.00 s, system: 7.36 s, elapsed: 7.92 s
=
(Total truncating time was about 15.3s in above case)

Any thoughts ?
Best regards,

-- 
Tatsuhito Kasahara
NTT Open Source Software Center


reset_usage.patch
Description: Binary data


Re: SegFault on 9.6.14

2019-08-12 Thread Amit Kapila
On Tue, Aug 13, 2019 at 3:18 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > Being able to do that sort of thing was one of my goals in designing
> > the ExecShutdownNode stuff.  Unfortunately, it's clear from this bug
> > report that it's still a few bricks short of a load, and Tom doesn't
> > seem real optimistic about how easy it will be to buy those bricks at
> > discount prices. But I hope we persist in trying to get there, because
> > I don't like the idea of saying that we'll never be smart enough to
> > know we're done with any part of the plan until we're definitely done
> > with the whole thing. I think that's leaving too much money on the
> > table.
>
> To clarify my position --- I think it's definitely possible to improve
> the situation a great deal.  We "just" have to pass down more information
> about whether rescans are possible.
>

Right, you have speculated above that it is possible via adding some
eflag bits.  Can you please describe a bit more about that idea, so
that somebody else can try to write a patch?  I think if someone other
than you try to write a patch without having some sort of upfront
design, it might lead to a lot of re-work.  It would be great if you
have an interest in doing the leg work which can then be extended to
fix the issue in the parallel query, but if not at least let us know
the idea you have in mind in a bit more detail.

>  What I don't believe is that that
> leads to a bug fix that would be sane to back-patch as far as 9.6.
>

Fair enough.  In such a situation, we can plan to revert the earlier
fix for Limit node and tell people that the same will be fixed in
PG-13.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Small patch to fix build on Windows

2019-08-12 Thread Michael Paquier
On Fri, Aug 09, 2019 at 11:21:52AM +0300, Dmitry Igrishin wrote:
> Personally I don't care. I used || notation only in order to be
> consistent, since this notation is already used in Solution.pm. If
> this consistency is not required let me provide a patch with {}
> notation. What do you think?

We are talking about one place in src/tools/msvc/ using qq on HEAD.
So one or the other is fine by me as long as we remain in the
acceptable ASCII ranks.
--
Michael


signature.asc
Description: PGP signature


Re: Regression test failure in regression test temp.sql

2019-08-12 Thread Michael Paquier
On Tue, Aug 13, 2019 at 02:51:03PM +1200, Thomas Munro wrote:
> Here's another one that seems to fit that pattern.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2019-08-11%2007%3A33%3A39

Indeed.  Good catch!  Perhaps you would like to fix it?  There are two
queries in need of an ORDER BY, and the second query even uses two
semicolons (spoiler warning: that's a nit).
--
Michael


signature.asc
Description: PGP signature


Re: pg_stat_replication lag fields return non-NULL values even with NULL LSNs

2019-08-12 Thread Thomas Munro
On Tue, Aug 13, 2019 at 2:20 PM Michael Paquier  wrote:
> On Tue, Aug 13, 2019 at 11:15:42AM +1200, Thomas Munro wrote:
> > One thing I noticed in passing is that you always get the same times
> > in the write_lag and flush_lag columns, in --synchronous mode, and the
> > times updates infrequently.  That's not the case with regular
> > replicas; I suspect there is a difference in the time and frequency of
> > replies sent to the server, which I guess might make synchronous
> > commit a bit "lumpier", but I didn't dig further today.
>
> The messages are sent by pg_receivewal via sendFeedback() in
> receivelog.c.  It gets triggered for the --synchronous case once a
> flush is done (but you are not surprised by my reply here, right!),
> and most likely the matches you are seeing some from the messages sent
> at the beginning of HandleCopyStream() where the flush and write
> LSNs are equal.  This code behaves as I would expect based on your
> description and a read of the code I have just done to refresh my
> mind, but we may of course have some issues or potential
> improvements.

Right.  For a replica server we call XLogWalRcvSendReply() after
writing, and then again inside XLogWalRcvFlush().  So the primary gets
to measure write_lag and flush_lag separately.  If pg_receivewal just
sends one reply after flushing, then turning on --synchronous has the
effect of showing the flush lag in both write_lag and flush_lag
columns.

Of course those things aren't quite as independent as they should be
anyway, since the flush is blocking and therefore delays the next
write.  That's why Simon probably wants to move the
flush to the WAL writer process, and Andres probably wants to change
the whole thing to use some kind of async IO[1].

[1] https://lwn.net/Articles/789024/

-- 
Thomas Munro
https://enterprisedb.com




Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Michael Paquier
On Mon, Aug 12, 2019 at 07:05:08PM +0200, Peter Eisentraut wrote:
> In this context, I would prefer #2, but I would expand that to cover all
> authentication methods, not only password methods.

I tend to prefer #2 as well and that's the kind of approach we were
tending to agree on when we discussed this issue during the v11 beta
for the downgrade issues with libpq.  And as you say extend it so as
we can apply filtering of more AUTH_REQ requests, inclusing GSS and
krb5.
--
Michael


signature.asc
Description: PGP signature


Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Michael Paquier
On Fri, Aug 09, 2019 at 09:28:50AM -0400, Stephen Frost wrote:
> I don't really care for auth_protocol as that's pretty close to
> "auth_method" and that isn't what we're talking about here- this isn't
> the user picking the auth method, per se, but rather saying which of the
> password-based mechanisms for communicating that the user knows the
> password is acceptable.  Letting users choose which auth methods are
> allowed might also be interesting (as in- we are in a Kerberized
> environment and therefore no client should ever be using any auth method
> except GSS, could be a reasonable ask) but it's not the same thing.
>
> What restriction are you suggesting here wrt krb5..?

What I suggested in this previous set of emails is if it would make
sense to extend what libpq can restrict at authentication time to not
only be password-based authentication methods, but also if we could
have a connection parameter allowing us to say "please I want krb5/gss
and nothing else".  My point is that password-based authentication is
only one portion of the problem as what we are looking at is applying
a filtering on AUTH_REQ messages that libpq receives from the server
(SCRAM with and without channel binding is an exception as that's
handled as part of the SASL set of messages), but at a high level we
are going to need a filtering of the first authentication message
received anyway.

But that's also basically what you outline in this previous paragraph
of yours.
--
Michael


signature.asc
Description: PGP signature


Re: Regression test failure in regression test temp.sql

2019-08-12 Thread Thomas Munro
On Tue, Aug 13, 2019 at 1:58 PM Michael Paquier  wrote:
> On Sun, Aug 11, 2019 at 03:59:06PM -0400, Tom Lane wrote:
> > I hacked temp.sql to print a couple different plans (doing it that way,
> > rather than manually, just to ensure that I was getting plans matching
> > what would actually happen right there).  And what I see, as attached,
> > is that IOS and plain index and bitmap scans all have pretty much the
> > same total cost.  The planner then ought to prefer IOS or plain on the
> > secondary grounds of cheaper startup cost.  However, it's not so hard
> > to believe that it might switch to bitmap if something caused the cost
> > estimates to change by a few percent.  So probably we should write this
> > off as "something affected the plan choice" and just add the ORDER BY
> > as you suggest.
>
> That matches what I was seeing, except that I have done those tests
> manually.  Still my plans matched with yours.

Here's another one that seems to fit that pattern.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2019-08-11%2007%3A33%3A39

+++ 
/home/andres/build/buildfarm/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/collate.icu.utf8.out
2019-08-11 08:29:11.792695714 +
@@ -1622,15 +1622,15 @@
 SELECT typname FROM pg_type WHERE typname LIKE 'int_' AND typname <>
'INT2'::text COLLATE case_insensitive;
  typname
 -
- int4
  int8
+ int4
 (2 rows)

-- 
Thomas Munro
https://enterprisedb.com




Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2019-08-12 Thread Michael Paquier
On Mon, Aug 12, 2019 at 05:32:08PM -0400, Alvaro Herrera wrote:
> So do we have an updated patch for this?  It's been a while since this
> patch saw any movement ...

Please note that this involves a couple of people in Japan, and this
week is the Obon vacation season for a lot of people.  So there could
be delays in replies.
--
Michael


signature.asc
Description: PGP signature


Re: pg_stat_replication lag fields return non-NULL values even with NULL LSNs

2019-08-12 Thread Michael Paquier
On Tue, Aug 13, 2019 at 11:15:42AM +1200, Thomas Munro wrote:
> Hmm.  It's working as designed, but indeed it's not very newsworthy
> information in this case.  If you run pg_receivewal --synchronous then
> you get sensible looking flush_lag times.  Without that, flush_lag
> only goes up, and of course replay_lag only goes up, so although it's
> telling the truth, I think your proposal makes sense.

Thanks!

> One question I had is what would happen with your patch without
> --synchronous, once it flushes a whole file and opens a new one; I
> wondered if your new boring-information-hiding behaviour would stop
> working after one segment file because of that.

Indeed.

> I tested that and the column remains NULL when we move to a new
> file, so that's good. 

Thanks for looking.

> One thing I noticed in passing is that you always get the same times
> in the write_lag and flush_lag columns, in --synchronous mode, and the
> times updates infrequently.  That's not the case with regular
> replicas; I suspect there is a difference in the time and frequency of
> replies sent to the server, which I guess might make synchronous
> commit a bit "lumpier", but I didn't dig further today.

The messages are sent by pg_receivewal via sendFeedback() in
receivelog.c.  It gets triggered for the --synchronous case once a
flush is done (but you are not surprised by my reply here, right!),
and most likely the matches you are seeing some from the messages sent
at the beginning of HandleCopyStream() where the flush and write
LSNs are equal.  This code behaves as I would expect based on your
description and a read of the code I have just done to refresh my
mind, but we may of course have some issues or potential
improvements.
--
Michael


signature.asc
Description: PGP signature


Re: Regression test failure in regression test temp.sql

2019-08-12 Thread Michael Paquier
On Sun, Aug 11, 2019 at 03:59:06PM -0400, Tom Lane wrote:
> I hacked temp.sql to print a couple different plans (doing it that way,
> rather than manually, just to ensure that I was getting plans matching
> what would actually happen right there).  And what I see, as attached,
> is that IOS and plain index and bitmap scans all have pretty much the
> same total cost.  The planner then ought to prefer IOS or plain on the
> secondary grounds of cheaper startup cost.  However, it's not so hard
> to believe that it might switch to bitmap if something caused the cost
> estimates to change by a few percent.  So probably we should write this
> off as "something affected the plan choice" and just add the ORDER BY
> as you suggest.

That matches what I was seeing, except that I have done those tests
manually.  Still my plans matched with yours.

> Hmm, I wasn't thinking of changing anything more than this one query.
> I'm not sure that a wide-ranging patch is going to be worth the
> potential back-patching land mines it'd introduce.  However, if you
> want to do it anyway, please at least patch v12 as well --- that
> should still be a pretty painless back-patch, even if it's not so
> easy to go further.

Okay, I have gone with a minimal fix of only changing some of the
quals in temp.sql as it could become a problem if other tests begin to
use relations beginning with "temp".  If it proves that we have other
problems in this area later on, let's address it at this time.

> BTW, most of the problem here seems to be that the SQL committee
> made an infelicitous choice of wildcard characters for LIKE.
> I wonder if it'd be saner to fix this by switching to regexes?

So that enforces the start of the string to match.  This has the merit
to make the relation name cleaner to grab.  I have gone with your
suggestion, thanks for the advice!
--
Michael


signature.asc
Description: PGP signature


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Sat, Aug 10, 2019 at 08:06:17AM -0400, Bruce Momjian wrote:
> So, I just had an indea if we use separate encryption keys for
> heap/index and for WAL --- we already know we will have an offline tool
> that can rotate the passphrase or encryption keys.  If we allow the
> encryption keys to be rotated independently, we can create a standby,
> and immediately rotate its heap/index encryption key.  We can then start
> streaming replication.  When we promote the standby to primary, we can
> then shut it down and rotate the WAL encryption key --- the new primary
> would then have no shared keys with the old primary.

To help move this forward, I created a new wiki TDE section titled "TODO
for Full-Cluster Encryption" and marked some unresolved items with
question marks:


https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption

I have also updated some of the other text to match conclusions we have
made.

I know some of the items are done, but if we have agreement on moving
forward, I can help with some of the missing code.  This looks doable
for PG 13 if we start soon.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: SegFault on 9.6.14

2019-08-12 Thread Robert Haas
On Mon, Aug 12, 2019 at 5:48 PM Tom Lane  wrote:
> To clarify my position --- I think it's definitely possible to improve
> the situation a great deal.  We "just" have to pass down more information
> about whether rescans are possible.  What I don't believe is that that
> leads to a bug fix that would be sane to back-patch as far as 9.6.

Sounds like a fair opinion.  I'm not sure how complicated the fix
would be so I don't know whether I agree with your opinion, but you
usually have a fairly good intuition for such things, so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_stat_replication lag fields return non-NULL values even with NULL LSNs

2019-08-12 Thread Thomas Munro
On Wed, Jul 17, 2019 at 1:52 PM Michael Paquier  wrote:
> I got surprised by the following behavior from pg_stat_get_wal_senders
> when connecting for example pg_receivewal to a primary:
> =# select application_name, flush_lsn, replay_lsn, flush_lag,
> replay_lag from pg_stat_replication;
>  application_name | flush_lsn | replay_lsn |flush_lag|replay_lag
> --+---++-+-
>  receivewal   | null  | null   | 00:09:13.578185 | 00:09:13.578185
> (1 row)
>
> It makes little sense to me, as we are reporting a replay lag on a
> position which has never been reported yet, so it cannot actually be
> used as a comparison base for the lag.  Am I missing something or
> should we return NULL for those fields if we have no write, flush or
> apply LSNs like in the attached?

Hmm.  It's working as designed, but indeed it's not very newsworthy
information in this case.  If you run pg_receivewal --synchronous then
you get sensible looking flush_lag times.  Without that, flush_lag
only goes up, and of course replay_lag only goes up, so although it's
telling the truth, I think your proposal makes sense.

One question I had is what would happen with your patch without
--synchronous, once it flushes a whole file and opens a new one; I
wondered if your new boring-information-hiding behaviour would stop
working after one segment file because of that.  I tested that and the
column remains NULL when we move to a new file, so that's good.

One thing I noticed in passing is that you always get the same times
in the write_lag and flush_lag columns, in --synchronous mode, and the
times updates infrequently.  That's not the case with regular
replicas; I suspect there is a difference in the time and frequency of
replies sent to the server, which I guess might make synchronous
commit a bit "lumpier", but I didn't dig further today.

-- 
Thomas Munro
https://enterprisedb.com




Re: Do not check unlogged indexes on standby

2019-08-12 Thread Peter Geoghegan
On Mon, Aug 12, 2019 at 2:58 AM Andrey Borodin  wrote:
> BTW I really want to enable rightlink-leftlink invariant validation on 
> standby..

That seems very hard. My hope was that bt_check_index() can detect the
same problem a different way. The bt_right_page_check_scankey()
cross-page check (which needs nothing more than an AccessShareLock)
will often detect such problems, because the page image itself will be
totally wrong in some way.

I'm guessing that you have direct experience with that *not* being
good enough, though. Can you share further details? I suppose that
bt_right_page_check_scankey() helps with transposed pages, but doesn't
help so much when you have WAL-level inconsistencies.

-- 
Peter Geoghegan




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Wed, Aug  7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote:
> On Mon, Aug 5, 2019 at 9:02 PM Bruce Momjian  wrote:
> I don't think we want to add a MAC at this point since the MAC for 8k
> pages seems unattainable.
> 
> Even without a per-page MAC, a MAC at some level for WAL has its own benefits
> such as perfect corruption detection. It could be per-record, per-N-records,
> per-checkpoint, or per-file. The current WAL file format already handles
> arbitrary gaps so there is significantly more flexibility in adding it vs
> pages. I'm not saying it should be a requirement but, unlike pages, I would 
> not
> rule it out just yet as it may not be that complicated.

FYI, the WAL already has a CRC that detects corruption and
parially-written records (which are ignored and stop the reading of
WAL).

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Do not check unlogged indexes on standby

2019-08-12 Thread Peter Geoghegan
On Mon, Aug 12, 2019 at 2:58 AM Andrey Borodin  wrote:
> Currently, if we check indexes on standby we often get
>
> man-psbpshn0skhsxynd/xiva_xtable_testing_01 R # select 
> bt_index_check('xiva_loadtest.pk_uid');
> ERROR:  58P01: could not open file "base/16453/125407": No such file or 
> directory
>
> I think that we should print warning and that's it. Amcheck should not give 
> false positives.

I agree -- amcheck should just skip over unlogged tables during
recovery, since there is simply nothing to check.

I pushed your patch to all branches that have amcheck just now, so now
we skip over unlogged relations when in recovery, though I made some
revisions.

Your patch didn't handle temp tables/indexes that were created in the
first session correctly -- we must be careful about the distinction
between unlogged tables, and tables that don't require WAL logging
(the later includes temp tables). Also, I thought that it was a good
idea to actively test for the presence of a main fork when we don't
skip (i.e. when the system isn't in recovery and the B-Tree indexes
isn't unlogged) -- we now give a clean report of corruption when that
happens, rather than letting an ambiguous "can't happen" error get
raised by low-level code. This might be possible with system catalog
corruption, for example. Finally, I thought that the WARNING was a bit
strong -- a NOTICE is more appropriate.

Thanks!

--
Peter Geoghegan




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Wed, Jul 31, 2019 at 09:43:00AM -0400, Sehrope Sarkuni wrote:
> On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada  wrote:
> For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> and then write only the encrypted data of the new WAL record using
> pg_pwrite() rather than write whole encrypted page. So each time we
> encrypt 8k WAL page we end up with encrypting different data with the
> same key+nonce but since we don't write to the disk other than space
> where we actually wrote WAL records it's not a problem. Is that right?
> 
> Ah, this is what I was referring to in my previous mail. I'm not familiar with
> how the writes happen yet (reading up...) but, yes, we would need to ensure
> that encrypted data is not written more than once (i.e. no writing of encrypt
> (zero) followed by writing of encrypt(non-zero) at the same spot).

No one replied to this comment, so I will state here that we never write
zeros to the WAL and go back and write something else --- the WAL is
append-only.  We might do that for heap/index pages, but they would get
a new LSN (and hence a new IV) when that happens.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Tue, Jul 30, 2019 at 04:48:31PM -0400, Bruce Momjian wrote:
> I am not even clear if pg_upgrade preserving relfilenode is possible ---
> when we wrap the relfilenode counter, does it start at 1 or at the
> first-user-relation-oid?  If the former, it could conflict with oids
> assigned to new system tables in later major releases.  Tying the
> preservation of relations to two restrictions seems risky.

For the curious, when relfilenode wraps, it starts at
FirstNormalObjectId, because GetNewRelFileNode eventually calls
GetNewObjectId(), so the concern above is wrong, though this is not an
issue anymore.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: SegFault on 9.6.14

2019-08-12 Thread Tom Lane
Robert Haas  writes:
> Being able to do that sort of thing was one of my goals in designing
> the ExecShutdownNode stuff.  Unfortunately, it's clear from this bug
> report that it's still a few bricks short of a load, and Tom doesn't
> seem real optimistic about how easy it will be to buy those bricks at
> discount prices. But I hope we persist in trying to get there, because
> I don't like the idea of saying that we'll never be smart enough to
> know we're done with any part of the plan until we're definitely done
> with the whole thing. I think that's leaving too much money on the
> table.

To clarify my position --- I think it's definitely possible to improve
the situation a great deal.  We "just" have to pass down more information
about whether rescans are possible.  What I don't believe is that that
leads to a bug fix that would be sane to back-patch as far as 9.6.

regards, tom lane




Re: SegFault on 9.6.14

2019-08-12 Thread Thomas Munro
On Tue, Aug 13, 2019 at 7:07 AM Alvaro Herrera  wrote:
> On 2019-Aug-12, Thomas Munro wrote:
> > That's possibly relevant because it means we'd have a ParallelContext
> > or some new overarching object that has a lifetime that is longer than
> > the individual Gather nodes' processes and instrumentation data.  I'm
> > not saying we need to discuss any details of this other concern now,
> > I'm just wondering out loud if the whole problem in this thread goes
> > away automatically when we fix it.
>
> How likely is it that we would ever be able to release memory from a
> Sort (or, say, a hashjoin hash table) when it's done being read, but
> before completing the whole plan?  As I understand, right now we hold
> onto a lot of memory after such plans have been fully read, for no good
> reason other than executor being unaware of this.  This might not be
> directly related to the problem at hand, since it's not just parallel
> plans that are affected.

Right, AIUI we hold onto that memory because it's a nice optimisation
to be able to rescan the sorted data or reuse the hash table (single
batch, non-parallel hash joins only for now).  We have no
disincentive, because our memory model doesn't care about the total
peak memory usage (ie all nodes).  Some other RDBMSs do care about
that, and somehow consider the peak memory usage (that is, considering
early memory release) when comparing join orders.

However, I think it's independent of the DSM lifetime question,
because the main Parallel Context DSM segment is really small, it has
a small fixed header and then a small object per parallel-aware node,
and isn't used for holding the hash table for Parallel Hash and
probably wouldn't be used for a future hypothetical Parallel Sort (if
it's implemented the way I imagine at least).  It contains a DSA area,
which creates more DSM segments as it needs them, and nodes can opt to
free DSA memory sooner, which will likely result in those extra DSM
segments being freed; you can see that happening in Parallel Hash
which in fact does give back memory quite eagerly.  (I'm the first to
admit that it's weird that DSM segments can hold DSA areas and DSA
areas are made up of DSM segments; that falls out of the choice to use
DSM segments both for storage and as a lifetime management system for
shared resources, and I wouldn't be surprised if we reconsider that as
we get more experience and ideas.)

-- 
Thomas Munro
https://enterprisedb.com




Re: SegFault on 9.6.14

2019-08-12 Thread Robert Haas
On Mon, Aug 12, 2019 at 3:07 PM Alvaro Herrera  wrote:
> How likely is it that we would ever be able to release memory from a
> Sort (or, say, a hashjoin hash table) when it's done being read, but
> before completing the whole plan?  As I understand, right now we hold
> onto a lot of memory after such plans have been fully read, for no good
> reason other than executor being unaware of this.  This might not be
> directly related to the problem at hand, since it's not just parallel
> plans that are affected.

Being able to do that sort of thing was one of my goals in designing
the ExecShutdownNode stuff.  Unfortunately, it's clear from this bug
report that it's still a few bricks short of a load, and Tom doesn't
seem real optimistic about how easy it will be to buy those bricks at
discount prices. But I hope we persist in trying to get there, because
I don't like the idea of saying that we'll never be smart enough to
know we're done with any part of the plan until we're definitely done
with the whole thing. I think that's leaving too much money on the
table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Duplicated LSN in ReorderBuffer

2019-08-12 Thread Alvaro Herrera
On 2019-Aug-07, Andres Freund wrote:

> I think we would need to do this for all values of
> SnapBuildCurrentState() - after all the problem occurs because we
> *previously* didn't assign subxids to the toplevel xid.  Compared to the
> cost of catalog changes, ReorderBufferAssignChild() is really cheap. So
> I don't think there's any problem just calling it unconditionally (when
> top_xid <> xid, of course).

BTW I wrote the code as suggested and it passes all the tests ... but I
then noticed that the unpatched code doesn't fail Ildar's original
pgbench-based test for me, either.  So maybe my laptop is not powerful
enough to reproduce it, or maybe I'm doing something wrong.

I'm tempted to just push it, since it seems "obviously" more correct
than the original.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2019-08-12 Thread Alvaro Herrera
On 2018-Nov-16, Tom Lane wrote:

> Etsuro Fujita  writes:
> > [ fix-foreign-modify-efujita-2.patch ]
> 
> Um ... wow, I do not like anything about this.  Adding a "tableoid = X"
> constraint to every remote update query seems awfully expensive,
> considering that (a) it's useless for non-partitioned tables, and
> (b) the remote planner will have exactly no intelligence about handling
> it.  We could improve (b) probably, but that'd be another big chunk of
> work, and it wouldn't help when talking to older servers.

So do we have an updated patch for this?  It's been a while since this
patch saw any movement ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Thu, Jul 25, 2019 at 11:30:55PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-25, Alvaro Herrera wrote:
> > On the other hand if the Key and IV are reused between messages then
> > the same plaintext will lead to the same ciphertext, so you can
> > potentially decrypt a message using a sufficiently large corpus of known
> > matching plaintext/ciphertext pairs, even without ever recovering the
> > key.
> 
> Actually the attack being described presumes that you know *both the*
> *unencrypted data and the encrypted data* for a certain key/IV pair,
> and only then you can decrypt some other data.  It doesn't follow that
> you can decrypt data just because somebody reused the IV for a second
> page ... I haven't seen any literature referenced that explains what
> this attack is.

I never addressed this exact comment.  If someone can guess at some
known heap/index format markers at specific offsets in a page, they can
XOR that with the encrypted data to get the encryption bit stream.  They
could then use that encrypted bit stream to XOR against another
encrypted page at the same offsets and with the same key/IV to see
unenrypted user data if it exists at the same page offsets.  (The
all-zero empty space is a huge known format marker area.)

This is why CTR is so sensitive to reuse of the key/IV settings for
encrypting different data.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Support for jsonpath .datetime() method

2019-08-12 Thread Alexander Korotkov
On Thu, Aug 1, 2019 at 1:31 PM Thomas Munro  wrote:
> On Sat, Jul 27, 2019 at 2:43 AM Andrew Dunstan
>  wrote:
> > On 7/23/19 6:48 PM, Nikita Glukhov wrote:
> > > Some concrete pieces of review:
> > >> +   
> > >> +FF1
> > >> +decisecond (0-9)
> > >> +   
> > >>
> > >> Let's not use such weird terms as "deciseconds".  We could say
> > >> "fractional seconds, 1 digit" etc. or something like that.
> > > And what about "tenths of seconds", "hundredths of seconds"?
> >
> > Yes, those are much better.
>
> I've moved this to the September CF, still in "Waiting on Author" state.

I'd like to summarize differences between standard datetime parsing
and our to_timestamp()/to_date().

1) Standard defines much less datetime template parts.  Namely it defines:
 | YYY | YY | Y
 | RR
MM
DD
DDD
HH | HH12
HH24
MI
SS
S
FF1 | FF2 | FF3 | FF4 | FF5 | FF6 | FF7 | FF8 | FF9
A.M. | P.M.
TZH
TZM

We support majority of them and much more.  Incompatibilities are:
 *  (our name is S),
 * We don't support  | RR,
 * Our handling of  | YYY | YY | Y is different.  What we have
here is more like  | RR in standard (Nikita explained that
upthread [1]),
 * We don't support FF[1-9].  FF[1-6] are implemented in patch.  We
can't support FF[7-9], because our binary representation of timestamp
datatype don't have enough of precision.

2) Standard defines only following delimiters: , ,
, , , , , .  And
it requires strict matching of separators between template and input
strings.  We don't do so either in FX or non-FX mode.

For instance, we allow both to_date('2019/12/31', '-MM-DD') and
to_date('2019/12/31', 'FX-MM-DD').  But according to standard this
date should be written only as '2019-12-31' to match given template
string.

3) Standard prescribes recognition of digits according to \p{Nd}
regex.  \p{Nd} matches to "a digit zero through nine in any script
except ideographic scripts".  As far as I remember, we currently do
recognize only ASCII digits.

4) For non-delimited template parts standard requires matching to
digit sequences of lengths between 1 and maximum number of characters
of that template part.  We don't always do so.  For instance, we allow
more than 4 digits to correspond to , more than 3 digits to
correspond to YYY and so on.

# select to_date('2019-12-31', 'YYY-MM-DD');
  to_date

 2019-12-31
(1 row)

Links.

1. 
https://www.postgresql.org/message-id/d6efab15-f3a4-40d6-8ddb-6fd8f64cbc08%40postgrespro.ru

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: SegFault on 9.6.14

2019-08-12 Thread Alvaro Herrera
On 2019-Aug-12, Thomas Munro wrote:

> That's possibly relevant because it means we'd have a ParallelContext
> or some new overarching object that has a lifetime that is longer than
> the individual Gather nodes' processes and instrumentation data.  I'm
> not saying we need to discuss any details of this other concern now,
> I'm just wondering out loud if the whole problem in this thread goes
> away automatically when we fix it.

How likely is it that we would ever be able to release memory from a
Sort (or, say, a hashjoin hash table) when it's done being read, but
before completing the whole plan?  As I understand, right now we hold
onto a lot of memory after such plans have been fully read, for no good
reason other than executor being unaware of this.  This might not be
directly related to the problem at hand, since it's not just parallel
plans that are affected.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Improve search for missing parent downlinks in amcheck

2019-08-12 Thread Alexander Korotkov
On Fri, Jul 19, 2019 at 3:21 AM Peter Geoghegan  wrote:
> On Tue, Apr 30, 2019 at 5:58 PM Peter Geoghegan  wrote:
> > I will think about a simple fix, but after the upcoming point release.
> > There is no hurry.
>
> Attached draft patch uses RelationGetNumberOfBlocks() to size each of
> the two Bloom filters that may be used by amcheck to perform
> verification.
>
> The basic heapallindexed Bloom filter is now sized based on the
> conservative assumption that there must be *at least*
> "RelationGetNumberOfBlocks()  * 50" elements to fingerprint (reltuples
> will continue to be used to size the basic heapallindexed Bloom filter
> in most cases, though). The patch also uses the same
> RelationGetNumberOfBlocks() value to size the downlink Bloom filter.
> This second change will fix your problem very non-invasively.
>
> I intend to backpatch this to v11 in the next few days.

Thank you for backpatching this!

BTW, there is next revision of patch I'm proposing for v13.

In this revision check for missing downlinks is combined with
bt_downlink_check().  So, pages visited by bt_downlink_check() patch
doesn't cause extra accessed.  It only causes following additional
page accesses:
1) Downlinks corresponding to "negative infinity" keys,
2) Pages of child level, which are not referenced by downlinks.

But I think these two kinds are very minority, and those accesses
could be trade off with more precise missing downlink check without
bloom filter.  What do you think?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-2.patch
Description: Binary data


Re: clean up obsolete initdb locale handling

2019-08-12 Thread Tom Lane
Peter Eisentraut  writes:
> OK, let's do it like that.  Updated patch attached.

LGTM, but I don't have the ability to test it on Windows.

regards, tom lane




Re: clean up obsolete initdb locale handling

2019-08-12 Thread Peter Eisentraut
On 2019-08-08 17:51, Tom Lane wrote:
> However, I don't much like the choice to set LC_COLLATE and LC_CTYPE
> differently.  That seems to be risking weird behavior, and for what?
> I'd be inclined to just remove the WIN32 stanza, initialize all
> three of these variables with "", and explain it along the lines of
> 
> * In the postmaster, absorb the environment values for LC_COLLATE
> * and LC_CTYPE.  Individual backends will change these later to
> * settings taken from pg_database, but the postmaster cannot do
> * that.  If we leave these set to "C" then message localization
> * might not work well in the postmaster.

OK, let's do it like that.  Updated patch attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From e3deca5b7236d1f83a286f347338f320ca952147 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 12 Aug 2019 20:09:40 +0200
Subject: [PATCH v2] initdb: Remove obsolete locale handling

The method of passing LC_COLLATE and LC_CTYPE to the backend during
initdb is obsolete as of 61d967498802ab86d8897cb3c61740d7e9d712f6.
This can all be removed.

Discussion: 
https://www.postgresql.org/message-id/flat/eeaf2f99-a1a6-8aca-3f43-9ab0b2fb112a%402ndquadrant.com
---
 src/backend/main/main.c | 38 ++
 src/bin/initdb/initdb.c | 14 --
 2 files changed, 10 insertions(+), 42 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 7b18f8c758..a9edbfd4a4 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -99,42 +99,24 @@ main(int argc, char *argv[])
MemoryContextInit();
 
/*
-* Set up locale information from environment.  Note that LC_CTYPE and
-* LC_COLLATE will be overridden later from pg_control if we are in an
-* already-initialized database.  We set them here so that they will be
-* available to fill pg_control during initdb.  LC_MESSAGES will get set
-* later during GUC option processing, but we set it here to allow 
startup
-* error messages to be localized.
+* Set up locale information
 */
-
set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres"));
 
-#ifdef WIN32
-
/*
-* Windows uses codepages rather than the environment, so we work around
-* that by querying the environment explicitly first for LC_COLLATE and
-* LC_CTYPE. We have to do this because initdb passes those values in 
the
-* environment. If there is nothing there we fall back on the codepage.
+* In the postmaster, absorb the environment values for LC_COLLATE and
+* LC_CTYPE.  Individual backends will change these later to settings
+* taken from pg_database, but the postmaster cannot do that.  If we 
leave
+* these set to "C" then message localization might not work well in the
+* postmaster.
 */
-   {
-   char   *env_locale;
-
-   if ((env_locale = getenv("LC_COLLATE")) != NULL)
-   init_locale("LC_COLLATE", LC_COLLATE, env_locale);
-   else
-   init_locale("LC_COLLATE", LC_COLLATE, "");
-
-   if ((env_locale = getenv("LC_CTYPE")) != NULL)
-   init_locale("LC_CTYPE", LC_CTYPE, env_locale);
-   else
-   init_locale("LC_CTYPE", LC_CTYPE, "");
-   }
-#else
init_locale("LC_COLLATE", LC_COLLATE, "");
init_locale("LC_CTYPE", LC_CTYPE, "");
-#endif
 
+   /*
+* LC_MESSAGES will get set later during GUC option processing, but we 
set
+* it here to allow startup error messages to be localized.
+*/
 #ifdef LC_MESSAGES
init_locale("LC_MESSAGES", LC_MESSAGES, "");
 #endif
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 551d379d85..88a261d9bd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1417,20 +1417,6 @@ bootstrap_template1(void)
bki_lines = replace_token(bki_lines, "LC_CTYPE",
  
escape_quotes_bki(lc_ctype));
 
-   /*
-* Pass correct LC_xxx environment to bootstrap.
-*
-* The shell script arranged to restore the LC settings afterwards, but
-* there doesn't seem to be any compelling reason to do that.
-*/
-   snprintf(cmd, sizeof(cmd), "LC_COLLATE=%s", lc_collate);
-   putenv(pg_strdup(cmd));
-
-   snprintf(cmd, sizeof(cmd), "LC_CTYPE=%s", lc_ctype);
-   putenv(pg_strdup(cmd));
-
-   unsetenv("LC_ALL");
-
/* Also ensure backend isn't confused by this environment var: */
unsetenv("PGCLIENTENCODING");
 
-- 
2.22.0



Re: Problem with default partition pruning

2019-08-12 Thread Simon Riggs
On Mon, 12 Aug 2019 at 18:45, Alvaro Herrera 
wrote:

> I think that should appease
> Simon's performance concern for the most common case of default
> partition not existing.
>

Much appreciated, thank you.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Peter Eisentraut
On 2019-08-12 19:26, Tom Lane wrote:
> What problem do we actually need to solve here?
> 
> If the known use-case is just "don't send my password in the clear",
> maybe we should just change libpq to refuse to do that, ie reject
> plain-password auth methods unless SSL is on (except maybe over
> unix sockets?).  Or invent a bool connection option that enables
> exactly that.

There are several overlapping problems:

1) A downgrade attack by a malicious server.  The server can collect
passwords from unsuspecting clients by just requesting some weak
authentication like plain-text or md5.  This can currently be worked
around by using SSL with server verification, except when considering
the kind of attack that channel binding is supposed to address.

2) A downgrade attack to evade channel binding.  This cannot currently
be worked around.

3) A user not wanting to expose a weakly hashed password to the
(otherwise trusted) server.  This cannot currently be done.

4) A user not wanting to send a password in plain text over the wire.
This can currently be done by requiring SSL.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Problem with default partition pruning

2019-08-12 Thread Alvaro Herrera
v3-0001 still seems to leave things a bit duplicative.  I think we can
make it better if we move the logic to set RelOptInfo->partition_qual to
a separate routine (set_baserel_partition_constraint mirroring the
existing set_baserel_partition_key_exprs), and then call that from both
places that need access to partition_qual.

So I propose that the attached v4 patch should be the final form of this
(also rebased across today's list_concat API change).  I verified that
constraint exclusion is not being called by partprune unless a default
partition exists (thanks errbacktrace()); I think that should appease
Simon's performance concern for the most common case of default
partition not existing.

I think I was not really understanding the comments being added by
Amit's v3, so I reworded them.  I hope I understood the intent of the
code correctly.

I'm not comfortable with RelOptInfo->partition_qual.  But I'd rather
leave that for another time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From eccc45f4e62be8ded446e8e5afd81aa248eed937 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 9 Aug 2019 13:23:56 -0400
Subject: [PATCH v4] Rejigger code

---
 src/backend/optimizer/util/plancat.c | 59 +++
 src/backend/partitioning/partprune.c | 70 ++--
 2 files changed, 64 insertions(+), 65 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 98e99481c6..cf1761401d 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -78,6 +78,9 @@ static void set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 static PartitionScheme find_partition_scheme(PlannerInfo *root, Relation rel);
 static void set_baserel_partition_key_exprs(Relation relation,
 			RelOptInfo *rel);
+static void set_baserel_partition_constraint(Relation relation,
+			 RelOptInfo *rel);
+
 
 /*
  * get_relation_info -
@@ -1267,25 +1270,9 @@ get_relation_constraints(PlannerInfo *root,
 	 */
 	if (include_partition && relation->rd_rel->relispartition)
 	{
-		List	   *pcqual = RelationGetPartitionQual(relation);
-
-		if (pcqual)
-		{
-			/*
-			 * Run the partition quals through const-simplification similar to
-			 * check constraints.  We skip canonicalize_qual, though, because
-			 * partition quals should be in canonical form already; also,
-			 * since the qual is in implicit-AND format, we'd have to
-			 * explicitly convert it to explicit-AND format and back again.
-			 */
-			pcqual = (List *) eval_const_expressions(root, (Node *) pcqual);
-
-			/* Fix Vars to have the desired varno */
-			if (varno != 1)
-ChangeVarNodes((Node *) pcqual, 1, varno, 0);
-
-			result = list_concat(result, pcqual);
-		}
+		/* make sure rel->partition_qual is set */
+		set_baserel_partition_constraint(relation, rel);
+		result = list_concat(result, rel->partition_qual);
 	}
 
 	table_close(relation, NoLock);
@@ -2149,7 +2136,7 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
 	rel->boundinfo = partdesc->boundinfo;
 	rel->nparts = partdesc->nparts;
 	set_baserel_partition_key_exprs(relation, rel);
-	rel->partition_qual = RelationGetPartitionQual(relation);
+	set_baserel_partition_constraint(relation, rel);
 }
 
 /*
@@ -2324,3 +2311,35 @@ set_baserel_partition_key_exprs(Relation relation,
 	 */
 	rel->nullable_partexprs = (List **) palloc0(sizeof(List *) * partnatts);
 }
+
+/*
+ * set_baserel_partition_constraint
+ *
+ * Builds the partition constraint for the given base relation and sets it
+ * in the given RelOptInfo.  All Var nodes are restamped with the relid of the
+ * given relation.
+ */
+static void
+set_baserel_partition_constraint(Relation relation, RelOptInfo *rel)
+{
+	List	   *partconstr;
+
+	if (rel->partition_qual)	/* already done */
+		return;
+
+	/*
+	 * Run the partition quals through const-simplification similar to check
+	 * constraints.  We skip canonicalize_qual, though, because partition
+	 * quals should be in canonical form already; also, since the qual is in
+	 * implicit-AND format, we'd have to explicitly convert it to explicit-AND
+	 * format and back again.
+	 */
+	partconstr = RelationGetPartitionQual(relation);
+	if (partconstr)
+	{
+		partconstr = (List *) expression_planner((Expr *) partconstr);
+		if (rel->relid != 1)
+			ChangeVarNodes((Node *) partconstr, 1, rel->relid, 0);
+		rel->partition_qual = partconstr;
+	}
+}
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index a1bd4efd0b..f619171ace 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -619,31 +619,16 @@ gen_partprune_steps(RelOptInfo *rel, List *clauses, PartClauseTarget target,
 	context->target = target;
 
 	/*
-	 * For sub-partitioned tables there's a corner case where if the
-	 * sub-partitioned table shares any partiti

Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm not really thrilled with approach #2 because it means the user
> > will have to know which of the PG authentication methods involve, eg,
> > sending the password in the clear to the server, and which don't, if
> > what they're really looking for is "don't send my password in the clear
> > to the server" which seems like a really useful and sensible thing to
> > ask for.
> 
> What problem do we actually need to solve here?
> 
> If the known use-case is just "don't send my password in the clear",
> maybe we should just change libpq to refuse to do that, ie reject
> plain-password auth methods unless SSL is on (except maybe over
> unix sockets?).  Or invent a bool connection option that enables
> exactly that.

Right, inventing a bool connection for it was discussed up-thread and
seems like a reasonable idea to me (note: we should absolutely allow the
user to refuse to send the password to the server even over SSL, if they
would prefer to not do so).

> I'm not really convinced that there is a use-case for client side
> specification of allowed auth methods beyond that.  In the end,
> if you don't trust the server you're connecting to to handle your
> password with reasonable safety, you have got bigger problems than
> this one.  And we already have coverage for MITM problems (or if
> we don't, this sideshow isn't fixing it).

Uh, no, we really don't have MITM protection in certain cases, which is
exactly what channel-binding is intended to address, but we can't have
the "server" be able to say "oh, well, I don't support channel binding"
and have the client go "oh, ok, that's just fine, we won't use it then"-
that's a downgrade attack.

What was suggest up-thread to deal with that downgrade risk was a clear
connection option along the lines of "require channel binding" to
prevent that kind of a MITM downgrade attack from working.

I could possibly see some value in a client-side option along the lines
of "only authenticate using GSSAPI", which could prevent some users from
being fooled into sending their PW to a malicious server.  GSSAPI does
prevent MITM attacks (as much as it's able to anyway- each key is
specific to a particular server, so you'd have to have the specific
server's key in order to become a MITM), but if the server says "we
don't do GSSAPI, we do password, please give us your password" then psql
will happily go along with that even in an otherwise properly set up
GSSAPI environment.

Requiring GSSAPI Encryption on the client side should prevent that
though, as of v12, since psql will just refuse if the server claims to
not support GSSAPI Encryption.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Tom Lane
Stephen Frost  writes:
> I'm not really thrilled with approach #2 because it means the user
> will have to know which of the PG authentication methods involve, eg,
> sending the password in the clear to the server, and which don't, if
> what they're really looking for is "don't send my password in the clear
> to the server" which seems like a really useful and sensible thing to
> ask for.

What problem do we actually need to solve here?

If the known use-case is just "don't send my password in the clear",
maybe we should just change libpq to refuse to do that, ie reject
plain-password auth methods unless SSL is on (except maybe over
unix sockets?).  Or invent a bool connection option that enables
exactly that.

I'm not really convinced that there is a use-case for client side
specification of allowed auth methods beyond that.  In the end,
if you don't trust the server you're connecting to to handle your
password with reasonable safety, you have got bigger problems than
this one.  And we already have coverage for MITM problems (or if
we don't, this sideshow isn't fixing it).

regards, tom lane




Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Stephen Frost
Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 2019-08-12 18:02, Jeff Davis wrote:
> > https://postgr.es/m/daf0017a1a5c2caabf88a4e00f66b4fcbdfeccad.camel%40j-davis.com
> > 
> > The weakness of proposal #1 is that it's not very "future-proof" and we
> > would likely need to change something about it later when we support
> > new methods. That wouldn't break clients, but it would be annoying to
> > need to support some old syntax and some new syntax for the connection
> > parameters.
> > 
> > Proposal #3 does not have this weakness. When we add sha-512, we could
> > also add a parameter to specify that the client requires a certain hash
> > algorithm for SCRAM.
> > 
> > Do you favor that existing proposal #3, or are you proposing a fourth
> > option?
> 
> In this context, I would prefer #2, but I would expand that to cover all
> authentication methods, not only password methods.

I'm not really thrilled with approach #2 because it means the user
will have to know which of the PG authentication methods involve, eg,
sending the password in the clear to the server, and which don't, if
what they're really looking for is "don't send my password in the clear
to the server" which seems like a really useful and sensible thing to
ask for.

It also ends up not being very future-proof either, since a user who is
fine with scram-sha-256-plus will probably also be ok with
scram-sha-512-plus, should we ever implement it.

Not to mention that, at least at the moment, we don't let users pick
authentication methods with that kind of specificity on the server side
(how do you require channel binding..?), so the set of "authentication
methods" on the client side and those on the server side end up being
different sets, which strikes me as awfully confusing...

Or did I misunderstand what you were suggesting here wrt "all
authentication methods"?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: errbacktrace

2019-08-12 Thread Pavel Stehule
po 12. 8. 2019 v 19:06 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2019-08-12 13:19, Pavel Stehule wrote:
> > If I understand well, backtrace is displayed only when edata->funcname
> > is same like backtrace_function GUC. Isn't it too strong limit?
> >
> > For example, I want to see backtrace for all PANIC level errors on
> > production, and I would not to limit the source function?
>
> We can add additional ways to invoke this once we have the basic
> functionality in.
>

ok

Pavel

>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: errbacktrace

2019-08-12 Thread Peter Eisentraut
On 2019-08-12 13:19, Pavel Stehule wrote:
> If I understand well, backtrace is displayed only when edata->funcname
> is same like backtrace_function GUC. Isn't it too strong limit?
> 
> For example, I want to see backtrace for all PANIC level errors on
> production, and I would not to limit the source function?

We can add additional ways to invoke this once we have the basic
functionality in.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Peter Eisentraut
On 2019-08-12 18:02, Jeff Davis wrote:
> https://postgr.es/m/daf0017a1a5c2caabf88a4e00f66b4fcbdfeccad.camel%40j-davis.com
> 
> The weakness of proposal #1 is that it's not very "future-proof" and we
> would likely need to change something about it later when we support
> new methods. That wouldn't break clients, but it would be annoying to
> need to support some old syntax and some new syntax for the connection
> parameters.
> 
> Proposal #3 does not have this weakness. When we add sha-512, we could
> also add a parameter to specify that the client requires a certain hash
> algorithm for SCRAM.
> 
> Do you favor that existing proposal #3, or are you proposing a fourth
> option?

In this context, I would prefer #2, but I would expand that to cover all
authentication methods, not only password methods.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread legrand legrand
my understanding is

* pg_stat_statements.track = 'none' or 'top' (default) or 'all' 
to make queryId optionally computed

* a new GUC: pg_stat_statements.queryid_based = 'oids' (default) or 'names'
or 'fullnames'
to choose the queryid computation algorithm

am I rigth ?




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: Global temporary tables

2019-08-12 Thread Pavel Stehule
po 12. 8. 2019 v 18:19 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

> Hi,
>
> On 11.08.2019 10:14, Pavel Stehule wrote:
>
>
> Hi
>
>
>> There is one more problem with global temporary tables for which I do not
>> know good solution now: collecting statistic.
>> As far as each backend has its own data, generally them may need
>> different query plans.
>> Right now if you perform "analyze table" in one backend, then it will
>> affect plans in all backends.
>> It can be considered not as bug, but as feature if we assume that
>> distribution if data in all backens is similar.
>> But if this assumption is not true, then it can be a problem.
>>
>
> Last point is probably the most difficult issue and I think about it
> years.
>
> I have a experience with my customers so 99% of usage temp tables is
> without statistics - just with good information only about rows. Only few
> customers know so manual ANALYZE is necessary for temp tables (when it is
> really necessary).
>
> Sharing meta data about global temporary tables can real problem -
> probably not about statistics, but surely about number of pages and number
> of rows.
>
>
> But Postgres is not storing this information now anywhere else except
> statistic, isn't it?
>

not only - critical numbers are reltuples, relpages from pg_class

There was proposal to cache relation size,  but it is not implemented yet.
> If such cache exists, then we can use it to store local information about
> global temporary tables.
> So if 99% of users do not perform analyze for temporary tables, then them
> will not be faced with this problem, right?
>

they use default statistics based on relpages. But for 1% of applications
statistics are critical - almost always for OLAP applications.


>
>
> There are two requirements:
>
> a) we need some special meta data for any instance (per session) of global
> temporary table (row, pages, statistics, maybe multicolumn statistics, ...)
>
> b) we would not to use persistent global catalogue (against catalogue
> bloating)
>
> I see two possible solution:
>
> 1. hold these data only in memory in special buffers
>
> 2. hold these data in global temporary tables - it is similar to normal
> tables - we can use global temp tables for metadata like classic persistent
> tables are used for metadata of classic persistent tables. Next syscache
> can be enhanced to work with union of two system tables.
>
> I prefer @2 because changes can be implemented on deeper level.
>
> Sharing metadata for global temp tables (current state if I understand
> well) is good enough for develop stage, but It is hard to expect so it can
> work generally in production environment.
>
>
> I think that it not possible to assume that temporary data will aways fir
> in memory.
> So 1) seems to be not acceptable solution.
>

I spoke only about metadata. Data should be stored in temp buffers (and
possibly in temp files)

Pavel

>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: Removing unneeded downlink field from nbtree stack struct

2019-08-12 Thread Anastasia Lubennikova

16.07.2019 2:16, Peter Geoghegan wrote:

Attached patch slightly simplifies _bt_getstackbuf() by making it
accept a child BlockNumber argument, rather than requiring that
callers store the child block number in the parent stack item's
bts_btentry field. We can remove the bts_btentry field from the
BTStackData struct, because we know where we ended up when we split a
page and need to relocate parent to insert new downlink -- it's only
truly necessary to remember what pivot tuple/downlink we followed to
arrive at the page being split. There is no point in remembering the
child block number during our initial descent of a B-Tree, since it's
never actually used at a later point, and can go stale immediately
after the buffer lock on parent is released. Besides,
_bt_getstackbuf() callers can even redefine the definition of child to
be child's right sibling after the descent is over. For example, this
happens when we move right, or when we step right during unique index
insertion.

This slightly simplifies the code. Our stack is inherently
approximate, because we might have to move right for a number of
reasons.

I'll add the patch to the 2019-09 CF.



The refactoring is clear, so I set Ready for committer status.
I have just a couple of notes about comments:

1) I think that it's worth to add explanation of the case when we use 
right sibling to this comment:
+                * stack to work back up to the parent page.  We use the 
child block
+                * number (or possibly the block number of a page to its 
right)


2) It took me quite some time to understand why does page deletion case 
doesn't need a lock.
I propose to add something like "For more see comments for 
_bt_lock_branch_parent()" to this line:


Page deletion caller
+ *             can get away with a lock on leaf level page when 
locating topparent

+ *             downlink, though.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global temporary tables

2019-08-12 Thread Konstantin Knizhnik

Hi,

On 11.08.2019 10:14, Pavel Stehule wrote:


Hi


There is one more problem with global temporary tables for which I
do not know good solution now: collecting statistic.
As far as each backend has its own data, generally them may need
different query plans.
Right now if you perform "analyze table" in one backend, then it
will affect plans in all backends.
It can be considered not as bug, but as feature if we assume that
distribution if data in all backens is similar.
But if this assumption is not true, then it can be a problem.


Last point is probably the most difficult issue and I think about it 
years.


I have a experience with my customers so 99% of usage temp tables is 
without statistics - just with good information only about rows. Only 
few customers know so manual ANALYZE is necessary for temp tables 
(when it is really necessary).


Sharing meta data about global temporary tables can real problem - 
probably not about statistics, but surely about number of pages and 
number of rows.


But Postgres is not storing this information now anywhere else except 
statistic, isn't it?
There was proposal to cache relation size,  but it is not implemented 
yet. If such cache exists, then we can use it to store local information 
about global temporary tables.
So if 99% of users do not perform analyze for temporary tables, then 
them will not be faced with this problem, right?





There are two requirements:

a) we need some special meta data for any instance (per session) of 
global temporary table (row, pages, statistics, maybe multicolumn 
statistics, ...)


b) we would not to use persistent global catalogue (against catalogue 
bloating)


I see two possible solution:

1. hold these data only in memory in special buffers

2. hold these data in global temporary tables - it is similar to 
normal tables - we can use global temp tables for metadata like 
classic persistent tables are used for metadata of classic persistent 
tables. Next syscache can be enhanced to work with union of two system 
tables.


I prefer @2 because changes can be implemented on deeper level.

Sharing metadata for global temp tables (current state if I understand 
well) is good enough for develop stage, but It is hard to expect so it 
can work generally in production environment.




I think that it not possible to assume that temporary data will aways 
fir in memory.

So 1) seems to be not acceptable solution.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: POC: converting Lists into arrays

2019-08-12 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Aug-09, Tom Lane wrote:
>> I poked at this, and attached is a patch, but again I'm not seeing
>> that there's any real performance-based argument for it.

> I'm confused.  I thought that the point of doing this wasn't that we
> wanted to improve performance, but rather that we're now able to remove
> the array without *losing* performance.  I mean, those arrays were there
> to improve performance for code that wanted fast access to specific list
> items, but now we have fast access to the list items without it.  So a
> measurement that finds no performance difference is good news, and we
> can get rid of the now-pointless optimization ...

Yeah, fair enough, so pushed.

In principle, this change adds an extra indirection in exec_rt_fetch,
so I went looking to see if there were any such calls in arguably
performance-critical paths.  Unsurprisingly, most calls are in executor
initialization, and they tend to be adjacent to table_open() or other
expensive operations, so it's pretty hard to claim that there could
be any measurable hit.  However, I did notice that trigger.c uses
ExecUpdateLockMode() and GetAllUpdatedColumns() in ExecBRUpdateTriggers
which executes per-row, and so might be worth trying to optimize.
exec_rt_fetch itself is not the main cost in either of those, but I wonder
why we are doing those calculations over again for each row in the first
place.  I'm not excited enough about the issue to do anything right now,
but the next time somebody whines about trigger-firing overhead, there
might be an easy win available there.

regards, tom lane




Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Jeff Davis
On Sun, 2019-08-11 at 19:00 +0200, Peter Eisentraut wrote:
> On 2019-08-09 23:56, Jeff Davis wrote:
> > 1. Hierarchical semantics, where you specify the least-secure
> > acceptable method:
> > 
> >   password_protocol = {any,md5,scram-sha-256,scram-sha-256-plus}
> 
> What would the hierarchy be if scram-sha-512 and scram-sha-512-plus
> are
> added?


https://postgr.es/m/daf0017a1a5c2caabf88a4e00f66b4fcbdfeccad.camel%40j-davis.com

The weakness of proposal #1 is that it's not very "future-proof" and we
would likely need to change something about it later when we support
new methods. That wouldn't break clients, but it would be annoying to
need to support some old syntax and some new syntax for the connection
parameters.

Proposal #3 does not have this weakness. When we add sha-512, we could
also add a parameter to specify that the client requires a certain hash
algorithm for SCRAM.

Do you favor that existing proposal #3, or are you proposing a fourth
option?

Regards,
Jeff Davis






Re: POC: converting Lists into arrays

2019-08-12 Thread Tom Lane
David Rowley  writes:
> On Fri, 9 Aug 2019 at 04:24, Tom Lane  wrote:
>> Has anyone got further thoughts about naming around list_concat
>> and friends?
>> If not, I'm inclined to go ahead with the concat-improvement patch as
>> proposed in [1], modulo the one improvement David spotted.
>> [1] https://www.postgresql.org/message-id/6704.1563739...@sss.pgh.pa.us

> I'm okay with the patch once that one improvement is done.

Pushed with that fix.

> I think if we want to think about freeing the 2nd input List then we
> can do that in another commit. Removing the redundant list_copy()
> calls seems quite separate from that.

The reason I was holding off is that this patch obscures the distinction
between places that needed to preserve the second input (which were
doing list_copy on it) and those that didn't (and weren't).  If somebody
wants to rethink the free-second-input business they'll now have to do
a bit of software archaeology to determine which calls to change.  But
I don't think we're going to bother.

regards, tom lane




Re: Add "password_protocol" connection parameter to libpq

2019-08-12 Thread Jonathan S. Katz
On 8/11/19 3:56 PM, Peter Eisentraut wrote:
> On 2019-08-11 21:46, Jonathan S. Katz wrote:
>> On 8/11/19 1:00 PM, Peter Eisentraut wrote:
>>> On 2019-08-09 23:56, Jeff Davis wrote:
 1. Hierarchical semantics, where you specify the least-secure
 acceptable method:

   password_protocol = {any,md5,scram-sha-256,scram-sha-256-plus}
>>>
>>> What would the hierarchy be if scram-sha-512 and scram-sha-512-plus are
>>> added?
>>
>> password_protocol =
>> {any,md5,scram-sha-256,scram-sha-512,scram-sha-256-plus,scram-sha-512-plus}?
>>
>> I'd put one length of digest over another, but I'd still rank a method
>> that uses channel binding has more protections than one that does not.
> 
> Sure, but the opposite opinion is also possible.

That's true, and when originally started composing my note I had it as
(256,512,256-plus,512-plus).

But upon further reflection, the reason I ranked the digest-plus methods
above the digest methods is that there is any additional requirement
imposed by them. The digest methods could be invoked either with/without
TLS, whereas the digest-plus methods *must* use TLS. As such, 256-plus
is explicitly asking for an additional security parameter over 512, i.e.
transmission over TLS, so even if it's a smaller digest, it has the
additional channel binding requirement.

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread Julien Rouhaud
On Mon, Aug 12, 2019 at 4:01 PM Evgeniy Efimkin  wrote:
>
> > One problem with pg_stat_statement's normalized query is that it's not
> > stable, it's storing the normalized version of the first query string
> > passed when an entry is created. So you could have different strings
> > depending on whether the query was fully qualified or relying on
> > search path for instance.
> I think normalized query stored in pg_stat_statement it's not very important.
> it might look something like that
> `
>  query | calls |  queryid   |  sql_id
> ---+---++
>  Select * from   t | 4 |  762359559 |  680388963
>  select * from t   | 7 | 3438533065 |  680388963
>  select * from test2.t | 1 |  230362373 |  680388963
> `
> we can cut schema name in sql normalization
> algorithm

Not only schema name but all kind of qualification and indeed extra
whitespaces. Things get harder for other difference that aren't
meaningful (LIKE vs ~~, IN vs = ANY...).  That would also imply that
everyone wants to ignore schemas in query normalization, I'm not sure
how realistic that is.




Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread Evgeniy Efimkin



> One problem with pg_stat_statement's normalized query is that it's not
> stable, it's storing the normalized version of the first query string
> passed when an entry is created. So you could have different strings
> depending on whether the query was fully qualified or relying on
> search path for instance.
I think normalized query stored in pg_stat_statement it's not very important.
it might look something like that
`
 query | calls |  queryid   |  sql_id
---+---++
 Select * from   t | 4 |  762359559 |  680388963
 select * from t   | 7 | 3438533065 |  680388963
 select * from test2.t | 1 |  230362373 |  680388963
`
we can cut schema name in sql normalization 
algorithm 
 
Efimkin Evgeny





Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread Julien Rouhaud
On Mon, Aug 12, 2019 at 2:52 PM Evgeniy Efimkin  wrote:
>
> Hi!
> What about adding new column in pg_stat_statements e.g. sql_id it's hash from 
> normalized query. Аnd add function which get that hash (using raw_parser, 
> raw_expression_tree_walker) for any query
> `
> postgres=# select get_queryid('select 1');
>  get_queryid
> -
>  680388963
> (1 row)

One problem with pg_stat_statement's normalized query is that it's not
stable, it's storing the normalized version of the first query string
passed when an entry is created.  So you could have different strings
depending on whether the query was fully qualified or relying on
search path for instance.

Exposing the queryid computation at SQL level has already been
proposed, and FWIW I'm all for it.




Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread Julien Rouhaud
On Mon, Aug 12, 2019 at 2:40 PM Jim Finnerty  wrote:
>
> If hashing names instead of using OIDs is too expensive for some workload,
> then that workload would need to be able to turn statement hashing off.  So
> it needs to be optional, just like queryId is optionally computed today.
> For many cases the extra overhead of hashing object names is small compared
> to optimization time plus execution time.

Are you suggesting a fallback to oid based queryid or to entirely
disable queryid generation?

How would that work with pg_stat_statements or similar extension?




Re: [survey] New "Stable" QueryId based on normalized query text

2019-08-12 Thread Evgeniy Efimkin
Hi!
What about adding new column in pg_stat_statements e.g. sql_id it's hash from 
normalized query. Аnd add function which get that hash (using raw_parser, 
raw_expression_tree_walker) for any query
`
postgres=# select get_queryid('select 1');
 get_queryid
-
 680388963
(1 row)
`
that function can be used on pg_stat_activity(query) for join 
pg_stat_statements if it need.

12.08.2019, 14:51, "legrand legrand" :
> Hi Jim,
>
> Its never too later, as nothing has been concluded about that survey ;o)
>
> For information, I thought It would be possible to get a more stable
> QueryId,
> by hashing relation name or fully qualified names.
>
> With the support of Julien Rouhaud, I tested with this kind of code:
>
>  case RTE_RELATION:
> if (pgss_queryid_oid)
> {
> APP_JUMB(rte->relid);
> }
> else
> {
> rel = 
> RelationIdGetRelation(rte->relid);
> 
> APP_JUMB_STRING(RelationGetRelationName(rel));
> 
> APP_JUMB_STRING(get_namespace_name(get_rel_namespace(rte->relid)));
> RelationClose(rel);
> {
>
> thinking that 3 hash options would be interesting in pgss:
> 1- actual OID
> 2- relation names only (for databases WITHOUT distinct schemas contaning
> same tables)
> 3- fully qualified names schema.relname (for databases WITH distinct schemas
> contaning same tables)
>
> but performances where quite bad (it was a few month ago, but I remenber
> about a 1-5% decrease).
> I also remenber that's this was not portable between distinct pg versions
> 11/12
> and also not sure it was stable between windows / linux ports ...
>
> So I stopped here ... Maybe its time to test deeper this alternative
> (to get fully qualified names hashes in One call) knowing that such
> transformations
> will have to be done for all objects types (not only relations) ?
>
> I'm ready to continue testing as it seems the less impacting solution to
> keep actual pgss ...
>
> If this doesn't work, then trying with a normalized query text (associated
> with search_path) would be the
> other alternative, but impacts on actual pgss would be higher ...
>
> Regards
> PAscal
>
> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

 
Efimkin Evgeny






Re: block-level incremental backup

2019-08-12 Thread Robert Haas
On Mon, Aug 12, 2019 at 7:57 AM Jeevan Chalke
 wrote:
> Agree that we can certainly use open(), read(), write(), and close() here, but
> given that pg_basebackup.c and basbackup.c are using file operations, I think
> using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
> for consistetncy.

Oh, that's fine.  Whatever's more consistent with the pre-existing
code. Just, let's not use system().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke
On Mon, Aug 12, 2019 at 5:29 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <
> jeevan.la...@enterprisedb.com> wrote:
>
>> Hi Robert,
>>
>> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:
>>
>>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>>>  wrote:
>>> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
>>> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
>>> %X/%X\n",
>>> > +(uint32) (previous_lsn >> 32), (uint32)
>>> previous_lsn);
>>> >
>>> > May be we should rename to something like:
>>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
>>> START LOCATION"
>>> > to make it more intuitive?
>>>
>>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
>>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
>>> LOCATION is definitely not clear.  This backup is an incremental
>>> backup, and it has a start WAL location, so you'd end up with START
>>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
>>> like they ought to both be the same thing, but they're not.  Perhaps
>>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
>>> INCREMENTAL BACKUP would be clearer.
>>>
>>
>> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
>>
>
> +1 for INCREMENTAL BACKUP REFERENCE WA.
>

Sorry for the typo:
+1 for the INCREMENTAL BACKUP REFERENCE WAL LOCATION.


>
>>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Bruce Momjian
On Wed, Jul 10, 2019 at 08:07:49PM -0400, Bruce Momjian wrote:
> On Thu, Jul 11, 2019 at 12:18:47AM +0200, Tomas Vondra wrote:
> > On Wed, Jul 10, 2019 at 06:04:30PM -0400, Stephen Frost wrote:
> > > Greetings,
> > > 
> > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > > On Wed, Jul 10, 2019 at 04:11:21PM -0400, Alvaro Herrera wrote:
> > > > >On 2019-Jul-10, Bruce Momjian wrote:
> > > > >
> > > > >>Uh, what if a transaction modifies page 0 and page 1 of the same table
> > > > >>--- don't those pages have the same LSN.
> > > > >
> > > > >No, because WAL being a physical change log, each page gets its own
> > > > >WAL record with its own LSN.
> > > > >
> > > > 
> > > > What if you have wal_log_hints=off? AFAIK that won't change the page 
> > > > LSN.
> > > 
> > > Alvaro suggested elsewhere that we require checksums for these, which
> > > would also force wal_log_hints to be on, and therefore the LSN would
> > > change.
> > > 
> > 
> > Oh, I see - yes, that would solve the hint bits issue. Not sure we want
> > to combine the features like this, though, as it increases the costs of
> > TDE. But maybe it's the best solution.
> 
> Uh, why can't we just force log_hint_bits for encrypted tables?  Why
> would we need to use checksums as well?

When we were considering CBC mode for heap/index pages, a change of a
hint bit would change all later 16-byte encrypted blocks.  Now that we
are using CTR mode, a change of a hint bit will only change a bit on the
stored page.

Someone could compare the old and new pages and see that a bit was
changed.  This would make log_hint_bits less of a requirement with CTR
mode, though leaking hit bit changes is probably not ideal.  Perhaps
log_hint_bits should be a recommendation and not a requirement.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe 
wrote:

> Hi Robert,
>
> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:
>
>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>>  wrote:
>> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
>> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
>> %X/%X\n",
>> > +(uint32) (previous_lsn >> 32), (uint32)
>> previous_lsn);
>> >
>> > May be we should rename to something like:
>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
>> START LOCATION"
>> > to make it more intuitive?
>>
>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
>> LOCATION is definitely not clear.  This backup is an incremental
>> backup, and it has a start WAL location, so you'd end up with START
>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
>> like they ought to both be the same thing, but they're not.  Perhaps
>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
>> INCREMENTAL BACKUP would be clearer.
>>
>
> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
>

+1 for INCREMENTAL BACKUP REFERENCE WA.


>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke
On Fri, Aug 9, 2019 at 6:36 PM Robert Haas  wrote:

> On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
>  wrote:
> > So, do you mean we should just do fread() and fwrite() for the whole
> file?
> >
> > I thought it is better if it was done by the OS itself instead of
> reading 1GB
> > into the memory and writing the same to the file.
>
> Well, 'cp' is just a C program.  If they can write code to copy a
> file, so can we, and then we're not dependent on 'cp' being installed,
> working properly, being in the user's path or at the hard-coded
> pathname we expect, etc.  There's an existing copy_file() function in
> src/backed/storage/file/copydir.c which I'd probably look into
> adapting for frontend use.  I'm not sure whether it would be important
> to adapt the data-flushing code that's present in that routine or
> whether we could get by with just the loop to read() and write() data.
>

Agree that we can certainly use open(), read(), write(), and close() here,
but
given that pg_basebackup.c and basbackup.c are using file operations, I
think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.

Let me know if we still want to go with native OS calls.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: errbacktrace

2019-08-12 Thread Pavel Stehule
Hi

so I agree with unconditionally defining that symbol.
>
> Nitpicking dept: I think in these tests:
>
> +   if (!edata->backtrace &&
> +   edata->funcname &&
> +   backtrace_function[0] &&
> +   strcmp(backtrace_function, edata->funcname) == 0)
> +   set_backtrace(edata, 2);
>
>
If I understand well, backtrace is displayed only when edata->funcname is
same like backtrace_function GUC. Isn't it too strong limit?

For example, I want to see backtrace for all PANIC level errors on
production, and I would not to limit the source function?

Regards

Pavel





> we should test for backtrace_function[0] before edata->funcname, since
> it seems more likely to be unset.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>
>
>


Do not check unlogged indexes on standby

2019-08-12 Thread Andrey Borodin
Hi hackers!

Currently, if we check indexes on standby we often get

man-psbpshn0skhsxynd/xiva_xtable_testing_01 R # select 
bt_index_check('xiva_loadtest.pk_uid');
ERROR:  58P01: could not open file "base/16453/125407": No such file or 
directory

I think that we should print warning and that's it. Amcheck should not give 
false positives.

Or, maybe, there are some design considerations that I miss?


BTW I really want to enable rightlink-leftlink invariant validation on standby..

Thanks!

Best regards, Andrey Borodin.



0001-Do-not-verify-unlogged-indexes-during-amcheck.patch
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-12 Thread Antonin Houska
Masahiko Sawada  wrote:

> Attached the draft version patch sets of per tablespace transparent
> data at rest encryption. The patch doesn't support full functionality,
> it includes:
> 
> * Per tablespace encryption
> * Encryption and decryption buffer data when disk I/O.
> * 2 tier key hierarchy and key rotation
> * Temporary file encryption (based on the patch Antonin proposd)
> * System catalog encryption
> * Generic key management API and test module
> * Simple TAP test

I've checked your patch series to find out how to adjust [1] to make future
merge easier.

The biggest issue I see is that front-end applications won't be able to load
the KMGR plugin. We also used some sort of external library in the first
version of [1], but when I was trying to make the front-ends aware of
encryption, I found out that dfmgr.c cannot be linked to them w/o significant
rework. So I gave up and moved the encrypt_block() and decrypt_block()
functions to the core.

A few more notes regarding key management:

* InitializeKmgr()

  ** the function probably does not have to acquire KeyringControlLock, for
  the same reasons that load_keyring_file() does not do (i.e. it's only called
  by postmaster during startup)

  ** the lines

char *key = NULL;

 as well as

/* Get the master key */
key = KmgrPluginGetKey(KmgrCtl->masterKeyId);

Assert(key != NULL);

  should be enclosed in #ifdef USE_ASSERT_CHECKING - #endif, otherwise I
  suppose (but haven't verified) compiler will produce warning that variable
  is set but not used.

  Actually ERROR might be more suitable for external (loadable) KMGR plugin,
  but, as explained above, I'm not sure if such an approach is viable.

* KmgrPluginGetKey() only seems to deal with the master key, not with the
  tablespace keys. So I suggest the name to contain the 'Master' word.

* KmgrPluginRemoveKey() seems to be unused.

* KeyringCreateKey() - I wondered why the key is returned encrypted. Actually
  the only call of the function that I found is that in CreateTableSpace(),
  and it does not use the return value at all. Shouldn't KeyringGetKey()
  handle creation of the key if it does not exist yet?

* KeyringAddKey() seems to be unused.

* keyring size (kmgr.c):

/*
 * Since we have encryption keys per tablspace, we expect this value is 
enough
 * for most usecase.
 */
#define KMGR_KEYRING_SIZE 128

There's no guarantee that the number of tablespaces won't exceed any
(reasonably low) constant value. The KMGR module should be able to
allocate additional memory dynamically.


[1] https://commitfest.postgresql.org/23/2104/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com