date:20191008

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Antonin Houska

Moon, Insung  wrote:

> Hello.
> 
> On Tue, Oct 8, 2019 at 8:52 PM Antonin Houska  wrote:
> >
> > Robert Haas  wrote:
> >
> > > On Mon, Oct 7, 2019 at 3:01 PM Antonin Houska  wrote:
> > > > However the design doesn't seem to be stable enough at the
> > > > moment for coding to make sense.
> > >
> > > Well, I think the question is whether working further on your patch
> > > could produce some things that everyone would agree are a step
> > > forward.
> >
> > It would have made a lot of sense several months ago (Masahiko Sawada 
> > actually
> > used parts of our patch in the previous version of his patch (see [1]), but
> > the requirement to use a different IV for each execution of the encryption
> > changes things quite a bit.
> >
> > Besides the relation pages and SLRU (CLOG), which are already being 
> > discussed
> > elsewhere in the thread, let's consider other two file types:
> >
> > * Temporary files (buffile.c): we derive the IV from PID of the process that
> >   created the file + segment number + block within the segment. This
> >   information does not change if you need to write the same block again. If
> >   new IV should be used for each encryption run, we can simply introduce an
> >   in-memory counter that generates the IV for each block. However it becomes
> >   trickier if the temporary file is shared by multiple backends. I think it
> >   might still be easier to expose the IV values to other backends via shared
> >   memory than to store them on disk ...
> 
> I think encrypt a temporary file in a slightly different way.
> Previously, I had a lot of trouble with IV uniqueness, but I have
> proposed a unique encryption key for each file.
> 
> First, in the case of the CTR mode to be used, 32 bits are used for
> the counter in the 128-bit nonce value.
> Here, the counter increases every time 16 bytes are encrypted, and
> theoretically, if nonce 96 bits are the same, a total of 64 GiB can be
> encrypted.

> Therefore, in the case of buffile.c that creates a temporary file due
> to lack of work_mem, it is possible to use up to 1GiB per file, so it
> is possible to encrypt to a simple IV value sufficiently safely.
> The problem is that a vulnerability occurs when 96-bit nonce values
> excluding Counter are the same values.

I don't think the lower 32 bits impose any limitation, see
CRYPTO_ctr128_encrypt_ctr32() in OpenSSL: if this lower part overflows, the
upper part is simply incremented. So it's up to the user to decide what
portion of the IV he wants to control and what portion should be controlled by
OpenSSL internally. Of course the application design should be such that no
overflows into the upper (user specific) part occur because those would result
in duplicate IVs.

> I also tried to generate IV using PID (32bit) + tempCounter (64bit) at
> first, but in the worst-case PID and tempCounter are used in the same
> values.
> Therefore, the uniqueness of the encryption key was considered without
> considering the uniqueness of the IV value.

If you consider 64bit counter insufficient (here it seems that tempCounter
counts the 1GB segments), then we can't even use LSN as the IV for relation
pages.

> The encryption key uses a separate key for each file, as described earlier.

Do you mean a separate key for the whole temporary file, or for a single (1GB)
segment?

> First, it generates a hash value randomly for the file, and uses the
> hash value and KEK (or MDEK) to derive and use the key with
> HMAC-SHA256.
> In this case, there is no need to store the encryption key separately
> if it is not necessary to keep it in a separate IV file or memory.
> (IV is a hash value of 64 bits and a counter of 32 bits.)

You seem to miss the fact that user of buffile.c can seek in the file and
rewrite arbitrary part. Thus you'd have to generate a new key for the part
being changed.

I think it's easier to use the same key for the whole 1GB segment if not for
the whole temporary file, and generate an unique IV each time we write a chung
(BLCKSZ bytes).

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: WIP: raise error when submitting invalid ALTER SYSTEM command

2019-10-08 Thread Tom Lane

"Jordan Deitch"  writes:
> ALTER SYSTEM currently does not raise error upon invalid entry.

You mean on invalid combinations of entries.

 Take for example:
> alter system set superuser_reserved_connections = 10;
> ALTER SYSTEM
> alter system set max_connections = 5;
> ALTER SYSTEM

> The database will now fail to restart without manual intervention by way of 
> editing the autoconf file (which says "# Do not edit this file manually!" on 
> its first line).

Yeah.  That's unfortunate, but ...

> The attached WIP patch is intended to raise an error on invalid ALTER SYSTEM 
> commands before being written out to the filesystem. Hopefully this behavior 
> is more intuitive.

There's no chance that you can make this work.  We've had unpleasant
experiences with previous attempts to implement cross-checks between
GUC variables; in general, they created more problems than they fixed.

A specific issue with what you've got here is that it only checks values
that are proposed to be put into postgresql.auto.conf, without regard to
other value sources such as postgresql.conf or built-in defaults.  You
also managed to break the system's defenses against invalid combinations
that arise from such other sources --- taking out those error checks in
PostmasterMain is completely unsafe.

Also, even if you believe that O(N^2) behavior isn't a problem, this
programming approach doesn't scale to cases where more than two variables
contribute to an issue.  Somewhere around O(N^3) or O(N^4) there is
definitely going to be a threshold of pain.  This aspect doesn't seem
that hard to fix ... but it's just an efficiency issue, and doesn't
speak at all to the fundamental problem that you don't have enough
visibility into what the next postmaster run will be seeing.

Also, from a code maintenance standpoint, having code in
AlterSystemSetConfigFile that tries to know all about not only specific
GUCs, but every possible combination of specific GUCs, is just not going
to be maintainable.  (The real underlying problem there is that those
checks in PostmasterMain are merely the tip of the iceberg of error
conditions that might cause a postmaster startup failure.)

regards, tom lane

Re: maintenance_work_mem used by Vacuum

2019-10-08 Thread Masahiko Sawada

On Tue, Oct 8, 2019 at 2:45 PM Amit Kapila  wrote:
>
> On Tue, Oct 8, 2019 at 1:48 AM Peter Geoghegan  wrote:
> >
> > On Mon, Oct 7, 2019 at 12:28 PM Robert Haas  wrote:
> > > I would say that sucks, because it makes it harder to set
> > > maintenance_work_mem correctly.  Not sure how hard it would be to fix,
> > > though.
> >
> > ginInsertCleanup() may now be the worst piece of code in the entire
> > tree, so no surprised that it gets this wrong too.
> >
> > 2016's commit e2c79e14d99 ripped out the following comment about the
> > use of maintenance_work_mem by ginInsertCleanup():
> >
> > @@ -821,13 +847,10 @@ ginInsertCleanup(GinState *ginstate,
> >   * Is it time to flush memory to disk? Flush if we are at the end of
> >   * the pending list, or if we have a full row and memory is getting
> >   * full.
> > - *
> > - * XXX using up maintenance_work_mem here is probably unreasonably
> > - * much, since vacuum might already be using that much.
> >   */
> >
> > ISTM that the use of maintenance_work_mem wasn't given that much
> > thought originally.
> >
>
> One idea to something better could be to check, if there is a GIN
> index on a table, then use 1/4 (25% or whatever) of
> maintenance_work_mem for GIN indexes and 3/4 (75%) of
> maintenance_work_mem for collection dead tuples.
>

I felt that it would not be easy for users to tune
maintenance_work_mem which controls more than one things.  If this is
an index AM(GIN) specific issue we might rather want to control the
memory limit of pending list cleanup by a separate GUC parameter like
gin_pending_list_limit, say gin_pending_list_work_mem. And we can
either set the  (the memory for GIN pending list cleanup / # of GIN
indexes) to the parallel workers.

Regards,

--
Masahiko Sawada

WIP: raise error when submitting invalid ALTER SYSTEM command

2019-10-08 Thread Jordan Deitch

Hi all,

ALTER SYSTEM currently does not raise error upon invalid entry. Take for 
example:

alter system set superuser_reserved_connections = 10;
> ALTER SYSTEM
alter system set max_connections = 5;
> ALTER SYSTEM

The database will now fail to restart without manual intervention by way of 
editing the autoconf file (which says "# Do not edit this file manually!" on 
its first line).

The attached WIP patch is intended to raise an error on invalid ALTER SYSTEM 
commands before being written out to the filesystem. Hopefully this behavior is 
more intuitive.

Thanks
--
Jordan Deitch
https://id.rsa.pub/

001-alter-system-raise-error-on-invalid-combination.patch
Description: Binary data

Re: format of pg_upgrade loadable_libraries warning

2019-10-08 Thread Bruce Momjian

On Mon, Oct  7, 2019 at 01:47:57PM -0400, Bruce Momjian wrote:
> On Fri, Oct  4, 2019 at 11:55:21PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > On Fri, Oct  4, 2019 at 05:40:08PM -0500, Justin Pryzby wrote:
> > >> I would argue to include in 12.1, since 12 is what most everyone will 
> > >> use for
> > >> upgrades, and patch for .1 will help people upgrading for 11 of the next 
> > >> 12
> > >> months.  (But, your patch is more general than mine).
> > 
> > > No, there might be tools that depend on the existing format, and this is
> > > the first report of confusion I have read.
> > 
> > Translations will also lag behind any such change.  Speaking of which,
> > it might be a good idea to include some translator: annotations to
> > help translators understand what context these fragmentary phrases
> > are used in.  I'd actually say that my biggest concern with these
> > messages is whether they can translate into something sane.
> 
> Uh, I looked at the pg_ugprade code and the error message is:
> 
> pg_fatal("Your installation contains \"contrib/isn\" functions which 
> rely on the\n"
>  "bigint data type.  Your old and new clusters pass bigint 
> values\n"
>  "differently so this cluster cannot currently be upgraded.  
> You can\n"
>  "manually upgrade databases that use \"contrib/isn\" 
> facilities and remove\n"
>  "\"contrib/isn\" from the old cluster and restart the 
> upgrade.  A list of\n"
>  "the problem functions is in the file:\n"
>  "%s\n\n", output_path);
> 
> and the "in database" (which I have changed to capitalized "In database"
> in the attached patch), looks like:
> 
>fprintf(script, "In database: %s\n", active_db->db_name);
> 
> meaning it _isn't_ an output error message, but rather something that
> appears in an error file.  I don't think either of these are translated.
> Is that wrong?

Patch applied to head.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] Block level parallel vacuum

2019-10-08 Thread Masahiko Sawada

On Fri, Oct 4, 2019 at 8:55 PM vignesh C  wrote:
>
> On Fri, Oct 4, 2019 at 4:18 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 2, 2019 at 7:29 PM Masahiko Sawada  
> > wrote:
> >>
> >> On Sat, Sep 21, 2019 at 9:31 PM Amit Kapila  
> >> wrote:
> >> >
> One comment:

Thank you for reviewing this patch.

> We can check if parallel_workers is within range something within
> MAX_PARALLEL_WORKER_LIMIT.
> + int parallel_workers = 0;
> +
> + if (optarg != NULL)
> + {
> + parallel_workers = atoi(optarg);
> + if (parallel_workers <= 0)
> + {
> + pg_log_error("number of parallel workers must be at least 1");
> + exit(1);
> + }
> + }
>

Fixed.

Regards,

--
Masahiko Sawada

Re: [HACKERS] Block level parallel vacuum

2019-10-08 Thread Masahiko Sawada

On Sat, Oct 5, 2019 at 4:36 PM Dilip Kumar  wrote:
>
> Few more comments
> 
>
> 1.
> +static int
> +compute_parallel_workers(Relation onerel, int nrequested, int nindexes)
> +{
> + int parallel_workers;
> + bool leaderparticipates = true;
>
> Seems like this function is not using onerel parameter so we can remove this.
>

Fixed.

>
> 2.
> +
> + /* Estimate size for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_TUPLES */
> + maxtuples = compute_max_dead_tuples(nblocks, true);
> + est_deadtuples = MAXALIGN(add_size(SizeOfLVDeadTuples,
> +mul_size(sizeof(ItemPointerData), maxtuples)));
> + shm_toc_estimate_chunk(>estimator, est_deadtuples);
> + shm_toc_estimate_keys(>estimator, 1);
> +
> + /* Finally, estimate VACUUM_KEY_QUERY_TEXT space */
> + querylen = strlen(debug_query_string);
>
> for consistency with other comments change
> VACUUM_KEY_QUERY_TEXT  to PARALLEL_VACUUM_KEY_QUERY_TEXT
>

Fixed.

>
> 3.
> @@ -2888,6 +2888,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>   (!wraparound ? VACOPT_SKIP_LOCKED : 0);
>   tab->at_params.index_cleanup = VACOPT_TERNARY_DEFAULT;
>   tab->at_params.truncate = VACOPT_TERNARY_DEFAULT;
> + /* parallel lazy vacuum is not supported for autovacuum */
> + tab->at_params.nworkers = -1;
>
> What is the reason for the same?  Can we explain in the comments?

I think it's just that we don't want to support parallel auto vacuum
because it can consume more CPU resources in spite of background job,
which might be an unexpected behavior of autovacuum. I've changed the
comment.

Regards,

--
Masahiko Sawada

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Moon, Insung

Hello.

On Tue, Oct 8, 2019 at 8:52 PM Antonin Houska  wrote:
>
> Robert Haas  wrote:
>
> > On Mon, Oct 7, 2019 at 3:01 PM Antonin Houska  wrote:
> > > However the design doesn't seem to be stable enough at the
> > > moment for coding to make sense.
> >
> > Well, I think the question is whether working further on your patch
> > could produce some things that everyone would agree are a step
> > forward.
>
> It would have made a lot of sense several months ago (Masahiko Sawada actually
> used parts of our patch in the previous version of his patch (see [1]), but
> the requirement to use a different IV for each execution of the encryption
> changes things quite a bit.
>
> Besides the relation pages and SLRU (CLOG), which are already being discussed
> elsewhere in the thread, let's consider other two file types:
>
> * Temporary files (buffile.c): we derive the IV from PID of the process that
>   created the file + segment number + block within the segment. This
>   information does not change if you need to write the same block again. If
>   new IV should be used for each encryption run, we can simply introduce an
>   in-memory counter that generates the IV for each block. However it becomes
>   trickier if the temporary file is shared by multiple backends. I think it
>   might still be easier to expose the IV values to other backends via shared
>   memory than to store them on disk ...

I think encrypt a temporary file in a slightly different way.
Previously, I had a lot of trouble with IV uniqueness, but I have
proposed a unique encryption key for each file.

First, in the case of the CTR mode to be used, 32 bits are used for
the counter in the 128-bit nonce value.
Here, the counter increases every time 16 bytes are encrypted, and
theoretically, if nonce 96 bits are the same, a total of 64 GiB can be
encrypted.

Therefore, in the case of buffile.c that creates a temporary file due
to lack of work_mem, it is possible to use up to 1GiB per file, so it
is possible to encrypt to a simple IV value sufficiently safely.
The problem is that a vulnerability occurs when 96-bit nonce values
excluding Counter are the same values.

I also tried to generate IV using PID (32bit) + tempCounter (64bit) at
first, but in the worst-case PID and tempCounter are used in the same
values.
Therefore, the uniqueness of the encryption key was considered without
considering the uniqueness of the IV value.

The encryption key uses a separate key for each file, as described earlier.
First, it generates a hash value randomly for the file, and uses the
hash value and KEK (or MDEK) to derive and use the key with
HMAC-SHA256.
In this case, there is no need to store the encryption key separately
if it is not necessary to keep it in a separate IV file or memory.
(IV is a hash value of 64 bits and a counter of 32 bits.)

Also, currently, the temporary file name is specified by the current
PID.tempFileCounter, but if this is set to
PID.tempFileCounter.hashvalue, we can encrypt and decrypt in any
process thinking about.

Reference URL
https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#TODO_for_Full-Cluster_Encryption

>
> * "Buffered transient file". This is to be used instead of OpenTransientFile()
>   if user needs the option to encrypt the file. (Our patch adds this API to
>   buffile.c. Currently we use it in reorderbuffer.c to encrypt the data
>   changes produced by logical decoding, but there should be more use cases.)

Agreed.

Best regards.
Moon.

>
>   In this case we cannot keep the IVs in memory because user can close the
>   file anytime and open it much later. So we derive the IV by hashing the file
>   path. However if we should generate the IV again and again, we need to store
>   it on disk in another way, probably one IV value per block (PGAlignedBlock).
>
>   However since our implementation of both these file types shares some code,
>   it might yet be easier if the shared temporary file also stored the IV on
>   disk instead of exposing it via shared memory ...
>
> Perhaps this is what I can work on, but I definitely need some feedback.
>
> [1] 
> https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com
>
> --
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
>
>

Re: pgsql: Remove pqsignal() from libpq's official exports list.

2019-10-08 Thread Tom Lane

Christoph Berg  writes:
> Re: Tom Lane 2018-09-28 
>> Remove pqsignal() from libpq's official exports list.

> This is starting to hurt in several places:
> 04 11:41  mha@xindi:~$ psql
> 04 11:41  /usr/lib/postgresql/9.2/bin/psql: symbol lookup error:
>/usr/lib/postgresql/9.2/bin/psql: undefined symbol: 
> pqsignal

I poked into this a little.  Reviewing the commit history, pqsignal()
was a part of libpq (so far as frontend code is concerned) up until
9.3, when commit da5aeccf6 moved it into libpgport.  Since then we've
expected client programs to get it from libpgport not libpq, and indeed
they do so AFAICT --- I can reproduce the described failure with 9.2
and below psql linking to current libpq.so, but not with 9.3 and up.

libpq itself indeed has no need for pqsignal at all, if
--enable-thread-safety is set, which it usually is these days.

I also notice that we've made at least one nontrivial semantics change
to pqsignal: commit 873ab9721 made it use SA_RESTART for SIGALRM
handlers, which it did not do before 9.3.  So really, none of the
post-9.2 versions of libpq have faithfully duplicated what an older
client would expect from pqsignal.  This isn't at all academic, because
I see that pgbench uses pqsignal(SIGALRM,...), and so does pg_test_fsync.
Now, I don't see any indication that we've adjusted either of those
programs for the different behavior, so maybe it's fine.  But we've been
treating this function as strictly internal, and so I'm not pleased with
the idea of putting it back into the exported symbol list.

I'm especially not pleased with doing so to support pre-9.3 client
programs.  Those have been below our support horizon for some time;
notably, they (presumably) haven't been patched for CVE-2018-1058.
Why are you still shipping them in current OS releases?

> pg_repack linked against libpq5 11 breaks with libpq5 12:

This probably means it needs to be linked with libpgport
not only libpq.

Having said all that, if we conclude we can't break compatibility
with this legacy code quite yet, I'd be inclined to put a
separate, clearly-marked-as-legacy-code version of pqsignal()
back into libpq, using the pre-9.3 SA_RESTART semantics.
But I'd like some pretty well-defined sunset time for that,
because it'd just be trouble waiting to happen.  When are
you going to remove 9.2 psql?

regards, tom lane

Re: Non-null values of recovery functions after promote or crash of primary

2019-10-08 Thread Martín Marqués

Hi,

> IMV, and not unlike other similar cases I've talked about on another
> thread, these should be cleared when the system is promoted as they're
> otherwise confusing and nonsensical.

Keep in mind that this also happens when the server crashes and has to
perform crash recovery. In that case the server was always a primary.

--
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Robert Haas

On Tue, Oct 8, 2019 at 7:52 AM Antonin Houska  wrote:
> * Temporary files (buffile.c): we derive the IV from PID of the process that
>   created the file + segment number + block within the segment. This
>   information does not change if you need to write the same block again. If
>   new IV should be used for each encryption run, we can simply introduce an
>   in-memory counter that generates the IV for each block. However it becomes
>   trickier if the temporary file is shared by multiple backends. I think it
>   might still be easier to expose the IV values to other backends via shared
>   memory than to store them on disk ...
>
> * "Buffered transient file". This is to be used instead of OpenTransientFile()
>   if user needs the option to encrypt the file. (Our patch adds this API to
>   buffile.c. Currently we use it in reorderbuffer.c to encrypt the data
>   changes produced by logical decoding, but there should be more use cases.)
>
>   In this case we cannot keep the IVs in memory because user can close the
>   file anytime and open it much later. So we derive the IV by hashing the file
>   path. However if we should generate the IV again and again, we need to store
>   it on disk in another way, probably one IV value per block (PGAlignedBlock).
>
>   However since our implementation of both these file types shares some code,
>   it might yet be easier if the shared temporary file also stored the IV on
>   disk instead of exposing it via shared memory ...
>
> Perhaps this is what I can work on, but I definitely need some feedback.

I think this would be a valuable thing upon which to work. I'm not
sure exactly what the right solution is, but it seems to me that it
would be a good thing if we tried to reuse the same solution in as
many places as possible. I don't know if it's realistic to use the
same method for storing IVs for temporary/transient files as we do for
SLRUs, but it would be nice if it were.

I think that one problem with trying to store the data in memory is
that these files get big enough that N bytes/block could still be
pretty big. For instance, if you're sorting 100GB of data with 8GB of
work_mem, you'll need to write 13 tapes and then merge them. Supposing
an IV of 12 bytes/block, the IV vector for each 8GB tape will be 12MB,
so once you've written all 12 types and are ready to merge them,
you're going to have 156MB of IV data floating around. If you keep it
in memory, it ought to count against your work_mem budget, and while
it's not a big fraction of your available memory, it's also not
negligible.  Worse (but less realistic) cases can also be constructed.
To avoid this kind of problem, you could write the IV data to disk.
But notice that tuplesort.c goes to a lot of work to make I/O
sequential, and that helps performance.  If you have to intersperse
reads of separate IV files with the reads of the main data files,
you're going to degrade the I/O pattern. It would really be best if
the IVs were in line with the data itself, I think. (The same probably
applies, and for not unrelated reasons, to SLRU data, if we're going
to try to encrypt that.)

Now, if you could store some kind of an IV "seed" where we only need
one per buffile rather than one per block, then that'd probably be
fine to story in memory. But I don't see how that would work given
that we can overwrite already-written blocks and need a new IV if we
do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: abort-time portal cleanup

2019-10-08 Thread Andres Freund

Hi,

On 2019-10-07 12:14:52 -0400, Robert Haas wrote:
> > - if (portal->status == PORTAL_READY)
> > - MarkPortalFailed(portal);
> >
> > Why it is safe to remove this check?  It has been explained in commit
> > 7981c342 why we need that check.  I don't see any explanation in email
> > or patch which justifies this code removal.  Is it because you removed
> > PortalCleanup?  If so, that is still called from PortalDrop?
>
> All MarkPortalFailed() does is change the status to PORTAL_FAILED and
> call the cleanup hook. PortalDrop() calls the cleanup hook, and we
> don't need to change the status if we're removing it completely.

Note that currently PortalCleanup() behaves differently depending on
whether the portal is set to failed or not...

- Andres

Re: v12 and pg_restore -f-

2019-10-08 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Andrew Gierth  writes:
> > "Tom" == Tom Lane  writes:
> >  Tom> Perhaps we could change the back branches so that they interpret
> >  Tom> "-f -" as "write to stdout", but without enforcing that you use
> >  Tom> that syntax.
> 
> > We should definitely do that.

I agree that this would be a reasonable course of action.  Really, it
should have always meant that...

> >  Tom> Alternatively, we could revert the v12 behavior change. On the
> >  Tom> whole that might be the wiser course. I do not think the costs and
> >  Tom> benefits of this change were all that carefully thought through.
> 
> > Failing to specify -d is a _really fricking common_ mistake for
> > inexperienced users, who may not realize that the fact that they're
> > seeing a ton of SQL on their terminal is not the normal result.
> > Seriously, this comes up on a regular basis on IRC (which is why I
> > suggested initially that we should do something about it).
> 
> No doubt, but that seems like a really poor excuse for breaking
> maintenance scripts in a way that basically can't be fixed.  Even
> with the change suggested above, scripts couldn't rely on "-f -"
> working anytime soon, because you couldn't be sure whether a
> back-rev pg_restore had the update or not.

Maintenance scripts break across major versions.  We completely
demolished everything around how recovery works, and some idea that you
could craft up something easy that would work in a backwards-compatible
way is outright ridiculous, so I don't see why we're so concerned about
a change to how pg_restore works here.

> The idea I'm leaning to after more thought is that we should change
> *all* the branches to accept "-f -", but not throw an error if you
> don't use it.  Several years from now, we could put the error back in;
> but not until there's a plausible argument that nobody is running
> old versions of pg_restore anymore.

No, I don't agree with this, at all.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: PATCH: Add uri percent-encoding for binary data

2019-10-08 Thread Anders Åstrand

On Mon, Oct 7, 2019 at 11:38 PM Isaac Morland  wrote:
>
> On Mon, 7 Oct 2019 at 03:15, Anders Åstrand  wrote:
>>
>> Hello
>>
>> Attached is a patch for adding uri as an encoding option for
>> encode/decode. It uses what's called "percent-encoding" in rfc3986
>> (https://tools.ietf.org/html/rfc3986#section-2.1).
>>
>> The background for this patch is that I could easily build urls in
>> plpgsql, but doing the actual encoding of the url parts is painfully
>> slow. The list of available encodings for encode/decode looks quite
>> arbitrary to me, so I can't see any reason this one couldn't be in
>> there.
>>
>> In modern web scenarios one would probably most likely want to encode
>> the utf8 representation of a text string for inclusion in a url, in
>> which case correct invocation would be ENCODE(CONVERT_TO('some text in
>> database encoding goes here', 'UTF8'), 'uri'), but uri
>> percent-encoding can of course also be used for other text encodings
>> and arbitrary binary data.
>
>
> This seems like a useful idea to me. I've used the equivalent in Python and 
> it provides more options:
>
> https://docs.python.org/3/library/urllib.parse.html#url-quoting
>
> I suggest reviewing that documentation there, because there are a few details 
> that need to be checked carefully. Whether or not space should be encoded as 
> plus and whether certain byte values should be exempt from %-encoding is 
> something that depends on the application. Unfortunately, as far as I can 
> tell there isn't a single version of URL encoding that satisfies all 
> situations (thus explaining the complexity of the Python implementation). It 
> might be feasible to suppress some of the Python options (I'm wondering about 
> the safe= parameter) but I'm pretty sure you at least need the equivalent of 
> quote and quote_plus.

Thanks a lot for your reply!

I agree that some (but not all) of the options available to that
python lib could be helpful for developers wanting to build urls
without having to encode the separate parts of it and stitching it
together, but not necessary for this patch to be useful. For generic
uri encoding the slash (/) must be percent encoded, because it has
special meaning in the standard. Some other extra characters may
appear unencoded though depending on context, but it's generally safer
to just encode them all and not hope that the encoder will know about
the context and skip over certain characters.

This does bring up an interesting point however. Maybe decode should
validate that only characters that are allowed unencoded appear in the
input?

Luckily, the plus-encoding of spaces are not part of the uri standard
at all but instead part of the format referred to as
application/x-www-form-urlencoded data. Fortunately that format is
close to dying now that forms more often post json.

Regards,
Anders

Re: Non-null values of recovery functions after promote or crash of primary

2019-10-08 Thread Stephen Frost

Greetings,

* Martín Marqués (mar...@2ndquadrant.com) wrote:
> pg_last_wal_receive_lsn()
> pg_last_wal_replay_lsn()
> pg_last_xact_replay_timestamp()
> 
> Under normal circumstances we would expect to receive NULLs from all
> three functions on a primary node, and code comments back up my thoughts.

Agreed.

> The problem is, what if the node is a standby which was promoted without
> restarting, or that had to perform crash recovery?
> 
> So during the time it's recovering the values in ` XLogCtl` are updated
> with recovery information, and once the recovery finishes, due to crash
> recovery reaching a consistent state, or a promotion of a standby
> happening, those values are not reset to startup defaults.
> 
> That's when you start seeing non-null values returned by
> `pg_last_wal_replay_lsn()`and `pg_last_xact_replay_timestamp()`.
> 
> Now, I don't know if we should call this a bug, or an undocumented
> anomaly. We could fix the bug by resetting the values from ` XLogCtl`
> after finishing recovery, or document that we might see non-NULL values
> in certain cases.

IMV, and not unlike other similar cases I've talked about on another
thread, these should be cleared when the system is promoted as they're
otherwise confusing and nonsensical.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> Unless we are *absolutely* certain, I bet someone will be able to find a
> side-channel that somehow leaks some data or data-about-data, if we don't
> encrypt everything. If nothing else, you can get use patterns out of it,
> and you can make a lot from that. (E.g. by whether transactions are using
> multixacts or not you can potentially determine which transaction they are,
> if you know what type of transactions are being issued by the application.
> In the simplest case, there might be a single pattern where multixacts end
> up actually being used, and in that case being able to see the multixact
> data tells you a lot about the system).

Thanks for bringing up the concern but this still doesn't strike me, at
least, as being a huge gaping hole that people will have large issues
with.  In other words, I don't agree that this is a high bandwidth side
channel and I don't think that it, alone, brings up a strong need to
encrypt clog and multixact.

> As for other things -- by default, we store the log files in text format in
> the data directory. That contains *loads* of sensitive data in a lot of
> cases. Will those also be encrypted?

imv, this is a largely independent thing, as I said elsewhere, and has
its own set of challenges and considerations to deal with.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: PATCH: Add uri percent-encoding for binary data

2019-10-08 Thread Anders Åstrand

On Mon, Oct 7, 2019 at 9:52 PM Bruce Momjian  wrote:
>
> On Mon, Oct  7, 2019 at 09:14:38AM +0200, Anders Åstrand wrote:
> > Hello
> >
> > Attached is a patch for adding uri as an encoding option for
> > encode/decode. It uses what's called "percent-encoding" in rfc3986
> > (https://tools.ietf.org/html/rfc3986#section-2.1).
>
> Oh, that's a cool idea.  Can you add it to the commit-fest?
>
> https://commitfest.postgresql.org/25/
>
>

Thanks for your reply! I added it but was unsure of what topic was
appropriate and couldn't find a description of them anywhere. I went
with Miscellaneous for now.

Re: pg_init

2019-10-08 Thread Antonin Houska

Tomas Vondra  wrote:

> On Tue, Oct 08, 2019 at 10:03:03PM +0530, Natarajan R wrote:
> >I want to read pg_database from pg_init...
> >
> >Is using heap_open() is possible? or else any other way is there ?
> 
> This is way too vague question - I have no idea what you mean by
> pg_init, for example. And it's probably a good idea to explain what
> you're trying to achieve.

This question was familiar to me so I searched the archives. It seems related
to

https://www.postgresql.org/message-id/17058.1570166272%40sss.pgh.pa.us

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: pg_init

2019-10-08 Thread Tomas Vondra


On Tue, Oct 08, 2019 at 10:03:03PM +0530, Natarajan R wrote:

I want to read pg_database from pg_init...

Is using heap_open() is possible? or else any other way is there ?


This is way too vague question - I have no idea what you mean by
pg_init, for example. And it's probably a good idea to explain what
you're trying to achieve.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_init

2019-10-08 Thread Natarajan R

I want to read pg_database from pg_init...

Is using heap_open() is possible? or else any other way is there ?

Re: Standby accepts recovery_target_timeline setting?

2019-10-08 Thread Stephen Frost

Greetings,

* Fujii Masao (masao.fu...@gmail.com) wrote:
> On Tue, Oct 8, 2019 at 10:58 PM Stephen Frost  wrote:
> > Having these options end up set but then hacking all of the other code
> > that looks at them to check if we're actually in recovery or not would
> > end up being both confusing to users as well as an ongoing source of
> > bugs (which has already been made clear by the fact that we're having
> > this discussion...).  Wouldn't that also mean we would need to hack the
> > 'show' code, to blank out the recovery_target_name variable if we aren't
> > in recovery?  Ugh.
> 
> Isn't this overkill? This doesn't seem the problem only for recovery-related
> settings. We have already have the similar issue with other settings.
> For example, log_directory parameter is ignored when logging_collector is
> not enabled. But SHOW log_directory reports the setting value even when
> logging_collector is disabled. This seems the similar issue and might be
> confusing, but we could live with that.

I agree it's a similar issue.  I disagree that it's actually sensible
for us to do so and would rather contend that it's confusing and not
good.

We certainly do a lot of smart things in PG, but showing the value of
variables that aren't accurate, and we *know* they aren't, hardly seems
like something we should be saying "this is good and ok, so let's do
more of this."

I'd rather argue that this just shows that we need to come up with a
solution in this area.  I don't think it's *as* big of a deal when it
comes to logging_collector/log_directory because, at least there, you
don't even start to get into the same code paths where it matters, like
you end up doing with the recovery targets and crash recovery, so the
chances of bugs creeping in are less in the log_directory case.

I still don't think it's great though and, yes, would prefer that we
avoid having log_directory set when logging_collector is in use.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-08 Thread Craig Ringer

On Thu, 11 Jul 2019 at 04:34, Tom Lane  wrote:

> Robert Haas  writes:
> > On Wed, Jul 10, 2019 at 9:59 AM Dave Cramer  wrote:
> >> I'm still a bit conflicted about what to do with search_path as I do
> believe this is potentially a security issue.
> >> It may be that we always want to report that and possibly back patch it.
>
> > I don't see that as a feasible option unless we make the logic that
> > does the reporting smarter.  If it changes transiently inside of a
> > security-definer function, and then changes back, my recollection is
> > that right now we would report both changes.  I think that could cause
> > a serious efficiency problem if you are calling such a function in a
> > loop.
>
> And, even more to the point, what's the client side going to do with
> the information?  If there was a security problem inside the
> security-definer function, it's too late.  And the client can't do
> much about it anyway.
>

Yep. For search_path I definitely agree.

In other places I've (ab)used GUC_REPORT to push information back to the
client as a workaround for the lack of protocol extensibility / custom
messages. It's a little less ugly than abusing NOTICE messages. I'd prefer
a real way to add optional/extension messages, but in the absence of that
extension-defined GUCs may have good reasons to want to report multiple
changes within a single statement/function/etc.

With that said, most of the time I think we could argue that it's
reasonable to fire a GUC_REPORT if the *top-level* GUC nestlevel values
change. That'd solve the search_path spam issue while still giving Dave
what he wants, amongst other things.

BTW, a good argument for the client wanting to be notified of search_path
changes might be for clients that want to cache name=>oid mappings for
types, relations, etc. The JDBC driver would be able to benefit from that
if it could trust that the same name would map to the same type (for a
top-level statement) in future, but currently it can't.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-08 Thread Craig Ringer

On Fri, 12 Jul 2019 at 01:27, Robert Haas  wrote:

>
> We have steadfastly refused to provide protocol-level tools for things
> like "please change my user ID, and don't let anyone change it again
> via SQL," and that's a huge problem for things like connection poolers
> which can't parse all the SQL flowing through the connection (because
> figuring out what it does requires solving the Halting Problem) and
> wouldn't want to if they could for performance reasons. I think that's
> a huge mistake.


I very strongly agree. The inability to limit SET and RESET of SESSION
AUTHORIZATION and ROLE is a huge pain point and it's far from the only one.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'

2019-10-08 Thread Mark Lorenz


Hi,

I apologize for the mistake.

For the mailing list correspondence I created this mail account. But I 
forgot to change the sender name. So, the "postgres" name appeared as 
sender name in the mailing list. I changed it.


Kind regards,
Mark/S-Man42


Hi,

some days ago I ran into a problem with the to_date() function. I
originally described it on StackExchange:
https://dba.stackexchange.com/questions/250111/unexpected-behaviour-for-to-date-with-week-number-and-week-day

The problem:

If you want to parse a date string with year, week and day of week,
you can do this using the ISO week pattern: 'IYYY-IW-ID'. This works
as expected:

date string |  to_date()
+
'2019-1-1'  |  2018-12-31  -> Monday of the first week of the year
(defined as the week that includes the 4th of January)
'2019-1-2'  |  2019-01-01
'2019-1-3'  |  2019-01-02
'2019-1-4'  |  2019-01-03
'2019-1-5'  |  2019-01-04
'2019-1-6'  |  2019-01-05
'2019-1-7'  |  2019-01-06

'2019-2-1'  |  2019-01-07
'2019-2-2'  |  2019-01-08

But if you are trying this with the non-ISO pattern '-WW-D', the
result was not expected:

date string |  to_date()
-
'2019-1-1'  |  2019-01-01
'2019-1-2'  |  2019-01-01
'2019-1-3'  |  2019-01-01
'2019-1-4'  |  2019-01-01
'2019-1-5'  |  2019-01-01
'2019-1-6'  |  2019-01-01
'2019-1-7'  |  2019-01-01

'2019-2-1'  |  2019-01-08
'2019-2-2'  |  2019-01-08

As you can see, the 'D' part of the pattern doesn't influence the
resulting date.

The answer of Laurenz Albe pointed to a part of the documentation, I
missed so far:

"In to_timestamp and to_date, weekday names or numbers (DAY, D, and
related field types) are accepted but are ignored for purposes of
computing the result. The same is true for quarter (Q) fields."
(https://www.postgresql.org/docs/12/functions-formatting.html)

So, I had a look at the relevant code part. I decided to try a patch
by myself. Now it works as I would expect it:

date string |  to_date()
-
'2019-1-1'  |  2018-12-30 -> Sunday (!) of the first week of the year
(the first week is at the first day of year)
'2019-1-2'  |  2018-12-31
'2019-1-3'  |  2019-01-01
'2019-1-4'  |  2019-01-02
'2019-1-5'  |  2019-01-03
'2019-1-6'  |  2019-01-04
'2019-1-7'  |  2019-01-05

'2019-2-1'  |  2019-01-06
'2019-2-2'  |  2019-01-07

Furthermore, if you left the 'D' part, the date would be always set to
the first day of the corresponding week (in that case it is Sunday, in
contrast to the ISO week, which starts mondays).

To be consistent, I added similar code for the week of month pattern
('W'). So, using the pattern '-MM-W-D' yields in:

date string   |  to_date()
---
'2018-12-5-1' |  2018-12-23
'2018-12-6-1' |  2018-12-30
'2019-1-1-1'  |  2018-12-30 -> First day (Su) of the first week of the
first month of the year
'2019-2-2-1'  |  2019-02-03 -> First day (Su) of the second week of 
February
'2019-10-3-5' |  2019-10-17 -> Fifth day (Th) of the third week of 
October


If you left the 'D', it would be set to 1 as well.

The code can be seen here:
https://github.com/S-Man42/postgres/commit/534e6bd70e23864f385d60ecf187496c7f4387c9

I hope, keeping the code style of the surrounding code (especially the
ISO code) is ok for you.

Now the questions:
1. Although the ignorance of the 'D' pattern is well documented, does
the new behaviour might be interesting for you?
2. Does it work as you'd expect it?
3. Because this could be my very first contribution to the PostgreSQL
code base, I really want you to be as critical as possible. I am not
quite sure if I didn't miss something important.
4. Currently something like '2019-1-8' does not throw an exception but
results in the same as '2019-2-1' (8th is the same as the 1st of the
next week). On the other hand, currently, the ISO week conversion
gives out the result of '2019-1-7' for every 'D' >= 7. I am not sure
if this is better. I think a consistent exception handling should be
discussed separately (date roll over vs. out of range exception vs.
ISO week behaviour)

So far, I am very curious about your opinions!

Kind regards,
Mark/S-Man42

Re: Standby accepts recovery_target_timeline setting?

2019-10-08 Thread Fujii Masao

On Tue, Oct 8, 2019 at 10:58 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Mon, Oct 7, 2019 at 9:14 AM Fujii Masao  wrote:
> > > Agreed, too. Do you have any idea to implement that? I've not found out
> > > "smart" way to do that yet.
> > >
> > > One idea is, as Michael suggested, to use SetConfigOption() for all the
> > > archive recovery parameters at the beginning of the startup process as 
> > > follows,
> > > to forcibly set the default values if crash recovery is running. But this
> > > seems not smart for me.
> > >
> > > SetConfigOption("restore_command", ...);
> > > SetConfigOption("archive_cleanup_command", ...);
> > > SetConfigOption("recovery_end_command", ...);
> > > ...
> > >
> > > Maybe we should make GUC mechanism notice signal files and ignore
> > > archive recovery-related parameters when none of those files exist?
> > > This change seems overkill at least in v12, though.
> >
> > I think this approach is going in the wrong direction. In every other
> > part of the system, it's the job of the code around the GUC system to
> > use parameters when they're relevant and ignore them when they should
> > be ignored. Deciding that the parameters that were formerly part of
> > recovery.conf are an exception to that rule and that the GUC system is
> > responsible for making sure they're set only when we pay attention to
> > them seems like it's bringing back or exacerbating a code-level split
> > between recovery.conf parameters and postgresql.conf parameters when,
> > meanwhile, we've been wanting to eradicate that split so that the
> > things we allow for postgresql.conf parameters -- e.g. changing them
> > while they are running -- can be applied to these parameters also.
>
> I don't think we necessairly need to be thinking about trying to
> eliminate all differences between certain former recovery.conf settings
> and things like work_mem, even as we make it such that those former
> settings can be changed while we're running.
>
> > I don't particularly like the use of SetConfigOption() either,
> > although it does have some precedent in autovacuum, for example.
>
> It's pretty explicitly the job of SetConfigOption to manage the fact
> that only certain options can be set at certain times, as noted at the
> top of guc.h where we're talking about GUC contexts (and which
> SetConfigOption references as being what it's job is to manage-
> guc.c:6776 currently).
>
> > Generally, it's right and proper that the GUC system sets the
> > variables to which the parameters it controls are tied -- and then the
> > rest of the code has to do the right thing around that. It sounds like
> > the patch that got rid of recovery.conf wasn't considered carefully
> > enough, and missed the fact that it was introducing some inadvertent
> > behavior changes. That's too bad, but let's not overreact. It seems
> > totally fine to me to just add ad-hoc checks that rule out
> > inappropriately relying on these parameters while performing crash
> > recovery - and be done with it.

Yeah, I agree.

> The patch that got rid of recovery.conf also removed the inherent
> understanding and knowledge that there are certain options that can only
> be set (and make sense ...) at certain times- namely, when we're doing
> recovery.  Having these options set at other times is entirely wrong and
> will be confusing to both users, and, as seen, code.  From a user
> perspective, what happens when you've started up PG as a primary, since
> you don't have a signal file in place to indicate that you're doing
> recovery, and you have a recovery_target set, so some user does
> "show recovery_target_name" and sees a value?  How is that sensible?
>
> Those options should only be set when we're actually doing recovery,
> which is governed by the signal file.  Recovery is absolutely a specific
> kind of state that the system is in, not unlike postmaster, we've even
> got a specific pg_is_in_recovery() function for it.
>
> Having these options end up set but then hacking all of the other code
> that looks at them to check if we're actually in recovery or not would
> end up being both confusing to users as well as an ongoing source of
> bugs (which has already been made clear by the fact that we're having
> this discussion...).  Wouldn't that also mean we would need to hack the
> 'show' code, to blank out the recovery_target_name variable if we aren't
> in recovery?  Ugh.

Isn't this overkill? This doesn't seem the problem only for recovery-related
settings. We have already have the similar issue with other settings.
For example, log_directory parameter is ignored when logging_collector is
not enabled. But SHOW log_directory reports the setting value even when
logging_collector is disabled. This seems the similar issue and might be
confusing, but we could live with that.

Regards,

-- 
Fujii Masao

Re: Ordering of header file inclusion

2019-10-08 Thread Tom Lane

Amit Kapila  writes:
> On Wed, Oct 2, 2019 at 2:57 PM vignesh C  wrote:
>> I noticed that some of the header files inclusion is not ordered as
>> per the usual standard that is followed.
>> The attached patch contains the fix for the order in which the header
>> files are included.
>> Let me know your thoughts on the same.

> +1.

FWIW, I'm not on board with reordering system-header inclusions.
Some platforms have (had?) ordering dependencies for those, and where
that's true, it's seldom alphabetical.  It's only our own headers
where we can safely expect that any arbitrary order will work.

> I think we shouldn't remove the extra line as part of the above change.

I would take out the blank lines between our own #includes.  Those are
totally arbitrary and unnecessary.  The whole point of style rules like
this one is to reduce the differences between the way one person might
write something and the way that someone else might.  Deciding to throw
in a blank line is surely in the realm of unnecessary differences.

regards, tom lane

Re: pg_upgrade fails with non-standard ACL

2019-10-08 Thread Stephen Frost

Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> Cool. It seems that everyone agrees on this patch.

Thanks for working on this, I took a quick look over it and I do have
some concerns.

> I attached the updated version. Now it prints a better error message
> and generates an SQL script instead of multiple warnings. The attached test
> script shows that.

Have you tested this with extensions, where the user has changed the
privileges on the extension?  I'm concerned that we'll throw out
warnings and tell users to go REVOKE privileges on any case where the
privileges on an extension's object were changed, but that shouldn't be
necessary and we should be excluding those.

Changes to privileges on extensions should be handled just fine using
the existing code, at least for upgrades of PG.

Otherwise, at least imv, the code could use more comments (inside the
functions, not just function-level...) and there's a few wording
improvements that could be made.  Again, not a full endorsement, as I
just took a quick look, but it generally seems like a reasonable
approach to go in and the issue with extensions was the only thing that
came to mind as a potential problem.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Standby accepts recovery_target_timeline setting?

2019-10-08 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Oct 7, 2019 at 9:14 AM Fujii Masao  wrote:
> > Agreed, too. Do you have any idea to implement that? I've not found out
> > "smart" way to do that yet.
> >
> > One idea is, as Michael suggested, to use SetConfigOption() for all the
> > archive recovery parameters at the beginning of the startup process as 
> > follows,
> > to forcibly set the default values if crash recovery is running. But this
> > seems not smart for me.
> >
> > SetConfigOption("restore_command", ...);
> > SetConfigOption("archive_cleanup_command", ...);
> > SetConfigOption("recovery_end_command", ...);
> > ...
> >
> > Maybe we should make GUC mechanism notice signal files and ignore
> > archive recovery-related parameters when none of those files exist?
> > This change seems overkill at least in v12, though.
> 
> I think this approach is going in the wrong direction. In every other
> part of the system, it's the job of the code around the GUC system to
> use parameters when they're relevant and ignore them when they should
> be ignored. Deciding that the parameters that were formerly part of
> recovery.conf are an exception to that rule and that the GUC system is
> responsible for making sure they're set only when we pay attention to
> them seems like it's bringing back or exacerbating a code-level split
> between recovery.conf parameters and postgresql.conf parameters when,
> meanwhile, we've been wanting to eradicate that split so that the
> things we allow for postgresql.conf parameters -- e.g. changing them
> while they are running -- can be applied to these parameters also.

I don't think we necessairly need to be thinking about trying to
eliminate all differences between certain former recovery.conf settings
and things like work_mem, even as we make it such that those former
settings can be changed while we're running.

> I don't particularly like the use of SetConfigOption() either,
> although it does have some precedent in autovacuum, for example.

It's pretty explicitly the job of SetConfigOption to manage the fact
that only certain options can be set at certain times, as noted at the
top of guc.h where we're talking about GUC contexts (and which
SetConfigOption references as being what it's job is to manage-
guc.c:6776 currently).

> Generally, it's right and proper that the GUC system sets the
> variables to which the parameters it controls are tied -- and then the
> rest of the code has to do the right thing around that. It sounds like
> the patch that got rid of recovery.conf wasn't considered carefully
> enough, and missed the fact that it was introducing some inadvertent
> behavior changes. That's too bad, but let's not overreact. It seems
> totally fine to me to just add ad-hoc checks that rule out
> inappropriately relying on these parameters while performing crash
> recovery - and be done with it.

The patch that got rid of recovery.conf also removed the inherent
understanding and knowledge that there are certain options that can only
be set (and make sense ...) at certain times- namely, when we're doing
recovery.  Having these options set at other times is entirely wrong and
will be confusing to both users, and, as seen, code.  From a user
perspective, what happens when you've started up PG as a primary, since
you don't have a signal file in place to indicate that you're doing
recovery, and you have a recovery_target set, so some user does
"show recovery_target_name" and sees a value?  How is that sensible?

Those options should only be set when we're actually doing recovery,
which is governed by the signal file.  Recovery is absolutely a specific
kind of state that the system is in, not unlike postmaster, we've even
got a specific pg_is_in_recovery() function for it.

Having these options end up set but then hacking all of the other code
that looks at them to check if we're actually in recovery or not would
end up being both confusing to users as well as an ongoing source of
bugs (which has already been made clear by the fact that we're having
this discussion...).  Wouldn't that also mean we would need to hack the
'show' code, to blank out the recovery_target_name variable if we aren't
in recovery?  Ugh.

Thanks,

Stephen

signature.asc
Description: PGP signature

Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'

2019-10-08 Thread postgres


Hi,

some days ago I ran into a problem with the to_date() function. I 
originally described it on StackExchange:

https://dba.stackexchange.com/questions/250111/unexpected-behaviour-for-to-date-with-week-number-and-week-day

The problem:

If you want to parse a date string with year, week and day of week, you 
can do this using the ISO week pattern: 'IYYY-IW-ID'. This works as 
expected:


date string |  to_date()
+
'2019-1-1'  |  2018-12-31  -> Monday of the first week of the year 
(defined as the week that includes the 4th of January)

'2019-1-2'  |  2019-01-01
'2019-1-3'  |  2019-01-02
'2019-1-4'  |  2019-01-03
'2019-1-5'  |  2019-01-04
'2019-1-6'  |  2019-01-05
'2019-1-7'  |  2019-01-06

'2019-2-1'  |  2019-01-07
'2019-2-2'  |  2019-01-08

But if you are trying this with the non-ISO pattern '-WW-D', the 
result was not expected:


date string |  to_date()
-
'2019-1-1'  |  2019-01-01
'2019-1-2'  |  2019-01-01
'2019-1-3'  |  2019-01-01
'2019-1-4'  |  2019-01-01
'2019-1-5'  |  2019-01-01
'2019-1-6'  |  2019-01-01
'2019-1-7'  |  2019-01-01

'2019-2-1'  |  2019-01-08
'2019-2-2'  |  2019-01-08

As you can see, the 'D' part of the pattern doesn't influence the 
resulting date.


The answer of Laurenz Albe pointed to a part of the documentation, I 
missed so far:


"In to_timestamp and to_date, weekday names or numbers (DAY, D, and 
related field types) are accepted but are ignored for purposes of 
computing the result. The same is true for quarter (Q) fields." 
(https://www.postgresql.org/docs/12/functions-formatting.html)


So, I had a look at the relevant code part. I decided to try a patch by 
myself. Now it works as I would expect it:


date string |  to_date()
-
'2019-1-1'  |  2018-12-30 -> Sunday (!) of the first week of the year 
(the first week is at the first day of year)

'2019-1-2'  |  2018-12-31
'2019-1-3'  |  2019-01-01
'2019-1-4'  |  2019-01-02
'2019-1-5'  |  2019-01-03
'2019-1-6'  |  2019-01-04
'2019-1-7'  |  2019-01-05

'2019-2-1'  |  2019-01-06
'2019-2-2'  |  2019-01-07

Furthermore, if you left the 'D' part, the date would be always set to 
the first day of the corresponding week (in that case it is Sunday, in 
contrast to the ISO week, which starts mondays).


To be consistent, I added similar code for the week of month pattern 
('W'). So, using the pattern '-MM-W-D' yields in:


date string   |  to_date()
---
'2018-12-5-1' |  2018-12-23
'2018-12-6-1' |  2018-12-30
'2019-1-1-1'  |  2018-12-30 -> First day (Su) of the first week of the 
first month of the year
'2019-2-2-1'  |  2019-02-03 -> First day (Su) of the second week of 
February
'2019-10-3-5' |  2019-10-17 -> Fifth day (Th) of the third week of 
October


If you left the 'D', it would be set to 1 as well.

The code can be seen here:
https://github.com/S-Man42/postgres/commit/534e6bd70e23864f385d60ecf187496c7f4387c9

I hope, keeping the code style of the surrounding code (especially the 
ISO code) is ok for you.


Now the questions:
1. Although the ignorance of the 'D' pattern is well documented, does 
the new behaviour might be interesting for you?

2. Does it work as you'd expect it?
3. Because this could be my very first contribution to the PostgreSQL 
code base, I really want you to be as critical as possible. I am not 
quite sure if I didn't miss something important.
4. Currently something like '2019-1-8' does not throw an exception but 
results in the same as '2019-2-1' (8th is the same as the 1st of the 
next week). On the other hand, currently, the ISO week conversion gives 
out the result of '2019-1-7' for every 'D' >= 7. I am not sure if this 
is better. I think a consistent exception handling should be discussed 
separately (date roll over vs. out of range exception vs. ISO week 
behaviour)


So far, I am very curious about your opinions!

Kind regards,
Mark/S-Man42

Re: dropping column prevented due to inherited index

2019-10-08 Thread Ashutosh Sharma

On Tue, Oct 8, 2019 at 2:12 PM Amit Langote  wrote:
>
> Hi Ashutosh,
>
> On Mon, Oct 7, 2019 at 1:39 PM Ashutosh Sharma  wrote:
> > I think we could have first deleted all the dependency of child object
> > on parent and then deleted the child itself using performDeletion().
>
> So, there are two objects to consider in this case -- column and an
> index that depends on it.
>
> For columns, we don't store any dependency records for the dependency
> between a child column and its corresponding parent column.  That
> dependency is explicitly managed by the code and the attinhcount flag,
> which if set, prevents the column from being dropped on its own.
>
> Indexes do rely on dependency records for the dependency between a
> child index and its corresponding parent index.  This dependency
> prevents a child index from being dropped if the corresponding parent
> index is also not being dropped.
>
> In this case, recursively dropping a child's column will in turn try
> to drop the depending child index.  findDependentObject() complains
> because it can't allow a child index to be dropped, because it can't
> establish that the corresponding parent index is also being dropped.
> That's because the parent index will be dropped when the parent's
> column will be dropped, which due to current coding of
> ATExecDropColumn() is *after* all the child columns and indexes are
> dropped.  If we drop the parent column and depending index first and
> then recurse to children as my proposed patch does, then the parent
> index would already have been dropped when dropping the child column
> and the depending index.
>
> > As an example let's consider the case of toast table where we first
> > delete the dependency of toast relation on main relation and then
> > delete the toast table itself otherwise the toast table deletion would
> > fail. But, the problem I see here is that currently we do not have any
> > entry in pg_attribute table that would tell us about the dependency of
> > child column on parent.
>
> I couldn't imagine how that trick could be implemented for this case. :(
>

I don't think that is possible because presently in pg_attribute table
we do not have any column indicating that there exists index on the
given attribute. If we were knowing that then we could find out if the
given index on child table have been inherited from parent and if so,
we could delete all the dependencies on the child table index first
and then delete the column itself in the child table but that doesn't
seem to be doable here. So, although the standard way that I feel to
perform an object deletion is to first remove all it's dependencies
from the pg_depend table and then delete the object itself but
considering the information available in the relevant catalog table
that doesn't seem to be possible.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Antonin Houska

Robert Haas  wrote:

> On Mon, Oct 7, 2019 at 3:01 PM Antonin Houska  wrote:
> > However the design doesn't seem to be stable enough at the
> > moment for coding to make sense.
> 
> Well, I think the question is whether working further on your patch
> could produce some things that everyone would agree are a step
> forward.

It would have made a lot of sense several months ago (Masahiko Sawada actually
used parts of our patch in the previous version of his patch (see [1]), but
the requirement to use a different IV for each execution of the encryption
changes things quite a bit.

Besides the relation pages and SLRU (CLOG), which are already being discussed
elsewhere in the thread, let's consider other two file types:

* Temporary files (buffile.c): we derive the IV from PID of the process that
  created the file + segment number + block within the segment. This
  information does not change if you need to write the same block again. If
  new IV should be used for each encryption run, we can simply introduce an
  in-memory counter that generates the IV for each block. However it becomes
  trickier if the temporary file is shared by multiple backends. I think it
  might still be easier to expose the IV values to other backends via shared
  memory than to store them on disk ...

* "Buffered transient file". This is to be used instead of OpenTransientFile()
  if user needs the option to encrypt the file. (Our patch adds this API to
  buffile.c. Currently we use it in reorderbuffer.c to encrypt the data
  changes produced by logical decoding, but there should be more use cases.)

  In this case we cannot keep the IVs in memory because user can close the
  file anytime and open it much later. So we derive the IV by hashing the file
  path. However if we should generate the IV again and again, we need to store
  it on disk in another way, probably one IV value per block (PGAlignedBlock).

  However since our implementation of both these file types shares some code,
  it might yet be easier if the shared temporary file also stored the IV on
  disk instead of exposing it via shared memory ...

Perhaps this is what I can work on, but I definitely need some feedback.

[1] 
https://www.postgresql.org/message-id/CAD21AoBjrbxvaMpTApX1cEsO=8N=nc2xVZPB0d9e-VjJ=ya...@mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: ICU for global collation

2019-10-08 Thread Marius Timmer

Hi everyone,

like the others before me we (the university of Münster) are happy to
see this feature as well. Thank you this.

When I applied the patch two weeks ago I run into the issue that initdb
did not recognize the new parameters (collation-provider and icu-locale)
but I guess it was caused by my own stupidity.

> When trying databases defined with ICU locales, I see that backends
> that serve such databases seem to have their LC_CTYPE inherited from
> the environment (as opposed to a per-database fixed value).
I am able to recreate the issue described by Daniel on my machine.

Now it works as expected. I just had to update the patch since commit
3f6b3be3 had modified two lines which resulted in conflicts. You find
the updated patch as attachement to this mail.


Best regards,

Marius Timmer




-- 
Westfälische Wilhelms-Universität Münster (WWU)
Zentrum für Informationsverarbeitung (ZIV)
Röntgenstraße 7-13
Besucheradresse: Einsteinstraße 60 - Raum 107
48149 Münster
+49 251 83 31158
marius.tim...@uni-muenster.de
https://www.uni-muenster.de/ZIV
From 0b520194ed164feaeac94af25ddf1429cf4ab24f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 7 Oct 2019 12:21:36 +0200
Subject: [PATCH v1.2] Add option to use ICU as global collation provider

This adds the option to use ICU as the default collation provider for
either the whole cluster or a database.  New options for initdb,
createdb, and CREATE DATABASE are used to select this.
---
 doc/src/sgml/ref/createdb.sgml   |   9 ++
 doc/src/sgml/ref/initdb.sgml |  23 
 src/backend/access/hash/hashfunc.c   |  18 ++-
 src/backend/commands/dbcommands.c|  52 -
 src/backend/regex/regc_pg_locale.c   |   7 +-
 src/backend/utils/adt/formatting.c   |   6 +
 src/backend/utils/adt/like.c |  20 +++-
 src/backend/utils/adt/like_support.c |   2 +
 src/backend/utils/adt/pg_locale.c| 168 ---
 src/backend/utils/adt/varchar.c  |  22 +++-
 src/backend/utils/adt/varlena.c  |  26 -
 src/backend/utils/init/postinit.c|  21 
 src/bin/initdb/Makefile  |   2 +
 src/bin/initdb/initdb.c  |  63 --
 src/bin/initdb/t/001_initdb.pl   |  18 ++-
 src/bin/pg_dump/pg_dump.c|  16 +++
 src/bin/psql/describe.c  |   8 ++
 src/bin/scripts/Makefile |   2 +
 src/bin/scripts/createdb.c   |   9 ++
 src/bin/scripts/t/020_createdb.pl|  19 ++-
 src/include/catalog/pg_database.dat  |   2 +-
 src/include/catalog/pg_database.h|   3 +
 src/include/utils/pg_locale.h|   6 +
 23 files changed, 417 insertions(+), 105 deletions(-)

diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 8fc8128bf9..5b73afad91 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -85,6 +85,15 @@ PostgreSQL documentation
   
  
 
+ 
+  --collation-provider={libc|icu}
+  
+   
+Specifies the collation provider for the database.
+   
+  
+ 
+
  
   -D tablespace
   --tablespace=tablespace
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index da5c8f5307..9ad7b2e112 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -165,6 +165,18 @@ PostgreSQL documentation
   
  
 
+ 
+  --collation-provider={libc|icu}
+  
+   
+This option sets the collation provider for databases created in the
+new cluster.  It can be overridden in the CREATE
+DATABASE command when new databases are subsequently
+created.  The default is libc.
+   
+  
+ 
+
  
   -D directory
   --pgdata=directory
@@ -209,6 +221,17 @@ PostgreSQL documentation
   
  
 
+ 
+  --icu-locale=locale
+  
+   
+Specifies the ICU locale if the ICU collation provider is used.  If
+this is not specified, the value from the --locale
+option is used.
+   
+  
+ 
+
  
   -k
   --data-checksums
diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index 6ec1ec3df3..2f8f220549 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -255,8 +255,13 @@ hashtext(PG_FUNCTION_ARGS)
  errmsg("could not determine which collation to use for string hashing"),
  errhint("Use the COLLATE clause to set the collation explicitly.")));
 
-	if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
-		mylocale = pg_newlocale_from_collation(collid);
+	if (!lc_collate_is_c(collid))
+	{
+		if (collid != DEFAULT_COLLATION_OID)
+			mylocale = pg_newlocale_from_collation(collid);
+		else if (global_locale.provider == COLLPROVIDER_ICU)
+			mylocale = _locale;
+	}
 
 	if (!mylocale || mylocale->deterministic)
 	{
@@ -311,8 +316,13 @@ hashtextextended(PG_FUNCTION_ARGS)
  errmsg("could not determine which collation to use for string hashing"),

Re: [PATCH] use separate PartitionedRelOptions structure to store partitioned table options

2019-10-08 Thread Nikolay Shaplov

В письме от вторник, 8 октября 2019 г. 16:00:49 MSK пользователь Amit Langote 
написал:

> > > > The idea of this patch is following: If you read the code, partitioned
> > > > tables do not have any options (you will not find
> > > > RELOPT_KIND_PARTITIONED
> > > > in boolRelOpts, intRelOpts, realRelOpts, stringRelOpts and enumRelOpts
> > > > in
> > > > reloption.c), but it uses StdRdOptions to store them (these no
> > > > options).
> > > 
> > > I am not even sure that we actually need that.  What kind of reloption
> > > you would think would suit into this subset?
> > 
> > Actually I do not know. But the author of partitioned patch, added a stub
> > for partitioned tables to have some reloptions in future. But this stub
> > is designed to use StdRdOptions. Which is not correct, as I presume. So
> > here I am correcting the stub.
> 
> I wrote the patch that introduced RELOPT_KIND_PARTITIONED.  Yes, it
> was added as a stub relopt_kind to eventually apply to reloptions that
> could be sensibly applied to partitioned tables.  We never got around
> to working on making the existing reloptions relevant to partitioned
> tables, nor did we add any new partitioned table-specific reloptions,
> so it remained an unused relopt_kind.
Thank you for clarifying thing.
 
> IIUC, this patch invents PartitionedRelOptions as the binary
> representation for future RELOPT_KIND_PARTITIONED parameters.  As long
> as others are on board with using different *Options structs for
> different object kinds, I see no problem with this idea.
Yes, this is correct. Some Access Methods already use it's own Options 
structure. As far as I can guess StdRdOptions still exists only for historical 
reasons, and became quite a mess because of adding all kind of stuff there. 
Better to separate it.

BTW, as far as you are familiar with this part of the code, may be you will 
join the review if this particular patch?

-- 
Software Developer: https://www.upwork.com/freelancers/~014a87e140ff02c0da
Body-oriented Therapist: https://vk.com/nataraj_rebalancing  (Russian)

Re: [PATCH] Add some useful asserts into View Options macroses

2019-10-08 Thread Nikolay Shaplov

В письме от понедельник, 7 октября 2019 г. 12:59:27 MSK пользователь Robert 
Haas написал:

> > This thread is a follow up to the thread
> > https://www.postgresql.org/message-id/2620882.s52SJui4ql@x200m where I've
> > been trying to remove StdRdOptions structure and replace it with unique
> > structure for each relation kind.
> > 
> > I've decided to split that patch into smaller parts.
> > 
> > This part adds some asserts to ViewOptions macroses.
> > Since an option pointer there is converted into (ViewOptions *) it would
> > be
> > really good to make sure that this macros is called in proper context, and
> > we do the convertation properly. At least when running tests with asserts
> > turned on.
> Seems like a good idea.  Should we try to do something similar for the
> macros in that header file that cast to StdRdOptions?

That would not be as easy as for ViewOptions. For example as for the current 
master code, fillfactor from StdRdOptions is used in Toast, Heap, Hash index, 
nbtree index, and spgist index. This will make RelationGetFillFactor macros a 
bit complicated for example.

Now I have patches that limits usage of StdRdOptions to Heap and Toast.

When StdRdOptions is not that widely used, we whould be able to add asserts 
for it, it will not make the code too complex.

So I would suggest to do ViewOptions asserts now, and keep dealing with 
StdRdOptions for later. When we are finished with my current patches, I will 
take care about it.

-- 
Software Developer: https://www.upwork.com/freelancers/~014a87e140ff02c0da
Body-oriented Therapist: https://vk.com/nataraj_rebalancing  (Russian)

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-10-08 Thread Nikolay Shaplov

В письме от понедельник, 7 октября 2019 г. 18:55:20 MSK пользователь Dent John 
написал:

> I like the new approach. And I agree with the ambition — to split out the
> representation from StdRdOptions.
Thanks.
 
> However, with that change, in the AM’s *options() function, it looks as if
> you could simply add new fields to the relopt_parse_elt list. That’s still
> not true, because parseRelOptions() will fail to find a matching entry,
> causing numoptions to be left zero, and an early exit made. (At least,
> that’s if I correctly understand how things work.)
> 
> I think that is fine as an interim limitation, because your change has not
> yet fully broken the connection to the boolRelOpts, intRelOpts, realRelOpts
> and stringRelOpts strutures. But perhaps a comment would help to make it
> clear. Perhaps add something like this above the tab[]: "When adding or
> changing a relopt in the relopt_parse_elt tab[], ensure the corresponding
> change is made to boolRelOpts, intRelOpts, realRelOpts or stringRelOpts."
Whoa-whoa!

I am not inventing something new here. Same code is already used in brin 
(brin.c:820), gin (ginutils.c:602) and gist (gistutils.c:909) indexes. I've 
just copied the idea, to do all index code uniform.

This does not mean that these code can't be improved, but as far as I can 
guess it is better to do it in small steps, first make option code uniform, and 
then improve all of it this way or another...

So I here I would suggest to discuss it I copied this code correctly, without 
going very deeply into discussion how we can improve code I've used as a 
source for cloning.

And then I have ideas how to do it better. But this will come later...

-- 
Software Developer: https://www.upwork.com/freelancers/~014a87e140ff02c0da
Body-oriented Therapist: https://vk.com/nataraj_rebalancing  (Russian)

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Antonin Houska

Ants Aasma  wrote:

> On Mon, 7 Oct 2019 at 18:02, Bruce Momjian  wrote:
> 
>>  Well, do to encryption properly, there is the requirement of the nonce. 
>>  If you ever rewrite a bit, you technically have to have a new nonce. 
>>  For WAL, since it is append-only, you can use the WAL file name.  For
>>  heap/index files, we change the LSN on every rewrite (with
>>  wal_log_hints=on), and we never use the same LSN for writing multiple
>>  relations, so LSN+page-offset is a sufficient nonce.
>> 
>>  For clog, it is not append-only, and bytes are rewritten (from zero to
>>  non-zero), so there would have to be a new nonce for every clog file
>>  write to the file system.  We can store the nonce in a separate file,
>>  but the clog contents and nonce would have to be always synchronized or
>>  the file could not be properly read.  Basically every file we want to
>>  encrypt, needs this kind of study.

> Yes. That is the reason why our current version doesn't encrypt
> SLRU's.

Actually there was one more problem: the AES-CBC cipher (or AES-XTS in the
earlier patch version) process an encryption block of 16 bytes at a time. Thus
if only a part of the block gets written (a torn page write), decryption of
the block results in garbage. Unlike relations, there's nothing like full-page
write for SLRU pages, so there's no way to recover from this problem.

However, if the current plan is to use the CTR mode, this problem should not
happen.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Proposal: Make use of C99 designated initialisers for nulls/values arrays

2019-10-08 Thread Amit Kapila

On Sat, Oct 5, 2019 at 3:10 AM Andres Freund  wrote:
>
> On 2019-10-04 17:08:29 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2019-10-04 16:31:29 -0400, Bruce Momjian wrote:
> > >> Yeah, it is certainly weird that you have to assign the first array
> > >> element to get the rest to be zeros.  By using a macro, we can document
> > >> this behavior in one place.
> >
> > > IDK, to me this seems like something one just has to learn about C, with
> > > the macro just obfuscating that already required knowledge. It's not
> > > like this only applies to stack variables initializes with {0}. It's
> > > also true of global variables, or function-local static ones, for
> > > example.
> >
> > Huh?  For both those cases, the *default* behavior, going back to K C,
> > is that the variable initializes to all-bits-zero.  There's no need to
> > write anything extra.
>
> What I mean is that if there's any initialization, it's to all zeroes,
> except for the parts explicitly initialized explicitly. And all that the
> {0} does, is that the rest of the fields are initialized the way other
> such initialization happens.
>

You have a point and I think over time everyone will know this.
However, so many people advocating for having a macro with a comment
to be more explicit about this behavior shows that this is not equally
obvious to everyone or at least they think that it will help future
patch authors.

Now, I think as the usage ({0}) already exists in the code, so I think
if we decide to use a macro, then ideally those places should also be
changed.  I am not telling that it must be done in the same patch, we
can even do it as a separate patch.

I am personally still in the camp of people advocating the use of
macro for this purpose.  It is quite possible after reading your
points, some people might change their opinion or some others also
share their opinion against using a macro in which case we can drop
the idea of using a macro.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Change atoi to strtol in same place

2019-10-08 Thread David Rowley

On Tue, 8 Oct 2019 at 19:46, Joe Nelson  wrote:
>
> David Rowley wrote:
> > The translation string here must be consistent over all platforms. I
> > think this will cause issues if the translation string uses %ld and
> > the platform requires %lld?
>
> A very good and subtle point. I'll change it to %lld so that a single
> format string will work everywhere.

The way to do this is to make a temp buffer and snprintf into that
buffer then use %s.

See basebackup.c where it does:

char buf[64];

snprintf(buf, sizeof(buf), INT64_FORMAT, total_checksum_failures);

ereport(WARNING,
(errmsg("%s total checksum verification failures", buf)));

as an example.

> > I think you should maybe aim for 2 patches here.
> >
> > Patch 1: ...
> >
> > Patch 2: Add additional validation where we don't currently do
> > anything. e.g pg_dump -j
> >
> > We can then see if there's any merit in patch 2 of if it's adding more
> > complexity than is really needed.
>
> Are you saying that my current patch adds extra constraints for
> pg_dump's -j argument, or that in the future we could do that? Because I
> don't believe the current patch adds any appreciable complexity for that
> particular argument, other than ensuring the value is positive, which
> doesn't seem too contentious.

> Maybe we can see whether we can reach consensus on the current
> parse-and-check combo patch, and if discussion drags on much longer then
> try to split it up?

I just think you're more likely to get a committer onside if you made
it so they didn't have to consider if throwing errors where we
previously didn't would be a bad thing. It's quite common to get core
infrastructure in first then followup with code that uses it. This
would be core infrastructure plus some less controversial usages of
it, then follow up with more. This was really just a suggestion. I
didn't dig into the patch in enough detail to decide on how many
places could raise an error that would have silently just have done
something unexpected beforehand.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Ants Aasma

On Mon, 7 Oct 2019 at 18:02, Bruce Momjian  wrote:

> Well, do to encryption properly, there is the requirement of the nonce.
> If you ever rewrite a bit, you technically have to have a new nonce.
> For WAL, since it is append-only, you can use the WAL file name.  For
> heap/index files, we change the LSN on every rewrite (with
> wal_log_hints=on), and we never use the same LSN for writing multiple
> relations, so LSN+page-offset is a sufficient nonce.
>
> For clog, it is not append-only, and bytes are rewritten (from zero to
> non-zero), so there would have to be a new nonce for every clog file
> write to the file system.  We can store the nonce in a separate file,
> but the clog contents and nonce would have to be always synchronized or
> the file could not be properly read.  Basically every file we want to
> encrypt, needs this kind of study.
>

Yes. That is the reason why our current version doesn't encrypt SLRU's.
There is some security in encrypting without a nonce when considering an
attack vector that only sees one version of the encrypted page. But I think
to make headway on this we need to figure out if TDE feature is useful
withour SLRU encryption (I think yes), and how hard would it be to properly
encrypt SLRU's? Would the solution be acceptable for inclusion?

I can think of 3 options:

a) A separate nonce storage. Seems pretty bad complexity wise. New
data-structures would need to be created. SLRU writes would need to be WAL
logged with a full page image.
b) Inline nonces, number of items per SLRU page is variable depending on if
encryption is enabled or not.
c) Inline nonces we reserve a header structure on all SLRU pages.
pg_upgrade needs to rewrite persistent SLRUs.

None of the options seem great, but c) has the benefit of also carving out
the space for SLRU checksums.

> As I also said to Stephen, the people who are discussing this here
> > should *really really really* be looking at the Cybertec patch instead
> > of trying to invent everything from scratch - unless that patch has,
>
> Someone from Cybertec is on the voice calls we have, and is actively
> involved.
>

As far as I can tell no-one from us is on the call. I personally missed the
invitation when it was sent out. I would gladly share our learnings, a lot
of what I see here is retreading what we already went through with our
patch. However, I think that at the very least the conclusions, problems to
work on and WIP patch should be shared on list. It's hard for anybody
outside to have any input if there are no concrete design proposals or code
to review. Moreover, I think e-mail is a much better media for having a
reasoned discussion about technical design decisions.

> > In other words: maybe I'm wrong here, but it looks to me like we're
>
> laboriously reinventing the wheel when we could be working on
> > improving the working prototype.
>
> The work being done is building on that prototype.
>

We would like to help on that front.

Regards,
Ants Aasma
Web: https://www.cybertec-postgresql.com

Re: dropping column prevented due to inherited index

2019-10-08 Thread Amit Langote

On Tue, Oct 8, 2019 at 6:15 PM Alvaro Herrera  wrote:
> Apologies for not helping much here; I'm on vacation for a couple of
> weeks.

No worries, please take your time.

> On 2019-Oct-08, Amit Langote wrote:
>
> > I couldn't imagine how that trick could be implemented for this case. :(
>
> Can't we pass around an ObjectAddresses pointer to which each recursion
> level adds the object(s) it wants to delete, and in the outermost level
> we drop everything in it?  I vaguely recall we implemented something
> like that somewhere within the past year, but all I can find is
> 20bef2c3110a.

I thought about doing something like that, but wasn't sure if
introducing that much complexity is warranted.

Thanks,
Amit

Re: dropping column prevented due to inherited index

2019-10-08 Thread Alvaro Herrera

Apologies for not helping much here; I'm on vacation for a couple of
weeks.

On 2019-Oct-08, Amit Langote wrote:

> I couldn't imagine how that trick could be implemented for this case. :(

Can't we pass around an ObjectAddresses pointer to which each recursion
level adds the object(s) it wants to delete, and in the outermost level
we drop everything in it?  I vaguely recall we implemented something
like that somewhere within the past year, but all I can find is
20bef2c3110a.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Wrong value in metapage of GIN INDEX.

2019-10-08 Thread imai.yoshik...@fujitsu.com

Moon-san, kuroda.keisuke-san

On Thu, Aug 29, 2019 at 8:20 AM, Moon, Insung wrote:
> =# CREATE INDEX foo_idx ON foo USING gin (i jsonb_ops);
> =# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0)) WITH 
> (fastupdate=off);

This is not important thing but some mistakes are here.

=# CREATE INDEX foo_idx ON foo USING gin (i jsonb_ops) WITH (fastupdate=off);
=# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0));


> In this example, the nentries value should be 10001 because the gin index 
> stores duplicate values in one leaf(posting
> tree or posting list).
...
> The patch is very simple.
> Fix to increase the value of nEntries only when a non-duplicate GIN index 
> leaf added.

Does nentries show the number of entries in the leaf pages?
If so, the fix seems to be correct.

--
Yoshikazu Imai

RE: Global shared meta cache

2019-10-08 Thread ideriha.take...@fujitsu.com

Hi, Alvaro

>
>The last patch we got here (a prototype) was almost a year ago.  There was
>substantial discussion about it, but no new version of the patch has been 
>posted.  Are
>we getting a proper patch soon, or did we give up on the approach entirely?

I'm sorry for the late response. I started to work for it again for coming up 
commitfest.

Regards,
Takeshi Ideriha

Re: Remove some code for old unsupported versions of MSVC

2019-10-08 Thread Peter Eisentraut

On 2019-10-07 08:52, Michael Paquier wrote:
> On Fri, Oct 04, 2019 at 04:35:59PM +0200, Peter Eisentraut wrote:
>> As of d9dd406fe281d22d5238d3c26a7182543c711e74, we require MSVC 2013,
>> which means _MSC_VER >= 1800.  This means that conditionals about
>> older versions of _MSC_VER can be removed or simplified.
>>
>> Previous code was also in some cases handling MinGW, where _MSC_VER is
>> not defined at all, incorrectly, such as in pg_ctl.c and win32_port.h,
>> leading to some compiler warnings.  This should now be handled better.
> 
> Thanks Peter for cleaning up this code.  I have looked at it, did some
> testing and it looks good to me.  No spots are visibly missing.

pushed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: dropping column prevented due to inherited index

2019-10-08 Thread Amit Langote

Hi Ashutosh,

On Mon, Oct 7, 2019 at 1:39 PM Ashutosh Sharma  wrote:
> I think we could have first deleted all the dependency of child object
> on parent and then deleted the child itself using performDeletion().

So, there are two objects to consider in this case -- column and an
index that depends on it.

For columns, we don't store any dependency records for the dependency
between a child column and its corresponding parent column.  That
dependency is explicitly managed by the code and the attinhcount flag,
which if set, prevents the column from being dropped on its own.

Indexes do rely on dependency records for the dependency between a
child index and its corresponding parent index.  This dependency
prevents a child index from being dropped if the corresponding parent
index is also not being dropped.

In this case, recursively dropping a child's column will in turn try
to drop the depending child index.  findDependentObject() complains
because it can't allow a child index to be dropped, because it can't
establish that the corresponding parent index is also being dropped.
That's because the parent index will be dropped when the parent's
column will be dropped, which due to current coding of
ATExecDropColumn() is *after* all the child columns and indexes are
dropped.  If we drop the parent column and depending index first and
then recurse to children as my proposed patch does, then the parent
index would already have been dropped when dropping the child column
and the depending index.

> As an example let's consider the case of toast table where we first
> delete the dependency of toast relation on main relation and then
> delete the toast table itself otherwise the toast table deletion would
> fail. But, the problem I see here is that currently we do not have any
> entry in pg_attribute table that would tell us about the dependency of
> child column on parent.

I couldn't imagine how that trick could be implemented for this case. :(

Thanks,
Amit

RE: Global shared meta cache

2019-10-08 Thread ideriha.take...@fujitsu.com

Hi, Konstantin 

I'm very sorry for the late response and thank you for your feedback.
(I re-sent this email because my email address changed and couldn't deliver to 
hackers.)

>From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
>
>Takeshi-san,
>
>I am sorry for late response - I just waited new version of the patch 
>from you for review.

Though I haven't incorporated your idea, I made PoC patch, which supports 
regular create table, select, and drop table.

TBH, current patch is not sophisticated so much. 
It failed some installcheck items with global catalog cache on and has around 
2k LOC.

>I read your last proposal and it seems to be very reasonable.
> From my point of view we can not reach acceptable level of performance 
>if we do not have local cache at all.
>So, as you proposed, we should maintain local cache for uncommitted data.
Yeah, I did this in my patch.

>I think that size of global cache should be limited (you have introduced GUC 
>for it).
>In principle it is possible to use dynamic shared memory and have 
>unlimited global cache.
>But I do not see much sense in it.
Yes. I limit the size for global cache. Right now it doesn't support eviction 
policy like LRU.
 
>I do not completely understand from your description when are are going 
>to evict entry from local cache?
>Just once transaction is committed? I think it will be more efficient 
>to also specify memory threshold for local cache size and use LRU or 
>some other eviction policy to remove data from local cache.
>So if working set (accessed relations) fits in local cache limit, there 
>will be no performance penalty comparing with current implementation.
>There should be completely on difference on pgbench or other benchmarks 
>with relatively small number of relations.
>
>If entry is not found in local cache, then we should look for it in 
>global cache and in case of double cache miss - read it from the disk.
>I do not completely understand why we need to store references to 
>global cache entries in local cache and use reference counters for global 
>cache entries.
>Why we can not maintain just two independent caches?
>
>While there are really databases with hundreds and even thousands of 
>tables, application is still used to work with only some small subset of them.
>So I think that "working set" can still fit in memory.  This is why I 
>think that in case of local cache miss and global cache hit, we should 
>copy data from global cache to local cache to make it possible to access it in 
>future without any sycnhronization.
>
>As far as we need to keep all uncommitted data in local cache, there is 
>still a chance of local memory overflow (if some transaction creates or 
>alters too much number of tables).
>But I think that it is very exotic and rare use case. The problem with 
>memory overflow usually takes place if we have large number of 
>backends, each maintaining its own  catalog cache.
>So I think that we should have "soft" limit for local cache and "hard"
>limit for global cache.

Oh, I didn't come up this idea at all. So local cache is sort of 1st cache and 
global cache is second cache. That sounds great. 
It would be good for performance and also setting two guc parameter for 
limiting local cache and global cache gives complete memory control for DBA.
Yeah, uncommitted data should be in local but it's the only exception.
No need to keep track of reference to global cache from local cache header 
seems less complex for implementation. I'll look into the design.

>I didn't think much about cache invalidation. I read your proposal, but 
>frankly speaking do not understand why it should be so complicated.
>Why we can't immediately invalidate entry in global cache and lazily 
>(as it is done now using invalidation signals) invalidate local caches?
>

I was overthinking about when local/global cache is evicted. Simply the process 
reads the sinval messages then invalidate it. If the refcount is not zero, the 
process mark it dead to prevent other process from finding the obsoleted cache 
from global hash table.
The refcount of global cache is raised between SearchSysCache() and 
ReleaseSysCache().
Invalidation of global cache with refcount up would cause invalid memory access.

Regards,
Takeshi Ideriha


0002-POC-global-catalog-cache.patch
Description: 0002-POC-global-catalog-cache.patch


0001-MemoryContext-for-shared-memory-based-on-DSA.patch
Description: 0001-MemoryContext-for-shared-memory-based-on-DSA.patch

Re: Regarding extension

2019-10-08 Thread Craig Ringer

On Fri, 4 Oct 2019 at 13:18, Tom Lane  wrote:

> Natarajan R  writes:
> > Me:  Thanks Tomas, But this is for that particular database only, I want
> > to get the *list of database Id's* on which my extension is installed
> > during *PG_INIT* itself...
>
> You can't.  In the first place, that information simply isn't obtainable,
> because a session running within one database doesn't have access to the
> catalogs of other databases in the cluster.  (You could perhaps imagine
> firing up connections to other DBs a la dblink/postgres_fdw, but that will
> fail because you won't necessarily have permissions to connect to every
> database.)  In the second place, it's a pretty terrible design to be
> attempting any sort of database access within _PG_init, because that
> precludes loading that module outside a transaction; for example you
> will not be able to preload it via shared_preload_libraries or allied
> features.
>

Absolutely agreed. Having done this myself, it's much, much harder than
you'd expect and not something I suggest anyone try unless it's absolutely
necessary.

It'd be an absolute dream if extensions could create their own shared
catalogs; that'd make life *so* much easier. But I seem to recall looking
at that and nope-ing right out. That was a while ago so I should probably
revisit it.

Anyhow: BDR and pglogical are extensions that do need to concern itself
with what's in various databases, so this is an issue I've worked with day
to day for some time.

BDR1 used a custom security label and the pg_shseclabel catalog to mark
databases that were BDR-enabled. It launched a worker that connected to
database InvalidOid, so it could read only the global shared catalogs, then
it scanned them to find out which DBs to launch individual workers for.
This interacted poorly with pg_dump/pg_restore and proved fragile, so I
don't recommend it.

pglogical instead launches a static bgworker with no DB connections. On
startup or when it gets a suitable message over its extension shmem
segment + a latch set, it launches new workers for each DB. Each worker
inspects the DB to check for the presence of the pglogical extension and
exits if it isn't found.

All in all, it's pretty clumsy, though it works very well.

We have to do our own process management and registration. Workarounds must
be put in place for processes failing to launch then a new process taking
their shmem slot and various other things. pglogical lands up having to
duplicate quite a bit of the bgw and postmaster management infrastructure
because it's not extensible and it has some serious deficiencies in
error/crash handling.

(We also have our own dependency management, lock management, shared cache
invalidations, syscache/catcache-like mechanism, and other areas where we'd
rather extend Pg's infrastructure but can't. Being able to create our own
dependency types, custom lock types/methods, custom syscaches we could get
invalidations for, etc would just be amazing. But each would likely be a
major effort to get into core, if we could get it accepted at all given the
"in core users" argument, and we'd have to keep the old method around
anyway...)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: [PATCH] use separate PartitionedRelOptions structure to store partitioned table options

2019-10-08 Thread Amit Langote

Hello,

On Mon, Oct 7, 2019 at 6:43 PM Nikolay Shaplov  wrote:
> В письме от понедельник, 7 октября 2019 г. 14:57:14 MSK пользователь Michael
> Paquier написал:
> > On Sun, Oct 06, 2019 at 03:47:46PM +0300, Nikolay Shaplov wrote:
> > > The idea of this patch is following: If you read the code, partitioned
> > > tables do not have any options (you will not find RELOPT_KIND_PARTITIONED
> > > in boolRelOpts, intRelOpts, realRelOpts, stringRelOpts and enumRelOpts in
> > > reloption.c), but it uses StdRdOptions to store them (these no options).
> > I am not even sure that we actually need that.  What kind of reloption
> > you would think would suit into this subset?
>
> Actually I do not know. But the author of partitioned patch, added a stub for
> partitioned tables to have some reloptions in future. But this stub is
> designed to use StdRdOptions. Which is not correct, as I presume. So here I am
> correcting the stub.

I wrote the patch that introduced RELOPT_KIND_PARTITIONED.  Yes, it
was added as a stub relopt_kind to eventually apply to reloptions that
could be sensibly applied to partitioned tables.  We never got around
to working on making the existing reloptions relevant to partitioned
tables, nor did we add any new partitioned table-specific reloptions,
so it remained an unused relopt_kind.

IIUC, this patch invents PartitionedRelOptions as the binary
representation for future RELOPT_KIND_PARTITIONED parameters.  As long
as others are on board with using different *Options structs for
different object kinds, I see no problem with this idea.

Thanks,
Amit

Re: Change atoi to strtol in same place

2019-10-08 Thread Joe Nelson

David Rowley wrote:
> It's not for this patch to decide what ports clients can connect to
> PostgreSQL on. As far as I'm aware Windows has no restrictions on what
> ports unprivileged users can listen on. I think we're likely going to
> upset a handful of people if we block the client tools from connecting
> to ports < 1024.

Makes sense. We can instead allow any port number and if it errors at
connection time then the user will find out at that point.

> This part seems over-complex to me. What's wrong with just returning a
> bool and if the caller gets "false", then just have it emit the error,
> such as:
> 
> "compression level must be between %d and %d"
> 
> I see Michael's patch is adding this new return type, but really, is
> there a good reason why we need to do something special when the user
> does not pass in an integer?

Displaying the range when given a non-integer input could be misleading.
For instance, suppose we put as little restriction on the port number
range as possible, enforcing only that it's positive, between 1 and
INT_MAX.  If someone passes a non-integer value, they would get a
message like:

invalid port number: must be an integer in range 1..2147483647

Sure, the parsing code will accept such a big number, but we don't want
to *suggest* the number. Notice if the user had passed a well-formed
number for the port it's unlikely to be greater than INT_MAX, so they
wouldn't have to see this weird message.

Perhaps you weren't suggesting to remove the upper limit from port
checking, just to change the lower limit from 1024 back to 1. In that
case we could keep it capped at 65535 and the error message above would
be OK.

Other utilities do have command line args that are allowed the whole
non-negative (but signed) int range, and their error message would show
the big number. It's not misleading in that case, but a little
ostentatious.

> Current patch:
> $ pg_dump -Z blah
> invalid compression level: could not parse 'blah' as integer
> 
> I propose:
> $ pg_dump -Z blah
> compression level must be an integer in range 0..9
> 
> This might save a few round trips, e.g the current patch will do:

I do see your reasoning that we're teasing people with a puzzle they
have to solve with multiple invocations. On the other hand, passing a
non-number for the compression level is pretty strange, and perhaps
explicitly calling out the mistake might make someone look more
carefully at what they -- or their script -- is doing.

> The translation string here must be consistent over all platforms. I
> think this will cause issues if the translation string uses %ld and
> the platform requires %lld?

A very good and subtle point. I'll change it to %lld so that a single
format string will work everywhere.

> I think you should maybe aim for 2 patches here.
> 
> Patch 1: ...
> 
> Patch 2: Add additional validation where we don't currently do
> anything. e.g pg_dump -j
> 
> We can then see if there's any merit in patch 2 of if it's adding more
> complexity than is really needed.

Are you saying that my current patch adds extra constraints for
pg_dump's -j argument, or that in the future we could do that? Because I
don't believe the current patch adds any appreciable complexity for that
particular argument, other than ensuring the value is positive, which
doesn't seem too contentious.

Maybe we can see whether we can reach consensus on the current
parse-and-check combo patch, and if discussion drags on much longer then
try to split it up?

> I also think some compilers won't like:
> 
> + compressLevel = parsed;
> 
> given that "parsed" is int64 and "compressLevel" is int, surely some
> compilers will warn of possible truncation? An explicit cast to int
> should fix those

Good point, I bet some compilers (justly) warn about truncation. We've
checked the range so I'll add an explicit cast.

> or you could consider just writing a version of the function for int32
> and int64 and directly passing in the variable to be set.

One complication is that the destination values are often int rather
than int32, and I don't know their width in general (probably 32, maybe
16, but *possibly* 64?).  The pg_strtoint64_range() function with range
argument of INT_MAX is flexible enough to handle whatever situation we
encounter. Am I overthinking this part?

-- 
Joe Nelson  https://begriffs.com

Re: Ordering of header file inclusion

2019-10-08 Thread Amit Kapila

On Wed, Oct 2, 2019 at 2:57 PM vignesh C  wrote:
>
> Hi,
>
> I noticed that some of the header files inclusion is not ordered as
> per the usual standard that is followed.
> The attached patch contains the fix for the order in which the header
> files are included.
> Let me know your thoughts on the same.
>

+1.   I think this will make an order of header inclusions consistent
throughout code.   One thing which will be slightly tricky is we might
not be able to back-patch this as some of this belongs to a recent
version(s) and others to older versions as well.   OTOH, I have not
investigated how much of this is relevant to back branches.  I think
most of these will apply to 12, but I am not sure if it is worth the
effort to segregate the changes which apply to back branches.  What do
you think?

 Few minor comments after a quick read:
 #include "lib/ilist.h"
-
+#include "miscadmin.h"

I think we shouldn't remove the extra line as part of the above change.

--- a/src/bin/psql/variables.c
+++ b/src/bin/psql/variables.c
@@ -8,10 +8,8 @@
 #include "postgres_fe.h"

 #include "common.h"
-#include "variables.h"
-
 #include "common/logging.h"
-
+#include "variables.h"

Same as above.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Updated some links which are not working with new links

2019-10-08 Thread vignesh C

On Tue, Oct 8, 2019 at 11:06 AM Michael Paquier  wrote:
>
> On Mon, Oct 07, 2019 at 02:14:05PM +0530, vignesh C wrote:
> > Sorry Michael for the miscommunication, the patch was present in the
> > first mail of this mail thread.
> > I'm re-attaching the patch in this mail.
> > Let me know if anything is required.
>
> Thanks.  It looks like I have been able to miss the actual patch, and
> got confused as there were two threads about more or less the same
> matter.  The links were redirected to an https equivalent, so applied
> with that.
Thanks Michael.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: maintenance_work_mem used by Vacuum

2019-10-08 Thread Amit Kapila

On Tue, Oct 8, 2019 at 12:57 AM Robert Haas  wrote:
>
> On Sun, Oct 6, 2019 at 6:55 AM Amit Kapila  wrote:
> > We wanted to decide how a parallel vacuum should use memory?  Can each 
> > worker consume maintenance_work_mem to clean up the gin Index or all 
> > workers should use no more than maintenance_work_mem?  We were thinking of 
> > later but before we decide what is the right behavior for parallel vacuum, 
> > I thought it is better to once discuss if the current memory usage model is 
> > right.
>
> Well, I had the idea when we were developing parallel query that we
> should just ignore the problem of work_mem: every node can use X
> amount of work_mem, and if there are multiple copies of the node in
> multiple processes, then you probably end up using more memory.  I
> have been informed by Thomas Munro -- in very polite terminology --
> that this was a terrible decision which is causing all kinds of
> problems for users. I haven't actually encountered that situation
> myself, but I don't doubt that it's an issue.
>
> I think it's a lot easier to do better when we're talking about
> maintenance commands rather than queries. Maintenance operations
> typically don't have the problem that queries do with an unknown
> number of nodes using memory; you typically know all of your memory
> needs up front.  So it's easier to budget that out across workers or
> whatever. It's a little harder in this case, because you could have
> any number of GIN indexes (1 to infinity) and the amount of memory you
> can use depends on not only on how many of them there are but,
> presumably also, the number of those that are going to be vacuumed at
> the same time.  So you might have 8 indexes, 3 workers, and 2 of the
> indexes are GIN. In that case, you know that you can't have more than
> 2 GIN indexes being processed at the same time, but it's likely to be
> only one, and maybe with proper scheduling you could make it sure it's
> only one. On the other hand, if you dole out the memory assuming it's
> only 1, what happens if you start that one, then process all 6 of the
> non-GIN indexes, and that one isn't done yet. I guess you could wait
> to start cleanup on the other GIN indexes until the previous index
> cleanup finishes, but that kinda sucks too. So I'm not really sure how
> to handle this particular case. I think the principle of dividing up
> the memory rather than just using more is probably a good one, but
> figuring out exactly how that should work seems tricky.
>

Yeah and what if we have workers equal to indexes, so doing the clean
up of Gin indexes serially (wait for the prior index to finish before
starting the clean up of next Gin index) in that case would be bad
too.  I think we can do something simple like choose minimum among
'number of Gin Indexes', 'number of workers requested for parallel
vacuum' and 'number of max_parallel_maintenance_workers' and then
divide the maintenance_work_mem by that to get the memory used by each
of the Gin indexes.  I think it has some caveats like we might not be
able to launch the number of workers we decided and in that case we
probably could have computed bigger value of work_mem that can be used
by Gin indexes.   I think whatever we pick here can be good for some
cases and not-so-good for others, so why not pick something general
and simple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

52 matches

Mail list logo