Re: Unused header file inclusion

2019-07-30 Thread Michael Paquier
On Wed, Jul 31, 2019 at 11:19:08AM +0530, vignesh C wrote:
> I noticed that there are many header files being included which need
> not be included.  I have tried this in a few files and found the
> compilation and regression to be working.  I have attached the patch
> for the files that  I tried.  I tried this in CentOS, I did not find
> the header files to be platform specific.
> Should we pursue this further and cleanup in all the files?

Do you use a particular method here or just manual deduction after
looking at each file individually?  If this can be cleaned up a bit, I
think that's welcome.  The removal of headers is easily forgotten when
moving code from one file to another...
--
Michael


signature.asc
Description: PGP signature


Re: Proposal to suppress errors thrown by to_reg*()

2019-07-30 Thread Takuma Hoshiai
On Tue, 30 Jul 2019 12:24:13 -0400
Tom Lane  wrote:

> Takuma Hoshiai  writes:
> > [ fix_to_reg_v2.patch ]
> 
> I took a quick look through this patch.  I'm on board with the goal
> of not having schema-access violations throw an error in these
> functions, but the implementation feels pretty ugly and bolted-on.
> Nobody who had designed the code to do this from the beginning
> would have chosen to parse the names twice, much less check the
> ACLs twice which is what's going to happen in the success path.
> 
> But a worse problem is that this only fixes the issue for the
> object name proper.  to_regprocedure and to_regoperator also
> have type name(s) to think about, and this doesn't fix the
> problem for those:
> 
> regression=# create schema s1;
> CREATE SCHEMA
> regression=# create type s1.d1 as enum('a','b');
> CREATE TYPE
> regression=# create function f1(s1.d1) returns s1.d1 language sql as
> regression-# 'select $1';
> CREATE FUNCTION
> regression=# select to_regprocedure('f1(s1.d1)');
>  to_regprocedure 
> -
>  f1(s1.d1)
> (1 row)
> 
> regression=# create user joe;
> CREATE ROLE
> regression=# \c - joe
> You are now connected to database "regression" as user "joe".
> regression=> select to_regprocedure('f1(s1.d1)');
> ERROR:  permission denied for schema s1
> 
> 
> A closely related issue that's been complained of before is that
> while these functions properly return NULL when the main object
> name includes a nonexistent schema:
> 
> regression=> select to_regprocedure('q1.f1(int)');
>  to_regprocedure 
> -
>  
> (1 row)
> 
> it doesn't work for a nonexistent schema in a type name:
> 
> regression=> select to_regprocedure('f1(q1.d1)');
> ERROR:  schema "q1" does not exist
> 
> 
> Looking at the back-traces for these failures,
> 
> #0  errfinish (dummy=0) at elog.c:411
> #1  0x00553626 in aclcheck_error (aclerr=, 
> objtype=OBJECT_SCHEMA, objectname=) at aclchk.c:3623
> #2  0x0055028f in LookupExplicitNamespace (
> nspname=, missing_ok=false) at namespace.c:2928
> #3  0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x20d87a0, 
> typmod_p=0x7fff94c3ee38, missing_ok=)
> at parse_type.c:162
> #4  0x005b3b29 in parseTypeString (str=, 
> typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false)
> at parse_type.c:822
> #5  0x0086fe21 in parseNameAndArgTypes (string=, 
> allowNone=false, names=, nargs=0x7fff94c3f01c, 
> argtypes=0x7fff94c3ee80) at regproc.c:1874
> #6  0x00870b2d in to_regprocedure (fcinfo=0x2134900) at regproc.c:305
> 
> #0  errfinish (dummy=0) at elog.c:411
> #1  0x0054dc7b in get_namespace_oid (nspname=, 
> missing_ok=false) at namespace.c:3061
> #2  0x00550230 in LookupExplicitNamespace (
> nspname=, missing_ok=false) at namespace.c:2922
> #3  0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x216bd20, 
> typmod_p=0x7fff94c3ee38, missing_ok=)
> at parse_type.c:162
> #4  0x005b3b29 in parseTypeString (str=, 
> typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false)
> at parse_type.c:822
> #5  0x0086fe21 in parseNameAndArgTypes (string=, 
> allowNone=false, names=, nargs=0x7fff94c3f01c, 
> argtypes=0x7fff94c3ee80) at regproc.c:1874
> #6  0x00870b2d in to_regprocedure (fcinfo=0x2170f50) at regproc.c:305
> 
> it's not *that* far from where we know we need no-error behavior to
> where it's failing to happen.  parseNameAndArgTypes isn't even global,
> so certainly changing its API is not problematic.  For the functions
> below it, we'd have to decide whether it's okay to consider that
> missing_ok=true also enables treating a schema access failure as
> "missing", or whether we should add an additional flag argument
> to decide that.  It seems like it might be more flexible to use a
> separate flag, but that decision could propagate to a lot of places,
> especially if we conclude that all the namespace.c functions that
> expose missing_ok also need to expose schema_access_violation_ok.
> 
> So I think you ought to drop this implementation and fix it properly
> by adjusting the behaviors of the functions cited above.

Thank you for watching.
Sorry, I overlooked an access permission error of argument.

behavior of 'missing_ok = true' is changed  have an impact on
DROP TABLE IF EXISTS for example. So, I will consider to add an additonal
flag like schema_access_violation_ok, instead of checking ACL twice.

>   regards, tom lane
> 

Best Regards,

-- 
Takuma Hoshiai 





Unused header file inclusion

2019-07-30 Thread vignesh C
Hi,

I noticed that there are many header files being
included which need not be included.
I have tried this in a few files and found the
compilation and regression to be working.
I have attached the patch for the files that
I tried.
I tried this in CentOS, I did not find the header
files to be platform specific.
Should we pursue this further and cleanup in
all the files?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


unused_header_file_exclusion.patch
Description: Binary data


Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-07-30 Thread Amit Langote
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita  wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas  wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita  
> > wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this.  I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning.  In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations.

I had similar thoughts upon seeing that partition_bounds_merge() will
be replacing the current way of determining if partition-wise join can
occur; that it will make the handling of currently supported cases
more expensive.

The current way is to compare the PartitionBoundInfos of joining
relations using partition_bounds_equal(), and if equal, simply join
the pairs of matching partitions if the join quals permit doing so.
There's no need to do anything extra to determine which partitions to
join with each other, because it's already established.  Likewise,
partition_bounds_merge() shouldn't to have to anything extra in that
case.  That is, for the cases that are already supported, we should
find a way to make partition_bounds_merge() only as expensive as just
performing partition_bounds_equals(), or maybe just slightly more.

Thanks,
Amit




Re: Unused struct member in pgcrypto pgp.c

2019-07-30 Thread vignesh C
On Tue, Jul 30, 2019 at 9:19 PM Daniel Gustafsson  wrote:
>
> Hi,
>
> In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info 
> which
> isn’t used, and seems to have never been used (a potential copy/pasteo from 
> the
> cipher_info struct?).  Is there a reason for keeping this, or can it be 
> removed
> as per the attached?
>
Agreed.
It seems the member is not being used anywhere, only code and name members
are being used in digest lookup.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 10:19:45PM -0700, Andres Freund wrote:
> I don't really have a problem fixing this case if we think it's
> useful. But I'm a bit bothered by all the "fixes" being submitted that
> don't matter for PG itself. They do eat up resources.

Sure.  In this particular case, we can simplify at least one code path
in the backend though for temporary path creation.  Such cleanup rings
like a sufficient argument to me.

> And sorry, adding in-backend threading to test testing mkdir_p doesn't
> make me inclined to believe that this is all well considered.  There's
> minor issues like us not supporting threads in the backend, pthread not
> being portable, and also it being entirely out of proportion to the
> issue.

Agreed on this one.  The test case may be useful for the purpose of
testing the presented patches, but it does not have enough value to be
merged.
--
Michael


signature.asc
Description: PGP signature


Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Andres Freund
Hi,

On 2019-07-31 13:48:23 +0900, Michael Paquier wrote:
> On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote:
> > Hm. I'm not really seing much of a point in making mkdir_p safe against
> > all of this. What's the scenario for pg where this matters? I assume
> > you're using it for somewhat different purposes, and that's why it is
> > problematic for you?
> 
> I don't see why it is a problem to make our APIs more stable if we
> have ways to do so.

Well, wanting to support additional use-cases often isn't free. There's
additional code for the new usecase, there's review & commit time,
there's additional test time, there's bug fixes for the new code etc.

We're not developing a general application support library...

I don't really have a problem fixing this case if we think it's
useful. But I'm a bit bothered by all the "fixes" being submitted that
don't matter for PG itself. They do eat up resources.


And sorry, adding in-backend threading to test testing mkdir_p doesn't
make me inclined to believe that this is all well considered.  There's
minor issues like us not supporting threads in the backend, pthread not
being portable, and also it being entirely out of proportion to the
issue.

Greetings,

Andres Freund




Re: tap tests driving the database via psql

2019-07-30 Thread Craig Ringer
On Wed, 31 Jul 2019 at 10:20, Andres Freund  wrote:

> Hi,
>
> On 2019-07-31 09:32:10 +0800, Craig Ringer wrote:
> > OK. So rather than building our own $everything from scratch, lets look
> at
> > solving that.
>
> IDK, a minimal driver that just does what we need it to do is a few
> hundred lines, not more. And there's plenty of stuff that we simply
> won't be able to test with any driver that's not purposefully written
> for testing. There's e.g. basically no way with any of the drivers to
> test intentionally bogus sequences of protocol messages, yet that's
> something we really ought to test.
>
> I've written custom hacks^Wprograms to tests things like that a few
> times. That really shouldn't be necessary.
>
>
That's a good point. I've had to write simple protocol hacks myself, and
having a reusable base tool for it would indeed be valuable.

I'm just worried it'll snowball into Yet Another Driver We Don't Need, or
land up as a half-finished poorly-maintained burden nobody wants to touch.
Though the general fondness for and familiarity with Perl in the core
circle of Pg contributors and committers would probably help here, since
it'd inevitably land up being written in Perl...

I'm interested in evolution of the protocol and ways to enhance it, and I
can see something we can actively use for protocol testing being valuable.
If done right, the protocol implementation could probably pass-through
inspect messages from regular libpq etc as well as serving as either a
client or even as a dumb server simulator for pregenerated responses for
client testing.

We certainly couldn't do anything like that with existing tools and by
reusing existing drivers.

I wonder if there's any way to make writing and maintaining it less painful
though. Having to make everything work with Perl 5.8.8 and no non-core
modules leads to a whole pile of bespoke code and reinvention.


> > Perl modules support local installations. Would it be reasonable to
> require
> > DBI to be present, then rebuild DBD::Pg against our libpq during our own
> > test infra compilation, and run it with a suitable LD_LIBRARY_PATH or
> using
> > rpath? That'd actually get us a side-benefit of some test coverage of
> > DBD::Pg and libpq.
>
> You're intending to download DBD::Pg?


That'd be a reasonable option IMO, yes. Either via git with https, or a
tarball where we check a signature. So long as it can use any pre-existing
local copy without needing to hit the Internet, and the build can also
succeed without the presence of it at all, I think that'd be reasonable
enough.

But I'm probably getting contaminated by excessive exposure to Apache Maven
when I have to work with Java. I'm rapidly getting used to downloading
things being a part of builds. Personally I'd expect that unless
specifically testing new libpq functionality etc, most people would just be
using their system DBD::Pg... and I consider linking the system DBD::Pg
against our new-compiled libpq more feature than bug, as if we retain the
same soname we should be doing a reasonable job of not breaking upward
compatibility anyway. (It'd potentially be an issue if you have a very new
system libpq and are running tests on an old postgres, I guess).


> Or how would you get around the licensing issues? Downloading it will have
> some issues too: For one at least I
> often want to be able to run tests when offline; there's security
> concerns too.
>

Roughly what I was thinking of was:

For Pg source builds (git, or dist tarball), we'd generally curl a tarball
of DBD::Pg from a HTTPs URL specified in Makefile variables (with =? so it
can be overridden to point to an alt location, different version, local
path, etc) and unpack it, then build it. The makefile can also store a
signing key fingerprint so we can download the sig file and check the sig
is by the expected signer if the user has imported the signing key for
DBD::Pg.

We could test if the target directory already exists and is populated and
re-use it, so people can git-clone DBD::Pg if they prefer.

And we'd allow users to specify --with-system-dbd-pg at configure time, or
--without-dbd-pg .

The Pg perl libraries for our TAP test harness/wrappers would offer a
simple function to skip a test if DBD::Pg is missing, a simple function to
skip a test if the loaded DBD::Pg lacks $some_feature_or_attribute, etc.






>
> Greetings,
>
> Andres Freund
>


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Ning Yu
On Wed, Jul 31, 2019 at 12:41 PM Michael Paquier  wrote:
>
> On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote:
> > Could I double confirm with you that you made a clean rebuild after
> > applying the patches?  pg_mkdir_p() is compiled as part of libpgport.a,
> > and the postgres makefile will not relink the initdb binary
> > automatically, for myself I must 'make clean' and 'make' to ensure
> > initdb gets relinked.
>
> For any patch I test, I just do a "git clean -d -x -f" before building
> as I switch a lot across stable branches as well.  It looks that you
> are right on this one though, I have just rebuilt from scratch and I
> don't see the failures anymore.

Cool, glad to know that it works.

Best Regards
Ning

> --
> Michael




Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote:
> Hm. I'm not really seing much of a point in making mkdir_p safe against
> all of this. What's the scenario for pg where this matters? I assume
> you're using it for somewhat different purposes, and that's why it is
> problematic for you?

I don't see why it is a problem to make our APIs more stable if we
have ways to do so.  I actually fixed one recently as of 754b90f for a
problem that involved a tool linking to our version of readdir() that
we ship.  Even with that, the retries for mkdir() on the base
directory in PathNameCreateTemporaryDir() are basically caused by that
same limitation with the parent paths from this report, no?  So we
could actually remove the dependency to the base directory in this
code path and just rely on pg_mkdir_p() to do the right thing for all
the parent paths.  That's also a point raised by Ning upthread.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-30 Thread Amit Kapila
On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  wrote:
>
> Hi Amit
>
> I've been testing some undo worker workloads (more on that soon),
>

One small point, there is one small bug in the error queues which is
that the element pushed into error queue doesn't have an updated value
of to_urec_ptr which is important to construct the hash key.  This
will lead to undolauncher/worker think that the action for the same is
already processed and it removes the same from the hash table. I have
a fix for the same which I will share in next version of the patch
(which I am going to share in the next day or two).

>  but
> here's a small thing: I managed to reach an LWLock self-deadlock in
> the undo worker launcher:
>

I could see the problem, will fix in next version.

Thank you for reviewing and testing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Michael Paquier
On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote:
> Could I double confirm with you that you made a clean rebuild after
> applying the patches?  pg_mkdir_p() is compiled as part of libpgport.a,
> and the postgres makefile will not relink the initdb binary
> automatically, for myself I must 'make clean' and 'make' to ensure
> initdb gets relinked.

For any patch I test, I just do a "git clean -d -x -f" before building
as I switch a lot across stable branches as well.  It looks that you
are right on this one though, I have just rebuilt from scratch and I
don't see the failures anymore.  
--
Michael


signature.asc
Description: PGP signature


Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Ning Yu
On Wed, Jul 31, 2019 at 12:11 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-07-18 16:17:22 +0800, Ning Yu wrote:
> > This seems buggy as it first checks the existence of the dir and makes the
> > dir if it does not exist yet, however when executing concurrently a
> > possible race condition can be as below:
> >
> > A: does a/ exists? no
> > B: does a/ exists? no
> > A: try to create a/, succeed
> > B: try to create a/, failed as it already exists
>
> Hm. I'm not really seing much of a point in making mkdir_p safe against
> all of this. What's the scenario for pg where this matters? I assume
> you're using it for somewhat different purposes, and that's why it is
> problematic for you?

Yes, you are right, postgres itself might not run into such kind of race
condition issue.  The problem we encountered was on a downstream product
of postgres, where multiple postgres clusters are deployed on the same
machine with common parent dirs.

Best Regards
Ning

>
> Greetings,
>
> Andres Freund




Re: SegFault on 9.6.14

2019-07-30 Thread Tom Lane
Amit Kapila  writes:
> On Wed, Jul 31, 2019 at 12:05 AM Robert Haas  wrote:
>> The other option is to do
>> what I understand Amit and Thomas to be proposing, which is to do a
>> better job identifying the case where we're "done for good" and can
>> trigger the shutdown fearlessly.

> Yes, this sounds safe fix for back-branches.

Actually, my point was exactly that I *didn't* think that would be a
safe fix for the back branches --- at least, not unless you're okay
with a very conservative and hence resource-leaky method for deciding
when it's safe to shut down sub-nodes.

We could do something involving (probably) adding new eflags bits to
pass this sort of info down to child plan nodes.  But that's going
to require design and coding, and it will not be backwards compatible.
At least not from the point of view of any extension that's doing
anything in that area.

regards, tom lane




Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Ning Yu
On Wed, Jul 31, 2019 at 12:04 PM Michael Paquier  wrote:
>
> On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote:
> > In fact personally I'm thinking that whether could we replace all uses of
> > MakePGDirectory() with pg_mkdir_p(), so we could simplify
> > TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers
> > significantly.
>
> I would still keep the wrapper, but I think that as a final result we
> should be able to get the code in PathNameCreateTemporaryDir() shaped
> in such a way that there are no multiple attempts at calling
> MakePGDirectory() on EEXIST.  This has been introduced by dc6c4c9d to
> allow sharing temporary files between backends, which is rather recent
> but a fixed set of two retries is not a deterministic method of
> resolution.
>
> > Well, we should have fixed problem 2, this is our major purpose of the patch
> > 0001, it performs sanity check with stat() after mkdir() at each part of the
> > path.
>
> I just reuse the script presented at the top of the thread with n=2,
> and I get that:
> pgdata/logs/1.log:creating directory
> pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1
> ... initdb: error: could not create directory "pgdata/a": File exists

Could I double confirm with you that you made a clean rebuild after
applying the patches?  pg_mkdir_p() is compiled as part of libpgport.a,
and the postgres makefile will not relink the initdb binary
automatically, for myself I must 'make clean' and 'make' to ensure
initdb gets relinked.

>
> But the result expected is that all the paths should be created with
> no complains about the parents existing, no?  This reproduces on my
> Debian box 100% of the time, for different sub-paths.  So something
> looks wrong in your solution.  The code comes originally from FreeBSD,
> how do things happen there.  Do we get failures if doing something
> like that?  I would expect this sequence to not fail:
> for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done
> --
> Michael




Re: tap tests driving the database via psql

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 07:20:53PM -0700, Andres Freund wrote:
> IDK, a minimal driver that just does what we need it to do is a few
> hundred lines, not more. And there's plenty of stuff that we simply
> won't be able to test with any driver that's not purposefully written
> for testing. There's e.g. basically no way with any of the drivers to
> test intentionally bogus sequences of protocol messages, yet that's
> something we really ought to test.

Yes, I don't know if a hundred of lines measure that correctly, but
really we should avoid bloating the stuff we rely on like by fetching
an independent driver if we finish by not using most of its features.
300~500 lines would be fine for me, 10k definitely not.

> I've written custom hacks^Wprograms to tests things like that a few
> times. That really shouldn't be necessary.

Mine are usually plain hacks :), and usually on top of libpq but I
have done some python-ish script to send bogus and hardcoded protocol
messages.  If we can automate a bit this area, that could be useful.

>> Perl modules support local installations. Would it be reasonable to require
>> DBI to be present, then rebuild DBD::Pg against our libpq during our own
>> test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using
>> rpath? That'd actually get us a side-benefit of some test coverage of
>> DBD::Pg and libpq.
> 
> You're intending to download DBD::Pg? Or how would you get around the
> licensing issues? Downloading it will have some issues too: For one at least I
> often want to be able to run tests when offline; there's security
> concerns too.

As Craig has mentioned CPAN can be configured to install all the
libraries for a local user, so there is no need to be online if a copy
has been already downloaded.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-30 Thread Amit Kapila
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund  wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>

Noted, I will try to ensure that patches are in order in future posts.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Andres Freund
Hi,

On 2019-07-18 16:17:22 +0800, Ning Yu wrote:
> This seems buggy as it first checks the existence of the dir and makes the
> dir if it does not exist yet, however when executing concurrently a
> possible race condition can be as below:
> 
> A: does a/ exists? no
> B: does a/ exists? no
> A: try to create a/, succeed
> B: try to create a/, failed as it already exists

Hm. I'm not really seing much of a point in making mkdir_p safe against
all of this. What's the scenario for pg where this matters? I assume
you're using it for somewhat different purposes, and that's why it is
problematic for you?

Greetings,

Andres Freund




Re: SegFault on 9.6.14

2019-07-30 Thread Amit Kapila
On Wed, Jul 31, 2019 at 12:05 AM Robert Haas  wrote:
>
> On Thu, Jul 18, 2019 at 9:45 AM Tom Lane  wrote:
> > I think this is going in the wrong direction.  Nodes should *always*
> > assume that a rescan is possible until ExecEndNode is called.
> > If you want to do otherwise, you are going to be inventing a whole
> > bunch of complicated and doubtless-initially-buggy control logic
> > to pass down information about whether a rescan might be possible.
> > That doesn't sound like a recipe for a back-patchable fix.  Perhaps
> > we could consider redesigning the rules around REWIND in a future
> > version, but that's not where to focus the bug fix effort.
>
> So, if I can summarize how we got here, as best I understand it:
>

Thanks for the summarization.  This looks mostly correct to me.

> 0. The historic behavior of the executor is to assume it's OK to leak
> resources for the lifetime of the query.  Nodes that are executed to
> completion generally do some cleanup, but we feel free (as under
> Limit) to just stop executing a node without giving it any hint that
> it should release resources.  So a Sort may hold onto a terabyte of
> memory and an index scan may keep holding a pin even after there's no
> theoretical way of ever needing those resources again, and we just
> don't care.
>
> 1. Parallel query made that perhaps-already-shaky assumption a lot
> more problematic. Partly that's because workers are a a more scarce
> and considerably heavier resource than anything else, and moreover act
> as a container for anything else, so whatever you were leaking before,
> you can now leak N times more of it, plus N processes, until the end
> of the query. However, there's a correctness reason too, which is that
> when a node has a copy in the leader and a copy in every worker, each
> copy has its own instrumentation data (startup time, run time, nloops,
> etc) and we can only fold all that together once the node is done
> executing, because it's really hard to add up a bunch of numbers
> before the numbers are done changing.  We could've made the
> instrumentation shared throughout, but if we had, we could have
> contention for updating the instrumentation data, which seems like
> it'd be bad.
>
> 2. To fix that correctness problem, we decided to try to shut down the
> node under a limit node when we're done with it (commit
> 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this
> looks fundamentally necessary to me.  If you're going to have N
> separate copies of the instrumentation, and you want to add them up
> when you're done, then you have to decide to be done at some point;
> otherwise you don't know when to add them up, and maybe won't add them
> up at all, and then you'll be sad.  This does not mean that the exact
> timing couldn't be changed somehow, but if you want a correct
> implementation, you have to shut down Limit's sub-node after you're
> done executing it (so that you can get the instrumentation data from
> the workers after it's final) and before you start destroying DSM
> segments and stuff (so that you get the instrumentation data from the
> workers before it vanishes).
>
> 3. The aforementioned commit turned out to be buggy in at least to two
> ways, precisely because it didn't do a good enough job predicting when
> the Limit needed to be shut down.  First, there was commit
> 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact
> that you could hit the Limit and then back up.
>

We have not missed it, rather we decided to it separately because it
appears to impact some different cases as well [1][2].

>  Second, there's the
> present issue, where the Limit gets rescanned.
>
> So, given all that, if we want to adopt Tom's position that we should
> always cater to a possible rescan, then we're going to have to rethink
> the way that instrumentation data gets consolidated from workers into
> the leader in such a way that we can consolidate multiple times
> without ending up with the wrong answer.
>

The other idea we had discussed which comes closer to adopting Tom's
position was that during ExecShutdownNode, we just destroy parallel
workers, collect instrumentation data and don't destroy the parallel
context.  The parallel context could be destroyed in ExecEndNode
(ExecEndGather(Merge)) code path.  The problem with this idea is that
ExitParallelMode doesn't expect parallel context to be active.  Now,
we can either change the location of Exit/EnterParallelMode or relax
that restriction.  As mentioned above that restriction appears good to
me, so I am not in favor of changing it unless we have some other
solid way to install it.   I am not sure if this idea is better than
other approaches we are discussing.

>  The other option is to do
> what I understand Amit and Thomas to be proposing, which is to do a
> better job identifying the case where we're "done for good" and can
> trigger the shutdown fearlessly.
>

Yes, this sounds safe fix for back-branches.  We might 

Re: Possible race condition in pg_mkdir_p()?

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote:
> In fact personally I'm thinking that whether could we replace all uses of
> MakePGDirectory() with pg_mkdir_p(), so we could simplify
> TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers
> significantly.

I would still keep the wrapper, but I think that as a final result we
should be able to get the code in PathNameCreateTemporaryDir() shaped
in such a way that there are no multiple attempts at calling
MakePGDirectory() on EEXIST.  This has been introduced by dc6c4c9d to
allow sharing temporary files between backends, which is rather recent
but a fixed set of two retries is not a deterministic method of
resolution.

> Well, we should have fixed problem 2, this is our major purpose of the patch
> 0001, it performs sanity check with stat() after mkdir() at each part of the
> path.

I just reuse the script presented at the top of the thread with n=2,
and I get that:
pgdata/logs/1.log:creating directory
pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1
... initdb: error: could not create directory "pgdata/a": File exists

But the result expected is that all the paths should be created with
no complains about the parents existing, no?  This reproduces on my
Debian box 100% of the time, for different sub-paths.  So something
looks wrong in your solution.  The code comes originally from FreeBSD,
how do things happen there.  Do we get failures if doing something
like that?  I would expect this sequence to not fail:
for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done
--
Michael


signature.asc
Description: PGP signature


How to retain lesser paths at add_path()?

2019-07-30 Thread Kohei KaiGai
Hello,

When we add a new path using add_path(), it checks estimated cost and path-keys,
then it also removes dominated paths, if any.
Do we have a reasonable way to retain these "dominated" paths? Once it
is considered
lesser paths at a level, however, it may have a combined cheaper cost
with upper pathnode.

In my case, PG-Strom adds CustomPath to support JOIN/GROUP BY
workloads that utilizes
GPU and NVME storage. If GpuPreAgg and GpuJoin are executed
continuously, output buffer
of GpuJoin simultaneously performs as input buffer of GpuPreAgg on GPU
device memory.
So, it allows to avoid senseless DMA transfer between GPU and CPU/RAM.
This behavior
affects cost estimation during path construction steps - GpuPreAgg
discount DMA cost if its
input path is GpuJoin.
On the other hands, it looks to me add_path() does not consider upper
level optimization
other than sorting path-keys. As long as we can keep these "lesser"
pathnodes that has
further optimization chance, it will help making more reasonable query plan.

Do we have any reasonable way to retain these paths at add_path() even
if it is dominated
by other paths? Any idea welcome.

Best regards,

[*] GpuJoin + GpuPreAgg combined GPU kernel
https://www.slideshare.net/kaigai/20181016pgconfeussd2gpumulti/13
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-07-30 Thread Peter Geoghegan
On Tue, Jul 30, 2019 at 8:07 PM Melanie Plageman
 wrote:
> For the actual write to disk, I'm pretty sure I get that for free from
> the BufFile API, no?
> I was more thinking about optimizing when I call BufFileWrite at all.

Right. Clearly several existing buffile.c users regularly have very
small BufFileWrite() size arguments. tuplestore.c, for one.

-- 
Peter Geoghegan




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-07-30 Thread Melanie Plageman
On Tue, Jul 30, 2019 at 4:36 PM Robert Haas  wrote:

> On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman
>  wrote:
> > I did the "needlessly dumb implementation" Robert mentioned, though,
> > I thought about it and couldn't come up with a much smarter way to
> > write match bits to a file. I think there might be an optimization
> > opportunity in not writing the current_byte to the file each time that
> > the outer tuple matches and only doing this once we have advanced to a
> > tuple number that wouldn't have its match bit in the current_byte. I
> > didn't do that to keep it simple, and, I suspect there might be a bit
> > of gymnastics needed to make sure that that byte is actually written
> > to the file in case we exit from some other state before we encounter
> > the tuple represented in the last bit in that byte.
>
> I mean, I was assuming we'd write in like 8kB blocks or something.
> Doing it a byte at a time seems like it'd produce way too many
> syscals.
>
>
For the actual write to disk, I'm pretty sure I get that for free from
the BufFile API, no?
I was more thinking about optimizing when I call BufFileWrite at all.

-- 
Melanie Plageman


Re: Runtime pruning problem

2019-07-30 Thread Amit Langote
On Wed, Jul 31, 2019 at 8:31 AM Tom Lane  wrote:
> David Rowley  writes:
> > On Wed, 31 Jul 2019 at 10:56, Tom Lane  wrote:
> >> The portion of this below the Append is fine, but I argue that
> >> the Vars above the Append should say "part", not "part_p1".
> >> In that way they'd look the same regardless of which partitions
> >> have been pruned or not.
>
> > That seems perfectly reasonable for Append / MergeAppend that are for
> > scanning partitioned tables. What do you propose we do for inheritance
> > and UNION ALLs?
>
> For inheritance, I don't believe there would be any change, precisely
> because we've historically used the parent rel as reference.

I may be missing something, but Vars above an Append/MergeAppend,
whether it's scanning a partitioned table or a regular inheritance
table, always refer to the first child subplan, which may or may not
be for the inheritance parent in its role as a child, not the Append
parent.

create table parent (a int);
alter table only parent add check (a = 1) no inherit;
create table child1 (a int check (a = 2)) inherits (parent);
create table child2 (a int check (a = 3)) inherits (parent);

explain (costs off, verbose) select * from parent where a > 1 order by 1;
  QUERY PLAN
───
 Sort
   Output: child1.a
   Sort Key: child1.a
   ->  Append
 ->  Seq Scan on public.child1
   Output: child1.a
   Filter: (child1.a > 1)
 ->  Seq Scan on public.child2
   Output: child2.a
   Filter: (child2.a > 1)
(10 rows)

I think this is because we replace the original targetlist of such
nodes by a dummy one using set_dummy_tlist_references(), where all the
parent Vars are re-stamped with OUTER_VAR as varno.  When actually
printing the EXPLAIN VERBOSE output, ruleutils.c considers the first
child of Append as the OUTER referent, as set_deparse_planstate()
states:

/*
 * We special-case Append and MergeAppend to pretend that the first child
 * plan is the OUTER referent; we have to interpret OUTER Vars in their
 * tlists according to one of the children, and the first one is the most
 * natural choice.

If I change set_append_references() to comment out the
set_dummy_tlist_references() call, I get this output:

explain (costs off, verbose) select * from parent where a > 1 order by 1;
  QUERY PLAN
───
 Sort
   Output: a
   Sort Key: a
   ->  Append
 ->  Seq Scan on public.child1
   Output: child1.a
   Filter: (child1.a > 1)
 ->  Seq Scan on public.child2
   Output: child2.a
   Filter: (child2.a > 1)
(10 rows)

Not parent.a as I had expected.  That seems to be because parent's RTE
is considered unused in the plan.  One might say that the plan's
Append node belongs to that RTE, but then Append doesn't have any RT
index attached to it, so it escapes ExplainPreScanNode()'s walk of the
plan tree to collect the indexes of "used RTEs".  I changed
set_rtable_names() to get around that as follows:

@@ -3458,7 +3458,7 @@ set_rtable_names(deparse_namespace *dpns, List
*parent_namespaces,
 /* Just in case this takes an unreasonable amount of time ... */
 CHECK_FOR_INTERRUPTS();

-if (rels_used && !bms_is_member(rtindex, rels_used))
+if (rels_used && !bms_is_member(rtindex, rels_used) && !rte->inh)

and I get:

explain (costs off, verbose) select * from parent where a > 1 order by 1;
  QUERY PLAN
───
 Sort
   Output: parent.a
   Sort Key: parent.a
   ->  Append
 ->  Seq Scan on public.child1
   Output: child1.a
   Filter: (child1.a > 1)
 ->  Seq Scan on public.child2
   Output: child2.a
   Filter: (child2.a > 1)
(10 rows)

> For setops we've traditionally used the left input as reference.
> Maybe we could do better, but I'm not very sure how, since SQL
> doesn't actually provide any explicit names for the setop result.
> Making up a name with no basis in the query probably isn't an
> improvement, or at least not enough of one to justify a change.

I too am not sure what we should about Appends of setops, but with the
above hacks, I get this:

explain (costs off, verbose) select * from child1 union all select *
from child2 order by 1;
  QUERY PLAN
───
 Sort
   Output: "*SELECT* 1".a
   Sort Key: "*SELECT* 1".a
   ->  Append
 ->  Seq Scan on public.child1
   Output: child1.a
 ->  Seq Scan on public.child2
   Output: child2.a
(8 rows)

whereas currently it prints:

explain (costs off, verbose) select * from child1 union all select *
from child2 order by 1;
  QUERY PLAN
───
 Sort
   Output: child1.a
   Sort Key: child1.a
   ->  Append
 ->  Seq Scan on public.child1
   

Re: tap tests driving the database via psql

2019-07-30 Thread Andres Freund
Hi,

On 2019-07-31 09:32:10 +0800, Craig Ringer wrote:
> OK. So rather than building our own $everything from scratch, lets look at
> solving that.

IDK, a minimal driver that just does what we need it to do is a few
hundred lines, not more. And there's plenty of stuff that we simply
won't be able to test with any driver that's not purposefully written
for testing. There's e.g. basically no way with any of the drivers to
test intentionally bogus sequences of protocol messages, yet that's
something we really ought to test.

I've written custom hacks^Wprograms to tests things like that a few
times. That really shouldn't be necessary.


> Perl modules support local installations. Would it be reasonable to require
> DBI to be present, then rebuild DBD::Pg against our libpq during our own
> test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using
> rpath? That'd actually get us a side-benefit of some test coverage of
> DBD::Pg and libpq.

You're intending to download DBD::Pg? Or how would you get around the
licensing issues? Downloading it will have some issues too: For one at least I
often want to be able to run tests when offline; there's security
concerns too.

Greetings,

Andres Freund




Re: Initdb failure

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 03:30:12PM -0400, Tom Lane wrote:
> On the whole though, I don't have a problem with the "do nothing"
> answer.  There's no security risk here, and no issue that seems
> likely to arise in actual use cases rather than try-to-break-it
> test scripts.

+1.
--
Michael


signature.asc
Description: PGP signature


Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 03:43:58PM -0400, Tom Lane wrote:
> Well, we do not need to have a backwards-compatibility problem
> here, because we have yet to release a version containing
> log_statement_sample_rate.  I do not think it's too late to decide
> that v12's semantics for that are broken, and either revert that
> patch in v12, or back-patch a fix to make it match this idea.

With my RTM hat on, if we think that the current semantics of
log_statement_sample_rate are broken and need a redesign, then I would
take the safest path and just revert the original patch in v12, and
finally make sure that it brews correctly for v13.  We are in beta2
and close to a beta3, so redesigning things at this stage on a stable
branch sounds wrong.
--
Michael


signature.asc
Description: PGP signature


Re: extension patch of CREATE OR REPLACE TRIGGER

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 04:44:11PM -0400, Tom Lane wrote:
> We do not test \h output in any existing regression test, and we're
> not going to start doing so in this one.  For one thing, the expected
> URL would break every time we forked off a new release branch.
> (There would surely be value in having more-than-no test coverage
> of psql/help.c, but that's a matter for its own patch, which would
> need some thought about how to cope with instability of the output.)

One way to get out of that could be some psql-level options to control
which parts of the help output is showing up.  The recent addition of
the URL may bring more weight for doing something in this area.
--
Michael


signature.asc
Description: PGP signature


Re: tap tests driving the database via psql

2019-07-30 Thread Craig Ringer
On Tue, 30 Jul 2019 at 21:40, Tom Lane  wrote:

> Craig Ringer  writes:
> > On Sun, 28 Jul 2019 at 03:15, Andres Freund  wrote:
> >> 1) Just depend on DBD::Pg being installed. It's fairly common, after
> >> all. It'd be somewhat annoying that we'd often end up using a
> >> different version of libpq than what we're testing against. But in
> >> most cases that'd not be particularly problematic.
>
> > I advocated for this in the past, and still think it's the best option.
>
> I think the not-same-libpq issue would be a much bigger problem than either
> of you are accounting for.


OK. So rather than building our own $everything from scratch, lets look at
solving that. I'd argue that the PostgreSQL community has enough work to do
maintaining the drivers that are fairly directly community-owned like
PgJDBC and psqlODBC, let alone writing new ones for less-in-demand
languages just for test use.

Perl modules support local installations. Would it be reasonable to require
DBI to be present, then rebuild DBD::Pg against our libpq during our own
test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using
rpath? That'd actually get us a side-benefit of some test coverage of
DBD::Pg and libpq.

Note that I'm not saying this is the only or right way. Your concerns about
using a system libpq are entirely valid, as are your concerns about trying
to link to our libpq using a DBD::Pg originally built against the system
libpq. I've just been dealing with issues similar to those today and I know
how much of a hassle it can be. However, I don't think "it's hard" is a
good reason to write a whole new driver and potentially do a
Netscape/Mozilla.


> * Since we'd have to presume a possibly-very-back-rev libpq, we could
> never add tests related to any recent libpq bug fixes or new features.
>

We can feature-test. We're not dealing with pg_regress where it's
essentially impossible to do anything conditionally without blocks of
plpgsql. We're dealing with Perl and Test::More where we can make tests
conditional, we can probe the runtime environment to decide whether we can
run a given test, etc.

Also, lets compare to the status quo. Will it be worse than exec()ing psql
with a string argument and trying to do sane database access via stdio?
Definitely. Not.


> * The system libpq might have a different idea of default port
> and/or socket directory than the test build.  Yeah, we could work
> around that by forcing the choice all the time, but it would be a
> constant hazard.
>

I'd call that a minor irritation personally, as all we have to do is set up
the environment and we're done.


> * We don't have control over what else gets brought in beside libpq.
>

That's a very significant point that we must pay attention to.

Anyone who's worked with nss will probably know the pain of surprise
library conflicts arising from nss plugins being loaded unexpectedly into
the program. I still twitch thinking about libnss-ldap.

It'd make a lot of sense to capture "ldd" output and/or do a minimal run
with LD_DEBUG set during buildfarm test runs to help identify any
interesting surprises.


> Depending on the whims of the platform packagers, there might need
> to be other parts of the platform's default postgres installation,
> eg psql, sitting in one's PATH.  Again, this wouldn't be too much
> of a hazard for pre-debugged test scripts --- but it would be a huge
> hazard for developers, who do lots of things manually and would always be
> at risk of invoking the wrong psql. I learned long ago not to have any
> part of a platform's postgres packages in place on my development systems,
> and I will fight hard against any test procedure change that requires me
> to do differently.
>

TBH I think that's a habit/procedure issue. That doesn't make it invalid,
but it might not be as big a concern for everyone as for you. I have a
shell alias that sets up my environment and another that prints the current
setup. I never run into issues like this, despite often multi-tasking while
tired. Not only that, I find platform postgres extremely useful to have on
my systems to use for simple cross-version tests.

If we adopt something like I suggested above where we (re)build DBD::Pg
against our installed pg and libpq, that wouldn't be much of an issue
anyway. We'd want a nice way to set up the runtime environment in a shell
for manual testing of course, like a generated .sh we can source. But
frankly I'd find that a useful thing to have in postgres anyway. I can't
count the number of times I've wished there was an easy way to pause
pg_regress and launch a psql session against its temp instance, or do the
same with a TAP test.

Now, none of these things are really a problem with DBD/DBI as such
> --- rather, they are reasons not to depend on a pre-packaged build
> of DBD::Pg that depends on a pre-packaged build of libpq.so.
>

Yeah. While I don't agree with your conclusion in terms of how you weigh
the priorities, I agree that all your points 

Re: Unused struct member in pgcrypto pgp.c

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 05:48:49PM +0200, Daniel Gustafsson wrote:
> In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info 
> which
> isn’t used, and seems to have never been used (a potential copy/pasteo from 
> the
> cipher_info struct?).  Is there a reason for keeping this, or can it be 
> removed
> as per the attached?

I don't see one as this is not used in any logic for the digest
lookups.  So agreed and applied.  This originally comes from e94dd6a.
--
Michael


signature.asc
Description: PGP signature


Re: concerns around pg_lsn

2019-07-30 Thread Michael Paquier
On Tue, Jul 30, 2019 at 02:22:30PM +0530, Jeevan Ladhe wrote:
> On Tue, Jul 30, 2019 at 9:42 AM Michael Paquier  wrote:
>> Agreed about making the code more defensive as you do.  I would keep
>> the initialization in check_recovery_target_lsn and pg_lsn_in_internal
>> though.  That does not hurt and makes the code easier to understand,
>> aka we don't expect an error by default in those paths.
>>
> 
> Sure, understood. I am ok with this.

I am adding Peter Eisentraut in CC as 21f428e is his commit.  I think
that the first patch is a good idea, so I would be fine to apply it,
but let's see the original committer's opinion first.
--
Michael


signature.asc
Description: PGP signature


Re: Attached partition not considering altered column properties of root partition.

2019-07-30 Thread Amit Langote
On Wed, Jul 31, 2019 at 2:38 AM Robert Haas  wrote:
>
> On Tue, Jul 2, 2019 at 9:53 PM Amit Langote  wrote:
> > Thanks for the report.  This seems like a bug.  Documentation claims
> > that the child tables inherit column storage options from the parent
> > table.  That's actually enforced in only some cases.
>
> I realize I'm just repeating the same argument I've already made
> before on other related topics, but I don't think this is a bug.
> Inherited-from-parent is not the same as
> enforced-to-always-be-the-same-as-parent.  Note that this is allowed,
> changing only the child:
>
> rhaas=# create table foo (a int, b text) partition by range (a);
> CREATE TABLE
> rhaas=# create table foo1 partition of foo for values from (0) to (10);
> CREATE TABLE
> rhaas=# alter table foo1 alter column b set storage plain;
> ALTER TABLE
>
> As is this, changing only the parent:
>
> rhaas=# alter table only foo1 alter column b set storage plain;
> ALTER TABLE
>
> How can you possibly argue that the intended behavior is
> everything-always-matches when that's not what's documented and
> there's absolutely nothing that tries to enforce it?

You're right.  The patch as proposed is barely enough to ensure the
everything-always-matches behavior.  Let's update^H^H^H^H^H^H^H forget
about the patch. :)

I do remember that we made a list of things that we decided must match
in all partitions, which ended up being slightly bigger than the same
list for regular inheritance children, but much smaller than the list
of all the things that could be different among children.  I forgot we
did that when replying to Prabhat's report.  In this particular case,
I do have doubts about whether we really need attstorage to be the
same in all the children, so maybe I should've first asked why we
should think of this as a bug.

> I'm getting really tired of people thinking that they can invent rules
> for table partitioning that were (1) never intended by the original
> implementation and (2) not even slightly enforced by the code, and
> then decide that those behavior changes should not only be made but
> back-patched.  That is just dead wrong.  There is no problem here.
> There is no need to change ANYTHING.  There is even less need to do it
> in the back branches.

OK, I'm withdrawing my patch.

Thanks,
Amit




Re: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread David Fetter
On Tue, Jul 30, 2019 at 10:01:09PM +, Daniel Migowski wrote:
> Hello,
> 
> Will my patch be considered for 12.0? The calculation of the
> mem_usage value might be improved later on but because the system
> catalog is changed I would love to add it before 12.0 becomes
> stable. 

Feature freeze was April 8, so no.

https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg37059.html

You're in plenty of time for 13, though!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: pgbench - implement strict TPC-B benchmark

2019-07-30 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Jul 30, 2019 at 3:00 PM Tom Lane  wrote:
>> I'm also highly dubious about labeling this script "standard TPC-B",
>> when it resolves only some of the reasons why our traditional script
>> is not really TPC-B.  That's treading on being false advertising.

> IANAL, but it may not even be permissible to claim that we have
> implemented "standard TPC-B".

Yeah, very likely you can't legally say that unless the TPC
has certified your test.  (Our existing code and docs are quite
careful to call pgbench's version "TPC-like" or similar weasel
wording, and never claim that it is really TPC-B or even a close
approximation.)

regards, tom lane




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-07-30 Thread Robert Haas
On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman
 wrote:
> I did the "needlessly dumb implementation" Robert mentioned, though,
> I thought about it and couldn't come up with a much smarter way to
> write match bits to a file. I think there might be an optimization
> opportunity in not writing the current_byte to the file each time that
> the outer tuple matches and only doing this once we have advanced to a
> tuple number that wouldn't have its match bit in the current_byte. I
> didn't do that to keep it simple, and, I suspect there might be a bit
> of gymnastics needed to make sure that that byte is actually written
> to the file in case we exit from some other state before we encounter
> the tuple represented in the last bit in that byte.

I mean, I was assuming we'd write in like 8kB blocks or something.
Doing it a byte at a time seems like it'd produce way too many
syscals.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgbench - implement strict TPC-B benchmark

2019-07-30 Thread Peter Geoghegan
On Tue, Jul 30, 2019 at 3:00 PM Tom Lane  wrote:
> TBH, I think we should reject this patch.  Nobody cares about TPC-B
> anymore, and they care even less about differences between one
> sort-of-TPC-B test and another sort-of-TPC-B test.  (As the lack
> of response on this thread shows.)  We don't need this kind of
> baggage in pgbench; it's got too many "features" already.

+1. TPC-B was officially made obsolete in 1995.

> I'm also highly dubious about labeling this script "standard TPC-B",
> when it resolves only some of the reasons why our traditional script
> is not really TPC-B.  That's treading on being false advertising.

IANAL, but it may not even be permissible to claim that we have
implemented "standard TPC-B".

-- 
Peter Geoghegan




Re: Runtime pruning problem

2019-07-30 Thread Tom Lane
David Rowley  writes:
> On Wed, 31 Jul 2019 at 10:56, Tom Lane  wrote:
>> The portion of this below the Append is fine, but I argue that
>> the Vars above the Append should say "part", not "part_p1".
>> In that way they'd look the same regardless of which partitions
>> have been pruned or not.

> That seems perfectly reasonable for Append / MergeAppend that are for
> scanning partitioned tables. What do you propose we do for inheritance
> and UNION ALLs?

For inheritance, I don't believe there would be any change, precisely
because we've historically used the parent rel as reference.

For setops we've traditionally used the left input as reference.
Maybe we could do better, but I'm not very sure how, since SQL
doesn't actually provide any explicit names for the setop result.
Making up a name with no basis in the query probably isn't an
improvement, or at least not enough of one to justify a change.

regards, tom lane




Re: Runtime pruning problem

2019-07-30 Thread David Rowley
On Wed, 31 Jul 2019 at 10:56, Tom Lane  wrote:
>
> OK, so experimenting, I see that it is a change: HEAD does
>
> regression=# explain verbose select * from part order by a;
>QUERY PLAN
> -
>  Sort  (cost=362.21..373.51 rows=4520 width=8)
>Output: part_p1.a, part_p1.b
>Sort Key: part_p1.a
>->  Append  (cost=0.00..87.80 rows=4520 width=8)
>  ->  Seq Scan on public.part_p1  (cost=0.00..32.60 rows=2260 width=8)
>Output: part_p1.a, part_p1.b
>  ->  Seq Scan on public.part_p2_p1  (cost=0.00..32.60 rows=2260 
> width=8)
>Output: part_p2_p1.a, part_p2_p1.b
> (8 rows)
>
> The portion of this below the Append is fine, but I argue that
> the Vars above the Append should say "part", not "part_p1".
> In that way they'd look the same regardless of which partitions
> have been pruned or not.

That seems perfectly reasonable for Append / MergeAppend that are for
scanning partitioned tables. What do you propose we do for inheritance
and UNION ALLs?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Runtime pruning problem

2019-07-30 Thread Tom Lane
I wrote:
> This may be arguing for a change in ruleutils' existing behavior,
> not sure.  But when dealing with traditional-style inheritance,
> I've always thought that Vars above the Append were referring to
> the parent rel in its capacity as the parent, not in its capacity
> as the first child.  With new-style partitioning drawing a clear
> distinction between the parent and all its children, it's easier
> to understand the difference.

OK, so experimenting, I see that it is a change: HEAD does

regression=# explain verbose select * from part order by a;
   QUERY PLAN   
-
 Sort  (cost=362.21..373.51 rows=4520 width=8)
   Output: part_p1.a, part_p1.b
   Sort Key: part_p1.a
   ->  Append  (cost=0.00..87.80 rows=4520 width=8)
 ->  Seq Scan on public.part_p1  (cost=0.00..32.60 rows=2260 width=8)
   Output: part_p1.a, part_p1.b
 ->  Seq Scan on public.part_p2_p1  (cost=0.00..32.60 rows=2260 width=8)
   Output: part_p2_p1.a, part_p2_p1.b
(8 rows)

The portion of this below the Append is fine, but I argue that
the Vars above the Append should say "part", not "part_p1".
In that way they'd look the same regardless of which partitions
have been pruned or not.

regards, tom lane




Re: Runtime pruning problem

2019-07-30 Thread Tom Lane
David Rowley  writes:
> On Wed, 31 Jul 2019 at 10:27, Tom Lane  wrote:
>> What I had in mind was to revert 1cc29fe7c's ruleutils changes
>> entirely, so that ruleutils deals only in Plans not PlanStates.
>> Perhaps we've grown some code since then that really needs the
>> PlanStates, but what is that, and could we do it some other way?
>> I'm not thrilled with passing both of these around, especially
>> if the PlanState sometimes isn't there, meaning that no code in
>> ruleutils could safely assume it's there anyway.

> Are you not worried about the confusion that run-time pruning might
> cause if we always show the Vars from the first Append/MergeAppend
> plan node, even though the corresponding executor node might have been
> pruned?

The upper-level Vars should ideally be labeled with the append parent
rel's name anyway, no?  I think it's likely *more* confusing if those
Vars change appearance depending on which partitions get pruned or not.

This may be arguing for a change in ruleutils' existing behavior,
not sure.  But when dealing with traditional-style inheritance,
I've always thought that Vars above the Append were referring to
the parent rel in its capacity as the parent, not in its capacity
as the first child.  With new-style partitioning drawing a clear
distinction between the parent and all its children, it's easier
to understand the difference.

regards, tom lane




Re: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Daniel Migowski



Am 31.07.2019 um 00:29 schrieb Tom Lane:

Daniel Migowski  writes:

Ok, just have read about the commitfest thing. Is the patch OK for that? Or is 
there generally no love for a mem_usage column here? If it was, I really would 
add some memory monitoring in our app regarding this.

You should certainly put it into the next commitfest.  We might or
might not accept it, but if it's not listed in the CF we might
not remember to even review it.  (The CF app is really a to-do
list for patches ...)


OK, added it there. Thanks for your patience :).

Regards,
Daniel Migowski





Re: Runtime pruning problem

2019-07-30 Thread David Rowley
On Wed, 31 Jul 2019 at 10:27, Tom Lane  wrote:
>
> David Rowley  writes:
> > The part I wouldn't mind another set of eyes on is the ruleutils.c
> > changes.
>
> Um, sorry for not getting to this sooner.
>
> What I had in mind was to revert 1cc29fe7c's ruleutils changes
> entirely, so that ruleutils deals only in Plans not PlanStates.
> Perhaps we've grown some code since then that really needs the
> PlanStates, but what is that, and could we do it some other way?
> I'm not thrilled with passing both of these around, especially
> if the PlanState sometimes isn't there, meaning that no code in
> ruleutils could safely assume it's there anyway.

Are you not worried about the confusion that run-time pruning might
cause if we always show the Vars from the first Append/MergeAppend
plan node, even though the corresponding executor node might have been
pruned?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Daniel Migowski

Am 31.07.2019 um 00:17 schrieb Tomas Vondra:

FWIW not sure what mail client you're using, but it seems to be breaking
the threads for some reason, splitting it into two - see [1].

Also, please stop top-posting. It makes it way harder to follow the
discussion.


Was using Outlook because it's my companies mail client but switching to 
Thunderbird now for communication with the list, trying to be a good 
citizen now.


Regards,
Daniel Migowski





Re: AW: AW: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Tom Lane
Daniel Migowski  writes:
> Ok, just have read about the commitfest thing. Is the patch OK for that? Or 
> is there generally no love for a mem_usage column here? If it was, I really 
> would add some memory monitoring in our app regarding this.

You should certainly put it into the next commitfest.  We might or
might not accept it, but if it's not listed in the CF we might
not remember to even review it.  (The CF app is really a to-do
list for patches ...)

regards, tom lane




Re: Runtime pruning problem

2019-07-30 Thread Tom Lane
David Rowley  writes:
> The part I wouldn't mind another set of eyes on is the ruleutils.c
> changes.

Um, sorry for not getting to this sooner.

What I had in mind was to revert 1cc29fe7c's ruleutils changes
entirely, so that ruleutils deals only in Plans not PlanStates.
Perhaps we've grown some code since then that really needs the
PlanStates, but what is that, and could we do it some other way?
I'm not thrilled with passing both of these around, especially
if the PlanState sometimes isn't there, meaning that no code in
ruleutils could safely assume it's there anyway.

regards, tom lane




Re: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Tomas Vondra

On Tue, Jul 30, 2019 at 10:01:09PM +, Daniel Migowski wrote:

Hello,

Will my patch be considered for 12.0? The calculation of the mem_usage
value might be improved later on but because the system catalog is
changed I would love to add it before 12.0 becomes stable.



Nope. Code freeze for PG12 data was April 7, 2019. You're a bit late for
that, unfortunately. We're in PG13 land now.

FWIW not sure what mail client you're using, but it seems to be breaking
the threads for some reason, splitting it into two - see [1].

Also, please stop top-posting. It makes it way harder to follow the
discussion.

[1] 
https://www.postgresql.org/message-id/flat/41ED3F5450C90F4D8381BC4D8DF6BBDCF02E10B4%40EXCHANGESERVER.ikoffice.de


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





AW: AW: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Daniel Migowski
Ok, just have read about the commitfest thing. Is the patch OK for that? Or is 
there generally no love for a mem_usage column here? If it was, I really would 
add some memory monitoring in our app regarding this.

-Ursprüngliche Nachricht-
Von: Tom Lane  
Gesendet: Mittwoch, 31. Juli 2019 00:09
An: Daniel Migowski 
Cc: Andres Freund ; pgsql-hackers@lists.postgresql.org
Betreff: Re: AW: Adding column "mem_usage" to view pg_prepared_statements

Daniel Migowski  writes:
> Will my patch be considered for 12.0? The calculation of the mem_usage value 
> might be improved later on but because the system catalog is changed I would 
> love to add it before 12.0 becomes stable. 

v12 has been feature-frozen for months, and it's pretty hard to paint this as a 
bug fix, so I'd say the answer is "no".

regards, tom lane




Re: AW: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Tom Lane
Daniel Migowski  writes:
> Will my patch be considered for 12.0? The calculation of the mem_usage value 
> might be improved later on but because the system catalog is changed I would 
> love to add it before 12.0 becomes stable. 

v12 has been feature-frozen for months, and it's pretty hard to paint
this as a bug fix, so I'd say the answer is "no".

regards, tom lane




AW: Adding column "mem_usage" to view pg_prepared_statements

2019-07-30 Thread Daniel Migowski
Hello,

Will my patch be considered for 12.0? The calculation of the mem_usage value 
might be improved later on but because the system catalog is changed I would 
love to add it before 12.0 becomes stable. 

Regards,
Daniel Migowski

-Ursprüngliche Nachricht-
Von: Daniel Migowski  
Gesendet: Sonntag, 28. Juli 2019 08:21
An: Andres Freund 
Cc: pgsql-hackers@lists.postgresql.org
Betreff: Re: Adding column "mem_usage" to view pg_prepared_statements

Hello Andres,

how do you want to generalize it? Are you thinking about a view solely for the 
display of the memory usage of different objects? Like functions or views (that 
also have a plan associated with it, when I think about it)? While being 
interesting I still believe monitoring the mem usage of prepared statements is 
a bit more important than that of other objects because of how they change 
memory consumption of the server without using any DDL or configuration options 
and I am not aware of other objects with the same properties, or are there 
some? And for the other volatile objects like tables and indexes and their 
contents PostgreSQL already has it's information functions. 

Regardless of that here is the patch for now. I didn't want to fiddle to much 
with MemoryContexts yet, so it still doesn't recurse in child contexts, but I 
will change that also when I try to build a more compact MemoryContext 
implementation and see how that works out.

Thanks for pointing out the relevant information in the statement column of the 
view.

Regards,
Daniel Migowski

-Ursprüngliche Nachricht-
Von: Andres Freund 
Gesendet: Samstag, 27. Juli 2019 21:12
An: Daniel Migowski 
Cc: pgsql-hackers@lists.postgresql.org
Betreff: Re: Adding column "mem_usage" to view pg_prepared_statements

Hi,

On 2019-07-27 18:29:23 +, Daniel Migowski wrote:
> I just implemented a small change that adds another column "mem_usage"
> to the system view "pg_prepared_statements". It returns the memory 
> usage total of CachedPlanSource.context, 
> CachedPlanSource.query_content and if available 
> CachedPlanSource.gplan.context.

FWIW, it's generally easier to comment if you actually provide the patch, even 
if it's just POC, as that gives a better handle on how much additional 
complexity it introduces.

I think this could be a useful feature. I'm not so sure we want it tied to just 
cached statements however - perhaps we ought to generalize it a bit more.


Regarding the prepared statements specific considerations: I don't think we 
ought to explicitly reference CachedPlanSource.query_content, and 
CachedPlanSource.gplan.context.

In the case of actual prepared statements (rather than oneshot plans) 
CachedPlanSource.query_context IIRC should live under CachedPlanSource.context. 
 I think there's no relevant cases where gplan.context isn't a child of 
CachedPlanSource.context either, but not quite sure.

Then we ought to just include child contexts in the memory computation (cf. 
logic in MemoryContextStatsInternal(), although you obviously wouldn't need all 
that). That way, if the cached statements has child contexts, we're going to 
stay accurate.


> Also I wonder why the "prepare test as" is part of the statement 
> column. I isn't even part of the real statement that is prepared as 
> far as I would assume. Would prefer to just have the "select *..." in 
> that column.

It's the statement that was executed. Note that you'll not see that in the case 
of protocol level prepared statements.  It will sometimes include relevant 
information, e.g. about the types specified as part of the prepare (as in 
PREPARE foo(int, float, ...) AS ...).

Greetings,

Andres Freund




Re: pgbench - implement strict TPC-B benchmark

2019-07-30 Thread Tom Lane
Fabien COELHO  writes:
> [ pgbench-strict-tpcb-2.patch ]

TBH, I think we should reject this patch.  Nobody cares about TPC-B
anymore, and they care even less about differences between one
sort-of-TPC-B test and another sort-of-TPC-B test.  (As the lack
of response on this thread shows.)  We don't need this kind of
baggage in pgbench; it's got too many "features" already.

I'm also highly dubious about labeling this script "standard TPC-B",
when it resolves only some of the reasons why our traditional script
is not really TPC-B.  That's treading on being false advertising.

regards, tom lane




Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-30 Thread Thomas Munro
On Tue, Jun 25, 2019 at 6:02 AM Andres Freund  wrote:
> On 2019-06-24 10:41:06 -0700, Ashwin Agrawal wrote:
> > Proposing following changes to make predicate locking and checking
> > functions generic and remove dependency on HeapTuple and Heap AM. We
> > made these changes to help with Zedstore. I think the changes should
> > help Zheap and other AMs in general.
>
> Indeed.

+1

> > - Change PredicateLockTuple() to PredicateLockTID(). So, instead of
> >   passing HeapTuple to it, just pass ItemPointer and tuple insert
> >   transaction id if known. This was also discussed earlier in [1],
> >   don't think was done in that context but would be helpful in future
> >   if such requirement comes up as well.
>
> Right.

+1

> > - CheckForSerializableConflictIn() take blocknum instead of
> >   buffer. Currently, the function anyways does nothing with the buffer
> >   just needs blocknum. Also, helps to decouple dependency on buffer as
> >   not all AMs may have one to one notion between blocknum and single
> >   buffer. Like for zedstore, tuple is stored across individual column
> >   buffers. So, wish to have way to lock not physical buffer but
> >   logical blocknum.
>
> Hm.  I wonder if we somehow ought to generalize the granularity scheme
> for predicate locks to not be tuple/page/relation.  But even if, that's
> probably a separate patch.

What do you have in mind?  This is certainly the traditional way to do
lock hierarchies (archeological fun: we used to have
src/backend/storage/lock/multi.c, a "standard multi-level lock manager
as per the Gray paper", before commits 3f7fbf85 and e6e9e18e
decommissioned it;  SSI brought the concept back again in a parallel
lock manager), but admittedly it has assumptions about physical
storage baked into the naming.  Perhaps you just want to give those
things different labels, "TID range" instead of page, for the benefit
of "logical" TID users?  Perhaps you want to permit more levels?  That
seems premature as long as TIDs are defined in terms of blocks and
offsets, so this stuff is reflected all over the source tree...

> > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> >   buffer, instead just takes xid. Push heap specific parts from
> >   CheckForSerializableConflictOut() into its own function
> >   HeapCheckForSerializableConflictOut() which calls
> >   CheckForSerializableConflictOut(). The alternative option could be
> >   CheckForSerializableConflictOut() take callback function and
> >   callback arguments, which gets called if required after performing
> >   prechecks. Though currently I fell AM having its own wrapper to
> >   perform AM specific task and then calling
> >   CheckForSerializableConflictOut() is fine.
>
> I think it's right to move the xid handling out of
> CheckForSerializableConflictOut(). But I think we also ought to move the
> subtransaction handling out of the function - e.g. zheap doesn't
> want/need that.

Thoughts on this Ashwin?

> > Attaching patch which makes these changes.
>
> Please make sure that there's a CF entry for this (I'm in a plane with a
> super slow connection, otherwise I'd check).

This all makes sense, and I'd like to be able to commit this soon.  We
come up with something nearly identical for zheap.  I'm subscribing
Kuntal Ghosh to see if he has comments, as he worked on that.  The
main point of difference seems to be the assumption about how
subtransactions work.

-- 
Thomas Munro
https://enterprisedb.com




Re: [Patch] Adding CORRESPONDING/CORRESPONDING BY to set operation

2019-07-30 Thread David Fetter
On Tue, Jul 30, 2019 at 02:43:05PM -0700, 毛瑞嘉 wrote:
> Hi,
> 
> 
> I wrote a patch for adding CORRESPONDING/CORRESPONDING BY to set operation.
> It is a task in the todo list. This is how the patch works:
> 
> I modified transformSetOperationStmt() to get an intersection target list
> which is the intersection of the target lists of the left clause and right
> clause for a set operation statement (sostmt). The intersection target list
> is calculated in transformSetOperationTree() and then I modified the target
> lists of the larg and rarg of sostmt to make them equal to the intersection
> target list. Also, I also changed the target list in pstate->p_rtable in
> order to make it consistent with the intersection target list.
> 
> 
> I attached the scratch version of this patch to the email. I am not sure
> whether the method used in the patch is acceptable or not, but any
> suggestions are appreciated. I will add tests and other related things to
> the patch if the method used in this patch is acceptable.

Thanks for sending this!

It needs documentation and tests so people can see whether it does
what it's supposed to do.  Would you be so kind as to include those in
the next revision of the patch?  You can just attach the patch(es)
without zipping them.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Patch to document base64 encoding

2019-07-30 Thread Fabien COELHO


My 0.02 €

It seems entirely crazy that encode() and decode() are no longer in the 
same section, likewise that convert_from() and convert_to() aren't 
documented together anymore.


Awkward, yes.  But findable if you know what the categories are.

I suppose there could be 3 different categories: those that input
and output strings, those that input and output binary, and those
that convert -- inputting one data type and outputting another.


Personnaly, I'd be ok with having a separate "conversion function" table, 
and also with Tom suggestion to have string functions with "only simple 
string" functions, and if any binary appears it is moved into the binary 
section.


--
Fabien.

[Patch] Adding CORRESPONDING/CORRESPONDING BY to set operation

2019-07-30 Thread 毛瑞嘉
Hi,


I wrote a patch for adding CORRESPONDING/CORRESPONDING BY to set operation.
It is a task in the todo list. This is how the patch works:


I modified transformSetOperationStmt() to get an intersection target list
which is the intersection of the target lists of the left clause and right
clause for a set operation statement (sostmt). The intersection target list
is calculated in transformSetOperationTree() and then I modified the target
lists of the larg and rarg of sostmt to make them equal to the intersection
target list. Also, I also changed the target list in pstate->p_rtable in
order to make it consistent with the intersection target list.


I attached the scratch version of this patch to the email. I am not sure
whether the method used in the patch is acceptable or not, but any
suggestions are appreciated. I will add tests and other related things to
the patch if the method used in this patch is acceptable.



Best,

Ruijia
<>


Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-30 Thread Tomas Vondra

On Tue, Jul 30, 2019 at 03:43:58PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I've started reviewing this patch, thinking that maybe I could get it
committed as it's marked as RFC. In general I agree with having this
fuature, but I think we need to rethink the GUC because the current
approach is just confusing.
...
What I think we should do instead is to use two minimum thresholds.
1) log_min_duration_sample - enables sampling of commands, using the
existing GUC log_statement_sample_rate
2) log_min_duration_statement - logs all commands exceeding this
I think this is going to be much easier for users to understand.


I agree with Tomas' idea.


The one difference between those approaches is in how they work with
existing current settings. That is, let's say you have
  log_min_duration_statement = 1000
  log_statement_sample_rate = 0.01
then no queries below 1000ms will be logged, and 1% of longer queries
will be sampled. And with the original config (as proposed in v3 of the
patch), this would still work the same way.
With the new approach (two min thresholds) it'd behave differently,
because we'd log *all* queries longer than 1000ms (not just 1%). And
whether we'd sample any queries (using log_statement_sample_rate) would
depend on how we'd pick the default value for the other threshold.


Well, we do not need to have a backwards-compatibility problem
here, because we have yet to release a version containing
log_statement_sample_rate.  I do not think it's too late to decide
that v12's semantics for that are broken, and either revert that
patch in v12, or back-patch a fix to make it match this idea.



I'm willing to try fixing this to salvage the feature for v12. The
question is how would that fix look like - IMO we'd need to introduce
the new threshold GUC, essentially implementing what this thread is
about. It's not a complex patch, but it kinda flies in the face of
feature freeze. OTOH if we call it a fix ...

The patch itself is not that complicated - attached is a WIP version,
(particularly) the docs need more work.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..8cb695044b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5850,8 +5850,7 @@ local0.*/var/log/postgresql
 
  Causes the duration of each completed statement to be logged
  if the statement ran for at least the specified number of
- milliseconds, modulated by 
log_statement_sample_rate.
- Setting this to zero prints all statement durations.
+ milliseconds. Setting this to zero prints all statement durations.
  -1 (the default) disables logging statements due to
  exceeding duration threshold; for example, if you set it to
  250ms, then all SQL statements that run 250ms or
@@ -5882,6 +5881,48 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_min_duration_sample (integer)
+  
+   log_min_duration_sample configuration 
parameter
+  
+  
+   
+
+ Causes the duration of each completed statement to be logged
+ if the statement ran for at least the specified number of
+ milliseconds, modulated by 
log_statement_sample_rate.
+ Setting this to zero samples all statement durations.
+ -1 (the default) disables sampling of statements;
+ for example, if you set it to 250ms, then all
+ SQL statements that run 250ms or longer will be sampled for
+ logging. Statements with duration exceeding
+ log_min_duration_statement are not subject to
+ sampling and are logged every time.
+
+
+
+ For clients using extended query protocol, durations of the Parse,
+ Bind, and Execute steps are logged independently.
+
+
+   
+
+ When using this option together with
+ ,
+ the text of statements that are logged because of
+ log_statement will not be repeated in the
+ duration log message.
+ If you are not using syslog, it is 
recommended
+ that you log the PID or session ID using
+ 
+ so that you can link the statement message to the later
+ duration message using the process ID or session ID.
+
+   
+   
+  
+
  
   log_statement_sample_rate (real)
   
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a6505c7335..ae1def0f37 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2213,7 +2213,8 @@ check_log_statement(List *stmt_list)
 /*
  * check_log_duration
  * Determine whether current command's duration should be logged.
- * If log_statement_sample_rate < 1.0, log only a sample.
+ * If log_statement_sample_rate < 1.0, 

Re: heapam_index_build_range_scan's anyvisible

2019-07-30 Thread Ashwin Agrawal
On Tue, Jul 16, 2019 at 10:22 AM Andres Freund  wrote:

> Hi,
>
> On 2019-07-11 17:27:46 -0700, Ashwin Agrawal wrote:
> > Please find attached the patch to remove IndexBuildCallback's dependency
> on
> > HeapTuple, as discussed. Changed to have the argument as ItemPointer
> > instead of HeapTuple. Other larger refactoring if feasible for
> > index_build_range_scan API itself can be performed as follow-up changes.
>
> > From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001
> > From: Ashwin Agrawal 
> > Date: Thu, 11 Jul 2019 16:58:50 -0700
> > Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple.
> >
> > With IndexBuildCallback taking input as HeapTuple, all table AMs
> > irrespective of storing the tuples in HeapTuple form or not, are
> > forced to construct HeapTuple, to insert the tuple in Index. Since,
> > only thing required by the index callbacks is TID and not really the
> > full tuple, modify callback to only take ItemPointer.
>
> Looks good to me. Planning to apply this unless somebody wants to argue
> against it soon?
>

Andres, I didn't yet register this for next commitfest. If its going in
soon anyways will not do it otherwise let me know and I will add it to the
list.


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Bruce Momjian
On Tue, Jul 30, 2019 at 10:14:14AM -0400, Sehrope Sarkuni wrote:
> > In general it's fine to use the same IV with different keys. Only reuse
> of Key
> > + IV is a problem and the entire set of possible counter values (IV + 0,
> IV +
> > 1, ...) generated with a key must be unique. That's also why we must
> leave at
> > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be
> filled
> > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our
> per-page IV
> > prefix used any of those bits then the counter could overflow into the
> next
> > page's IV's range.
> 
> Agreed.
> 
> Attached is an updated patch that checks only main relation forks, which
> I think are the only files we are going ot encrypt, and it has better
> comments.
> 
> 
> Okay that makes sense in the context of using a single key and relying on the
> LSN based IV to be unique.

I had more time to think about the complexity of adding relfilenode to
the IV.  Since relfilenode is only unique within a database/tablespace,
we would need to have pg_upgrade preserve database/tablespace oids
(which I assume are the same as the directory and tablespace symlinks). 
Then, to decode a page, you would need to look up those values.  This is
in addition to the new complexity of CREATE DATABASE and moving files
between tablespaces.  I am also concerned that crash recovery operations
and cluster forensics and repair would need to also deal with this.

I am not even clear if pg_upgrade preserving relfilenode is possible ---
when we wrap the relfilenode counter, does it start at 1 or at the
first-user-relation-oid?  If the former, it could conflict with oids
assigned to new system tables in later major releases.  Tying the
preservation of relations to two restrictions seems risky.

Using just the page LSN and page number allows a page to be be
decrypted/encrypted independently of its file name, tablespace, and
database, and I think that is a win for simplicity.  Of course, if it is
insecure we will not do it.

I am thinking for the heap/index IV, it would be:

uint64 lsn;
unint32 page number;
/* only uses 11 bits for a zero-based CTR counter for 32k pages */
uint32 counter;

and for WAL it would be:

uint64 segment_number;
uint32counter;
/* guarantees this IV doesn't match any relation IV */
uint32   2^32-1 /* all 1's */

Anyway, these are my thoughts so far.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: extension patch of CREATE OR REPLACE TRIGGER

2019-07-30 Thread Tom Lane
"osumi.takami...@fujitsu.com"  writes:
> [ CREATE_OR_REPLACE_TRIGGER_v03.patch ]

I took a quick look through this just to see what was going on.
A few comments:

* Upthread you asked about changing the lock level to be
AccessExclusiveLock if the trigger already exists, but the patch doesn't
actually do that.  Which is fine by me, because that sounds like a
perfectly bad idea.  In the first place, nobody is going to expect that
OR REPLACE changes the lock level, and in the second place, you can't
actually tell whether the trigger exists until you already have some
lock on the table.  I do not put any credit in the argument that it's
more important to lock out pg_dump against a concurrent REPLACE TRIGGER
than it is to lock out a concurrent CREATE TRIGGER, anyway.  So I think
keeping it at ShareRowExclusiveLock is fine.

* I wouldn't recommend adding CreateConstraintEntry's new argument
at the end.  IME, "add at the end" is almost always bad coding style;
the right thing is "add where it belongs, that is where you'd have
put it if you were writing the list from scratch".  To the admittedly
imperfect extent that the order of CreateConstraintEntry's arguments
matches the column order of pg_constraint, there's a good argument
that the OID should be *first*.  (Maybe, as long as we've gotta touch
all the callers anyway, we should fix the other random deviations
from the catalog's column order, too.)

* While you're at it, it wouldn't hurt to fix CreateConstraintEntry's
header comment, maybe like

- * The new constraint's OID is returned.
+ * The new constraint's OID is returned.  (This will be the same as
+ * "conOid" if that is specified as nonzero.)

* The new code added to CreateTrigger could stand a rethink, too.
For starters, this comment does not describe the code stanza
just below it, but something considerably further down:

/*
+* Generate the trigger's OID now, so that we can use it in the name if
+* needed.
+*/

It's also quite confusing because if there is a pre-existing trigger,
we *don't* generate any new OID.  I'd make that say "See if there is a
pre-existing trigger of the same name", and then comment the later OID
generation appropriately.  Also, the code below the pg_trigger search
seems pretty confused and redundant:

+   if (!trigger_exists)
+   // do something
+   if (stmt->replace && trigger_exists)
+   {
+   if (stmt->isconstraint && !OidIsValid(existing_constraint_oid))
+   //  do something
+   else if (!stmt->isconstraint && 
OidIsValid(existing_constraint_oid))
+   // do something
+   }
+   else if (trigger_exists && !isInternal)
+   {
+   // do something
+   }

I'm not on board with the idea that testing trigger_exists three separate
times, in three randomly-different-looking ways, makes things more
readable.  I'm also not excited about spending the time to scan pg_trigger
at all in the isInternal case, where you're going to ignore the result.
So I think this could use some refactoring.

Also, in the proposed tests:

+\h CREATE TRIGGER;

We do not test \h output in any existing regression test, and we're
not going to start doing so in this one.  For one thing, the expected
URL would break every time we forked off a new release branch.
(There would surely be value in having more-than-no test coverage
of psql/help.c, but that's a matter for its own patch, which would
need some thought about how to cope with instability of the output.)

regards, tom lane




Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-30 Thread Tom Lane
Tomas Vondra  writes:
> I've started reviewing this patch, thinking that maybe I could get it
> committed as it's marked as RFC. In general I agree with having this
> fuature, but I think we need to rethink the GUC because the current
> approach is just confusing.
> ...
> What I think we should do instead is to use two minimum thresholds.
> 1) log_min_duration_sample - enables sampling of commands, using the
> existing GUC log_statement_sample_rate
> 2) log_min_duration_statement - logs all commands exceeding this
> I think this is going to be much easier for users to understand.

I agree with Tomas' idea.

> The one difference between those approaches is in how they work with
> existing current settings. That is, let's say you have
>   log_min_duration_statement = 1000
>   log_statement_sample_rate = 0.01
> then no queries below 1000ms will be logged, and 1% of longer queries
> will be sampled. And with the original config (as proposed in v3 of the
> patch), this would still work the same way.
> With the new approach (two min thresholds) it'd behave differently,
> because we'd log *all* queries longer than 1000ms (not just 1%). And
> whether we'd sample any queries (using log_statement_sample_rate) would
> depend on how we'd pick the default value for the other threshold.

Well, we do not need to have a backwards-compatibility problem
here, because we have yet to release a version containing
log_statement_sample_rate.  I do not think it's too late to decide
that v12's semantics for that are broken, and either revert that
patch in v12, or back-patch a fix to make it match this idea.

regards, tom lane




Re: Initdb failure

2019-07-30 Thread Tom Lane
Robert Haas  writes:
> Agreed, but I think we should just do nothing.  To actually fix this
> in general, we'd have to get rid of every instance of MAXPGPATH in the
> source tree:
> [rhaas pgsql]$ git grep MAXPGPATH | wc -l
>  611

I don't think it'd really be necessary to go that far.  One of the
reasons we chdir to the data directory at postmaster start is so that
(pretty nearly) all filenames that backends deal with are relative
pathnames of very predictable, short lengths.  So a lot of those
MAXPGPATH uses are probably perfectly safe, indeed likely overkill.

However, identifying which ones are not safe would still take looking
at every use case, so I agree there'd be a lot of work here.

Would there be any value in teaching initdb to do something similar,
ie chdir to the parent of the target datadir location?  Then the set
of places in initdb that have to deal with long pathnames would be
pretty well constrained.

On the whole though, I don't have a problem with the "do nothing"
answer.  There's no security risk here, and no issue that seems
likely to arise in actual use cases rather than try-to-break-it
test scripts.

regards, tom lane




Re: Allow table AM's to cache stuff in relcache

2019-07-30 Thread Heikki Linnakangas

On 12/07/2019 16:07, Julien Rouhaud wrote:

On Fri, Jun 14, 2019 at 5:40 PM Tom Lane  wrote:

Heikki Linnakangas  writes:

In the patch, I documented that rd_amcache must be allocated in
CacheMemoryContext, or in rd_indexcxt if it's an index. It works, but
it's a bit weird.


Given the way the patch is implemented, it doesn't really matter which
context it's in, does it?  The retail pfree is inessential but also
harmless, if rd_amcache is in rd_indexcxt.  So we could take out the
"must".  I think it's slightly preferable to use rd_indexcxt if available,
to reduce the amount of loose junk in CacheMemoryContext.


I agree that for indexes the context used won't make much difference.
But IMHO avoiding some bloat in CacheMemoryContext is a good enough
reason to document using rd_indexcxt when available.


Right, it doesn't really matter whether an index AM uses 
CacheMemoryContext or rd_indexctx, the code works either way. I think 
it's better to give clear advice though, one way or another. Otherwise, 
different index AM's can end up doing it differently for no particular 
reason, which seems confusing.



It would nice to have one memory context in every
relcache entry, to hold all the stuff related to it, including
rd_amcache. In other words, it would be nice if we had "rd_indexcxt" for
tables, too, not just indexes. That would allow tracking memory usage
more accurately, if you're debugging an out of memory situation for example.


We had some discussion related to that in the "hyrax
vs. RelationBuildPartitionDesc" thread.  I'm not quite sure where
we'll settle on that, but some redesign seems inevitable.


There wasn't any progress on this since last month, and this patch
won't make the situation any worse.  I'll mark this patch as ready for
committer, as it may save some time for people working on custom table
AM.


Pushed, thanks for the review! As Tom noted, some redesign here seems 
inevitable, but this patch shouldn't get in the way of that, so no need 
to hold this back for the redesign.


- Heikki




Redacting information from logs

2019-07-30 Thread Jeff Davis
Logs are important to diagnose problems or monitor operations, but logs
can contain sensitive information which is often unnecessary for these
purposes. Redacting the sensitive information would enable easier
access and simpler integration with analysis tools without compromising
the sensitive information.

The challenge is that nobody wants to classify all of the log messages;
and even if someone did that today, there would be never-ending work in
the future to try to maintain that classification.

My proposal is:

 * redact every '%s' in an ereport by having a special mode for
snprintf.c (this is possible because we now own snprintf)
 * generate both redacted and unredacted messages (if redaction is
enabled)
 * choose which destinations (stderr, eventlog, syslog, csvlog) get
redacted or plain messages
 * emit_log_hook always has both redacted and plain messages available
 * allow specifying a custom redaction function, e.g. a function that
hashes the string rather than completely redacting it

I think '%s' in a log message is a pretty close match to the kind of
information that might be sensitive. All data goes through type output
functions (e.g. the conflicting datum for a unique constraint violation
message), and most other things that a user might type would go through
%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
etc. would be preserved.

All object names would be redacted, but that's not as bad as it sounds:
  (a) You can specify a custom redaction function that hashes rather
than completely redacts. That allows you to see if different messages
refer to the same object, and also map back to suspected objects if you
really need to.
  (b) The unredacted object names are still a part of ErrorData, so you
can do something interesting with emit_log_hook.
  (c) You still might have the unredacted logs in a more protected
place, and can access them when you really need to.

A weakness of this proposal is that it could be confusing to use
ereport() in combination with snprintf(). If using snprintf to build
the format string, nothing would be redacted, so you'd have to be
careful not to expand any %s that might be sensitive. If using snprintf
to build up an argument, the entire argument would be redacted. The
first case should not be common, because good coding generally avoids
non-constant format strings. The second case is just over-redaction,
which is not necessarily bad.

One annoying case would be if some of the arguments to ereport() are
used for things like the right number of commas or tabs -- redacting
those would just make the message look horrible. I didn't find such
cases but I'm pretty sure they exist. Another annoying case is time,
which is useful for debugging, but formatted with %s so it gets
redacted (I did find plenty of these cases).

But I don't see a better solution. Right now, it's a pain to treat log
files as sensitive things when there are so many ways they can help
with smooth operations and so many tools available to analyze them.
This proposal seems like a practical solution to enable better use of
log files while protecting potentially-sensitive information.

Attached is a WIP patch.

Regards,
Jeff Davis
From b9e51843f2cf3635e65ee8807ed01ff1cb197fab Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 29 Jul 2019 16:28:35 -0700
Subject: [PATCH] WIP redaction

---
 src/backend/utils/error/elog.c | 183 -
 src/backend/utils/misc/guc.c   |  28 +
 src/include/utils/elog.h   |   8 ++
 src/port/snprintf.c|  16 +++
 4 files changed, 207 insertions(+), 28 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 8b4720ef3a..0b9a3d22d1 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -90,6 +90,7 @@ ErrorContextCallback *error_context_stack = NULL;
 sigjmp_buf *PG_exception_stack = NULL;
 
 extern bool redirection_done;
+extern bool snprintf_do_redact;
 
 /*
  * Hook for intercepting messages before they are sent to the server log.
@@ -105,6 +106,9 @@ int			Log_error_verbosity = PGERROR_VERBOSE;
 char	   *Log_line_prefix = NULL; /* format for extra log line info */
 int			Log_destination = LOG_DESTINATION_STDERR;
 char	   *Log_destination_string = NULL;
+bool		redact_messages = false;
+int			redact_destination = LOG_DESTINATION_STDERR;
+char	   *redact_destination_string = NULL;
 bool		syslog_sequence_numbers = true;
 bool		syslog_split_messages = true;
 
@@ -480,14 +484,24 @@ errfinish(int dummy,...)
 	/* Now free up subsidiary data attached to stack entry, and release it */
 	if (edata->message)
 		pfree(edata->message);
+	if (edata->message_r)
+		pfree(edata->message_r);
 	if (edata->detail)
 		pfree(edata->detail);
+	if (edata->detail_r)
+		pfree(edata->detail_r);
 	if (edata->detail_log)
 		pfree(edata->detail_log);
+	if (edata->detail_log_r)
+		pfree(edata->detail_log_r);
 	if (edata->hint)
 		pfree(edata->hint);
+	if 

Re: Initdb failure

2019-07-30 Thread Robert Haas
On Sat, Jul 27, 2019 at 2:22 AM Peter Eisentraut
 wrote:
> I think if you want to make this more robust, get rid of the fixed-size
> array, use dynamic allocation with PQExpBuffer, and let the operating
> system complain if it doesn't like the directory name length.

Agreed, but I think we should just do nothing.  To actually fix this
in general, we'd have to get rid of every instance of MAXPGPATH in the
source tree:

[rhaas pgsql]$ git grep MAXPGPATH | wc -l
 611

If somebody feels motivated to spend that amount of effort improving
this, I will stand back and cheer from the sidelines, but that's gonna
be a LOT of work for a problem that, as Tom says, is probably not
really affecting very many people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-07-30 Thread Melanie Plageman
So, I've rewritten the patch to use a BufFile for the outer table
batch file tuples' match statuses and write bytes to and from the file
which start as 0 and, upon encountering a match for a tuple, I set its
bit in the file to 1 (also rebased with current master).

It, of course, only works for parallel-oblivious hashjoin -- it relies
on deterministic order of tuples encountered in the outer side batch
file to set the right match bit and uses a counter to decide which bit
to set.

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

I plan to work on a separate implementation for parallel hashjoin
next--to understand what is required. I believe the logic to decide
when to fall back should be fairly easy to slot in at the end once
we've decided what that logic is.

On Sat, Jul 13, 2019 at 4:44 PM Tomas Vondra 
wrote:

> On Wed, Jul 03, 2019 at 02:22:09PM -0700, Melanie Plageman wrote:
> >On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman <
> melanieplage...@gmail.com>
> >
> >Before doing that, I wanted to ask what a desirable fallback condition
> >would be. In this patch, fallback to hashloop join happens only when
> >inserting tuples into the hashtable after batch 0 when inserting
> >another tuple from the batch file would exceed work_mem. This means
> >you can't increase nbatches, which, I would think is undesirable.
> >
>
> Yes, I think that's undesirable.
>
> >I thought a bit about when fallback should happen. So, let's say that
> >we would like to fallback to hashloop join when we have increased
> >nbatches X times. At that point, since we do not want to fall back to
> >hashloop join for all batches, we have to make a decision. After
> >increasing nbatches the Xth time, do we then fall back for all batches
> >for which inserting inner tuples exceeds work_mem? Do we use this
> >strategy but work_mem + some fudge factor?
> >
> >Or, do we instead try to determine if data skew led us to increase
> >nbatches both times and then determine which batch, given new
> >nbatches, contains that data, set fallback to true only for that
> >batch, and let all other batches use the existing logic (with no
> >fallback option) unless they contain a value which leads to increasing
> >nbatches X number of times?
> >
>
> I think we should try to detect the skew and use this hashloop logic
> only for the one batch. That's based on the assumption that the hashloop
> is less efficient than the regular hashjoin.
>

> We may need to apply it even for some non-skewed (but misestimated)
> cases, though. At some point we'd need more than work_mem for BufFiles,
> at which point we ought to use this hashloop.
>
>
I have not yet changed the logic for deciding to fall back from
my original design. It will still only fall back for a given batch if
that batch's inner batch file doesn't fit in memory. I haven't,
however, changed the logic to allow it to increase the number of
batches some number of times or according to some criteria before
falling back for that batch.

-- 
Melanie Plageman


v3-0001-hashloop-fallback.patch
Description: Binary data


Re: tap tests driving the database via psql

2019-07-30 Thread Alvaro Herrera
On 2019-Jul-30, Tom Lane wrote:

> OK, so just lifting DBD::Pg in toto is out for license reasons.
> However, maybe we could consider writing a new DBD driver from
> scratch (while using a platform-provided DBI layer) rather than
> doing everything from scratch.  I'm not sure how much actual
> functionality is in the DBI layer, so maybe that approach
> wouldn't buy much.

Then again, maybe we don't *need* all the functionality that DBI offers.
DBI is enormous, has a lot of extensibility, cross-database
compatibility ... and, well, just the fact that it's a layered design
(requiring a DBD on top of it before it even works) makes it even more
complicated.

I think a pure-perl standalone driver might be a lot simpler than
maintanining our own DBD ... and we don't have to convince animal
maintainers to install the right version of DBI in the first place.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SegFault on 9.6.14

2019-07-30 Thread Robert Haas
On Thu, Jul 18, 2019 at 9:45 AM Tom Lane  wrote:
> I think this is going in the wrong direction.  Nodes should *always*
> assume that a rescan is possible until ExecEndNode is called.
> If you want to do otherwise, you are going to be inventing a whole
> bunch of complicated and doubtless-initially-buggy control logic
> to pass down information about whether a rescan might be possible.
> That doesn't sound like a recipe for a back-patchable fix.  Perhaps
> we could consider redesigning the rules around REWIND in a future
> version, but that's not where to focus the bug fix effort.

So, if I can summarize how we got here, as best I understand it:

0. The historic behavior of the executor is to assume it's OK to leak
resources for the lifetime of the query.  Nodes that are executed to
completion generally do some cleanup, but we feel free (as under
Limit) to just stop executing a node without giving it any hint that
it should release resources.  So a Sort may hold onto a terabyte of
memory and an index scan may keep holding a pin even after there's no
theoretical way of ever needing those resources again, and we just
don't care.

1. Parallel query made that perhaps-already-shaky assumption a lot
more problematic. Partly that's because workers are a a more scarce
and considerably heavier resource than anything else, and moreover act
as a container for anything else, so whatever you were leaking before,
you can now leak N times more of it, plus N processes, until the end
of the query. However, there's a correctness reason too, which is that
when a node has a copy in the leader and a copy in every worker, each
copy has its own instrumentation data (startup time, run time, nloops,
etc) and we can only fold all that together once the node is done
executing, because it's really hard to add up a bunch of numbers
before the numbers are done changing.  We could've made the
instrumentation shared throughout, but if we had, we could have
contention for updating the instrumentation data, which seems like
it'd be bad.

2. To fix that correctness problem, we decided to try to shut down the
node under a limit node when we're done with it (commit
85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this
looks fundamentally necessary to me.  If you're going to have N
separate copies of the instrumentation, and you want to add them up
when you're done, then you have to decide to be done at some point;
otherwise you don't know when to add them up, and maybe won't add them
up at all, and then you'll be sad.  This does not mean that the exact
timing couldn't be changed somehow, but if you want a correct
implementation, you have to shut down Limit's sub-node after you're
done executing it (so that you can get the instrumentation data from
the workers after it's final) and before you start destroying DSM
segments and stuff (so that you get the instrumentation data from the
workers before it vanishes).

3. The aforementioned commit turned out to be buggy in at least to two
ways, precisely because it didn't do a good enough job predicting when
the Limit needed to be shut down.  First, there was commit
2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact
that you could hit the Limit and then back up.  Second, there's the
present issue, where the Limit gets rescanned.

So, given all that, if we want to adopt Tom's position that we should
always cater to a possible rescan, then we're going to have to rethink
the way that instrumentation data gets consolidated from workers into
the leader in such a way that we can consolidate multiple times
without ending up with the wrong answer.  The other option is to do
what I understand Amit and Thomas to be proposing, which is to do a
better job identifying the case where we're "done for good" and can
trigger the shutdown fearlessly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Patch to document base64 encoding

2019-07-30 Thread Karl O. Pinc
On Tue, 30 Jul 2019 11:40:03 -0400
Tom Lane  wrote:

> "Karl O. Pinc"  writes:
> > On Mon, 15 Jul 2019 23:00:55 +0200 (CEST)
> > Fabien COELHO  wrote:  
> >> The patch clarifies the documentation about encode/decode and
> >> other text/binary string conversion functions.  
> 
> > Other notable changes:
> >   Corrects categorization of functions as string or binary.
> >   Reorders functions alphabetically by function name.  
> 
> So I took a look at this, expecting that after so much discussion it
> ought to just be committable ...

It started simple, just changing the base64 function descriptions,
but critique drew in additional issues.

> but I am befuddled by your choices
> about which functions to move where. 

The grouping is by the data type on which each function operates,
the data type of the input.

If there's to be 2 categories, as in the
existing docs, it seems to me that you have to categorize either by
the data type input or data type output.  To categorize by input
and output together would result 4 (or more?) categories, which
would be even crazier.

> It seems entirely crazy that
> encode() and decode() are no longer in the same section, likewise that
> convert_from() and convert_to() aren't documented together anymore.

Awkward, yes.  But findable if you know what the categories are.

I suppose there could be 3 different categories: those that input
and output strings, those that input and output binary, and those
that convert -- inputting one data type and outputting another.

I'm not sure that this would really address the issue of documenting,
say encode() and decode() together.  It pretty much makes sense to
alphabetize the functions _within_ each category, because that's
about the only easily defined way to do it.  Going "by feel" and
putting encode() and decode() together raises the question of
where they should be together in the overall ordering within
the category.

> I'm not sure what is the right dividing line between string and binary
> functions, but I don't think that anyone is going to find this
> division helpful.

Maybe there's a way to make more clear what the categories are?
I could be explicit in the description of the section.

> I do agree that documenting some functions twice is a bad plan,
> so we need to clean this up somehow.
>
> After some thought, it seems like maybe a workable approach would be
> to consider that all conversion functions going between text and
> bytea belong in the binary-string-functions section.  I think it's
> reasonable to say that plain "string functions" just means stuff
> dealing with text.

Ok.  Should the section title remain unchanged?
"Binary String Functions and Operators"

I think the summary description of the section will need
a little clarification.

> Possibly we could make a separate table in the binary-functions
> section just for conversions, although that feels like it might be
> overkill.

I have no good answers.  An advantage to a separate section
for conversions is that you _might_ be able to pair the functions,
so that encode() and decode() do show up right next to each other.

I'm not sure exactly how to structure "pairing".  I would have to
play around and see what might look good.

> While we're on the subject, Table 9.11 (conversion names) seems
> entirely misplaced, and I don't just mean that it would need to
> migrate to the binary-functions page.  I don't think it belongs
> in func.sgml at all.  Isn't it pretty duplicative of Table 23.2
> (Client/Server Character Set Conversions)?  I think we should
> unify it with that table, or at least put it next to that one.
> Perhaps that's material for a separate patch though.

I don't know.  But it does seem something that can be addressed
in isolation and suitable for it's own patch.

Thanks for the help.  I will wait for a response to this
before submitting another patch, just in case someone sees any
ideas here to be followed up on.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




typesafe printf hackery

2019-07-30 Thread Andres Freund
Hi,

I've occasionally wished for a typesafe version of pg_printf() and other
varargs functions. The compiler warnings are nice, but also far from
complete.

Here's a somewhat crazy hack/prototype for how printf could get actual
argument types.  I'm far from certain it's worth pursuing this
further... Nor the contrary.

Note that this requires removing the parentheses from VA_ARGS_NARGS_'s
result (i.e. (N) -> N). To me those parens don't make much sense, we're
pretty much guaranteed to only ever have a number there.

With the following example e.g.

myprintf("boring fmt", 1, 0.1, (char)'c', (void*)0, "crazy stuff");
myprintf("empty argument fmt");

yield

format string "boring fmt", 5 args
arg number 0 is of type int: 1
arg number 1 is of type double: 0.10
arg number 2 is of type bool: true
arg number 3 is of type void*: (nil)
arg number 4 is of type char*: crazy stuff
format string "empty argument fmt", 0 args


which'd obviously allow for error checking inside myprintf.


#include "c.h"

// hack pg version out of the way
#undef printf

// FIXME: This doesn't correctly work with zero arguments
#define VA_ARGS_EACH(wrap, ...) \
VA_ARGS_EACH_EXPAND(VA_ARGS_NARGS(__VA_ARGS__)) (wrap, __VA_ARGS__)
#define VA_ARGS_EACH_EXPAND(count) 
VA_ARGS_EACH_EXPAND_REALLY(VA_ARGS_EACH_INT_, count)
#define VA_ARGS_EACH_EXPAND_REALLY(prefix, count) prefix##count

#define VA_ARGS_EACH_INT_1(wrap, el1) wrap(el1)
#define VA_ARGS_EACH_INT_2(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_1(wrap, 
__VA_ARGS__)
#define VA_ARGS_EACH_INT_3(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_2(wrap, 
__VA_ARGS__)
#define VA_ARGS_EACH_INT_4(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_3(wrap, 
__VA_ARGS__)
#define VA_ARGS_EACH_INT_5(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_4(wrap, 
__VA_ARGS__)
#define VA_ARGS_EACH_INT_6(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_5(wrap, 
__VA_ARGS__)
#define VA_ARGS_EACH_INT_7(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_6(wrap, 
__VA_ARGS__)


typedef enum printf_arg_type
{
PRINTF_ARG_BOOL,
PRINTF_ARG_CHAR,
PRINTF_ARG_INT,
PRINTF_ARG_DOUBLE,
PRINTF_ARG_CHARP,
PRINTF_ARG_VOIDP,
} printf_arg_type;

typedef struct arginfo
{
printf_arg_type tp;
} arginfo;

// hackfix empty argument case
#define myprintf(...) myprintf_wrap(__VA_ARGS__, "dummy")
#define myprintf_wrap(fmt, ... ) \
myprintf_impl(fmt, VA_ARGS_NARGS(__VA_ARGS__) - 1, ((arginfo[]){ 
VA_ARGS_EACH(blurttype, __VA_ARGS__) }), __VA_ARGS__)

// FIXME: Obviously not enough types
#define blurttype(x) ((arginfo){_Generic(x, char: PRINTF_ARG_BOOL, int: 
PRINTF_ARG_INT, double: PRINTF_ARG_DOUBLE, char *: PRINTF_ARG_CHARP, void *: 
PRINTF_ARG_VOIDP)})

static const char*
printf_arg_typename(printf_arg_type tp)
{
switch (tp)
{
case PRINTF_ARG_BOOL:
return "bool";
case PRINTF_ARG_CHAR:
return "char";
case PRINTF_ARG_INT:
return "int";
case PRINTF_ARG_DOUBLE:
return "double";
case PRINTF_ARG_CHARP:
return "char*";
case PRINTF_ARG_VOIDP:
return "void*";
}

return "";
}

static void
myprintf_impl(char *fmt, size_t nargs, arginfo arg[], ...)
{
va_list args;
va_start(args, arg);

printf("format string \"%s\", %zu args\n", fmt, nargs);
for (int argno = 0; argno < nargs; argno++)
{
printf("\targ number %d is of type %s: ",
   argno,
   printf_arg_typename(arg[argno].tp));

switch (arg[argno].tp)
{
case PRINTF_ARG_BOOL:
printf("%s", ((bool) va_arg(args, int)) ? 
"true" : "false");
break;
case PRINTF_ARG_CHAR:
printf("%c", (char) va_arg(args, int));
break;
case PRINTF_ARG_INT:
printf("%d", va_arg(args, int));
break;
case PRINTF_ARG_DOUBLE:
printf("%f", va_arg(args, double));
break;
case PRINTF_ARG_CHARP:
printf("%s", va_arg(args, char *));
break;
case PRINTF_ARG_VOIDP:
printf("%p", va_arg(args, void *));
break;
}

printf("\n");
}
}

int main(int argc, char **argv)
{
myprintf("boring fmt", 1, 0.1, (char)'c', (void*)0, "crazy stuff");
myprintf("empty argument fmt");
}



Re: ANALYZE: ERROR: tuple already updated by self

2019-07-30 Thread Tomas Vondra

On Mon, Jul 29, 2019 at 12:18:33PM +0200, Tomas Vondra wrote:

On Sun, Jul 28, 2019 at 09:53:20PM -0700, Andres Freund wrote:

Hi,

On 2019-07-28 21:21:51 +0200, Tomas Vondra wrote:

AFAICS it applies to 10+ versions, because that's where extended stats
were introduced. We certainly can't mess with catalogs there, so this is
about the only backpatchable fix I can think of.


AFAIU the inh version wouldn't be used anyway, and this has never
worked. So I think it's imo fine to fix it that way for < master. For
master we should have something better, even if perhaps not immediately.



Agreed. I'll get the simple fix committed soon, and put a TODO on my
list for pg13.



I've pushed the simple fix - I've actually simplified it a bit further by
simply not calling the BuildRelationExtStatistics() at all when inh=true,
instead of passing the flag to BuildRelationExtStatistics() and making the
decision there. The function is part of public API, so this would be an
ABI break (although it's unlikely anyone else is calling the function).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: TopoSort() fix

2019-07-30 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jul 30, 2019 at 1:44 PM Tom Lane  wrote:
>> Well, there'd be an actual isolation test that they work ;-), if you
>> override the marking.  Admittedly, one test case does not prove that
>> there's no way to crash the system, but that can be said of most
>> parts of Postgres.

> True.  I'm just talking about what we can foresee.

Sure.  But I think what we can foresee is that if there are any bugs
reachable this way, they'd be reachable and need fixing regardless.
We've already established that parallel workers can take and release locks
that their leader isn't holding.  Apparently, they won't take anything
stronger than RowExclusiveLock; but even AccessShare is enough to let a
process participate in all interesting behaviors of the lock manager,
including blocking, being blocked, and being released early by deadlock
resolution.  And the advisory-lock functions are pretty darn thin wrappers
around the lock manager.  So I'm finding it hard to see where there's
incremental risk, even if a user does intentionally bypass the parallel
safety markings.  And what we get in return is an easier way to add tests
for this area.

regards, tom lane




Re: TopoSort() fix

2019-07-30 Thread Robert Haas
On Tue, Jul 30, 2019 at 1:44 PM Tom Lane  wrote:
> Well, there'd be an actual isolation test that they work ;-), if you
> override the marking.  Admittedly, one test case does not prove that
> there's no way to crash the system, but that can be said of most
> parts of Postgres.

True.  I'm just talking about what we can foresee.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: TopoSort() fix

2019-07-30 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jul 30, 2019 at 1:36 PM Tom Lane  wrote:
>> In any case, my question at the moment is whether we need the belt-and-
>> suspenders-too approach of having both non-parallel-safe marking and an
>> explicit check inside these functions.  We've largely moved away from
>> hard-wired checks for e.g. superuserness, and surely these things are
>> less dangerous than most formerly-superuser-only functions.

> If we can't think of a way that the lack of these checks could crash
> it, then I think it's OK to remove the hardwired checks.  If we can,
> I'd favor keeping them.

Well, there'd be an actual isolation test that they work ;-), if you
override the marking.  Admittedly, one test case does not prove that
there's no way to crash the system, but that can be said of most
parts of Postgres.

regards, tom lane




Re: TopoSort() fix

2019-07-30 Thread Robert Haas
On Tue, Jul 30, 2019 at 1:36 PM Tom Lane  wrote:
> No, there's a sufficient reason why we should force advisory locks
> to be taken in the leader process, namely that the behavior is totally
> different if we don't: they will disappear at the end of the parallel
> worker run, not at end of transaction or session as documented.

Oh, good point.  I forgot about that.

> However, that argument doesn't seem to be a reason why the advisory-lock
> functions couldn't be parallel-restricted rather than parallel-unsafe.

Agreed.

> In any case, my question at the moment is whether we need the belt-and-
> suspenders-too approach of having both non-parallel-safe marking and an
> explicit check inside these functions.  We've largely moved away from
> hard-wired checks for e.g. superuserness, and surely these things are
> less dangerous than most formerly-superuser-only functions.

If we can't think of a way that the lack of these checks could crash
it, then I think it's OK to remove the hardwired checks.  If we can,
I'd favor keeping them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Attached partition not considering altered column properties of root partition.

2019-07-30 Thread Robert Haas
On Tue, Jul 2, 2019 at 9:53 PM Amit Langote  wrote:
> Thanks for the report.  This seems like a bug.  Documentation claims
> that the child tables inherit column storage options from the parent
> table.  That's actually enforced in only some cases.

I realize I'm just repeating the same argument I've already made
before on other related topics, but I don't think this is a bug.
Inherited-from-parent is not the same as
enforced-to-always-be-the-same-as-parent.  Note that this is allowed,
changing only the child:

rhaas=# create table foo (a int, b text) partition by range (a);
CREATE TABLE
rhaas=# create table foo1 partition of foo for values from (0) to (10);
CREATE TABLE
rhaas=# alter table foo1 alter column b set storage plain;
ALTER TABLE

As is this, changing only the parent:

rhaas=# alter table only foo1 alter column b set storage plain;
ALTER TABLE

How can you possibly argue that the intended behavior is
everything-always-matches when that's not what's documented and
there's absolutely nothing that tries to enforce it?

I'm getting really tired of people thinking that they can invent rules
for table partitioning that were (1) never intended by the original
implementation and (2) not even slightly enforced by the code, and
then decide that those behavior changes should not only be made but
back-patched.  That is just dead wrong.  There is no problem here.
There is no need to change ANYTHING.  There is even less need to do it
in the back branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: TopoSort() fix

2019-07-30 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jul 30, 2019 at 10:27 AM Tom Lane  wrote:
>> (BTW, why aren't these functions just "parallel restricted"?)

> ...
> But it is really pretty arguable whether we should feel responsible
> for that problem.  We could just decide that if you're doing that, and
> you don't want the scenario described above to happen, you oughta mark
> the function that contains this logic at least PARALLEL RESTRICTED,
> and if you don't, then it's your fault for doing a dumb thing.  I
> believe when we were early on in the development of this we wanted to
> be very conservative lest, ah, someone accuse us of not locking things
> down well enough, but maybe at this point parallel query is a
> sufficiently well-established concept that we should lighten up on
> some cases where we took an overly-stringent line.  If we take that
> view, then I'm not sure why these functions couldn't be just marked
> PARALLEL SAFE.

No, there's a sufficient reason why we should force advisory locks
to be taken in the leader process, namely that the behavior is totally
different if we don't: they will disappear at the end of the parallel
worker run, not at end of transaction or session as documented.

However, that argument doesn't seem to be a reason why the advisory-lock
functions couldn't be parallel-restricted rather than parallel-unsafe.

In any case, my question at the moment is whether we need the belt-and-
suspenders-too approach of having both non-parallel-safe marking and an
explicit check inside these functions.  We've largely moved away from
hard-wired checks for e.g. superuserness, and surely these things are
less dangerous than most formerly-superuser-only functions.

regards, tom lane




Re: Avoid full GIN index scan when possible

2019-07-30 Thread Tom Lane
Marc Cousin  writes:
> By the way, while preparing this, I noticed that it seems that during this 
> kind of index scan, the interrupt signal is masked
> for a very long time. Control-C takes a very long while to cancel the query. 
> But it's an entirely different problem :)

Yeah, that seems like an independent problem/patch, but it's not obvious
where to fix --- can you provide a self-contained test case?

Meanwhile, I looked at the v3 patch, and it seems like it might not be
too far from committable.  I think we should *not* let this get bogged
down in questions of whether EXPLAIN can report which index quals were
used or ignored.  That's a problem that's existed for decades in the
btree code, with more or less zero user complaints.

I do think v3 needs more attention to comments, for instance this
hunk is clearly falsifying the adjacent comment:

@ -141,7 +141,8 @@ ginFillScanKey(GinScanOpaque so, OffsetNumber attnum,
uint32  i;
 
/* Non-default search modes add one "hidden" entry to each key */
-   if (searchMode != GIN_SEARCH_MODE_DEFAULT)
+   if (searchMode != GIN_SEARCH_MODE_DEFAULT &&
+   (searchMode != GIN_SEARCH_MODE_ALL || nQueryValues))
nQueryValues++;
key->nentries = nQueryValues;
key->nuserentries = nUserQueryValues;

Also, I agree with Julien that this

+   so->forcedRecheck = key->triConsistentFn(key) != 
GIN_TRUE;

probably needs to be

+   so->forcedRecheck |= key->triConsistentFn(key) != 
GIN_TRUE;


regards, tom lane




Re: make installcheck-world in a clean environment

2019-07-30 Thread Tom Lane
Alexander Lakhin  writes:
> 01.07.2019 13:47, Thomas Munro wrote:
>> A new CF is here and this is in "Needs Review".  Would you like to
>> provide a rebased patch, or should it really be withdrawn?

> The rebased patch is attached, but I still can't find anyone interested
> in reviewing it.
> So let's withdraw it.

I agree with withdrawing this, and have marked it that way in the CF app.

I think the fundamental problem here is that nobody except Alexander
is unhappy with the current behavior of "make installcheck", and it
is not very clear whether this patch might change the behavior in
ways that do make others unhappy.  So I think it's best to set it
aside pending more people agreeing that there's a problem to solve.

regards, tom lane




Re: TopoSort() fix

2019-07-30 Thread Robert Haas
On Tue, Jul 30, 2019 at 10:27 AM Tom Lane  wrote:
> I also looked into whether one could use SELECT FOR UPDATE/SHARE to get
> stronger locks at a tuple level, but that's been blocked off as well.
> You guys really did a pretty good job of locking that down.

Thanks.  We learned from the master.

> After thinking about this for awhile, though, I believe it might be
> reasonable to just remove PreventAdvisoryLocksInParallelMode()
> altogether.  The "parallel unsafe" markings on the advisory-lock
> functions seem like adequate protection against somebody running
> them in a parallel worker.  If you defeat that by calling them from
> a mislabeled-parallel-safe wrapper (as the proposed test case does),
> then any negative consequences are on your own head.  AFAICT the
> only actual negative consequence is that the locks disappear the
> moment the parallel worker exits, so we'd not be opening any large
> holes even to people who rip off the safety cover.
>
> (BTW, why aren't these functions just "parallel restricted"?)

I don't exactly remember why we installed all of these restrictions
any more.  You might be able to find some discussion of it by
searching the archives.  I believe we may have been concerned about
the fact that group locking would cause advisory locks taken in one
process not to conflict with the same advisory lock taken in some
cooperating process, and maybe that would be unwelcome behavior for
someone.  For example, suppose the user defines a function that takes
an advisory lock on the number 1, does a bunch of stuff that should
never happen multiply at the same time, and then releases the lock.
Without parallel query, that will work.  With parallel query, it
won't, because several workers running the same query might run the
same function simultaneously and their locks won't conflict.

But it is really pretty arguable whether we should feel responsible
for that problem.  We could just decide that if you're doing that, and
you don't want the scenario described above to happen, you oughta mark
the function that contains this logic at least PARALLEL RESTRICTED,
and if you don't, then it's your fault for doing a dumb thing.  I
believe when we were early on in the development of this we wanted to
be very conservative lest, ah, someone accuse us of not locking things
down well enough, but maybe at this point parallel query is a
sufficiently well-established concept that we should lighten up on
some cases where we took an overly-stringent line.  If we take that
view, then I'm not sure why these functions couldn't be just marked
PARALLEL SAFE.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Proposal to suppress errors thrown by to_reg*()

2019-07-30 Thread Tom Lane
Takuma Hoshiai  writes:
> [ fix_to_reg_v2.patch ]

I took a quick look through this patch.  I'm on board with the goal
of not having schema-access violations throw an error in these
functions, but the implementation feels pretty ugly and bolted-on.
Nobody who had designed the code to do this from the beginning
would have chosen to parse the names twice, much less check the
ACLs twice which is what's going to happen in the success path.

But a worse problem is that this only fixes the issue for the
object name proper.  to_regprocedure and to_regoperator also
have type name(s) to think about, and this doesn't fix the
problem for those:

regression=# create schema s1;
CREATE SCHEMA
regression=# create type s1.d1 as enum('a','b');
CREATE TYPE
regression=# create function f1(s1.d1) returns s1.d1 language sql as
regression-# 'select $1';
CREATE FUNCTION
regression=# select to_regprocedure('f1(s1.d1)');
 to_regprocedure 
-
 f1(s1.d1)
(1 row)

regression=# create user joe;
CREATE ROLE
regression=# \c - joe
You are now connected to database "regression" as user "joe".
regression=> select to_regprocedure('f1(s1.d1)');
ERROR:  permission denied for schema s1


A closely related issue that's been complained of before is that
while these functions properly return NULL when the main object
name includes a nonexistent schema:

regression=> select to_regprocedure('q1.f1(int)');
 to_regprocedure 
-
 
(1 row)

it doesn't work for a nonexistent schema in a type name:

regression=> select to_regprocedure('f1(q1.d1)');
ERROR:  schema "q1" does not exist


Looking at the back-traces for these failures,

#0  errfinish (dummy=0) at elog.c:411
#1  0x00553626 in aclcheck_error (aclerr=, 
objtype=OBJECT_SCHEMA, objectname=) at aclchk.c:3623
#2  0x0055028f in LookupExplicitNamespace (
nspname=, missing_ok=false) at namespace.c:2928
#3  0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x20d87a0, 
typmod_p=0x7fff94c3ee38, missing_ok=)
at parse_type.c:162
#4  0x005b3b29 in parseTypeString (str=, 
typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false)
at parse_type.c:822
#5  0x0086fe21 in parseNameAndArgTypes (string=, 
allowNone=false, names=, nargs=0x7fff94c3f01c, 
argtypes=0x7fff94c3ee80) at regproc.c:1874
#6  0x00870b2d in to_regprocedure (fcinfo=0x2134900) at regproc.c:305

#0  errfinish (dummy=0) at elog.c:411
#1  0x0054dc7b in get_namespace_oid (nspname=, 
missing_ok=false) at namespace.c:3061
#2  0x00550230 in LookupExplicitNamespace (
nspname=, missing_ok=false) at namespace.c:2922
#3  0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x216bd20, 
typmod_p=0x7fff94c3ee38, missing_ok=)
at parse_type.c:162
#4  0x005b3b29 in parseTypeString (str=, 
typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false)
at parse_type.c:822
#5  0x0086fe21 in parseNameAndArgTypes (string=, 
allowNone=false, names=, nargs=0x7fff94c3f01c, 
argtypes=0x7fff94c3ee80) at regproc.c:1874
#6  0x00870b2d in to_regprocedure (fcinfo=0x2170f50) at regproc.c:305

it's not *that* far from where we know we need no-error behavior to
where it's failing to happen.  parseNameAndArgTypes isn't even global,
so certainly changing its API is not problematic.  For the functions
below it, we'd have to decide whether it's okay to consider that
missing_ok=true also enables treating a schema access failure as
"missing", or whether we should add an additional flag argument
to decide that.  It seems like it might be more flexible to use a
separate flag, but that decision could propagate to a lot of places,
especially if we conclude that all the namespace.c functions that
expose missing_ok also need to expose schema_access_violation_ok.

So I think you ought to drop this implementation and fix it properly
by adjusting the behaviors of the functions cited above.

regards, tom lane




Unused struct member in pgcrypto pgp.c

2019-07-30 Thread Daniel Gustafsson
Hi,

In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info which
isn’t used, and seems to have never been used (a potential copy/pasteo from the
cipher_info struct?).  Is there a reason for keeping this, or can it be removed
as per the attached?

cheers ./daniel



pgcrypto_int_name.diff
Description: Binary data


Re: Patch to document base64 encoding

2019-07-30 Thread Tom Lane
"Karl O. Pinc"  writes:
> On Mon, 15 Jul 2019 23:00:55 +0200 (CEST)
> Fabien COELHO  wrote:
>> The patch clarifies the documentation about encode/decode and other 
>> text/binary string conversion functions.

> Other notable changes:
>   Corrects categorization of functions as string or binary.
>   Reorders functions alphabetically by function name.

So I took a look at this, expecting that after so much discussion it
ought to just be committable ... but I am befuddled by your choices
about which functions to move where.  It seems entirely crazy that
encode() and decode() are no longer in the same section, likewise that
convert_from() and convert_to() aren't documented together anymore.
I'm not sure what is the right dividing line between string and binary
functions, but I don't think that anyone is going to find this
division helpful.

I do agree that documenting some functions twice is a bad plan,
so we need to clean this up somehow.

After some thought, it seems like maybe a workable approach would be
to consider that all conversion functions going between text and
bytea belong in the binary-string-functions section.  I think it's
reasonable to say that plain "string functions" just means stuff
dealing with text.

Possibly we could make a separate table in the binary-functions
section just for conversions, although that feels like it might be
overkill.

While we're on the subject, Table 9.11 (conversion names) seems
entirely misplaced, and I don't just mean that it would need to
migrate to the binary-functions page.  I don't think it belongs
in func.sgml at all.  Isn't it pretty duplicative of Table 23.2
(Client/Server Character Set Conversions)?  I think we should
unify it with that table, or at least put it next to that one.
Perhaps that's material for a separate patch though.

regards, tom lane




Re: pg_upgrade version checking questions

2019-07-30 Thread Daniel Gustafsson
> On 27 Jul 2019, at 08:42, Peter Eisentraut  
> wrote:

> I have committed 0002, 0003, and 0004.

Thanks!

> The implementation in 0001 (Only allow upgrades by the same exact
> version new bindir) has a problem.  It compares (new_cluster.bin_version
> != PG_VERSION_NUM), but new_cluster.bin_version is actually just the
> version of pg_ctl, so this is just comparing the version of pg_upgrade
> with the version of pg_ctl, which is not wrong, but doesn't really
> achieve the full goal of having all binaries match.

Right, it seemed the cleanest option at the time more or less based on the
issues outlined below.

> I think a better structure would be to add a version check for each
> validate_exec() so that each program is checked against pg_upgrade.
> This should mirror what find_other_exec() in src/common/exec.c does.  In
> a better world we would use find_other_exec() directly, but then we
> can't support -B.  Maybe expand find_other_exec() to support this, or
> make a private copy for pg_upgrade to support this.  (Also, we have two
> copies of validate_exec() around.  Maybe this could all be unified.)

I’ll take a stab at tidying all of this up to require less duplication, we’ll
see where that ends up.

cheers ./daniel



Re: Parallel grouping sets

2019-07-30 Thread Tomas Vondra

On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo  wrote:


Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1


Attached is the patch and also there is a github branch [1] for this
work.



Rebased with the latest master.



Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least I
don't see any serious issues). A couple minor comments:

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a bit
mixed up and it may not be immediately obvious why it's OK.

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

3) The regression tests do check plan and results like this:

   EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
   SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

4) It might be a good idea to check the negative case too, i.e. a query on
data set that we should not parallelize (because the number of partial
groups would be too high).


Do you have any plans to hack on the second approach too? AFAICS those two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: tap tests driving the database via psql

2019-07-30 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-30 09:40:54 -0400, Tom Lane wrote:
>> Now, none of these things are really a problem with DBD/DBI as such
>> --- rather, they are reasons not to depend on a pre-packaged build
>> of DBD::Pg that depends on a pre-packaged build of libpq.so.
>> I haven't looked at the size, or the license, of DBD::Pg ... but
>> could it be sane to include our own modified copy in the tree?

> I had that as an alternative too. I think the license (Artistic v1/GPL
> v1) probably makes that non-feasible. The pure-perl version of DBI
> probably would otherwise be realistic.

OK, so just lifting DBD::Pg in toto is out for license reasons.
However, maybe we could consider writing a new DBD driver from
scratch (while using a platform-provided DBI layer) rather than
doing everything from scratch.  I'm not sure how much actual
functionality is in the DBI layer, so maybe that approach
wouldn't buy much.

regards, tom lane




Re: tap tests driving the database via psql

2019-07-30 Thread Andres Freund
Hi,

On 2019-07-30 09:40:54 -0400, Tom Lane wrote:
> Craig Ringer  writes:
> > On Sun, 28 Jul 2019 at 03:15, Andres Freund  wrote:
> >> 1) Just depend on DBD::Pg being installed. It's fairly common, after
> >> all. It'd be somewhat annoying that we'd often end up using a
> >> different version of libpq than what we're testing against. But in
> >> most cases that'd not be particularly problematic.
> 
> > I advocated for this in the past, and still think it's the best option.
> 
> I think the not-same-libpq issue would be a much bigger problem than either
> of you are accounting for.  Some off-the-top-of-the-head reasons:

I came to the same conclusion?


> Now, none of these things are really a problem with DBD/DBI as such
> --- rather, they are reasons not to depend on a pre-packaged build
> of DBD::Pg that depends on a pre-packaged build of libpq.so.
> I haven't looked at the size, or the license, of DBD::Pg ... but
> could it be sane to include our own modified copy in the tree?

I had that as an alternative too. I think the license (Artistic v1/GPL
v1) probably makes that non-feasible. The pure-perl version of DBI
probably would otherwise be realistic.

Greetings,

Andres Freund




Re: TopoSort() fix

2019-07-30 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jul 29, 2019 at 9:48 PM Tom Lane  wrote:
>> I believe the only accessible route to taking any sort of new lock
>> in a parallel worker is catalog lookups causing AccessShareLock on
>> a catalog.

> Can't the worker just query a previously-untouched table, maybe by
> constructing a string and then using EXECUTE to execute it?

Hm, yeah, looks like you could get a new AccessShareLock that way too.
But not any exclusive lock.

I also looked into whether one could use SELECT FOR UPDATE/SHARE to get
stronger locks at a tuple level, but that's been blocked off as well.
You guys really did a pretty good job of locking that down.

After thinking about this for awhile, though, I believe it might be
reasonable to just remove PreventAdvisoryLocksInParallelMode()
altogether.  The "parallel unsafe" markings on the advisory-lock
functions seem like adequate protection against somebody running
them in a parallel worker.  If you defeat that by calling them from
a mislabeled-parallel-safe wrapper (as the proposed test case does),
then any negative consequences are on your own head.  AFAICT the
only actual negative consequence is that the locks disappear the
moment the parallel worker exits, so we'd not be opening any large
holes even to people who rip off the safety cover.

(BTW, why aren't these functions just "parallel restricted"?)

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 10:06 AM Masahiko Sawada 
wrote:

> On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni 
> wrote:
> >
> > On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian  wrote:
> >>
> >> > Checking that all buffers using a single LSN are from the same
> >> > relation would be a good idea but I think it's hard to test it and
> >> > regard the test result as okay. Even if we passed 'make checkworld',
> >> > it might still be possible to happen. And even assertion failures
> >>
> >> Yes, the problem is that if you embed the relfilenode or tablespace or
> >> database in the encryption IV, you then need to then make sure you
> >> re-encrypt any files that move between these.  I am hesitant to do that
> >> since it then requires these workarounds for encryption going forward.
> >> We know that most people will not be using encryption, so that will not
> >> be well tested either.  For pg_upgrade, I used a minimal-impact
> >> approach, and it has allowed dramatic changes in our code without
> >> requiring changes and retesting of pg_upgrade.
> >
> >
> > Will there be a per-relation salt stored in a separate file? I saw it
> mentioned in a few places (most recently
> https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org)
> but there's also discussion of trying to make the TDEK unique without a
> separate salt so I'm unsure.
> >
> > With a per-relation salt there is no need to include fixed attributes
> (database, relfilenode, or tablespace) to ensure the derived key is unique
> per relation. A long salt (32-bytes from /dev/urandom) alone guarantees
> that uniqueness. Copying or moving files would then be possible by also
> copying the salt. It does not need to be a salt per file on disk either,
> one salt can be used for many files for the same relation by including the
> fork number, type, or segment in the TDEK derivation (so each file on disk
> for that relation ends up with a unique TDEK).
>
> If we can derive unique TDEK using (database, tablespace, relfilenode)
> as info I think it's better to use it rather than using random salt
> per relations since it doesn't require additional information we need
> to store. As described in HKDF RFC[1], if the input key is already
> present as a cryptographically strong key we can skip the extract part
> where use a salt.
>
> [1] https://tools.ietf.org/html/rfc5869


Yes a random salt is not required for security reasons. Any unique values
mixed into the HKDF is fine and the derived keys will still be unique. The
HKDF ensures that uniqueness.

The separate salt allows you to disconnect the key derivation from the
physical attributes of the file. The physical attributes (ex: database,
tablespace, file node) are very convenient as they're unique and do not
require additional storage. However using them prevents copying or moving
the encrypted files as one or more of them would be different at the
destination (so the derived key would no longer decrypt the existing data).
So you would have to decrypt / encrypt everything as part of a copy.

If copying raw files without a decrypt/encrypt cycle is desired then the
key derivation cannot include physical attributes (or per Bruce's note
above, there would be no separate key derivation relation). I thought it'd
be a nice property to have as it limits the amount of code that needs to be
crypto aware (ex: copying a database or moving a table to a different
tablespace would not change beyond ensuring the salt is also copied).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 8:16 AM Bruce Momjian  wrote:

> On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote:
> > If each relation file has its own derived key, the derived TDEK for that
> > relation file, then there is no issue with reusing the same IV = LSN ||
> Page
> > Number. The TDEKs will be different so Key + IV will never collide.
>
> So, this email explains that we are considering not using the
> relfilenode/tablepsace/database to create a derived key per relation,
> but us the same key for all relaions because the IV will be unique per
> page across all relations:
>
>
> https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us
>
> There is talk of using a derived key with a contant to make sure all
> heap/index files use a different derived key than WAL, but I am not
> sure.  This is related to whether WAL IV and per-heap/index IV can
> collide.
>

Ah, I read that to imply that derived keys were a must. Specifically this
piece at the end:

>From Joe's email on 2019-07-13 18:41:34:
>> Based on all of that I cannot find a requirement that we use more than
one key per database.
>>
>> But I did find that files in an encrypted file system are encrypted with
derived keys from a master key, and I view this as analogous to what we
are doing.

I read that as the "one key per database" is the MDEK.

And read the part about derived keys as referring to separate derived keys
for relations. Perhaps I misread and it was referring to different keys for
WAL vs pages.


> There are other emails in the thread that also discuss the topic.  The
> issue is that we add a lot of complexity to other parts of the system,
> (e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces)
> to create a derived key, so we should make sure we need it before we do
> it.
>

Yes it definitely complicates things both on the derivation and potential
additional storage for the salts (they're small and fixed size, but you
still need to put it somewhere).

I think key rotation for TDEK will be impossible without some stored salt
and per-relation derived key. It might not be needed in a first cut though
as the "default salt" could be no salt or a place holder of all zeroes.
Even if the rotation itself is out of scope for a first pass the potential
to eventually add it should be there.

Should keep in mind that because we do not have a MAC on the encrypted
pages we'll need to know which derived key to use. We can't try multiple
options to see which is correct as any key would "succeed" with garbage
decrypted data.


> > In general it's fine to use the same IV with different keys. Only reuse
> of Key
> > + IV is a problem and the entire set of possible counter values (IV + 0,
> IV +
> > 1, ...) generated with a key must be unique. That's also why we must
> leave at
> > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be
> filled
> > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our
> per-page IV
> > prefix used any of those bits then the counter could overflow into the
> next
> > page's IV's range.
>
> Agreed.
>
> Attached is an updated patch that checks only main relation forks, which
> I think are the only files we are going ot encrypt, and it has better
> comments.
>

Okay that makes sense in the context of using a single key and relying on
the LSN based IV to be unique.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Masahiko Sawada
On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni  wrote:
>
> On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian  wrote:
>>
>> > Checking that all buffers using a single LSN are from the same
>> > relation would be a good idea but I think it's hard to test it and
>> > regard the test result as okay. Even if we passed 'make checkworld',
>> > it might still be possible to happen. And even assertion failures
>>
>> Yes, the problem is that if you embed the relfilenode or tablespace or
>> database in the encryption IV, you then need to then make sure you
>> re-encrypt any files that move between these.  I am hesitant to do that
>> since it then requires these workarounds for encryption going forward.
>> We know that most people will not be using encryption, so that will not
>> be well tested either.  For pg_upgrade, I used a minimal-impact
>> approach, and it has allowed dramatic changes in our code without
>> requiring changes and retesting of pg_upgrade.
>
>
> Will there be a per-relation salt stored in a separate file? I saw it 
> mentioned in a few places (most recently 
> https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org)
>  but there's also discussion of trying to make the TDEK unique without a 
> separate salt so I'm unsure.
>
> With a per-relation salt there is no need to include fixed attributes 
> (database, relfilenode, or tablespace) to ensure the derived key is unique 
> per relation. A long salt (32-bytes from /dev/urandom) alone guarantees that 
> uniqueness. Copying or moving files would then be possible by also copying 
> the salt. It does not need to be a salt per file on disk either, one salt can 
> be used for many files for the same relation by including the fork number, 
> type, or segment in the TDEK derivation (so each file on disk for that 
> relation ends up with a unique TDEK).

If we can derive unique TDEK using (database, tablespace, relfilenode)
as info I think it's better to use it rather than using random salt
per relations since it doesn't require additional information we need
to store. As described in HKDF RFC[1], if the input key is already
present as a cryptographically strong key we can skip the extract part
where use a salt.

[1] https://tools.ietf.org/html/rfc5869

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 8:16 AM Masahiko Sawada 
wrote:

> On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni 
> wrote:
> >
> > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada 
> wrote:
> > > > An argument could be made to push that problem upstream, i.e. let the
> > > > supplier of the passphrase deal with the indirection. You would still
> > > > need to verify the supplied passphrase/key is correct via something
> > > > like authenticating against a stored MAC.
> > >
> > > So do we need the key for MAC of passphrase/key in order to verify?
> >
> > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
> > can be "decrypted" with it in both CTR and CBC mode, you'll just end
> > up with garbage data if the key does not match. Verification of the
> > key prior to usage (i.e. starting DB and encrypting/decrypting data)
> > is a must as otherwise you'll end up with all kinds of corruption or
> > data loss.
> >
>
> Do you mean that we encrypt and store a 16 byte input with the correct
> key to the disk, and then decrypt it with the user supplied key and
> compare the result to the input data?
>

Yes but we don't compare via decryption of a known input. We instead
compute a MAC of the encrypted master key using the user supplied key, and
compare that against an expected MAC stored alongside the encrypted master
key.

The pseudo code would be something like:

// Read key text from user:
string raw_kek = read_from_user()
// Normalize it to a fixed size of 64-bytes
byte[64] kek = SHA512(SHA512(raw_kek))
// Split the 64-bytes into a separate encryption and MAC key
byte[32] user_encryption_key = kek.slice(0,32)
byte[32] user_mac_key = kek.slice(32, 64)

// Read our saved MAC and encrypted master key
byte[80] mac_iv_encrypted_master_key = read_from_file()
// First 32-bytes is the MAC of the rest
byte[32] expected_mac = mac_iv_encrypted_master_key.slice(0, 32)
// Rest is a random IV + Encrypted master key
byte[48] iv_encrypted_master_key = mac_iv_encrypted_master_key(32, 80)

// Compute the MAC with the user supplied key
byte[32] actual_mac = HMAC(user_mac_key, iv_encrypted_master_key)
// If it does not match then the user key is invalid
if (actual_mac != expected_mac) {
  print_err_and_exit("Bad user key!")
}

// Our MAC was correct
// ... so we know user supplied key is valid
// ... and we know our iv and encrypted_key are valid
byte[16] iv = iv_encrypted_master_key.slice(0,16)
byte[32] encrypted_master_key = iv_encrypted_master_key.slice(16, 48)
// ... so we can use all three to decrypt the master key (MDEK)
byte[32] master_key = decrypt_aes_cbc(user_encryption_key, iv,
encrypted_master_key)


> From a single user supplied passphrase you would derive the MDEK and
> > compute a MAC (either using the same key or via a separate derived
> > MDEK-MAC key). If the computed MAC matches against the previously
> > stored value then you know the MDEK is correct as well.
>
> You meant KEK, not MDEK?
>

If the KEK is incorrect then the MAC validation would fail and the decrypt
would never be attempted.

If the MAC matches then both the KEK (user supplied key) and MDEK
("master_key" in the pseudo code above) would be confirmed to be valid. So
the MDEK is safe to use for deriving keys for encrypt / decrypt.

I'm using the definitions for "KEK" and "MDEK" from Joe's mail
https://www.postgresql.org/message-id/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: tap tests driving the database via psql

2019-07-30 Thread Tom Lane
Craig Ringer  writes:
> On Sun, 28 Jul 2019 at 03:15, Andres Freund  wrote:
>> 1) Just depend on DBD::Pg being installed. It's fairly common, after
>> all. It'd be somewhat annoying that we'd often end up using a
>> different version of libpq than what we're testing against. But in
>> most cases that'd not be particularly problematic.

> I advocated for this in the past, and still think it's the best option.

I think the not-same-libpq issue would be a much bigger problem than either
of you are accounting for.  Some off-the-top-of-the-head reasons:

* Since we'd have to presume a possibly-very-back-rev libpq, we could
never add tests related to any recent libpq bug fixes or new features.

* The system libpq might have a different idea of default port
and/or socket directory than the test build.  Yeah, we could work
around that by forcing the choice all the time, but it would be a
constant hazard.

* We don't have control over what else gets brought in beside libpq.
Depending on the whims of the platform packagers, there might need
to be other parts of the platform's default postgres installation,
eg psql, sitting in one's PATH.  Again, this wouldn't be too much
of a hazard for pre-debugged test scripts --- but it would be a huge
hazard for developers, who do lots of things manually and would always be
at risk of invoking the wrong psql.  I learned long ago not to have any
part of a platform's postgres packages in place on my development systems,
and I will fight hard against any test procedure change that requires me
to do differently.

Now, none of these things are really a problem with DBD/DBI as such
--- rather, they are reasons not to depend on a pre-packaged build
of DBD::Pg that depends on a pre-packaged build of libpq.so.
I haven't looked at the size, or the license, of DBD::Pg ... but
could it be sane to include our own modified copy in the tree?

(Forking DBD::Pg would also let us add custom testing features
to it ...)

> The community IMO wastes *so* much time on not-invented-here make-work and
> on jumping through hoops to avoid depending on anything newer than the late
> '90s.

That's an unnecessary, and false, ad-hominem attack.

regards, tom lane




Re: block-level incremental backup

2019-07-30 Thread Ibrar Ahmed
On Tue, Jul 30, 2019 at 1:28 AM Robert Haas  wrote:

> On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
>  wrote:
> > In attachments, you can find a prototype of incremental pg_basebackup,
> > which consists of 2 features:
> >
> > 1) To perform incremental backup one should call pg_basebackup with a
> > new argument:
> >
> > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
> >
> > where lsn is a start_lsn of parent backup (can be found in
> > "backup_label" file)
> >
> > It calls BASE_BACKUP replication command with a new argument
> > PREV_BACKUP_START_LSN 'lsn'.
> >
> > For datafiles, only pages with LSN > prev_backup_start_lsn will be
> > included in the backup.
> > They are saved into 'filename.partial' file, 'filename.blockmap' file
> > contains an array of BlockNumbers.
> > For example, if we backuped blocks 1,3,5, filename.partial will contain
> > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
>
> I think it's better to keep both the information about changed blocks
> and the contents of the changed blocks in a single file.  The list of
> changed blocks is probably quite short, and I don't really want to
> double the number of files in the backup if there's no real need. I
> suspect it's just overall a bit simpler to keep everything together.
> I don't think this is a make-or-break thing, and welcome contrary
> arguments, but that's my preference.
>

I had experience working on a similar product and I agree with Robert to
keep
the changed block info and the changed block in a single file make more
sense.
+1

>
> > 2) To merge incremental backup into a full backup call
> >
> > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> > --merge-backups
> >
> > It will move all files from 'incremental_basedir' to 'basedir' handling
> > '.partial' files correctly.
>
> This, to me, looks like it's much worse than the design that I
> proposed originally.  It means that:
>
> 1. You can't take an incremental backup without having the full backup
> available at the time you want to take the incremental backup.
>
> 2. You're always storing a full backup, which means that you need more
> disk space, and potentially much more I/O while taking the backup.
> You save on transfer bandwidth, but you add a lot of disk reads and
> writes, costs which have to be paid even if the backup is never
> restored.
>
> > 1) Whether we collect block maps using simple "read everything page by
> > page" approach
> > or WAL scanning or any other page tracking algorithm, we must choose a
> > map format.
> > I implemented the simplest one, while there are more ideas:
>
> I think we should start simple.
>
> I haven't had a chance to look at Jeevan's patch at all, or yours in
> any detail, as yet, so these are just some very preliminary comments.
> It will be good, however, if we can agree on who is going to do what
> part of this as we try to drive this forward together.  I'm sorry that
> I didn't communicate EDB's plans to work on this more clearly;
> duplicated effort serves nobody well.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Ibrar Ahmed


Re: [proposal] de-TOAST'ing using a iterator

2019-07-30 Thread Binguo Bao
John Naylor  于2019年7月29日周一 上午11:49写道:

> On Thu, Jul 25, 2019 at 10:21 PM Binguo Bao  wrote:
> My goal for this stage of review was to understand more fully what the
>
code is doing, and make it as simple and clear as possible, starting
> at the top level. In doing so, it looks like I found some additional
> performance gains. I haven't looked much yet at the TOAST fetching
> logic.
>
>
> 1). For every needle comparison, text_position_next_internal()
> calculates how much of the value is needed and passes that to
> detoast_iterate(), which then calculates if it has to do something or
> not. This is a bit hard to follow. There might also be a performance
> penalty -- the following is just a theory, but it sounds plausible:
> The CPU can probably correctly predict that detoast_iterate() will
> usually return the same value it did last time, but it still has to
> call the function and make sure, which I imagine is more expensive
> than advancing the needle. Ideally, we want to call the iterator only
> if we have to.
>
> In the attached patch (applies on top of your v5),
> text_position_next_internal() simply compares hptr to the detoast
> buffer limit, and calls detoast_iterate() until it can proceed. I
> think this is clearer.


Yes, I think this is a general scenario where the caller continually
calls detoast_iterate until gets enough data, so I think such operations can
be extracted as a macro, as I did in patch v6. In the macro, the
detoast_iterate
function is called only when the data requested by the caller is greater
than the
buffer limit.

(I'm not sure of the error handling, see #2.)
> In this scheme, the only reason to know length is to pass to
> pglz_decompress_iterate() in the case of in-line compression. As I
> alluded to in my first review, I don't think it's worth the complexity
> to handle that iteratively since the value is only a few kB. I made it
> so in-line datums are fully decompressed as in HEAD and removed struct
> members to match.


Sounds good. This not only simplifies the structure and logic of Detoast
Iterator
but also has no major impact on efficiency.


> I also noticed that no one updates or looks at
> "toast_iter.done" so I removed that as well.
>

toast_iter.done is updated when the buffer limit reached the buffer
capacity now.
So, I added it back.


> Now pglz_decompress_iterate() doesn't need length at all. For testing
> I just set decompress_all = true and let the compiler optimize away
> the rest. I left finishing it for you if you agree with these changes.
>

Done.


> 2). detoast_iterate() and fetch_datum_iterate() return a value but we
> don't check it or do anything with it. Should we do something with it?
> It's also not yet clear if we should check the iterator state instead
> of return values. I've added some XXX comments as a reminder. We
> should also check the return value of pglz_decompress_iterate().
>

IMO, we need to provide users with a simple iterative interface.
Using the required data pointer to compare with the buffer limit is an easy
way.
And the application scenarios of the iterator are mostly read operations.
So I think there is no need to return a value, and the iterator needs to
throw an
exception for some wrong calls, such as all the data have been iterated,
but the user still calls the iterator.


>
> 3). Speaking of pglz_decompress_iterate(), I diff'd it with
> pglz_decompress(), and I have some questions on it:
>
> a).
> + srcend = (const unsigned char *) (source->limit == source->capacity
> ? source->limit : (source->limit - 4));
>
> What does the 4 here mean in this expression?


Since we fetch chunks one by one, if we make srcend equals to the source
buffer limit,
In the while loop "while (sp < srcend && dp < destend)", sp may exceed the
source buffer limit and
read unallocated bytes. Giving a four-byte buffer can prevent sp from
exceeding the source buffer limit.
If we have read all the chunks, we don't need to be careful to cross the
border,
just make srcend equal to source buffer limit. I've added comments to
explain it in patch v6.



> Is it possible it's
> compensating for this bit in init_toast_buffer()?
>
> + buf->limit = VARDATA(buf->buf);
>
> It seems the initial limit should also depend on whether the datum is
> compressed, right? Can we just do this:
>
> + buf->limit = buf->position;
>

I'm afraid not. buf->position points to the data portion of the buffer, but
the beginning of
the chunks we read may contain header information. For example, for
compressed data chunks,
the first four bytes record the size of raw data, this means that limit is
four bytes ahead of position.
This initialization doesn't cause errors, although the position is less
than the limit in other cases.
Because we always fetch chunks first, then decompress it.


> b).
> - while (sp < srcend && dp < destend)
> ...
> + while (sp + 1 < srcend && dp < destend &&
> ...
>
> Why is it here "sp + 1"?
>

Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in 

Re: Built-in connection pooler

2019-07-30 Thread Tomas Vondra

On Tue, Jul 30, 2019 at 01:01:48PM +0300, Konstantin Knizhnik wrote:



On 30.07.2019 4:02, Tomas Vondra wrote:


My idea (sorry if it wasn't too clear) was that we might handle some
cases more gracefully.

For example, if we only switch between transactions, we don't quite care
about 'SET LOCAL' (but the current patch does set the tainted flag). The
same thing applies to GUCs set for a function.
For prepared statements, we might count the number of statements we
prepared and deallocated, and treat it as 'not tained' when there are no
statements. Maybe there's some risk I can't think of.

The same thing applies to temporary tables - if you create and drop a
temporary table, is there a reason to still treat the session as tained?




I already handling temporary tables with transaction scope (created 
using "create temp table ... on commit drop" command) - backend is not 
marked as tainted in this case.
Thank you for your notice about "set local" command - attached patch 
is also handling such GUCs.




Thanks.



To implement prepared statements  we need to store them in session 
context or at least add some session specific prefix to prepare 
statement name.
Temporary tables also require per-session temporary table space. 
With GUCs situation is even more complicated - actually most of 
the time in my PgPro-EE pooler version
I have spent in the fight with GUCs (default values, reloading 
configuration, memory alllocation/deallocation,...).
But the "show stopper" are temporary tables: if them are accessed 
through internal (non-shared buffer), then you can not reschedule 
session to some other backend.
This is why I have now patch with implementation of global 
temporary tables (a-la Oracle) which has global metadata and are 
accessed though shared buffers (which also allows to use them

in parallel queries).



Yeah, temporary tables are messy. Global temporary tables would be nice,
not just because of this, but also because of catalog bloat.



Global temp tables solves two problems:
1. catalog bloating
2. parallel query execution.

Them are not solving problem with using temporary tables at replica.
May be this problem can be solved by implementing special table access 
method for temporary tables.
But I am still no sure how useful will be such implementation of 
special table access method for temporary tables.
Obviously it requires much more efforts (need to reimplement a lot of 
heapam stuff).
But it will allow to eliminate MVCC overhead for temporary tuple and 
may be also reduce space by reducing size of tuple header.




Sure. Temporary tables are a hard issue (another place where they cause
trouble are 2PC transactions, IIRC), so I think it's perfectly sensible to
accept the limitation, handle cases that we can handle and see if we can
improve the remaining cases later.






If Postgres backend is able to work only with on database, then 
you will have to start at least such number of backends as number 
of databases you have.
Situation with users is more obscure - it may be possible to 
implement multiuser access to the same backend (as it can be done 
now using "set role").




Yes, that's a direct consequence of the PostgreSQL process model.



I don't think I've said we need anything like that. The way I'd expect
it to work that when we run out of backend connections, we terminate
some existing ones (and then fork new backends).


I afraid that it may eliminate most of positive effect of session 
pooling if we will  terminate and launch new backends without any 
attempt to bind backends to database and reuse them.




I'm not sure I understand. Surely we'd still reuse connections as much as
possible - we'd still keep the per-db/user connection pools, but after
running out of backend connections we'd pick a victim in one of the pools,
close it and open a new connection.

We'd need some logic for picking the 'victim' but that does not seem
particularly hard - idle connections first, then connections from
"oversized" pools (this is one of the reasons why pgbouncer has
min_connection_pool).



So I am not sure that if we implement sophisticated configurator 
which allows to specify in some configuration file for each 
database/role pair maximal/optimal number
of workers, then it completely eliminate the problem with multiple 
session pools.




Why would we need to invent any sophisticated configurator? Why couldn't
we use some version of what pgbouncer already does, or maybe integrate
it somehow into pg_hba.conf?


I didn't think about such possibility.
But I suspect many problems with reusing pgbouncer code and moving it 
to Postgres core.




To be clear - I wasn't suggesting to copy any code from pgbouncer. It's
far too different (style, ...) compared to core. I'm suggesting to adopt
roughly the same configuration approach, i.e. what parameters are allowed
for each pool, global limits, etc.

I don't know whether we should have a separate configuration file, make it
part of pg_hba.conf 

Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-07-30 Thread Ibrar Ahmed
Hi Paul,

I have rebased the patch to master (1e2fddfa33d3c7cc93ca3ee0f32852699bd3e012)
and fixed some compilation warning. Now I am reviewing the actual code.


On Fri, Jul 26, 2019 at 6:35 PM Ibrar Ahmed  wrote:

> The patch requires to rebase on the master branch.
>
> The new status of this patch is: Waiting on Author
>


-- 
Ibrar Ahmed


temporal_pks_fks_v005.patch
Description: Binary data


Re: concerns around pg_lsn

2019-07-30 Thread Robert Haas
On Tue, Jul 30, 2019 at 4:52 AM Jeevan Ladhe
 wrote:
> My only concern was something that we internally treat as invalid, why do
> we allow, that as a valid value for that type. While I am not trying to
> reinvent the wheel here, I am trying to understand if there had been any
> idea behind this and I am missing it.

Well, the word "invalid" can mean more than one thing.  Something can
be valid or invalid depending on context.  I can't have -2 dollars in
my wallet, but I could have -2 dollars in my bank account, because the
bank will allow me to pay out slightly more money than I actually have
on the idea that I will pay them back later (and with interest!).  So
as an amount of money in my wallet, -2 is invalid, but as an amount of
money in my bank account, it is valid.

0/0 is not a valid LSN in the sense that (in current releases) we
never write a WAL record there, but it's OK to compute with it.
Subtracting '0/0'::pg_lsn seems useful as a way to convert an LSN to
an absolute byte-index in the WAL stream.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Restricting maximum keep segments by repslots

2019-07-30 Thread Kyotaro Horiguchi
At Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190730.213045.221405075.horikyota@gmail.com>
> I attach the revised patch. I'll repost the polished version
> sooner.

(Mainly TAP test and documentation, code comments will be revised)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Restricting maximum keep segments by repslots

2019-07-30 Thread Kyotaro Horiguchi
Thanks for reviewing!

At Thu, 27 Jun 2019 16:22:56 +0200, Jehan-Guillaume de Rorthais 
 wrote in <20190627162256.4f4872b8@firost>
> Hi all,
> 
> Being interested by this feature, I did a patch review.
> 
> This features adds the GUC "max_slot_wal_keep_size". This is the maximum 
> amount
> of WAL that can be kept in "pg_wal" by active slots.
> 
> If the amount of WAL is superior to this limit, the slot is deactivated and
> its status (new filed in pg_replication_slot) is set as "lost".

This patch doesn't deactivate walsender. A walsender stops by
itself when it finds that it cannot continue ongoing replication.

> Patching
> 
> 
> The patch v13-0003 does not apply on HEAD anymore.
> 
> The patch v13-0005 applies using "git am --ignore-space-change"
> 
> Other patches applies correctly.
> 
> Please, find attached the v14 set of patches rebased on master.

Sorry for missing this for log time. It is hit by 67b9b3ca32
again so I repost a rebased version.

> Documentation
> =
> 
> The documentation explains the GUC and related columns in 
> "pg_replication_slot".
> 
> It reflects correctly the current behavior of the patch.
> 
> 
> Usability
> =
> 
> The patch implement what it described. It is easy to enable and disable. The
> GUC name is describing correctly its purpose.
> 
> This feature is useful in some HA scenario where slot are required (eg. no
> possible archiving), but where primary availability is more important than
> standbys.

Yes. Thanks for the clear explanation on the purpose.

> In "pg_replication_slots" view, the new "wal_status" field is misleading.
> Consider this sentence and the related behavior from documentation
> (catalogs.sgml):
> 
>   keeping means that some of them are to be removed by the
>   next checkpoint.
> 
> "keeping" appears when the current checkpoint will delete some WAL further 
> than
> "current_lsn - max_slot_wal_keep_size", but still required by at least one 
> slot.
> As some WAL required by some slots will be deleted quite soon, probably before
> anyone can react, "keeping" status is misleading here. We are already in the
> red zone.

It may be "losing", which would be less misleading.

> I would expect this "wal_status" to be:
> 
> - streaming: slot lag between 0 and "max_wal_size"
> - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
>   slot actually protect some WALs from being deleted
> - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't protect
>   some WAL from deletion

I agree that comparing to max_wal_size is meaningful. The revised
version behaves as that.

> Documentation follows with:
> 
>   The last two states are seen only when max_slot_wal_keep_size is
>   non-negative
> 
> This is true with the current behavior. However, if "keeping" is set as soon 
> as
> te slot lag is superior than "max_wal_size", this status could be useful even
> with "max_slot_wal_keep_size = -1". As soon as a slot is stacking WALs that
> should have been removed by previous checkpoint, it "keeps" them.

I revised the documentation that way. Both
view-pg-replication-slots.html and
runtime-config-replication.html are reworded.

> Feature tests
> =
> 
> I have played with various traffic shaping setup between nodes, to observe how
> columns "active", "wal_status" and "remain" behaves in regard to each others
> using:
> 
>   while true; do
> 

> 
> Finally, at least once the following messages appeared in primary logs
> **before** the "wal_status" changed from "keeping" to "streaming":
> 
>  WARNING:  some replication slots have lost required WAL segments
> 
> So the slot lost one WAL, but the standby has been able to catch-up anyway.

Thanks for the intensive test run. It is quite helpful.

> My humble opinion about these results:
> 
> * after many different tests, the status "keeping" appears only when "remain"
>   equals 0. In current implementation, "keeping" really adds no value...

Hmm. I agree that given that the "lost" (or "losing" in the
patch) state is a not definite state. That is, the slot may
recover from the state.

> * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
>   active

The revised  version shows the following statuses.

   streaming / NULL max_slot_wal_keep_size is -1
   unkown/ NULL mswks >= 0 and restart_lsn is invalid
 /   elsewise

> * the "lost" status should be a definitive status
> * it seems related, but maybe the "wal_status" should be set as "lost"
>   only when the slot has been deactivate ?

Agreed. While replication is active, if required segments seems
to be lost once, delayed walreceiver ack can advance restart_lsn
to "safe" zone later. So, in the revised version, if the segment
for restart_lsn has been removed, GetLsnAvailablity() returns
"losing" if walsender is active and "lost" if not.

> * logs should warn about a failing slot as soon as it is effectively
>   deactivated, not before.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Bruce Momjian
On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote:
> On Mon, Jul 29, 2019 at 8:35 PM Bruce Momjian  wrote:
> From the patch:
> 
> /*
> ! * The initialization vector (IV) is used for page-level
> ! * encryption.  We use the LSN and page number as the IV, and IV
> ! * values must never be reused since it is insecure. It is safe
> ! * to use the LSN on multiple pages in the same relation since
> ! * the page number is part of the IV.  It is unsafe to reuse the
> ! * LSN in different relations because the page number might be
> ! * the same, and hence the IV.  Therefore, we check here that
> ! * we don't have WAL records for different relations using the
> ! * same LSN.
> ! */
> 
> If each relation file has its own derived key, the derived TDEK for that
> relation file, then there is no issue with reusing the same IV = LSN || Page
> Number. The TDEKs will be different so Key + IV will never collide.

So, this email explains that we are considering not using the
relfilenode/tablepsace/database to create a derived key per relation,
but us the same key for all relaions because the IV will be unique per
page across all relations:


https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us

There is talk of using a derived key with a contant to make sure all
heap/index files use a different derived key than WAL, but I am not
sure.  This is related to whether WAL IV and per-heap/index IV can
collide.

There are other emails in the thread that also discuss the topic.  The
issue is that we add a lot of complexity to other parts of the system,
(e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces)
to create a derived key, so we should make sure we need it before we do
it.

> In general it's fine to use the same IV with different keys. Only reuse of Key
> + IV is a problem and the entire set of possible counter values (IV + 0, IV +
> 1, ...) generated with a key must be unique. That's also why we must leave at
> least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be filled
> in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our per-page 
> IV
> prefix used any of those bits then the counter could overflow into the next
> page's IV's range.

Agreed.

Attached is an updated patch that checks only main relation forks, which
I think are the only files we are going ot encrypt, and it has better
comments.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
new file mode 100644
index 3ec67d4..57f4d71
*** a/src/backend/access/transam/xloginsert.c
--- b/src/backend/access/transam/xloginsert.c
*** XLogResetInsertion(void)
*** 208,213 
--- 208,215 
  /*
   * Register a reference to a buffer with the WAL record being constructed.
   * This must be called for every page that the WAL-logged operation modifies.
+  * Because of page-level encryption, You cannot reference more than one
+  * RelFileNode in a WAL record;  Assert checks for that.
   */
  void
  XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
*** XLogRegisterBuffer(uint8 block_id, Buffe
*** 235,241 
  
  	/*
  	 * Check that this page hasn't already been registered with some other
! 	 * block_id.
  	 */
  #ifdef USE_ASSERT_CHECKING
  	{
--- 237,243 
  
  	/*
  	 * Check that this page hasn't already been registered with some other
! 	 * block_id, and check for different RelFileNodes in the same WAL record.
  	 */
  #ifdef USE_ASSERT_CHECKING
  	{
*** XLogRegisterBuffer(uint8 block_id, Buffe
*** 248,256 
--- 250,274 
  			if (i == block_id || !regbuf_old->in_use)
  continue;
  
+ 			/* check for duplicate block numbers */
  			Assert(!RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode) ||
     regbuf_old->forkno != regbuf->forkno ||
     regbuf_old->block != regbuf->block);
+ 
+ 			/*
+ 			 * The initialization vector (IV) is used for page-level
+ 			 * encryption.  We use the LSN and page number as the IV, and IV
+ 			 * values must never be reused since it is insecure.	It is safe
+ 			 * to use the LSN on multiple pages in the same relation since
+ 			 * the page number is part of the IV.  It is unsafe to reuse the
+ 			 * LSN in different relations because the page number might be
+ 			 * the same, and hence the IV.  Therefore, we check here that
+ 			 * we don't have WAL records for different relations using the
+ 			 * same LSN.  We only encrypt MAIN_FORKNUM files.
+ 			 */
+ 			Assert(RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode) ||
+    regbuf_old->forkno != MAIN_FORKNUM ||
+    regbuf->forkno != MAIN_FORKNUM);
  		}
  	}
  #endif


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-30 Thread Masahiko Sawada
On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni  wrote:
>
> On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada  wrote:
> > > An argument could be made to push that problem upstream, i.e. let the
> > > supplier of the passphrase deal with the indirection. You would still
> > > need to verify the supplied passphrase/key is correct via something
> > > like authenticating against a stored MAC.
> >
> > So do we need the key for MAC of passphrase/key in order to verify?
>
> Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
> can be "decrypted" with it in both CTR and CBC mode, you'll just end
> up with garbage data if the key does not match. Verification of the
> key prior to usage (i.e. starting DB and encrypting/decrypting data)
> is a must as otherwise you'll end up with all kinds of corruption or
> data loss.
>

Do you mean that we encrypt and store a 16 byte input with the correct
key to the disk, and then decrypt it with the user supplied key and
compare the result to the input data?

> From a single user supplied passphrase you would derive the MDEK and
> compute a MAC (either using the same key or via a separate derived
> MDEK-MAC key). If the computed MAC matches against the previously
> stored value then you know the MDEK is correct as well.

You meant KEK, not MDEK?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




  1   2   >