Re: Unused header file inclusion
On Wed, Jul 31, 2019 at 11:19:08AM +0530, vignesh C wrote: > I noticed that there are many header files being included which need > not be included. I have tried this in a few files and found the > compilation and regression to be working. I have attached the patch > for the files that I tried. I tried this in CentOS, I did not find > the header files to be platform specific. > Should we pursue this further and cleanup in all the files? Do you use a particular method here or just manual deduction after looking at each file individually? If this can be cleaned up a bit, I think that's welcome. The removal of headers is easily forgotten when moving code from one file to another... -- Michael signature.asc Description: PGP signature
Re: Proposal to suppress errors thrown by to_reg*()
On Tue, 30 Jul 2019 12:24:13 -0400 Tom Lane wrote: > Takuma Hoshiai writes: > > [ fix_to_reg_v2.patch ] > > I took a quick look through this patch. I'm on board with the goal > of not having schema-access violations throw an error in these > functions, but the implementation feels pretty ugly and bolted-on. > Nobody who had designed the code to do this from the beginning > would have chosen to parse the names twice, much less check the > ACLs twice which is what's going to happen in the success path. > > But a worse problem is that this only fixes the issue for the > object name proper. to_regprocedure and to_regoperator also > have type name(s) to think about, and this doesn't fix the > problem for those: > > regression=# create schema s1; > CREATE SCHEMA > regression=# create type s1.d1 as enum('a','b'); > CREATE TYPE > regression=# create function f1(s1.d1) returns s1.d1 language sql as > regression-# 'select $1'; > CREATE FUNCTION > regression=# select to_regprocedure('f1(s1.d1)'); > to_regprocedure > - > f1(s1.d1) > (1 row) > > regression=# create user joe; > CREATE ROLE > regression=# \c - joe > You are now connected to database "regression" as user "joe". > regression=> select to_regprocedure('f1(s1.d1)'); > ERROR: permission denied for schema s1 > > > A closely related issue that's been complained of before is that > while these functions properly return NULL when the main object > name includes a nonexistent schema: > > regression=> select to_regprocedure('q1.f1(int)'); > to_regprocedure > - > > (1 row) > > it doesn't work for a nonexistent schema in a type name: > > regression=> select to_regprocedure('f1(q1.d1)'); > ERROR: schema "q1" does not exist > > > Looking at the back-traces for these failures, > > #0 errfinish (dummy=0) at elog.c:411 > #1 0x00553626 in aclcheck_error (aclerr=, > objtype=OBJECT_SCHEMA, objectname=) at aclchk.c:3623 > #2 0x0055028f in LookupExplicitNamespace ( > nspname=, missing_ok=false) at namespace.c:2928 > #3 0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x20d87a0, > typmod_p=0x7fff94c3ee38, missing_ok=) > at parse_type.c:162 > #4 0x005b3b29 in parseTypeString (str=, > typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false) > at parse_type.c:822 > #5 0x0086fe21 in parseNameAndArgTypes (string=, > allowNone=false, names=, nargs=0x7fff94c3f01c, > argtypes=0x7fff94c3ee80) at regproc.c:1874 > #6 0x00870b2d in to_regprocedure (fcinfo=0x2134900) at regproc.c:305 > > #0 errfinish (dummy=0) at elog.c:411 > #1 0x0054dc7b in get_namespace_oid (nspname=, > missing_ok=false) at namespace.c:3061 > #2 0x00550230 in LookupExplicitNamespace ( > nspname=, missing_ok=false) at namespace.c:2922 > #3 0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x216bd20, > typmod_p=0x7fff94c3ee38, missing_ok=) > at parse_type.c:162 > #4 0x005b3b29 in parseTypeString (str=, > typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false) > at parse_type.c:822 > #5 0x0086fe21 in parseNameAndArgTypes (string=, > allowNone=false, names=, nargs=0x7fff94c3f01c, > argtypes=0x7fff94c3ee80) at regproc.c:1874 > #6 0x00870b2d in to_regprocedure (fcinfo=0x2170f50) at regproc.c:305 > > it's not *that* far from where we know we need no-error behavior to > where it's failing to happen. parseNameAndArgTypes isn't even global, > so certainly changing its API is not problematic. For the functions > below it, we'd have to decide whether it's okay to consider that > missing_ok=true also enables treating a schema access failure as > "missing", or whether we should add an additional flag argument > to decide that. It seems like it might be more flexible to use a > separate flag, but that decision could propagate to a lot of places, > especially if we conclude that all the namespace.c functions that > expose missing_ok also need to expose schema_access_violation_ok. > > So I think you ought to drop this implementation and fix it properly > by adjusting the behaviors of the functions cited above. Thank you for watching. Sorry, I overlooked an access permission error of argument. behavior of 'missing_ok = true' is changed have an impact on DROP TABLE IF EXISTS for example. So, I will consider to add an additonal flag like schema_access_violation_ok, instead of checking ACL twice. > regards, tom lane > Best Regards, -- Takuma Hoshiai
Unused header file inclusion
Hi, I noticed that there are many header files being included which need not be included. I have tried this in a few files and found the compilation and regression to be working. I have attached the patch for the files that I tried. I tried this in CentOS, I did not find the header files to be platform specific. Should we pursue this further and cleanup in all the files? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com unused_header_file_exclusion.patch Description: Binary data
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita wrote: > On Fri, Jul 19, 2019 at 10:44 PM Robert Haas wrote: > > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita > > wrote: > > > I.e., partition_bounds_merge() is performed for each pair of input > > > partitioned relations for a join relation in try_partitionwise_join(). > > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > > > think this is acceptable; ISTM that some redesign is needed to avoid > > > this. I'm wondering that once we successfully merged partition bounds > > > from a pair of input partitioned relations for the join relation, by > > > using the merged partition bounds, we could get the lists of matching > > > to-be-joined partitions for subsequent pairs of input partitioned > > > relations for the join relation in a more efficient way than by > > > performing partition_bounds_merge() as proposed in the patch. > > > > I don't know whether partition_bounds_merge() is well-implemented; I > > haven't looked. > > My concern about that is list partitioning. In that case that > function calls partition_list_bounds_merge(), which generates the > partition bounds for a join relation between two given input > relations, by performing merge join for a pair of the datums arrays > from both the input relations. I had similar thoughts upon seeing that partition_bounds_merge() will be replacing the current way of determining if partition-wise join can occur; that it will make the handling of currently supported cases more expensive. The current way is to compare the PartitionBoundInfos of joining relations using partition_bounds_equal(), and if equal, simply join the pairs of matching partitions if the join quals permit doing so. There's no need to do anything extra to determine which partitions to join with each other, because it's already established. Likewise, partition_bounds_merge() shouldn't to have to anything extra in that case. That is, for the cases that are already supported, we should find a way to make partition_bounds_merge() only as expensive as just performing partition_bounds_equals(), or maybe just slightly more. Thanks, Amit
Re: Unused struct member in pgcrypto pgp.c
On Tue, Jul 30, 2019 at 9:19 PM Daniel Gustafsson wrote: > > Hi, > > In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info > which > isn’t used, and seems to have never been used (a potential copy/pasteo from > the > cipher_info struct?). Is there a reason for keeping this, or can it be > removed > as per the attached? > Agreed. It seems the member is not being used anywhere, only code and name members are being used in digest lookup. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Possible race condition in pg_mkdir_p()?
On Tue, Jul 30, 2019 at 10:19:45PM -0700, Andres Freund wrote: > I don't really have a problem fixing this case if we think it's > useful. But I'm a bit bothered by all the "fixes" being submitted that > don't matter for PG itself. They do eat up resources. Sure. In this particular case, we can simplify at least one code path in the backend though for temporary path creation. Such cleanup rings like a sufficient argument to me. > And sorry, adding in-backend threading to test testing mkdir_p doesn't > make me inclined to believe that this is all well considered. There's > minor issues like us not supporting threads in the backend, pthread not > being portable, and also it being entirely out of proportion to the > issue. Agreed on this one. The test case may be useful for the purpose of testing the presented patches, but it does not have enough value to be merged. -- Michael signature.asc Description: PGP signature
Re: Possible race condition in pg_mkdir_p()?
Hi, On 2019-07-31 13:48:23 +0900, Michael Paquier wrote: > On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote: > > Hm. I'm not really seing much of a point in making mkdir_p safe against > > all of this. What's the scenario for pg where this matters? I assume > > you're using it for somewhat different purposes, and that's why it is > > problematic for you? > > I don't see why it is a problem to make our APIs more stable if we > have ways to do so. Well, wanting to support additional use-cases often isn't free. There's additional code for the new usecase, there's review & commit time, there's additional test time, there's bug fixes for the new code etc. We're not developing a general application support library... I don't really have a problem fixing this case if we think it's useful. But I'm a bit bothered by all the "fixes" being submitted that don't matter for PG itself. They do eat up resources. And sorry, adding in-backend threading to test testing mkdir_p doesn't make me inclined to believe that this is all well considered. There's minor issues like us not supporting threads in the backend, pthread not being portable, and also it being entirely out of proportion to the issue. Greetings, Andres Freund
Re: tap tests driving the database via psql
On Wed, 31 Jul 2019 at 10:20, Andres Freund wrote: > Hi, > > On 2019-07-31 09:32:10 +0800, Craig Ringer wrote: > > OK. So rather than building our own $everything from scratch, lets look > at > > solving that. > > IDK, a minimal driver that just does what we need it to do is a few > hundred lines, not more. And there's plenty of stuff that we simply > won't be able to test with any driver that's not purposefully written > for testing. There's e.g. basically no way with any of the drivers to > test intentionally bogus sequences of protocol messages, yet that's > something we really ought to test. > > I've written custom hacks^Wprograms to tests things like that a few > times. That really shouldn't be necessary. > > That's a good point. I've had to write simple protocol hacks myself, and having a reusable base tool for it would indeed be valuable. I'm just worried it'll snowball into Yet Another Driver We Don't Need, or land up as a half-finished poorly-maintained burden nobody wants to touch. Though the general fondness for and familiarity with Perl in the core circle of Pg contributors and committers would probably help here, since it'd inevitably land up being written in Perl... I'm interested in evolution of the protocol and ways to enhance it, and I can see something we can actively use for protocol testing being valuable. If done right, the protocol implementation could probably pass-through inspect messages from regular libpq etc as well as serving as either a client or even as a dumb server simulator for pregenerated responses for client testing. We certainly couldn't do anything like that with existing tools and by reusing existing drivers. I wonder if there's any way to make writing and maintaining it less painful though. Having to make everything work with Perl 5.8.8 and no non-core modules leads to a whole pile of bespoke code and reinvention. > > Perl modules support local installations. Would it be reasonable to > require > > DBI to be present, then rebuild DBD::Pg against our libpq during our own > > test infra compilation, and run it with a suitable LD_LIBRARY_PATH or > using > > rpath? That'd actually get us a side-benefit of some test coverage of > > DBD::Pg and libpq. > > You're intending to download DBD::Pg? That'd be a reasonable option IMO, yes. Either via git with https, or a tarball where we check a signature. So long as it can use any pre-existing local copy without needing to hit the Internet, and the build can also succeed without the presence of it at all, I think that'd be reasonable enough. But I'm probably getting contaminated by excessive exposure to Apache Maven when I have to work with Java. I'm rapidly getting used to downloading things being a part of builds. Personally I'd expect that unless specifically testing new libpq functionality etc, most people would just be using their system DBD::Pg... and I consider linking the system DBD::Pg against our new-compiled libpq more feature than bug, as if we retain the same soname we should be doing a reasonable job of not breaking upward compatibility anyway. (It'd potentially be an issue if you have a very new system libpq and are running tests on an old postgres, I guess). > Or how would you get around the licensing issues? Downloading it will have > some issues too: For one at least I > often want to be able to run tests when offline; there's security > concerns too. > Roughly what I was thinking of was: For Pg source builds (git, or dist tarball), we'd generally curl a tarball of DBD::Pg from a HTTPs URL specified in Makefile variables (with =? so it can be overridden to point to an alt location, different version, local path, etc) and unpack it, then build it. The makefile can also store a signing key fingerprint so we can download the sig file and check the sig is by the expected signer if the user has imported the signing key for DBD::Pg. We could test if the target directory already exists and is populated and re-use it, so people can git-clone DBD::Pg if they prefer. And we'd allow users to specify --with-system-dbd-pg at configure time, or --without-dbd-pg . The Pg perl libraries for our TAP test harness/wrappers would offer a simple function to skip a test if DBD::Pg is missing, a simple function to skip a test if the loaded DBD::Pg lacks $some_feature_or_attribute, etc. > > Greetings, > > Andres Freund > -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Possible race condition in pg_mkdir_p()?
On Wed, Jul 31, 2019 at 12:41 PM Michael Paquier wrote: > > On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote: > > Could I double confirm with you that you made a clean rebuild after > > applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, > > and the postgres makefile will not relink the initdb binary > > automatically, for myself I must 'make clean' and 'make' to ensure > > initdb gets relinked. > > For any patch I test, I just do a "git clean -d -x -f" before building > as I switch a lot across stable branches as well. It looks that you > are right on this one though, I have just rebuilt from scratch and I > don't see the failures anymore. Cool, glad to know that it works. Best Regards Ning > -- > Michael
Re: Possible race condition in pg_mkdir_p()?
On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote: > Hm. I'm not really seing much of a point in making mkdir_p safe against > all of this. What's the scenario for pg where this matters? I assume > you're using it for somewhat different purposes, and that's why it is > problematic for you? I don't see why it is a problem to make our APIs more stable if we have ways to do so. I actually fixed one recently as of 754b90f for a problem that involved a tool linking to our version of readdir() that we ship. Even with that, the retries for mkdir() on the base directory in PathNameCreateTemporaryDir() are basically caused by that same limitation with the parent paths from this report, no? So we could actually remove the dependency to the base directory in this code path and just rely on pg_mkdir_p() to do the right thing for all the parent paths. That's also a point raised by Ning upthread. -- Michael signature.asc Description: PGP signature
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro wrote: > > Hi Amit > > I've been testing some undo worker workloads (more on that soon), > One small point, there is one small bug in the error queues which is that the element pushed into error queue doesn't have an updated value of to_urec_ptr which is important to construct the hash key. This will lead to undolauncher/worker think that the action for the same is already processed and it removes the same from the hash table. I have a fix for the same which I will share in next version of the patch (which I am going to share in the next day or two). > but > here's a small thing: I managed to reach an LWLock self-deadlock in > the undo worker launcher: > I could see the problem, will fix in next version. Thank you for reviewing and testing this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Possible race condition in pg_mkdir_p()?
On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote: > Could I double confirm with you that you made a clean rebuild after > applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, > and the postgres makefile will not relink the initdb binary > automatically, for myself I must 'make clean' and 'make' to ensure > initdb gets relinked. For any patch I test, I just do a "git clean -d -x -f" before building as I switch a lot across stable branches as well. It looks that you are right on this one though, I have just rebuilt from scratch and I don't see the failures anymore. -- Michael signature.asc Description: PGP signature
Re: Possible race condition in pg_mkdir_p()?
On Wed, Jul 31, 2019 at 12:11 PM Andres Freund wrote: > > Hi, > > On 2019-07-18 16:17:22 +0800, Ning Yu wrote: > > This seems buggy as it first checks the existence of the dir and makes the > > dir if it does not exist yet, however when executing concurrently a > > possible race condition can be as below: > > > > A: does a/ exists? no > > B: does a/ exists? no > > A: try to create a/, succeed > > B: try to create a/, failed as it already exists > > Hm. I'm not really seing much of a point in making mkdir_p safe against > all of this. What's the scenario for pg where this matters? I assume > you're using it for somewhat different purposes, and that's why it is > problematic for you? Yes, you are right, postgres itself might not run into such kind of race condition issue. The problem we encountered was on a downstream product of postgres, where multiple postgres clusters are deployed on the same machine with common parent dirs. Best Regards Ning > > Greetings, > > Andres Freund
Re: SegFault on 9.6.14
Amit Kapila writes: > On Wed, Jul 31, 2019 at 12:05 AM Robert Haas wrote: >> The other option is to do >> what I understand Amit and Thomas to be proposing, which is to do a >> better job identifying the case where we're "done for good" and can >> trigger the shutdown fearlessly. > Yes, this sounds safe fix for back-branches. Actually, my point was exactly that I *didn't* think that would be a safe fix for the back branches --- at least, not unless you're okay with a very conservative and hence resource-leaky method for deciding when it's safe to shut down sub-nodes. We could do something involving (probably) adding new eflags bits to pass this sort of info down to child plan nodes. But that's going to require design and coding, and it will not be backwards compatible. At least not from the point of view of any extension that's doing anything in that area. regards, tom lane
Re: Possible race condition in pg_mkdir_p()?
On Wed, Jul 31, 2019 at 12:04 PM Michael Paquier wrote: > > On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote: > > In fact personally I'm thinking that whether could we replace all uses of > > MakePGDirectory() with pg_mkdir_p(), so we could simplify > > TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers > > significantly. > > I would still keep the wrapper, but I think that as a final result we > should be able to get the code in PathNameCreateTemporaryDir() shaped > in such a way that there are no multiple attempts at calling > MakePGDirectory() on EEXIST. This has been introduced by dc6c4c9d to > allow sharing temporary files between backends, which is rather recent > but a fixed set of two retries is not a deterministic method of > resolution. > > > Well, we should have fixed problem 2, this is our major purpose of the patch > > 0001, it performs sanity check with stat() after mkdir() at each part of the > > path. > > I just reuse the script presented at the top of the thread with n=2, > and I get that: > pgdata/logs/1.log:creating directory > pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1 > ... initdb: error: could not create directory "pgdata/a": File exists Could I double confirm with you that you made a clean rebuild after applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, and the postgres makefile will not relink the initdb binary automatically, for myself I must 'make clean' and 'make' to ensure initdb gets relinked. > > But the result expected is that all the paths should be created with > no complains about the parents existing, no? This reproduces on my > Debian box 100% of the time, for different sub-paths. So something > looks wrong in your solution. The code comes originally from FreeBSD, > how do things happen there. Do we get failures if doing something > like that? I would expect this sequence to not fail: > for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done > -- > Michael
Re: tap tests driving the database via psql
On Tue, Jul 30, 2019 at 07:20:53PM -0700, Andres Freund wrote: > IDK, a minimal driver that just does what we need it to do is a few > hundred lines, not more. And there's plenty of stuff that we simply > won't be able to test with any driver that's not purposefully written > for testing. There's e.g. basically no way with any of the drivers to > test intentionally bogus sequences of protocol messages, yet that's > something we really ought to test. Yes, I don't know if a hundred of lines measure that correctly, but really we should avoid bloating the stuff we rely on like by fetching an independent driver if we finish by not using most of its features. 300~500 lines would be fine for me, 10k definitely not. > I've written custom hacks^Wprograms to tests things like that a few > times. That really shouldn't be necessary. Mine are usually plain hacks :), and usually on top of libpq but I have done some python-ish script to send bogus and hardcoded protocol messages. If we can automate a bit this area, that could be useful. >> Perl modules support local installations. Would it be reasonable to require >> DBI to be present, then rebuild DBD::Pg against our libpq during our own >> test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using >> rpath? That'd actually get us a side-benefit of some test coverage of >> DBD::Pg and libpq. > > You're intending to download DBD::Pg? Or how would you get around the > licensing issues? Downloading it will have some issues too: For one at least I > often want to be able to run tests when offline; there's security > concerns too. As Craig has mentioned CPAN can be configured to install all the libraries for a local user, so there is no need to be online if a copy has been already downloaded. -- Michael signature.asc Description: PGP signature
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Jul 30, 2019 at 1:32 PM Andres Freund wrote: > > Hi, > > Amit, short note: The patches aren't attached in patch order. Obviously > a miniscule thing, but still nicer if that's not the case. > Noted, I will try to ensure that patches are in order in future posts. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Possible race condition in pg_mkdir_p()?
Hi, On 2019-07-18 16:17:22 +0800, Ning Yu wrote: > This seems buggy as it first checks the existence of the dir and makes the > dir if it does not exist yet, however when executing concurrently a > possible race condition can be as below: > > A: does a/ exists? no > B: does a/ exists? no > A: try to create a/, succeed > B: try to create a/, failed as it already exists Hm. I'm not really seing much of a point in making mkdir_p safe against all of this. What's the scenario for pg where this matters? I assume you're using it for somewhat different purposes, and that's why it is problematic for you? Greetings, Andres Freund
Re: SegFault on 9.6.14
On Wed, Jul 31, 2019 at 12:05 AM Robert Haas wrote: > > On Thu, Jul 18, 2019 at 9:45 AM Tom Lane wrote: > > I think this is going in the wrong direction. Nodes should *always* > > assume that a rescan is possible until ExecEndNode is called. > > If you want to do otherwise, you are going to be inventing a whole > > bunch of complicated and doubtless-initially-buggy control logic > > to pass down information about whether a rescan might be possible. > > That doesn't sound like a recipe for a back-patchable fix. Perhaps > > we could consider redesigning the rules around REWIND in a future > > version, but that's not where to focus the bug fix effort. > > So, if I can summarize how we got here, as best I understand it: > Thanks for the summarization. This looks mostly correct to me. > 0. The historic behavior of the executor is to assume it's OK to leak > resources for the lifetime of the query. Nodes that are executed to > completion generally do some cleanup, but we feel free (as under > Limit) to just stop executing a node without giving it any hint that > it should release resources. So a Sort may hold onto a terabyte of > memory and an index scan may keep holding a pin even after there's no > theoretical way of ever needing those resources again, and we just > don't care. > > 1. Parallel query made that perhaps-already-shaky assumption a lot > more problematic. Partly that's because workers are a a more scarce > and considerably heavier resource than anything else, and moreover act > as a container for anything else, so whatever you were leaking before, > you can now leak N times more of it, plus N processes, until the end > of the query. However, there's a correctness reason too, which is that > when a node has a copy in the leader and a copy in every worker, each > copy has its own instrumentation data (startup time, run time, nloops, > etc) and we can only fold all that together once the node is done > executing, because it's really hard to add up a bunch of numbers > before the numbers are done changing. We could've made the > instrumentation shared throughout, but if we had, we could have > contention for updating the instrumentation data, which seems like > it'd be bad. > > 2. To fix that correctness problem, we decided to try to shut down the > node under a limit node when we're done with it (commit > 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this > looks fundamentally necessary to me. If you're going to have N > separate copies of the instrumentation, and you want to add them up > when you're done, then you have to decide to be done at some point; > otherwise you don't know when to add them up, and maybe won't add them > up at all, and then you'll be sad. This does not mean that the exact > timing couldn't be changed somehow, but if you want a correct > implementation, you have to shut down Limit's sub-node after you're > done executing it (so that you can get the instrumentation data from > the workers after it's final) and before you start destroying DSM > segments and stuff (so that you get the instrumentation data from the > workers before it vanishes). > > 3. The aforementioned commit turned out to be buggy in at least to two > ways, precisely because it didn't do a good enough job predicting when > the Limit needed to be shut down. First, there was commit > 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact > that you could hit the Limit and then back up. > We have not missed it, rather we decided to it separately because it appears to impact some different cases as well [1][2]. > Second, there's the > present issue, where the Limit gets rescanned. > > So, given all that, if we want to adopt Tom's position that we should > always cater to a possible rescan, then we're going to have to rethink > the way that instrumentation data gets consolidated from workers into > the leader in such a way that we can consolidate multiple times > without ending up with the wrong answer. > The other idea we had discussed which comes closer to adopting Tom's position was that during ExecShutdownNode, we just destroy parallel workers, collect instrumentation data and don't destroy the parallel context. The parallel context could be destroyed in ExecEndNode (ExecEndGather(Merge)) code path. The problem with this idea is that ExitParallelMode doesn't expect parallel context to be active. Now, we can either change the location of Exit/EnterParallelMode or relax that restriction. As mentioned above that restriction appears good to me, so I am not in favor of changing it unless we have some other solid way to install it. I am not sure if this idea is better than other approaches we are discussing. > The other option is to do > what I understand Amit and Thomas to be proposing, which is to do a > better job identifying the case where we're "done for good" and can > trigger the shutdown fearlessly. > Yes, this sounds safe fix for back-branches. We might
Re: Possible race condition in pg_mkdir_p()?
On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote: > In fact personally I'm thinking that whether could we replace all uses of > MakePGDirectory() with pg_mkdir_p(), so we could simplify > TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers > significantly. I would still keep the wrapper, but I think that as a final result we should be able to get the code in PathNameCreateTemporaryDir() shaped in such a way that there are no multiple attempts at calling MakePGDirectory() on EEXIST. This has been introduced by dc6c4c9d to allow sharing temporary files between backends, which is rather recent but a fixed set of two retries is not a deterministic method of resolution. > Well, we should have fixed problem 2, this is our major purpose of the patch > 0001, it performs sanity check with stat() after mkdir() at each part of the > path. I just reuse the script presented at the top of the thread with n=2, and I get that: pgdata/logs/1.log:creating directory pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1 ... initdb: error: could not create directory "pgdata/a": File exists But the result expected is that all the paths should be created with no complains about the parents existing, no? This reproduces on my Debian box 100% of the time, for different sub-paths. So something looks wrong in your solution. The code comes originally from FreeBSD, how do things happen there. Do we get failures if doing something like that? I would expect this sequence to not fail: for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done -- Michael signature.asc Description: PGP signature
How to retain lesser paths at add_path()?
Hello, When we add a new path using add_path(), it checks estimated cost and path-keys, then it also removes dominated paths, if any. Do we have a reasonable way to retain these "dominated" paths? Once it is considered lesser paths at a level, however, it may have a combined cheaper cost with upper pathnode. In my case, PG-Strom adds CustomPath to support JOIN/GROUP BY workloads that utilizes GPU and NVME storage. If GpuPreAgg and GpuJoin are executed continuously, output buffer of GpuJoin simultaneously performs as input buffer of GpuPreAgg on GPU device memory. So, it allows to avoid senseless DMA transfer between GPU and CPU/RAM. This behavior affects cost estimation during path construction steps - GpuPreAgg discount DMA cost if its input path is GpuJoin. On the other hands, it looks to me add_path() does not consider upper level optimization other than sorting path-keys. As long as we can keep these "lesser" pathnodes that has further optimization chance, it will help making more reasonable query plan. Do we have any reasonable way to retain these paths at add_path() even if it is dominated by other paths? Any idea welcome. Best regards, [*] GpuJoin + GpuPreAgg combined GPU kernel https://www.slideshare.net/kaigai/20181016pgconfeussd2gpumulti/13 -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Tue, Jul 30, 2019 at 8:07 PM Melanie Plageman wrote: > For the actual write to disk, I'm pretty sure I get that for free from > the BufFile API, no? > I was more thinking about optimizing when I call BufFileWrite at all. Right. Clearly several existing buffile.c users regularly have very small BufFileWrite() size arguments. tuplestore.c, for one. -- Peter Geoghegan
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Tue, Jul 30, 2019 at 4:36 PM Robert Haas wrote: > On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman > wrote: > > I did the "needlessly dumb implementation" Robert mentioned, though, > > I thought about it and couldn't come up with a much smarter way to > > write match bits to a file. I think there might be an optimization > > opportunity in not writing the current_byte to the file each time that > > the outer tuple matches and only doing this once we have advanced to a > > tuple number that wouldn't have its match bit in the current_byte. I > > didn't do that to keep it simple, and, I suspect there might be a bit > > of gymnastics needed to make sure that that byte is actually written > > to the file in case we exit from some other state before we encounter > > the tuple represented in the last bit in that byte. > > I mean, I was assuming we'd write in like 8kB blocks or something. > Doing it a byte at a time seems like it'd produce way too many > syscals. > > For the actual write to disk, I'm pretty sure I get that for free from the BufFile API, no? I was more thinking about optimizing when I call BufFileWrite at all. -- Melanie Plageman
Re: Runtime pruning problem
On Wed, Jul 31, 2019 at 8:31 AM Tom Lane wrote: > David Rowley writes: > > On Wed, 31 Jul 2019 at 10:56, Tom Lane wrote: > >> The portion of this below the Append is fine, but I argue that > >> the Vars above the Append should say "part", not "part_p1". > >> In that way they'd look the same regardless of which partitions > >> have been pruned or not. > > > That seems perfectly reasonable for Append / MergeAppend that are for > > scanning partitioned tables. What do you propose we do for inheritance > > and UNION ALLs? > > For inheritance, I don't believe there would be any change, precisely > because we've historically used the parent rel as reference. I may be missing something, but Vars above an Append/MergeAppend, whether it's scanning a partitioned table or a regular inheritance table, always refer to the first child subplan, which may or may not be for the inheritance parent in its role as a child, not the Append parent. create table parent (a int); alter table only parent add check (a = 1) no inherit; create table child1 (a int check (a = 2)) inherits (parent); create table child2 (a int check (a = 3)) inherits (parent); explain (costs off, verbose) select * from parent where a > 1 order by 1; QUERY PLAN ─── Sort Output: child1.a Sort Key: child1.a -> Append -> Seq Scan on public.child1 Output: child1.a Filter: (child1.a > 1) -> Seq Scan on public.child2 Output: child2.a Filter: (child2.a > 1) (10 rows) I think this is because we replace the original targetlist of such nodes by a dummy one using set_dummy_tlist_references(), where all the parent Vars are re-stamped with OUTER_VAR as varno. When actually printing the EXPLAIN VERBOSE output, ruleutils.c considers the first child of Append as the OUTER referent, as set_deparse_planstate() states: /* * We special-case Append and MergeAppend to pretend that the first child * plan is the OUTER referent; we have to interpret OUTER Vars in their * tlists according to one of the children, and the first one is the most * natural choice. If I change set_append_references() to comment out the set_dummy_tlist_references() call, I get this output: explain (costs off, verbose) select * from parent where a > 1 order by 1; QUERY PLAN ─── Sort Output: a Sort Key: a -> Append -> Seq Scan on public.child1 Output: child1.a Filter: (child1.a > 1) -> Seq Scan on public.child2 Output: child2.a Filter: (child2.a > 1) (10 rows) Not parent.a as I had expected. That seems to be because parent's RTE is considered unused in the plan. One might say that the plan's Append node belongs to that RTE, but then Append doesn't have any RT index attached to it, so it escapes ExplainPreScanNode()'s walk of the plan tree to collect the indexes of "used RTEs". I changed set_rtable_names() to get around that as follows: @@ -3458,7 +3458,7 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces, /* Just in case this takes an unreasonable amount of time ... */ CHECK_FOR_INTERRUPTS(); -if (rels_used && !bms_is_member(rtindex, rels_used)) +if (rels_used && !bms_is_member(rtindex, rels_used) && !rte->inh) and I get: explain (costs off, verbose) select * from parent where a > 1 order by 1; QUERY PLAN ─── Sort Output: parent.a Sort Key: parent.a -> Append -> Seq Scan on public.child1 Output: child1.a Filter: (child1.a > 1) -> Seq Scan on public.child2 Output: child2.a Filter: (child2.a > 1) (10 rows) > For setops we've traditionally used the left input as reference. > Maybe we could do better, but I'm not very sure how, since SQL > doesn't actually provide any explicit names for the setop result. > Making up a name with no basis in the query probably isn't an > improvement, or at least not enough of one to justify a change. I too am not sure what we should about Appends of setops, but with the above hacks, I get this: explain (costs off, verbose) select * from child1 union all select * from child2 order by 1; QUERY PLAN ─── Sort Output: "*SELECT* 1".a Sort Key: "*SELECT* 1".a -> Append -> Seq Scan on public.child1 Output: child1.a -> Seq Scan on public.child2 Output: child2.a (8 rows) whereas currently it prints: explain (costs off, verbose) select * from child1 union all select * from child2 order by 1; QUERY PLAN ─── Sort Output: child1.a Sort Key: child1.a -> Append -> Seq Scan on public.child1
Re: tap tests driving the database via psql
Hi, On 2019-07-31 09:32:10 +0800, Craig Ringer wrote: > OK. So rather than building our own $everything from scratch, lets look at > solving that. IDK, a minimal driver that just does what we need it to do is a few hundred lines, not more. And there's plenty of stuff that we simply won't be able to test with any driver that's not purposefully written for testing. There's e.g. basically no way with any of the drivers to test intentionally bogus sequences of protocol messages, yet that's something we really ought to test. I've written custom hacks^Wprograms to tests things like that a few times. That really shouldn't be necessary. > Perl modules support local installations. Would it be reasonable to require > DBI to be present, then rebuild DBD::Pg against our libpq during our own > test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using > rpath? That'd actually get us a side-benefit of some test coverage of > DBD::Pg and libpq. You're intending to download DBD::Pg? Or how would you get around the licensing issues? Downloading it will have some issues too: For one at least I often want to be able to run tests when offline; there's security concerns too. Greetings, Andres Freund
Re: Initdb failure
On Tue, Jul 30, 2019 at 03:30:12PM -0400, Tom Lane wrote: > On the whole though, I don't have a problem with the "do nothing" > answer. There's no security risk here, and no issue that seems > likely to arise in actual use cases rather than try-to-break-it > test scripts. +1. -- Michael signature.asc Description: PGP signature
Re: idea: log_statement_sample_rate - bottom limit for sampling
On Tue, Jul 30, 2019 at 03:43:58PM -0400, Tom Lane wrote: > Well, we do not need to have a backwards-compatibility problem > here, because we have yet to release a version containing > log_statement_sample_rate. I do not think it's too late to decide > that v12's semantics for that are broken, and either revert that > patch in v12, or back-patch a fix to make it match this idea. With my RTM hat on, if we think that the current semantics of log_statement_sample_rate are broken and need a redesign, then I would take the safest path and just revert the original patch in v12, and finally make sure that it brews correctly for v13. We are in beta2 and close to a beta3, so redesigning things at this stage on a stable branch sounds wrong. -- Michael signature.asc Description: PGP signature
Re: extension patch of CREATE OR REPLACE TRIGGER
On Tue, Jul 30, 2019 at 04:44:11PM -0400, Tom Lane wrote: > We do not test \h output in any existing regression test, and we're > not going to start doing so in this one. For one thing, the expected > URL would break every time we forked off a new release branch. > (There would surely be value in having more-than-no test coverage > of psql/help.c, but that's a matter for its own patch, which would > need some thought about how to cope with instability of the output.) One way to get out of that could be some psql-level options to control which parts of the help output is showing up. The recent addition of the URL may bring more weight for doing something in this area. -- Michael signature.asc Description: PGP signature
Re: tap tests driving the database via psql
On Tue, 30 Jul 2019 at 21:40, Tom Lane wrote: > Craig Ringer writes: > > On Sun, 28 Jul 2019 at 03:15, Andres Freund wrote: > >> 1) Just depend on DBD::Pg being installed. It's fairly common, after > >> all. It'd be somewhat annoying that we'd often end up using a > >> different version of libpq than what we're testing against. But in > >> most cases that'd not be particularly problematic. > > > I advocated for this in the past, and still think it's the best option. > > I think the not-same-libpq issue would be a much bigger problem than either > of you are accounting for. OK. So rather than building our own $everything from scratch, lets look at solving that. I'd argue that the PostgreSQL community has enough work to do maintaining the drivers that are fairly directly community-owned like PgJDBC and psqlODBC, let alone writing new ones for less-in-demand languages just for test use. Perl modules support local installations. Would it be reasonable to require DBI to be present, then rebuild DBD::Pg against our libpq during our own test infra compilation, and run it with a suitable LD_LIBRARY_PATH or using rpath? That'd actually get us a side-benefit of some test coverage of DBD::Pg and libpq. Note that I'm not saying this is the only or right way. Your concerns about using a system libpq are entirely valid, as are your concerns about trying to link to our libpq using a DBD::Pg originally built against the system libpq. I've just been dealing with issues similar to those today and I know how much of a hassle it can be. However, I don't think "it's hard" is a good reason to write a whole new driver and potentially do a Netscape/Mozilla. > * Since we'd have to presume a possibly-very-back-rev libpq, we could > never add tests related to any recent libpq bug fixes or new features. > We can feature-test. We're not dealing with pg_regress where it's essentially impossible to do anything conditionally without blocks of plpgsql. We're dealing with Perl and Test::More where we can make tests conditional, we can probe the runtime environment to decide whether we can run a given test, etc. Also, lets compare to the status quo. Will it be worse than exec()ing psql with a string argument and trying to do sane database access via stdio? Definitely. Not. > * The system libpq might have a different idea of default port > and/or socket directory than the test build. Yeah, we could work > around that by forcing the choice all the time, but it would be a > constant hazard. > I'd call that a minor irritation personally, as all we have to do is set up the environment and we're done. > * We don't have control over what else gets brought in beside libpq. > That's a very significant point that we must pay attention to. Anyone who's worked with nss will probably know the pain of surprise library conflicts arising from nss plugins being loaded unexpectedly into the program. I still twitch thinking about libnss-ldap. It'd make a lot of sense to capture "ldd" output and/or do a minimal run with LD_DEBUG set during buildfarm test runs to help identify any interesting surprises. > Depending on the whims of the platform packagers, there might need > to be other parts of the platform's default postgres installation, > eg psql, sitting in one's PATH. Again, this wouldn't be too much > of a hazard for pre-debugged test scripts --- but it would be a huge > hazard for developers, who do lots of things manually and would always be > at risk of invoking the wrong psql. I learned long ago not to have any > part of a platform's postgres packages in place on my development systems, > and I will fight hard against any test procedure change that requires me > to do differently. > TBH I think that's a habit/procedure issue. That doesn't make it invalid, but it might not be as big a concern for everyone as for you. I have a shell alias that sets up my environment and another that prints the current setup. I never run into issues like this, despite often multi-tasking while tired. Not only that, I find platform postgres extremely useful to have on my systems to use for simple cross-version tests. If we adopt something like I suggested above where we (re)build DBD::Pg against our installed pg and libpq, that wouldn't be much of an issue anyway. We'd want a nice way to set up the runtime environment in a shell for manual testing of course, like a generated .sh we can source. But frankly I'd find that a useful thing to have in postgres anyway. I can't count the number of times I've wished there was an easy way to pause pg_regress and launch a psql session against its temp instance, or do the same with a TAP test. Now, none of these things are really a problem with DBD/DBI as such > --- rather, they are reasons not to depend on a pre-packaged build > of DBD::Pg that depends on a pre-packaged build of libpq.so. > Yeah. While I don't agree with your conclusion in terms of how you weigh the priorities, I agree that all your points
Re: Unused struct member in pgcrypto pgp.c
On Tue, Jul 30, 2019 at 05:48:49PM +0200, Daniel Gustafsson wrote: > In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info > which > isn’t used, and seems to have never been used (a potential copy/pasteo from > the > cipher_info struct?). Is there a reason for keeping this, or can it be > removed > as per the attached? I don't see one as this is not used in any logic for the digest lookups. So agreed and applied. This originally comes from e94dd6a. -- Michael signature.asc Description: PGP signature
Re: concerns around pg_lsn
On Tue, Jul 30, 2019 at 02:22:30PM +0530, Jeevan Ladhe wrote: > On Tue, Jul 30, 2019 at 9:42 AM Michael Paquier wrote: >> Agreed about making the code more defensive as you do. I would keep >> the initialization in check_recovery_target_lsn and pg_lsn_in_internal >> though. That does not hurt and makes the code easier to understand, >> aka we don't expect an error by default in those paths. >> > > Sure, understood. I am ok with this. I am adding Peter Eisentraut in CC as 21f428e is his commit. I think that the first patch is a good idea, so I would be fine to apply it, but let's see the original committer's opinion first. -- Michael signature.asc Description: PGP signature
Re: Attached partition not considering altered column properties of root partition.
On Wed, Jul 31, 2019 at 2:38 AM Robert Haas wrote: > > On Tue, Jul 2, 2019 at 9:53 PM Amit Langote wrote: > > Thanks for the report. This seems like a bug. Documentation claims > > that the child tables inherit column storage options from the parent > > table. That's actually enforced in only some cases. > > I realize I'm just repeating the same argument I've already made > before on other related topics, but I don't think this is a bug. > Inherited-from-parent is not the same as > enforced-to-always-be-the-same-as-parent. Note that this is allowed, > changing only the child: > > rhaas=# create table foo (a int, b text) partition by range (a); > CREATE TABLE > rhaas=# create table foo1 partition of foo for values from (0) to (10); > CREATE TABLE > rhaas=# alter table foo1 alter column b set storage plain; > ALTER TABLE > > As is this, changing only the parent: > > rhaas=# alter table only foo1 alter column b set storage plain; > ALTER TABLE > > How can you possibly argue that the intended behavior is > everything-always-matches when that's not what's documented and > there's absolutely nothing that tries to enforce it? You're right. The patch as proposed is barely enough to ensure the everything-always-matches behavior. Let's update^H^H^H^H^H^H^H forget about the patch. :) I do remember that we made a list of things that we decided must match in all partitions, which ended up being slightly bigger than the same list for regular inheritance children, but much smaller than the list of all the things that could be different among children. I forgot we did that when replying to Prabhat's report. In this particular case, I do have doubts about whether we really need attstorage to be the same in all the children, so maybe I should've first asked why we should think of this as a bug. > I'm getting really tired of people thinking that they can invent rules > for table partitioning that were (1) never intended by the original > implementation and (2) not even slightly enforced by the code, and > then decide that those behavior changes should not only be made but > back-patched. That is just dead wrong. There is no problem here. > There is no need to change ANYTHING. There is even less need to do it > in the back branches. OK, I'm withdrawing my patch. Thanks, Amit
Re: Adding column "mem_usage" to view pg_prepared_statements
On Tue, Jul 30, 2019 at 10:01:09PM +, Daniel Migowski wrote: > Hello, > > Will my patch be considered for 12.0? The calculation of the > mem_usage value might be improved later on but because the system > catalog is changed I would love to add it before 12.0 becomes > stable. Feature freeze was April 8, so no. https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg37059.html You're in plenty of time for 13, though! Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: pgbench - implement strict TPC-B benchmark
Peter Geoghegan writes: > On Tue, Jul 30, 2019 at 3:00 PM Tom Lane wrote: >> I'm also highly dubious about labeling this script "standard TPC-B", >> when it resolves only some of the reasons why our traditional script >> is not really TPC-B. That's treading on being false advertising. > IANAL, but it may not even be permissible to claim that we have > implemented "standard TPC-B". Yeah, very likely you can't legally say that unless the TPC has certified your test. (Our existing code and docs are quite careful to call pgbench's version "TPC-like" or similar weasel wording, and never claim that it is really TPC-B or even a close approximation.) regards, tom lane
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Tue, Jul 30, 2019 at 2:47 PM Melanie Plageman wrote: > I did the "needlessly dumb implementation" Robert mentioned, though, > I thought about it and couldn't come up with a much smarter way to > write match bits to a file. I think there might be an optimization > opportunity in not writing the current_byte to the file each time that > the outer tuple matches and only doing this once we have advanced to a > tuple number that wouldn't have its match bit in the current_byte. I > didn't do that to keep it simple, and, I suspect there might be a bit > of gymnastics needed to make sure that that byte is actually written > to the file in case we exit from some other state before we encounter > the tuple represented in the last bit in that byte. I mean, I was assuming we'd write in like 8kB blocks or something. Doing it a byte at a time seems like it'd produce way too many syscals. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgbench - implement strict TPC-B benchmark
On Tue, Jul 30, 2019 at 3:00 PM Tom Lane wrote: > TBH, I think we should reject this patch. Nobody cares about TPC-B > anymore, and they care even less about differences between one > sort-of-TPC-B test and another sort-of-TPC-B test. (As the lack > of response on this thread shows.) We don't need this kind of > baggage in pgbench; it's got too many "features" already. +1. TPC-B was officially made obsolete in 1995. > I'm also highly dubious about labeling this script "standard TPC-B", > when it resolves only some of the reasons why our traditional script > is not really TPC-B. That's treading on being false advertising. IANAL, but it may not even be permissible to claim that we have implemented "standard TPC-B". -- Peter Geoghegan
Re: Runtime pruning problem
David Rowley writes: > On Wed, 31 Jul 2019 at 10:56, Tom Lane wrote: >> The portion of this below the Append is fine, but I argue that >> the Vars above the Append should say "part", not "part_p1". >> In that way they'd look the same regardless of which partitions >> have been pruned or not. > That seems perfectly reasonable for Append / MergeAppend that are for > scanning partitioned tables. What do you propose we do for inheritance > and UNION ALLs? For inheritance, I don't believe there would be any change, precisely because we've historically used the parent rel as reference. For setops we've traditionally used the left input as reference. Maybe we could do better, but I'm not very sure how, since SQL doesn't actually provide any explicit names for the setop result. Making up a name with no basis in the query probably isn't an improvement, or at least not enough of one to justify a change. regards, tom lane
Re: Runtime pruning problem
On Wed, 31 Jul 2019 at 10:56, Tom Lane wrote: > > OK, so experimenting, I see that it is a change: HEAD does > > regression=# explain verbose select * from part order by a; >QUERY PLAN > - > Sort (cost=362.21..373.51 rows=4520 width=8) >Output: part_p1.a, part_p1.b >Sort Key: part_p1.a >-> Append (cost=0.00..87.80 rows=4520 width=8) > -> Seq Scan on public.part_p1 (cost=0.00..32.60 rows=2260 width=8) >Output: part_p1.a, part_p1.b > -> Seq Scan on public.part_p2_p1 (cost=0.00..32.60 rows=2260 > width=8) >Output: part_p2_p1.a, part_p2_p1.b > (8 rows) > > The portion of this below the Append is fine, but I argue that > the Vars above the Append should say "part", not "part_p1". > In that way they'd look the same regardless of which partitions > have been pruned or not. That seems perfectly reasonable for Append / MergeAppend that are for scanning partitioned tables. What do you propose we do for inheritance and UNION ALLs? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Runtime pruning problem
I wrote: > This may be arguing for a change in ruleutils' existing behavior, > not sure. But when dealing with traditional-style inheritance, > I've always thought that Vars above the Append were referring to > the parent rel in its capacity as the parent, not in its capacity > as the first child. With new-style partitioning drawing a clear > distinction between the parent and all its children, it's easier > to understand the difference. OK, so experimenting, I see that it is a change: HEAD does regression=# explain verbose select * from part order by a; QUERY PLAN - Sort (cost=362.21..373.51 rows=4520 width=8) Output: part_p1.a, part_p1.b Sort Key: part_p1.a -> Append (cost=0.00..87.80 rows=4520 width=8) -> Seq Scan on public.part_p1 (cost=0.00..32.60 rows=2260 width=8) Output: part_p1.a, part_p1.b -> Seq Scan on public.part_p2_p1 (cost=0.00..32.60 rows=2260 width=8) Output: part_p2_p1.a, part_p2_p1.b (8 rows) The portion of this below the Append is fine, but I argue that the Vars above the Append should say "part", not "part_p1". In that way they'd look the same regardless of which partitions have been pruned or not. regards, tom lane
Re: Runtime pruning problem
David Rowley writes: > On Wed, 31 Jul 2019 at 10:27, Tom Lane wrote: >> What I had in mind was to revert 1cc29fe7c's ruleutils changes >> entirely, so that ruleutils deals only in Plans not PlanStates. >> Perhaps we've grown some code since then that really needs the >> PlanStates, but what is that, and could we do it some other way? >> I'm not thrilled with passing both of these around, especially >> if the PlanState sometimes isn't there, meaning that no code in >> ruleutils could safely assume it's there anyway. > Are you not worried about the confusion that run-time pruning might > cause if we always show the Vars from the first Append/MergeAppend > plan node, even though the corresponding executor node might have been > pruned? The upper-level Vars should ideally be labeled with the append parent rel's name anyway, no? I think it's likely *more* confusing if those Vars change appearance depending on which partitions get pruned or not. This may be arguing for a change in ruleutils' existing behavior, not sure. But when dealing with traditional-style inheritance, I've always thought that Vars above the Append were referring to the parent rel in its capacity as the parent, not in its capacity as the first child. With new-style partitioning drawing a clear distinction between the parent and all its children, it's easier to understand the difference. regards, tom lane
Re: Adding column "mem_usage" to view pg_prepared_statements
Am 31.07.2019 um 00:29 schrieb Tom Lane: Daniel Migowski writes: Ok, just have read about the commitfest thing. Is the patch OK for that? Or is there generally no love for a mem_usage column here? If it was, I really would add some memory monitoring in our app regarding this. You should certainly put it into the next commitfest. We might or might not accept it, but if it's not listed in the CF we might not remember to even review it. (The CF app is really a to-do list for patches ...) OK, added it there. Thanks for your patience :). Regards, Daniel Migowski
Re: Runtime pruning problem
On Wed, 31 Jul 2019 at 10:27, Tom Lane wrote: > > David Rowley writes: > > The part I wouldn't mind another set of eyes on is the ruleutils.c > > changes. > > Um, sorry for not getting to this sooner. > > What I had in mind was to revert 1cc29fe7c's ruleutils changes > entirely, so that ruleutils deals only in Plans not PlanStates. > Perhaps we've grown some code since then that really needs the > PlanStates, but what is that, and could we do it some other way? > I'm not thrilled with passing both of these around, especially > if the PlanState sometimes isn't there, meaning that no code in > ruleutils could safely assume it's there anyway. Are you not worried about the confusion that run-time pruning might cause if we always show the Vars from the first Append/MergeAppend plan node, even though the corresponding executor node might have been pruned? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Adding column "mem_usage" to view pg_prepared_statements
Am 31.07.2019 um 00:17 schrieb Tomas Vondra: FWIW not sure what mail client you're using, but it seems to be breaking the threads for some reason, splitting it into two - see [1]. Also, please stop top-posting. It makes it way harder to follow the discussion. Was using Outlook because it's my companies mail client but switching to Thunderbird now for communication with the list, trying to be a good citizen now. Regards, Daniel Migowski
Re: AW: AW: Adding column "mem_usage" to view pg_prepared_statements
Daniel Migowski writes: > Ok, just have read about the commitfest thing. Is the patch OK for that? Or > is there generally no love for a mem_usage column here? If it was, I really > would add some memory monitoring in our app regarding this. You should certainly put it into the next commitfest. We might or might not accept it, but if it's not listed in the CF we might not remember to even review it. (The CF app is really a to-do list for patches ...) regards, tom lane
Re: Runtime pruning problem
David Rowley writes: > The part I wouldn't mind another set of eyes on is the ruleutils.c > changes. Um, sorry for not getting to this sooner. What I had in mind was to revert 1cc29fe7c's ruleutils changes entirely, so that ruleutils deals only in Plans not PlanStates. Perhaps we've grown some code since then that really needs the PlanStates, but what is that, and could we do it some other way? I'm not thrilled with passing both of these around, especially if the PlanState sometimes isn't there, meaning that no code in ruleutils could safely assume it's there anyway. regards, tom lane
Re: Adding column "mem_usage" to view pg_prepared_statements
On Tue, Jul 30, 2019 at 10:01:09PM +, Daniel Migowski wrote: Hello, Will my patch be considered for 12.0? The calculation of the mem_usage value might be improved later on but because the system catalog is changed I would love to add it before 12.0 becomes stable. Nope. Code freeze for PG12 data was April 7, 2019. You're a bit late for that, unfortunately. We're in PG13 land now. FWIW not sure what mail client you're using, but it seems to be breaking the threads for some reason, splitting it into two - see [1]. Also, please stop top-posting. It makes it way harder to follow the discussion. [1] https://www.postgresql.org/message-id/flat/41ED3F5450C90F4D8381BC4D8DF6BBDCF02E10B4%40EXCHANGESERVER.ikoffice.de regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
AW: AW: Adding column "mem_usage" to view pg_prepared_statements
Ok, just have read about the commitfest thing. Is the patch OK for that? Or is there generally no love for a mem_usage column here? If it was, I really would add some memory monitoring in our app regarding this. -Ursprüngliche Nachricht- Von: Tom Lane Gesendet: Mittwoch, 31. Juli 2019 00:09 An: Daniel Migowski Cc: Andres Freund ; pgsql-hackers@lists.postgresql.org Betreff: Re: AW: Adding column "mem_usage" to view pg_prepared_statements Daniel Migowski writes: > Will my patch be considered for 12.0? The calculation of the mem_usage value > might be improved later on but because the system catalog is changed I would > love to add it before 12.0 becomes stable. v12 has been feature-frozen for months, and it's pretty hard to paint this as a bug fix, so I'd say the answer is "no". regards, tom lane
Re: AW: Adding column "mem_usage" to view pg_prepared_statements
Daniel Migowski writes: > Will my patch be considered for 12.0? The calculation of the mem_usage value > might be improved later on but because the system catalog is changed I would > love to add it before 12.0 becomes stable. v12 has been feature-frozen for months, and it's pretty hard to paint this as a bug fix, so I'd say the answer is "no". regards, tom lane
AW: Adding column "mem_usage" to view pg_prepared_statements
Hello, Will my patch be considered for 12.0? The calculation of the mem_usage value might be improved later on but because the system catalog is changed I would love to add it before 12.0 becomes stable. Regards, Daniel Migowski -Ursprüngliche Nachricht- Von: Daniel Migowski Gesendet: Sonntag, 28. Juli 2019 08:21 An: Andres Freund Cc: pgsql-hackers@lists.postgresql.org Betreff: Re: Adding column "mem_usage" to view pg_prepared_statements Hello Andres, how do you want to generalize it? Are you thinking about a view solely for the display of the memory usage of different objects? Like functions or views (that also have a plan associated with it, when I think about it)? While being interesting I still believe monitoring the mem usage of prepared statements is a bit more important than that of other objects because of how they change memory consumption of the server without using any DDL or configuration options and I am not aware of other objects with the same properties, or are there some? And for the other volatile objects like tables and indexes and their contents PostgreSQL already has it's information functions. Regardless of that here is the patch for now. I didn't want to fiddle to much with MemoryContexts yet, so it still doesn't recurse in child contexts, but I will change that also when I try to build a more compact MemoryContext implementation and see how that works out. Thanks for pointing out the relevant information in the statement column of the view. Regards, Daniel Migowski -Ursprüngliche Nachricht- Von: Andres Freund Gesendet: Samstag, 27. Juli 2019 21:12 An: Daniel Migowski Cc: pgsql-hackers@lists.postgresql.org Betreff: Re: Adding column "mem_usage" to view pg_prepared_statements Hi, On 2019-07-27 18:29:23 +, Daniel Migowski wrote: > I just implemented a small change that adds another column "mem_usage" > to the system view "pg_prepared_statements". It returns the memory > usage total of CachedPlanSource.context, > CachedPlanSource.query_content and if available > CachedPlanSource.gplan.context. FWIW, it's generally easier to comment if you actually provide the patch, even if it's just POC, as that gives a better handle on how much additional complexity it introduces. I think this could be a useful feature. I'm not so sure we want it tied to just cached statements however - perhaps we ought to generalize it a bit more. Regarding the prepared statements specific considerations: I don't think we ought to explicitly reference CachedPlanSource.query_content, and CachedPlanSource.gplan.context. In the case of actual prepared statements (rather than oneshot plans) CachedPlanSource.query_context IIRC should live under CachedPlanSource.context. I think there's no relevant cases where gplan.context isn't a child of CachedPlanSource.context either, but not quite sure. Then we ought to just include child contexts in the memory computation (cf. logic in MemoryContextStatsInternal(), although you obviously wouldn't need all that). That way, if the cached statements has child contexts, we're going to stay accurate. > Also I wonder why the "prepare test as" is part of the statement > column. I isn't even part of the real statement that is prepared as > far as I would assume. Would prefer to just have the "select *..." in > that column. It's the statement that was executed. Note that you'll not see that in the case of protocol level prepared statements. It will sometimes include relevant information, e.g. about the types specified as part of the prepare (as in PREPARE foo(int, float, ...) AS ...). Greetings, Andres Freund
Re: pgbench - implement strict TPC-B benchmark
Fabien COELHO writes: > [ pgbench-strict-tpcb-2.patch ] TBH, I think we should reject this patch. Nobody cares about TPC-B anymore, and they care even less about differences between one sort-of-TPC-B test and another sort-of-TPC-B test. (As the lack of response on this thread shows.) We don't need this kind of baggage in pgbench; it's got too many "features" already. I'm also highly dubious about labeling this script "standard TPC-B", when it resolves only some of the reasons why our traditional script is not really TPC-B. That's treading on being false advertising. regards, tom lane
Re: Remove HeapTuple and Buffer dependency for predicate locking functions
On Tue, Jun 25, 2019 at 6:02 AM Andres Freund wrote: > On 2019-06-24 10:41:06 -0700, Ashwin Agrawal wrote: > > Proposing following changes to make predicate locking and checking > > functions generic and remove dependency on HeapTuple and Heap AM. We > > made these changes to help with Zedstore. I think the changes should > > help Zheap and other AMs in general. > > Indeed. +1 > > - Change PredicateLockTuple() to PredicateLockTID(). So, instead of > > passing HeapTuple to it, just pass ItemPointer and tuple insert > > transaction id if known. This was also discussed earlier in [1], > > don't think was done in that context but would be helpful in future > > if such requirement comes up as well. > > Right. +1 > > - CheckForSerializableConflictIn() take blocknum instead of > > buffer. Currently, the function anyways does nothing with the buffer > > just needs blocknum. Also, helps to decouple dependency on buffer as > > not all AMs may have one to one notion between blocknum and single > > buffer. Like for zedstore, tuple is stored across individual column > > buffers. So, wish to have way to lock not physical buffer but > > logical blocknum. > > Hm. I wonder if we somehow ought to generalize the granularity scheme > for predicate locks to not be tuple/page/relation. But even if, that's > probably a separate patch. What do you have in mind? This is certainly the traditional way to do lock hierarchies (archeological fun: we used to have src/backend/storage/lock/multi.c, a "standard multi-level lock manager as per the Gray paper", before commits 3f7fbf85 and e6e9e18e decommissioned it; SSI brought the concept back again in a parallel lock manager), but admittedly it has assumptions about physical storage baked into the naming. Perhaps you just want to give those things different labels, "TID range" instead of page, for the benefit of "logical" TID users? Perhaps you want to permit more levels? That seems premature as long as TIDs are defined in terms of blocks and offsets, so this stuff is reflected all over the source tree... > > - CheckForSerializableConflictOut() no more takes HeapTuple nor > > buffer, instead just takes xid. Push heap specific parts from > > CheckForSerializableConflictOut() into its own function > > HeapCheckForSerializableConflictOut() which calls > > CheckForSerializableConflictOut(). The alternative option could be > > CheckForSerializableConflictOut() take callback function and > > callback arguments, which gets called if required after performing > > prechecks. Though currently I fell AM having its own wrapper to > > perform AM specific task and then calling > > CheckForSerializableConflictOut() is fine. > > I think it's right to move the xid handling out of > CheckForSerializableConflictOut(). But I think we also ought to move the > subtransaction handling out of the function - e.g. zheap doesn't > want/need that. Thoughts on this Ashwin? > > Attaching patch which makes these changes. > > Please make sure that there's a CF entry for this (I'm in a plane with a > super slow connection, otherwise I'd check). This all makes sense, and I'd like to be able to commit this soon. We come up with something nearly identical for zheap. I'm subscribing Kuntal Ghosh to see if he has comments, as he worked on that. The main point of difference seems to be the assumption about how subtransactions work. -- Thomas Munro https://enterprisedb.com
Re: [Patch] Adding CORRESPONDING/CORRESPONDING BY to set operation
On Tue, Jul 30, 2019 at 02:43:05PM -0700, 毛瑞嘉 wrote: > Hi, > > > I wrote a patch for adding CORRESPONDING/CORRESPONDING BY to set operation. > It is a task in the todo list. This is how the patch works: > > I modified transformSetOperationStmt() to get an intersection target list > which is the intersection of the target lists of the left clause and right > clause for a set operation statement (sostmt). The intersection target list > is calculated in transformSetOperationTree() and then I modified the target > lists of the larg and rarg of sostmt to make them equal to the intersection > target list. Also, I also changed the target list in pstate->p_rtable in > order to make it consistent with the intersection target list. > > > I attached the scratch version of this patch to the email. I am not sure > whether the method used in the patch is acceptable or not, but any > suggestions are appreciated. I will add tests and other related things to > the patch if the method used in this patch is acceptable. Thanks for sending this! It needs documentation and tests so people can see whether it does what it's supposed to do. Would you be so kind as to include those in the next revision of the patch? You can just attach the patch(es) without zipping them. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Patch to document base64 encoding
My 0.02 € It seems entirely crazy that encode() and decode() are no longer in the same section, likewise that convert_from() and convert_to() aren't documented together anymore. Awkward, yes. But findable if you know what the categories are. I suppose there could be 3 different categories: those that input and output strings, those that input and output binary, and those that convert -- inputting one data type and outputting another. Personnaly, I'd be ok with having a separate "conversion function" table, and also with Tom suggestion to have string functions with "only simple string" functions, and if any binary appears it is moved into the binary section. -- Fabien.
[Patch] Adding CORRESPONDING/CORRESPONDING BY to set operation
Hi, I wrote a patch for adding CORRESPONDING/CORRESPONDING BY to set operation. It is a task in the todo list. This is how the patch works: I modified transformSetOperationStmt() to get an intersection target list which is the intersection of the target lists of the left clause and right clause for a set operation statement (sostmt). The intersection target list is calculated in transformSetOperationTree() and then I modified the target lists of the larg and rarg of sostmt to make them equal to the intersection target list. Also, I also changed the target list in pstate->p_rtable in order to make it consistent with the intersection target list. I attached the scratch version of this patch to the email. I am not sure whether the method used in the patch is acceptable or not, but any suggestions are appreciated. I will add tests and other related things to the patch if the method used in this patch is acceptable. Best, Ruijia <>
Re: idea: log_statement_sample_rate - bottom limit for sampling
On Tue, Jul 30, 2019 at 03:43:58PM -0400, Tom Lane wrote: Tomas Vondra writes: I've started reviewing this patch, thinking that maybe I could get it committed as it's marked as RFC. In general I agree with having this fuature, but I think we need to rethink the GUC because the current approach is just confusing. ... What I think we should do instead is to use two minimum thresholds. 1) log_min_duration_sample - enables sampling of commands, using the existing GUC log_statement_sample_rate 2) log_min_duration_statement - logs all commands exceeding this I think this is going to be much easier for users to understand. I agree with Tomas' idea. The one difference between those approaches is in how they work with existing current settings. That is, let's say you have log_min_duration_statement = 1000 log_statement_sample_rate = 0.01 then no queries below 1000ms will be logged, and 1% of longer queries will be sampled. And with the original config (as proposed in v3 of the patch), this would still work the same way. With the new approach (two min thresholds) it'd behave differently, because we'd log *all* queries longer than 1000ms (not just 1%). And whether we'd sample any queries (using log_statement_sample_rate) would depend on how we'd pick the default value for the other threshold. Well, we do not need to have a backwards-compatibility problem here, because we have yet to release a version containing log_statement_sample_rate. I do not think it's too late to decide that v12's semantics for that are broken, and either revert that patch in v12, or back-patch a fix to make it match this idea. I'm willing to try fixing this to salvage the feature for v12. The question is how would that fix look like - IMO we'd need to introduce the new threshold GUC, essentially implementing what this thread is about. It's not a complex patch, but it kinda flies in the face of feature freeze. OTOH if we call it a fix ... The patch itself is not that complicated - attached is a WIP version, (particularly) the docs need more work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c91e3e1550..8cb695044b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5850,8 +5850,7 @@ local0.*/var/log/postgresql Causes the duration of each completed statement to be logged if the statement ran for at least the specified number of - milliseconds, modulated by log_statement_sample_rate. - Setting this to zero prints all statement durations. + milliseconds. Setting this to zero prints all statement durations. -1 (the default) disables logging statements due to exceeding duration threshold; for example, if you set it to 250ms, then all SQL statements that run 250ms or @@ -5882,6 +5881,48 @@ local0.*/var/log/postgresql + + log_min_duration_sample (integer) + + log_min_duration_sample configuration parameter + + + + + Causes the duration of each completed statement to be logged + if the statement ran for at least the specified number of + milliseconds, modulated by log_statement_sample_rate. + Setting this to zero samples all statement durations. + -1 (the default) disables sampling of statements; + for example, if you set it to 250ms, then all + SQL statements that run 250ms or longer will be sampled for + logging. Statements with duration exceeding + log_min_duration_statement are not subject to + sampling and are logged every time. + + + + For clients using extended query protocol, durations of the Parse, + Bind, and Execute steps are logged independently. + + + + + When using this option together with + , + the text of statements that are logged because of + log_statement will not be repeated in the + duration log message. + If you are not using syslog, it is recommended + that you log the PID or session ID using + + so that you can link the statement message to the later + duration message using the process ID or session ID. + + + + + log_statement_sample_rate (real) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index a6505c7335..ae1def0f37 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -2213,7 +2213,8 @@ check_log_statement(List *stmt_list) /* * check_log_duration * Determine whether current command's duration should be logged. - * If log_statement_sample_rate < 1.0, log only a sample. + * If log_statement_sample_rate < 1.0,
Re: heapam_index_build_range_scan's anyvisible
On Tue, Jul 16, 2019 at 10:22 AM Andres Freund wrote: > Hi, > > On 2019-07-11 17:27:46 -0700, Ashwin Agrawal wrote: > > Please find attached the patch to remove IndexBuildCallback's dependency > on > > HeapTuple, as discussed. Changed to have the argument as ItemPointer > > instead of HeapTuple. Other larger refactoring if feasible for > > index_build_range_scan API itself can be performed as follow-up changes. > > > From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001 > > From: Ashwin Agrawal > > Date: Thu, 11 Jul 2019 16:58:50 -0700 > > Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple. > > > > With IndexBuildCallback taking input as HeapTuple, all table AMs > > irrespective of storing the tuples in HeapTuple form or not, are > > forced to construct HeapTuple, to insert the tuple in Index. Since, > > only thing required by the index callbacks is TID and not really the > > full tuple, modify callback to only take ItemPointer. > > Looks good to me. Planning to apply this unless somebody wants to argue > against it soon? > Andres, I didn't yet register this for next commitfest. If its going in soon anyways will not do it otherwise let me know and I will add it to the list.
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 10:14:14AM -0400, Sehrope Sarkuni wrote: > > In general it's fine to use the same IV with different keys. Only reuse > of Key > > + IV is a problem and the entire set of possible counter values (IV + 0, > IV + > > 1, ...) generated with a key must be unique. That's also why we must > leave at > > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be > filled > > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our > per-page IV > > prefix used any of those bits then the counter could overflow into the > next > > page's IV's range. > > Agreed. > > Attached is an updated patch that checks only main relation forks, which > I think are the only files we are going ot encrypt, and it has better > comments. > > > Okay that makes sense in the context of using a single key and relying on the > LSN based IV to be unique. I had more time to think about the complexity of adding relfilenode to the IV. Since relfilenode is only unique within a database/tablespace, we would need to have pg_upgrade preserve database/tablespace oids (which I assume are the same as the directory and tablespace symlinks). Then, to decode a page, you would need to look up those values. This is in addition to the new complexity of CREATE DATABASE and moving files between tablespaces. I am also concerned that crash recovery operations and cluster forensics and repair would need to also deal with this. I am not even clear if pg_upgrade preserving relfilenode is possible --- when we wrap the relfilenode counter, does it start at 1 or at the first-user-relation-oid? If the former, it could conflict with oids assigned to new system tables in later major releases. Tying the preservation of relations to two restrictions seems risky. Using just the page LSN and page number allows a page to be be decrypted/encrypted independently of its file name, tablespace, and database, and I think that is a win for simplicity. Of course, if it is insecure we will not do it. I am thinking for the heap/index IV, it would be: uint64 lsn; unint32 page number; /* only uses 11 bits for a zero-based CTR counter for 32k pages */ uint32 counter; and for WAL it would be: uint64 segment_number; uint32counter; /* guarantees this IV doesn't match any relation IV */ uint32 2^32-1 /* all 1's */ Anyway, these are my thoughts so far. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: extension patch of CREATE OR REPLACE TRIGGER
"osumi.takami...@fujitsu.com" writes: > [ CREATE_OR_REPLACE_TRIGGER_v03.patch ] I took a quick look through this just to see what was going on. A few comments: * Upthread you asked about changing the lock level to be AccessExclusiveLock if the trigger already exists, but the patch doesn't actually do that. Which is fine by me, because that sounds like a perfectly bad idea. In the first place, nobody is going to expect that OR REPLACE changes the lock level, and in the second place, you can't actually tell whether the trigger exists until you already have some lock on the table. I do not put any credit in the argument that it's more important to lock out pg_dump against a concurrent REPLACE TRIGGER than it is to lock out a concurrent CREATE TRIGGER, anyway. So I think keeping it at ShareRowExclusiveLock is fine. * I wouldn't recommend adding CreateConstraintEntry's new argument at the end. IME, "add at the end" is almost always bad coding style; the right thing is "add where it belongs, that is where you'd have put it if you were writing the list from scratch". To the admittedly imperfect extent that the order of CreateConstraintEntry's arguments matches the column order of pg_constraint, there's a good argument that the OID should be *first*. (Maybe, as long as we've gotta touch all the callers anyway, we should fix the other random deviations from the catalog's column order, too.) * While you're at it, it wouldn't hurt to fix CreateConstraintEntry's header comment, maybe like - * The new constraint's OID is returned. + * The new constraint's OID is returned. (This will be the same as + * "conOid" if that is specified as nonzero.) * The new code added to CreateTrigger could stand a rethink, too. For starters, this comment does not describe the code stanza just below it, but something considerably further down: /* +* Generate the trigger's OID now, so that we can use it in the name if +* needed. +*/ It's also quite confusing because if there is a pre-existing trigger, we *don't* generate any new OID. I'd make that say "See if there is a pre-existing trigger of the same name", and then comment the later OID generation appropriately. Also, the code below the pg_trigger search seems pretty confused and redundant: + if (!trigger_exists) + // do something + if (stmt->replace && trigger_exists) + { + if (stmt->isconstraint && !OidIsValid(existing_constraint_oid)) + // do something + else if (!stmt->isconstraint && OidIsValid(existing_constraint_oid)) + // do something + } + else if (trigger_exists && !isInternal) + { + // do something + } I'm not on board with the idea that testing trigger_exists three separate times, in three randomly-different-looking ways, makes things more readable. I'm also not excited about spending the time to scan pg_trigger at all in the isInternal case, where you're going to ignore the result. So I think this could use some refactoring. Also, in the proposed tests: +\h CREATE TRIGGER; We do not test \h output in any existing regression test, and we're not going to start doing so in this one. For one thing, the expected URL would break every time we forked off a new release branch. (There would surely be value in having more-than-no test coverage of psql/help.c, but that's a matter for its own patch, which would need some thought about how to cope with instability of the output.) regards, tom lane
Re: idea: log_statement_sample_rate - bottom limit for sampling
Tomas Vondra writes: > I've started reviewing this patch, thinking that maybe I could get it > committed as it's marked as RFC. In general I agree with having this > fuature, but I think we need to rethink the GUC because the current > approach is just confusing. > ... > What I think we should do instead is to use two minimum thresholds. > 1) log_min_duration_sample - enables sampling of commands, using the > existing GUC log_statement_sample_rate > 2) log_min_duration_statement - logs all commands exceeding this > I think this is going to be much easier for users to understand. I agree with Tomas' idea. > The one difference between those approaches is in how they work with > existing current settings. That is, let's say you have > log_min_duration_statement = 1000 > log_statement_sample_rate = 0.01 > then no queries below 1000ms will be logged, and 1% of longer queries > will be sampled. And with the original config (as proposed in v3 of the > patch), this would still work the same way. > With the new approach (two min thresholds) it'd behave differently, > because we'd log *all* queries longer than 1000ms (not just 1%). And > whether we'd sample any queries (using log_statement_sample_rate) would > depend on how we'd pick the default value for the other threshold. Well, we do not need to have a backwards-compatibility problem here, because we have yet to release a version containing log_statement_sample_rate. I do not think it's too late to decide that v12's semantics for that are broken, and either revert that patch in v12, or back-patch a fix to make it match this idea. regards, tom lane
Re: Initdb failure
Robert Haas writes: > Agreed, but I think we should just do nothing. To actually fix this > in general, we'd have to get rid of every instance of MAXPGPATH in the > source tree: > [rhaas pgsql]$ git grep MAXPGPATH | wc -l > 611 I don't think it'd really be necessary to go that far. One of the reasons we chdir to the data directory at postmaster start is so that (pretty nearly) all filenames that backends deal with are relative pathnames of very predictable, short lengths. So a lot of those MAXPGPATH uses are probably perfectly safe, indeed likely overkill. However, identifying which ones are not safe would still take looking at every use case, so I agree there'd be a lot of work here. Would there be any value in teaching initdb to do something similar, ie chdir to the parent of the target datadir location? Then the set of places in initdb that have to deal with long pathnames would be pretty well constrained. On the whole though, I don't have a problem with the "do nothing" answer. There's no security risk here, and no issue that seems likely to arise in actual use cases rather than try-to-break-it test scripts. regards, tom lane
Re: Allow table AM's to cache stuff in relcache
On 12/07/2019 16:07, Julien Rouhaud wrote: On Fri, Jun 14, 2019 at 5:40 PM Tom Lane wrote: Heikki Linnakangas writes: In the patch, I documented that rd_amcache must be allocated in CacheMemoryContext, or in rd_indexcxt if it's an index. It works, but it's a bit weird. Given the way the patch is implemented, it doesn't really matter which context it's in, does it? The retail pfree is inessential but also harmless, if rd_amcache is in rd_indexcxt. So we could take out the "must". I think it's slightly preferable to use rd_indexcxt if available, to reduce the amount of loose junk in CacheMemoryContext. I agree that for indexes the context used won't make much difference. But IMHO avoiding some bloat in CacheMemoryContext is a good enough reason to document using rd_indexcxt when available. Right, it doesn't really matter whether an index AM uses CacheMemoryContext or rd_indexctx, the code works either way. I think it's better to give clear advice though, one way or another. Otherwise, different index AM's can end up doing it differently for no particular reason, which seems confusing. It would nice to have one memory context in every relcache entry, to hold all the stuff related to it, including rd_amcache. In other words, it would be nice if we had "rd_indexcxt" for tables, too, not just indexes. That would allow tracking memory usage more accurately, if you're debugging an out of memory situation for example. We had some discussion related to that in the "hyrax vs. RelationBuildPartitionDesc" thread. I'm not quite sure where we'll settle on that, but some redesign seems inevitable. There wasn't any progress on this since last month, and this patch won't make the situation any worse. I'll mark this patch as ready for committer, as it may save some time for people working on custom table AM. Pushed, thanks for the review! As Tom noted, some redesign here seems inevitable, but this patch shouldn't get in the way of that, so no need to hold this back for the redesign. - Heikki
Redacting information from logs
Logs are important to diagnose problems or monitor operations, but logs can contain sensitive information which is often unnecessary for these purposes. Redacting the sensitive information would enable easier access and simpler integration with analysis tools without compromising the sensitive information. The challenge is that nobody wants to classify all of the log messages; and even if someone did that today, there would be never-ending work in the future to try to maintain that classification. My proposal is: * redact every '%s' in an ereport by having a special mode for snprintf.c (this is possible because we now own snprintf) * generate both redacted and unredacted messages (if redaction is enabled) * choose which destinations (stderr, eventlog, syslog, csvlog) get redacted or plain messages * emit_log_hook always has both redacted and plain messages available * allow specifying a custom redaction function, e.g. a function that hashes the string rather than completely redacting it I think '%s' in a log message is a pretty close match to the kind of information that might be sensitive. All data goes through type output functions (e.g. the conflicting datum for a unique constraint violation message), and most other things that a user might type would go through %s. A lot of other information useful in logs, like LSNs, %m's, PIDs, etc. would be preserved. All object names would be redacted, but that's not as bad as it sounds: (a) You can specify a custom redaction function that hashes rather than completely redacts. That allows you to see if different messages refer to the same object, and also map back to suspected objects if you really need to. (b) The unredacted object names are still a part of ErrorData, so you can do something interesting with emit_log_hook. (c) You still might have the unredacted logs in a more protected place, and can access them when you really need to. A weakness of this proposal is that it could be confusing to use ereport() in combination with snprintf(). If using snprintf to build the format string, nothing would be redacted, so you'd have to be careful not to expand any %s that might be sensitive. If using snprintf to build up an argument, the entire argument would be redacted. The first case should not be common, because good coding generally avoids non-constant format strings. The second case is just over-redaction, which is not necessarily bad. One annoying case would be if some of the arguments to ereport() are used for things like the right number of commas or tabs -- redacting those would just make the message look horrible. I didn't find such cases but I'm pretty sure they exist. Another annoying case is time, which is useful for debugging, but formatted with %s so it gets redacted (I did find plenty of these cases). But I don't see a better solution. Right now, it's a pain to treat log files as sensitive things when there are so many ways they can help with smooth operations and so many tools available to analyze them. This proposal seems like a practical solution to enable better use of log files while protecting potentially-sensitive information. Attached is a WIP patch. Regards, Jeff Davis From b9e51843f2cf3635e65ee8807ed01ff1cb197fab Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Mon, 29 Jul 2019 16:28:35 -0700 Subject: [PATCH] WIP redaction --- src/backend/utils/error/elog.c | 183 - src/backend/utils/misc/guc.c | 28 + src/include/utils/elog.h | 8 ++ src/port/snprintf.c| 16 +++ 4 files changed, 207 insertions(+), 28 deletions(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 8b4720ef3a..0b9a3d22d1 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -90,6 +90,7 @@ ErrorContextCallback *error_context_stack = NULL; sigjmp_buf *PG_exception_stack = NULL; extern bool redirection_done; +extern bool snprintf_do_redact; /* * Hook for intercepting messages before they are sent to the server log. @@ -105,6 +106,9 @@ int Log_error_verbosity = PGERROR_VERBOSE; char *Log_line_prefix = NULL; /* format for extra log line info */ int Log_destination = LOG_DESTINATION_STDERR; char *Log_destination_string = NULL; +bool redact_messages = false; +int redact_destination = LOG_DESTINATION_STDERR; +char *redact_destination_string = NULL; bool syslog_sequence_numbers = true; bool syslog_split_messages = true; @@ -480,14 +484,24 @@ errfinish(int dummy,...) /* Now free up subsidiary data attached to stack entry, and release it */ if (edata->message) pfree(edata->message); + if (edata->message_r) + pfree(edata->message_r); if (edata->detail) pfree(edata->detail); + if (edata->detail_r) + pfree(edata->detail_r); if (edata->detail_log) pfree(edata->detail_log); + if (edata->detail_log_r) + pfree(edata->detail_log_r); if (edata->hint) pfree(edata->hint); + if
Re: Initdb failure
On Sat, Jul 27, 2019 at 2:22 AM Peter Eisentraut wrote: > I think if you want to make this more robust, get rid of the fixed-size > array, use dynamic allocation with PQExpBuffer, and let the operating > system complain if it doesn't like the directory name length. Agreed, but I think we should just do nothing. To actually fix this in general, we'd have to get rid of every instance of MAXPGPATH in the source tree: [rhaas pgsql]$ git grep MAXPGPATH | wc -l 611 If somebody feels motivated to spend that amount of effort improving this, I will stand back and cheer from the sidelines, but that's gonna be a LOT of work for a problem that, as Tom says, is probably not really affecting very many people. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Avoiding hash join batch explosions with extreme skew and weird stats
So, I've rewritten the patch to use a BufFile for the outer table batch file tuples' match statuses and write bytes to and from the file which start as 0 and, upon encountering a match for a tuple, I set its bit in the file to 1 (also rebased with current master). It, of course, only works for parallel-oblivious hashjoin -- it relies on deterministic order of tuples encountered in the outer side batch file to set the right match bit and uses a counter to decide which bit to set. I did the "needlessly dumb implementation" Robert mentioned, though, I thought about it and couldn't come up with a much smarter way to write match bits to a file. I think there might be an optimization opportunity in not writing the current_byte to the file each time that the outer tuple matches and only doing this once we have advanced to a tuple number that wouldn't have its match bit in the current_byte. I didn't do that to keep it simple, and, I suspect there might be a bit of gymnastics needed to make sure that that byte is actually written to the file in case we exit from some other state before we encounter the tuple represented in the last bit in that byte. I plan to work on a separate implementation for parallel hashjoin next--to understand what is required. I believe the logic to decide when to fall back should be fairly easy to slot in at the end once we've decided what that logic is. On Sat, Jul 13, 2019 at 4:44 PM Tomas Vondra wrote: > On Wed, Jul 03, 2019 at 02:22:09PM -0700, Melanie Plageman wrote: > >On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman < > melanieplage...@gmail.com> > > > >Before doing that, I wanted to ask what a desirable fallback condition > >would be. In this patch, fallback to hashloop join happens only when > >inserting tuples into the hashtable after batch 0 when inserting > >another tuple from the batch file would exceed work_mem. This means > >you can't increase nbatches, which, I would think is undesirable. > > > > Yes, I think that's undesirable. > > >I thought a bit about when fallback should happen. So, let's say that > >we would like to fallback to hashloop join when we have increased > >nbatches X times. At that point, since we do not want to fall back to > >hashloop join for all batches, we have to make a decision. After > >increasing nbatches the Xth time, do we then fall back for all batches > >for which inserting inner tuples exceeds work_mem? Do we use this > >strategy but work_mem + some fudge factor? > > > >Or, do we instead try to determine if data skew led us to increase > >nbatches both times and then determine which batch, given new > >nbatches, contains that data, set fallback to true only for that > >batch, and let all other batches use the existing logic (with no > >fallback option) unless they contain a value which leads to increasing > >nbatches X number of times? > > > > I think we should try to detect the skew and use this hashloop logic > only for the one batch. That's based on the assumption that the hashloop > is less efficient than the regular hashjoin. > > We may need to apply it even for some non-skewed (but misestimated) > cases, though. At some point we'd need more than work_mem for BufFiles, > at which point we ought to use this hashloop. > > I have not yet changed the logic for deciding to fall back from my original design. It will still only fall back for a given batch if that batch's inner batch file doesn't fit in memory. I haven't, however, changed the logic to allow it to increase the number of batches some number of times or according to some criteria before falling back for that batch. -- Melanie Plageman v3-0001-hashloop-fallback.patch Description: Binary data
Re: tap tests driving the database via psql
On 2019-Jul-30, Tom Lane wrote: > OK, so just lifting DBD::Pg in toto is out for license reasons. > However, maybe we could consider writing a new DBD driver from > scratch (while using a platform-provided DBI layer) rather than > doing everything from scratch. I'm not sure how much actual > functionality is in the DBI layer, so maybe that approach > wouldn't buy much. Then again, maybe we don't *need* all the functionality that DBI offers. DBI is enormous, has a lot of extensibility, cross-database compatibility ... and, well, just the fact that it's a layered design (requiring a DBD on top of it before it even works) makes it even more complicated. I think a pure-perl standalone driver might be a lot simpler than maintanining our own DBD ... and we don't have to convince animal maintainers to install the right version of DBI in the first place. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SegFault on 9.6.14
On Thu, Jul 18, 2019 at 9:45 AM Tom Lane wrote: > I think this is going in the wrong direction. Nodes should *always* > assume that a rescan is possible until ExecEndNode is called. > If you want to do otherwise, you are going to be inventing a whole > bunch of complicated and doubtless-initially-buggy control logic > to pass down information about whether a rescan might be possible. > That doesn't sound like a recipe for a back-patchable fix. Perhaps > we could consider redesigning the rules around REWIND in a future > version, but that's not where to focus the bug fix effort. So, if I can summarize how we got here, as best I understand it: 0. The historic behavior of the executor is to assume it's OK to leak resources for the lifetime of the query. Nodes that are executed to completion generally do some cleanup, but we feel free (as under Limit) to just stop executing a node without giving it any hint that it should release resources. So a Sort may hold onto a terabyte of memory and an index scan may keep holding a pin even after there's no theoretical way of ever needing those resources again, and we just don't care. 1. Parallel query made that perhaps-already-shaky assumption a lot more problematic. Partly that's because workers are a a more scarce and considerably heavier resource than anything else, and moreover act as a container for anything else, so whatever you were leaking before, you can now leak N times more of it, plus N processes, until the end of the query. However, there's a correctness reason too, which is that when a node has a copy in the leader and a copy in every worker, each copy has its own instrumentation data (startup time, run time, nloops, etc) and we can only fold all that together once the node is done executing, because it's really hard to add up a bunch of numbers before the numbers are done changing. We could've made the instrumentation shared throughout, but if we had, we could have contention for updating the instrumentation data, which seems like it'd be bad. 2. To fix that correctness problem, we decided to try to shut down the node under a limit node when we're done with it (commit 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this looks fundamentally necessary to me. If you're going to have N separate copies of the instrumentation, and you want to add them up when you're done, then you have to decide to be done at some point; otherwise you don't know when to add them up, and maybe won't add them up at all, and then you'll be sad. This does not mean that the exact timing couldn't be changed somehow, but if you want a correct implementation, you have to shut down Limit's sub-node after you're done executing it (so that you can get the instrumentation data from the workers after it's final) and before you start destroying DSM segments and stuff (so that you get the instrumentation data from the workers before it vanishes). 3. The aforementioned commit turned out to be buggy in at least to two ways, precisely because it didn't do a good enough job predicting when the Limit needed to be shut down. First, there was commit 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact that you could hit the Limit and then back up. Second, there's the present issue, where the Limit gets rescanned. So, given all that, if we want to adopt Tom's position that we should always cater to a possible rescan, then we're going to have to rethink the way that instrumentation data gets consolidated from workers into the leader in such a way that we can consolidate multiple times without ending up with the wrong answer. The other option is to do what I understand Amit and Thomas to be proposing, which is to do a better job identifying the case where we're "done for good" and can trigger the shutdown fearlessly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Patch to document base64 encoding
On Tue, 30 Jul 2019 11:40:03 -0400 Tom Lane wrote: > "Karl O. Pinc" writes: > > On Mon, 15 Jul 2019 23:00:55 +0200 (CEST) > > Fabien COELHO wrote: > >> The patch clarifies the documentation about encode/decode and > >> other text/binary string conversion functions. > > > Other notable changes: > > Corrects categorization of functions as string or binary. > > Reorders functions alphabetically by function name. > > So I took a look at this, expecting that after so much discussion it > ought to just be committable ... It started simple, just changing the base64 function descriptions, but critique drew in additional issues. > but I am befuddled by your choices > about which functions to move where. The grouping is by the data type on which each function operates, the data type of the input. If there's to be 2 categories, as in the existing docs, it seems to me that you have to categorize either by the data type input or data type output. To categorize by input and output together would result 4 (or more?) categories, which would be even crazier. > It seems entirely crazy that > encode() and decode() are no longer in the same section, likewise that > convert_from() and convert_to() aren't documented together anymore. Awkward, yes. But findable if you know what the categories are. I suppose there could be 3 different categories: those that input and output strings, those that input and output binary, and those that convert -- inputting one data type and outputting another. I'm not sure that this would really address the issue of documenting, say encode() and decode() together. It pretty much makes sense to alphabetize the functions _within_ each category, because that's about the only easily defined way to do it. Going "by feel" and putting encode() and decode() together raises the question of where they should be together in the overall ordering within the category. > I'm not sure what is the right dividing line between string and binary > functions, but I don't think that anyone is going to find this > division helpful. Maybe there's a way to make more clear what the categories are? I could be explicit in the description of the section. > I do agree that documenting some functions twice is a bad plan, > so we need to clean this up somehow. > > After some thought, it seems like maybe a workable approach would be > to consider that all conversion functions going between text and > bytea belong in the binary-string-functions section. I think it's > reasonable to say that plain "string functions" just means stuff > dealing with text. Ok. Should the section title remain unchanged? "Binary String Functions and Operators" I think the summary description of the section will need a little clarification. > Possibly we could make a separate table in the binary-functions > section just for conversions, although that feels like it might be > overkill. I have no good answers. An advantage to a separate section for conversions is that you _might_ be able to pair the functions, so that encode() and decode() do show up right next to each other. I'm not sure exactly how to structure "pairing". I would have to play around and see what might look good. > While we're on the subject, Table 9.11 (conversion names) seems > entirely misplaced, and I don't just mean that it would need to > migrate to the binary-functions page. I don't think it belongs > in func.sgml at all. Isn't it pretty duplicative of Table 23.2 > (Client/Server Character Set Conversions)? I think we should > unify it with that table, or at least put it next to that one. > Perhaps that's material for a separate patch though. I don't know. But it does seem something that can be addressed in isolation and suitable for it's own patch. Thanks for the help. I will wait for a response to this before submitting another patch, just in case someone sees any ideas here to be followed up on. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
typesafe printf hackery
Hi, I've occasionally wished for a typesafe version of pg_printf() and other varargs functions. The compiler warnings are nice, but also far from complete. Here's a somewhat crazy hack/prototype for how printf could get actual argument types. I'm far from certain it's worth pursuing this further... Nor the contrary. Note that this requires removing the parentheses from VA_ARGS_NARGS_'s result (i.e. (N) -> N). To me those parens don't make much sense, we're pretty much guaranteed to only ever have a number there. With the following example e.g. myprintf("boring fmt", 1, 0.1, (char)'c', (void*)0, "crazy stuff"); myprintf("empty argument fmt"); yield format string "boring fmt", 5 args arg number 0 is of type int: 1 arg number 1 is of type double: 0.10 arg number 2 is of type bool: true arg number 3 is of type void*: (nil) arg number 4 is of type char*: crazy stuff format string "empty argument fmt", 0 args which'd obviously allow for error checking inside myprintf. #include "c.h" // hack pg version out of the way #undef printf // FIXME: This doesn't correctly work with zero arguments #define VA_ARGS_EACH(wrap, ...) \ VA_ARGS_EACH_EXPAND(VA_ARGS_NARGS(__VA_ARGS__)) (wrap, __VA_ARGS__) #define VA_ARGS_EACH_EXPAND(count) VA_ARGS_EACH_EXPAND_REALLY(VA_ARGS_EACH_INT_, count) #define VA_ARGS_EACH_EXPAND_REALLY(prefix, count) prefix##count #define VA_ARGS_EACH_INT_1(wrap, el1) wrap(el1) #define VA_ARGS_EACH_INT_2(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_1(wrap, __VA_ARGS__) #define VA_ARGS_EACH_INT_3(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_2(wrap, __VA_ARGS__) #define VA_ARGS_EACH_INT_4(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_3(wrap, __VA_ARGS__) #define VA_ARGS_EACH_INT_5(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_4(wrap, __VA_ARGS__) #define VA_ARGS_EACH_INT_6(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_5(wrap, __VA_ARGS__) #define VA_ARGS_EACH_INT_7(wrap, el1, ...) wrap(el1), VA_ARGS_EACH_INT_6(wrap, __VA_ARGS__) typedef enum printf_arg_type { PRINTF_ARG_BOOL, PRINTF_ARG_CHAR, PRINTF_ARG_INT, PRINTF_ARG_DOUBLE, PRINTF_ARG_CHARP, PRINTF_ARG_VOIDP, } printf_arg_type; typedef struct arginfo { printf_arg_type tp; } arginfo; // hackfix empty argument case #define myprintf(...) myprintf_wrap(__VA_ARGS__, "dummy") #define myprintf_wrap(fmt, ... ) \ myprintf_impl(fmt, VA_ARGS_NARGS(__VA_ARGS__) - 1, ((arginfo[]){ VA_ARGS_EACH(blurttype, __VA_ARGS__) }), __VA_ARGS__) // FIXME: Obviously not enough types #define blurttype(x) ((arginfo){_Generic(x, char: PRINTF_ARG_BOOL, int: PRINTF_ARG_INT, double: PRINTF_ARG_DOUBLE, char *: PRINTF_ARG_CHARP, void *: PRINTF_ARG_VOIDP)}) static const char* printf_arg_typename(printf_arg_type tp) { switch (tp) { case PRINTF_ARG_BOOL: return "bool"; case PRINTF_ARG_CHAR: return "char"; case PRINTF_ARG_INT: return "int"; case PRINTF_ARG_DOUBLE: return "double"; case PRINTF_ARG_CHARP: return "char*"; case PRINTF_ARG_VOIDP: return "void*"; } return ""; } static void myprintf_impl(char *fmt, size_t nargs, arginfo arg[], ...) { va_list args; va_start(args, arg); printf("format string \"%s\", %zu args\n", fmt, nargs); for (int argno = 0; argno < nargs; argno++) { printf("\targ number %d is of type %s: ", argno, printf_arg_typename(arg[argno].tp)); switch (arg[argno].tp) { case PRINTF_ARG_BOOL: printf("%s", ((bool) va_arg(args, int)) ? "true" : "false"); break; case PRINTF_ARG_CHAR: printf("%c", (char) va_arg(args, int)); break; case PRINTF_ARG_INT: printf("%d", va_arg(args, int)); break; case PRINTF_ARG_DOUBLE: printf("%f", va_arg(args, double)); break; case PRINTF_ARG_CHARP: printf("%s", va_arg(args, char *)); break; case PRINTF_ARG_VOIDP: printf("%p", va_arg(args, void *)); break; } printf("\n"); } } int main(int argc, char **argv) { myprintf("boring fmt", 1, 0.1, (char)'c', (void*)0, "crazy stuff"); myprintf("empty argument fmt"); }
Re: ANALYZE: ERROR: tuple already updated by self
On Mon, Jul 29, 2019 at 12:18:33PM +0200, Tomas Vondra wrote: On Sun, Jul 28, 2019 at 09:53:20PM -0700, Andres Freund wrote: Hi, On 2019-07-28 21:21:51 +0200, Tomas Vondra wrote: AFAICS it applies to 10+ versions, because that's where extended stats were introduced. We certainly can't mess with catalogs there, so this is about the only backpatchable fix I can think of. AFAIU the inh version wouldn't be used anyway, and this has never worked. So I think it's imo fine to fix it that way for < master. For master we should have something better, even if perhaps not immediately. Agreed. I'll get the simple fix committed soon, and put a TODO on my list for pg13. I've pushed the simple fix - I've actually simplified it a bit further by simply not calling the BuildRelationExtStatistics() at all when inh=true, instead of passing the flag to BuildRelationExtStatistics() and making the decision there. The function is part of public API, so this would be an ABI break (although it's unlikely anyone else is calling the function). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 1:44 PM Tom Lane wrote: >> Well, there'd be an actual isolation test that they work ;-), if you >> override the marking. Admittedly, one test case does not prove that >> there's no way to crash the system, but that can be said of most >> parts of Postgres. > True. I'm just talking about what we can foresee. Sure. But I think what we can foresee is that if there are any bugs reachable this way, they'd be reachable and need fixing regardless. We've already established that parallel workers can take and release locks that their leader isn't holding. Apparently, they won't take anything stronger than RowExclusiveLock; but even AccessShare is enough to let a process participate in all interesting behaviors of the lock manager, including blocking, being blocked, and being released early by deadlock resolution. And the advisory-lock functions are pretty darn thin wrappers around the lock manager. So I'm finding it hard to see where there's incremental risk, even if a user does intentionally bypass the parallel safety markings. And what we get in return is an easier way to add tests for this area. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 1:44 PM Tom Lane wrote: > Well, there'd be an actual isolation test that they work ;-), if you > override the marking. Admittedly, one test case does not prove that > there's no way to crash the system, but that can be said of most > parts of Postgres. True. I'm just talking about what we can foresee. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 1:36 PM Tom Lane wrote: >> In any case, my question at the moment is whether we need the belt-and- >> suspenders-too approach of having both non-parallel-safe marking and an >> explicit check inside these functions. We've largely moved away from >> hard-wired checks for e.g. superuserness, and surely these things are >> less dangerous than most formerly-superuser-only functions. > If we can't think of a way that the lack of these checks could crash > it, then I think it's OK to remove the hardwired checks. If we can, > I'd favor keeping them. Well, there'd be an actual isolation test that they work ;-), if you override the marking. Admittedly, one test case does not prove that there's no way to crash the system, but that can be said of most parts of Postgres. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 1:36 PM Tom Lane wrote: > No, there's a sufficient reason why we should force advisory locks > to be taken in the leader process, namely that the behavior is totally > different if we don't: they will disappear at the end of the parallel > worker run, not at end of transaction or session as documented. Oh, good point. I forgot about that. > However, that argument doesn't seem to be a reason why the advisory-lock > functions couldn't be parallel-restricted rather than parallel-unsafe. Agreed. > In any case, my question at the moment is whether we need the belt-and- > suspenders-too approach of having both non-parallel-safe marking and an > explicit check inside these functions. We've largely moved away from > hard-wired checks for e.g. superuserness, and surely these things are > less dangerous than most formerly-superuser-only functions. If we can't think of a way that the lack of these checks could crash it, then I think it's OK to remove the hardwired checks. If we can, I'd favor keeping them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Attached partition not considering altered column properties of root partition.
On Tue, Jul 2, 2019 at 9:53 PM Amit Langote wrote: > Thanks for the report. This seems like a bug. Documentation claims > that the child tables inherit column storage options from the parent > table. That's actually enforced in only some cases. I realize I'm just repeating the same argument I've already made before on other related topics, but I don't think this is a bug. Inherited-from-parent is not the same as enforced-to-always-be-the-same-as-parent. Note that this is allowed, changing only the child: rhaas=# create table foo (a int, b text) partition by range (a); CREATE TABLE rhaas=# create table foo1 partition of foo for values from (0) to (10); CREATE TABLE rhaas=# alter table foo1 alter column b set storage plain; ALTER TABLE As is this, changing only the parent: rhaas=# alter table only foo1 alter column b set storage plain; ALTER TABLE How can you possibly argue that the intended behavior is everything-always-matches when that's not what's documented and there's absolutely nothing that tries to enforce it? I'm getting really tired of people thinking that they can invent rules for table partitioning that were (1) never intended by the original implementation and (2) not even slightly enforced by the code, and then decide that those behavior changes should not only be made but back-patched. That is just dead wrong. There is no problem here. There is no need to change ANYTHING. There is even less need to do it in the back branches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TopoSort() fix
Robert Haas writes: > On Tue, Jul 30, 2019 at 10:27 AM Tom Lane wrote: >> (BTW, why aren't these functions just "parallel restricted"?) > ... > But it is really pretty arguable whether we should feel responsible > for that problem. We could just decide that if you're doing that, and > you don't want the scenario described above to happen, you oughta mark > the function that contains this logic at least PARALLEL RESTRICTED, > and if you don't, then it's your fault for doing a dumb thing. I > believe when we were early on in the development of this we wanted to > be very conservative lest, ah, someone accuse us of not locking things > down well enough, but maybe at this point parallel query is a > sufficiently well-established concept that we should lighten up on > some cases where we took an overly-stringent line. If we take that > view, then I'm not sure why these functions couldn't be just marked > PARALLEL SAFE. No, there's a sufficient reason why we should force advisory locks to be taken in the leader process, namely that the behavior is totally different if we don't: they will disappear at the end of the parallel worker run, not at end of transaction or session as documented. However, that argument doesn't seem to be a reason why the advisory-lock functions couldn't be parallel-restricted rather than parallel-unsafe. In any case, my question at the moment is whether we need the belt-and- suspenders-too approach of having both non-parallel-safe marking and an explicit check inside these functions. We've largely moved away from hard-wired checks for e.g. superuserness, and surely these things are less dangerous than most formerly-superuser-only functions. regards, tom lane
Re: Avoid full GIN index scan when possible
Marc Cousin writes: > By the way, while preparing this, I noticed that it seems that during this > kind of index scan, the interrupt signal is masked > for a very long time. Control-C takes a very long while to cancel the query. > But it's an entirely different problem :) Yeah, that seems like an independent problem/patch, but it's not obvious where to fix --- can you provide a self-contained test case? Meanwhile, I looked at the v3 patch, and it seems like it might not be too far from committable. I think we should *not* let this get bogged down in questions of whether EXPLAIN can report which index quals were used or ignored. That's a problem that's existed for decades in the btree code, with more or less zero user complaints. I do think v3 needs more attention to comments, for instance this hunk is clearly falsifying the adjacent comment: @ -141,7 +141,8 @@ ginFillScanKey(GinScanOpaque so, OffsetNumber attnum, uint32 i; /* Non-default search modes add one "hidden" entry to each key */ - if (searchMode != GIN_SEARCH_MODE_DEFAULT) + if (searchMode != GIN_SEARCH_MODE_DEFAULT && + (searchMode != GIN_SEARCH_MODE_ALL || nQueryValues)) nQueryValues++; key->nentries = nQueryValues; key->nuserentries = nUserQueryValues; Also, I agree with Julien that this + so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE; probably needs to be + so->forcedRecheck |= key->triConsistentFn(key) != GIN_TRUE; regards, tom lane
Re: make installcheck-world in a clean environment
Alexander Lakhin writes: > 01.07.2019 13:47, Thomas Munro wrote: >> A new CF is here and this is in "Needs Review". Would you like to >> provide a rebased patch, or should it really be withdrawn? > The rebased patch is attached, but I still can't find anyone interested > in reviewing it. > So let's withdraw it. I agree with withdrawing this, and have marked it that way in the CF app. I think the fundamental problem here is that nobody except Alexander is unhappy with the current behavior of "make installcheck", and it is not very clear whether this patch might change the behavior in ways that do make others unhappy. So I think it's best to set it aside pending more people agreeing that there's a problem to solve. regards, tom lane
Re: TopoSort() fix
On Tue, Jul 30, 2019 at 10:27 AM Tom Lane wrote: > I also looked into whether one could use SELECT FOR UPDATE/SHARE to get > stronger locks at a tuple level, but that's been blocked off as well. > You guys really did a pretty good job of locking that down. Thanks. We learned from the master. > After thinking about this for awhile, though, I believe it might be > reasonable to just remove PreventAdvisoryLocksInParallelMode() > altogether. The "parallel unsafe" markings on the advisory-lock > functions seem like adequate protection against somebody running > them in a parallel worker. If you defeat that by calling them from > a mislabeled-parallel-safe wrapper (as the proposed test case does), > then any negative consequences are on your own head. AFAICT the > only actual negative consequence is that the locks disappear the > moment the parallel worker exits, so we'd not be opening any large > holes even to people who rip off the safety cover. > > (BTW, why aren't these functions just "parallel restricted"?) I don't exactly remember why we installed all of these restrictions any more. You might be able to find some discussion of it by searching the archives. I believe we may have been concerned about the fact that group locking would cause advisory locks taken in one process not to conflict with the same advisory lock taken in some cooperating process, and maybe that would be unwelcome behavior for someone. For example, suppose the user defines a function that takes an advisory lock on the number 1, does a bunch of stuff that should never happen multiply at the same time, and then releases the lock. Without parallel query, that will work. With parallel query, it won't, because several workers running the same query might run the same function simultaneously and their locks won't conflict. But it is really pretty arguable whether we should feel responsible for that problem. We could just decide that if you're doing that, and you don't want the scenario described above to happen, you oughta mark the function that contains this logic at least PARALLEL RESTRICTED, and if you don't, then it's your fault for doing a dumb thing. I believe when we were early on in the development of this we wanted to be very conservative lest, ah, someone accuse us of not locking things down well enough, but maybe at this point parallel query is a sufficiently well-established concept that we should lighten up on some cases where we took an overly-stringent line. If we take that view, then I'm not sure why these functions couldn't be just marked PARALLEL SAFE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal to suppress errors thrown by to_reg*()
Takuma Hoshiai writes: > [ fix_to_reg_v2.patch ] I took a quick look through this patch. I'm on board with the goal of not having schema-access violations throw an error in these functions, but the implementation feels pretty ugly and bolted-on. Nobody who had designed the code to do this from the beginning would have chosen to parse the names twice, much less check the ACLs twice which is what's going to happen in the success path. But a worse problem is that this only fixes the issue for the object name proper. to_regprocedure and to_regoperator also have type name(s) to think about, and this doesn't fix the problem for those: regression=# create schema s1; CREATE SCHEMA regression=# create type s1.d1 as enum('a','b'); CREATE TYPE regression=# create function f1(s1.d1) returns s1.d1 language sql as regression-# 'select $1'; CREATE FUNCTION regression=# select to_regprocedure('f1(s1.d1)'); to_regprocedure - f1(s1.d1) (1 row) regression=# create user joe; CREATE ROLE regression=# \c - joe You are now connected to database "regression" as user "joe". regression=> select to_regprocedure('f1(s1.d1)'); ERROR: permission denied for schema s1 A closely related issue that's been complained of before is that while these functions properly return NULL when the main object name includes a nonexistent schema: regression=> select to_regprocedure('q1.f1(int)'); to_regprocedure - (1 row) it doesn't work for a nonexistent schema in a type name: regression=> select to_regprocedure('f1(q1.d1)'); ERROR: schema "q1" does not exist Looking at the back-traces for these failures, #0 errfinish (dummy=0) at elog.c:411 #1 0x00553626 in aclcheck_error (aclerr=, objtype=OBJECT_SCHEMA, objectname=) at aclchk.c:3623 #2 0x0055028f in LookupExplicitNamespace ( nspname=, missing_ok=false) at namespace.c:2928 #3 0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x20d87a0, typmod_p=0x7fff94c3ee38, missing_ok=) at parse_type.c:162 #4 0x005b3b29 in parseTypeString (str=, typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false) at parse_type.c:822 #5 0x0086fe21 in parseNameAndArgTypes (string=, allowNone=false, names=, nargs=0x7fff94c3f01c, argtypes=0x7fff94c3ee80) at regproc.c:1874 #6 0x00870b2d in to_regprocedure (fcinfo=0x2134900) at regproc.c:305 #0 errfinish (dummy=0) at elog.c:411 #1 0x0054dc7b in get_namespace_oid (nspname=, missing_ok=false) at namespace.c:3061 #2 0x00550230 in LookupExplicitNamespace ( nspname=, missing_ok=false) at namespace.c:2922 #3 0x005b3433 in LookupTypeName (pstate=0x0, typeName=0x216bd20, typmod_p=0x7fff94c3ee38, missing_ok=) at parse_type.c:162 #4 0x005b3b29 in parseTypeString (str=, typeid_p=0x7fff94c3ee3c, typmod_p=0x7fff94c3ee38, missing_ok=false) at parse_type.c:822 #5 0x0086fe21 in parseNameAndArgTypes (string=, allowNone=false, names=, nargs=0x7fff94c3f01c, argtypes=0x7fff94c3ee80) at regproc.c:1874 #6 0x00870b2d in to_regprocedure (fcinfo=0x2170f50) at regproc.c:305 it's not *that* far from where we know we need no-error behavior to where it's failing to happen. parseNameAndArgTypes isn't even global, so certainly changing its API is not problematic. For the functions below it, we'd have to decide whether it's okay to consider that missing_ok=true also enables treating a schema access failure as "missing", or whether we should add an additional flag argument to decide that. It seems like it might be more flexible to use a separate flag, but that decision could propagate to a lot of places, especially if we conclude that all the namespace.c functions that expose missing_ok also need to expose schema_access_violation_ok. So I think you ought to drop this implementation and fix it properly by adjusting the behaviors of the functions cited above. regards, tom lane
Unused struct member in pgcrypto pgp.c
Hi, In contrib/pgcrypto/pgp.c we have a struct member int_name in digest_info which isn’t used, and seems to have never been used (a potential copy/pasteo from the cipher_info struct?). Is there a reason for keeping this, or can it be removed as per the attached? cheers ./daniel pgcrypto_int_name.diff Description: Binary data
Re: Patch to document base64 encoding
"Karl O. Pinc" writes: > On Mon, 15 Jul 2019 23:00:55 +0200 (CEST) > Fabien COELHO wrote: >> The patch clarifies the documentation about encode/decode and other >> text/binary string conversion functions. > Other notable changes: > Corrects categorization of functions as string or binary. > Reorders functions alphabetically by function name. So I took a look at this, expecting that after so much discussion it ought to just be committable ... but I am befuddled by your choices about which functions to move where. It seems entirely crazy that encode() and decode() are no longer in the same section, likewise that convert_from() and convert_to() aren't documented together anymore. I'm not sure what is the right dividing line between string and binary functions, but I don't think that anyone is going to find this division helpful. I do agree that documenting some functions twice is a bad plan, so we need to clean this up somehow. After some thought, it seems like maybe a workable approach would be to consider that all conversion functions going between text and bytea belong in the binary-string-functions section. I think it's reasonable to say that plain "string functions" just means stuff dealing with text. Possibly we could make a separate table in the binary-functions section just for conversions, although that feels like it might be overkill. While we're on the subject, Table 9.11 (conversion names) seems entirely misplaced, and I don't just mean that it would need to migrate to the binary-functions page. I don't think it belongs in func.sgml at all. Isn't it pretty duplicative of Table 23.2 (Client/Server Character Set Conversions)? I think we should unify it with that table, or at least put it next to that one. Perhaps that's material for a separate patch though. regards, tom lane
Re: pg_upgrade version checking questions
> On 27 Jul 2019, at 08:42, Peter Eisentraut > wrote: > I have committed 0002, 0003, and 0004. Thanks! > The implementation in 0001 (Only allow upgrades by the same exact > version new bindir) has a problem. It compares (new_cluster.bin_version > != PG_VERSION_NUM), but new_cluster.bin_version is actually just the > version of pg_ctl, so this is just comparing the version of pg_upgrade > with the version of pg_ctl, which is not wrong, but doesn't really > achieve the full goal of having all binaries match. Right, it seemed the cleanest option at the time more or less based on the issues outlined below. > I think a better structure would be to add a version check for each > validate_exec() so that each program is checked against pg_upgrade. > This should mirror what find_other_exec() in src/common/exec.c does. In > a better world we would use find_other_exec() directly, but then we > can't support -B. Maybe expand find_other_exec() to support this, or > make a private copy for pg_upgrade to support this. (Also, we have two > copies of validate_exec() around. Maybe this could all be unified.) I’ll take a stab at tidying all of this up to require less duplication, we’ll see where that ends up. cheers ./daniel
Re: Parallel grouping sets
On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote: On Wed, Jun 12, 2019 at 10:58 AM Richard Guo wrote: Hi all, Paul and I have been hacking recently to implement parallel grouping sets, and here we have two implementations. Implementation 1 Attached is the patch and also there is a github branch [1] for this work. Rebased with the latest master. Hi Richard, thanks for the rebased patch. I think the patch is mostly fine (at least I don't see any serious issues). A couple minor comments: 1) I think get_number_of_groups() would deserve a short explanation why it's OK to handle (non-partial) grouping sets and regular GROUP BY in the same branch. Before these cases were clearly separated, now it seems a bit mixed up and it may not be immediately obvious why it's OK. 2) There are new regression tests, but they are not added to any schedule (parallel or serial), and so are not executed as part of "make check". I suppose this is a mistake. 3) The regression tests do check plan and results like this: EXPLAIN (COSTS OFF, VERBOSE) SELECT ...; SELECT ... ORDER BY 1, 2, 3; which however means that the query might easily use a different plan than what's verified in the eplain (thanks to the additional ORDER BY clause). So I think this should explain and execute the same query. (In this case the plans seems to be the same, but that may easily change in the future, and we could miss it here, failing to verify the results.) 4) It might be a good idea to check the negative case too, i.e. a query on data set that we should not parallelize (because the number of partial groups would be too high). Do you have any plans to hack on the second approach too? AFAICS those two approaches are complementary (address different data sets / queries), and it would be nice to have both. One of the things I've been wondering is if we need to invent gset_id as a new concept, or if we could simply use the existing GROUPING() function - that uniquely identifies the grouping set. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: tap tests driving the database via psql
Andres Freund writes: > On 2019-07-30 09:40:54 -0400, Tom Lane wrote: >> Now, none of these things are really a problem with DBD/DBI as such >> --- rather, they are reasons not to depend on a pre-packaged build >> of DBD::Pg that depends on a pre-packaged build of libpq.so. >> I haven't looked at the size, or the license, of DBD::Pg ... but >> could it be sane to include our own modified copy in the tree? > I had that as an alternative too. I think the license (Artistic v1/GPL > v1) probably makes that non-feasible. The pure-perl version of DBI > probably would otherwise be realistic. OK, so just lifting DBD::Pg in toto is out for license reasons. However, maybe we could consider writing a new DBD driver from scratch (while using a platform-provided DBI layer) rather than doing everything from scratch. I'm not sure how much actual functionality is in the DBI layer, so maybe that approach wouldn't buy much. regards, tom lane
Re: tap tests driving the database via psql
Hi, On 2019-07-30 09:40:54 -0400, Tom Lane wrote: > Craig Ringer writes: > > On Sun, 28 Jul 2019 at 03:15, Andres Freund wrote: > >> 1) Just depend on DBD::Pg being installed. It's fairly common, after > >> all. It'd be somewhat annoying that we'd often end up using a > >> different version of libpq than what we're testing against. But in > >> most cases that'd not be particularly problematic. > > > I advocated for this in the past, and still think it's the best option. > > I think the not-same-libpq issue would be a much bigger problem than either > of you are accounting for. Some off-the-top-of-the-head reasons: I came to the same conclusion? > Now, none of these things are really a problem with DBD/DBI as such > --- rather, they are reasons not to depend on a pre-packaged build > of DBD::Pg that depends on a pre-packaged build of libpq.so. > I haven't looked at the size, or the license, of DBD::Pg ... but > could it be sane to include our own modified copy in the tree? I had that as an alternative too. I think the license (Artistic v1/GPL v1) probably makes that non-feasible. The pure-perl version of DBI probably would otherwise be realistic. Greetings, Andres Freund
Re: TopoSort() fix
Robert Haas writes: > On Mon, Jul 29, 2019 at 9:48 PM Tom Lane wrote: >> I believe the only accessible route to taking any sort of new lock >> in a parallel worker is catalog lookups causing AccessShareLock on >> a catalog. > Can't the worker just query a previously-untouched table, maybe by > constructing a string and then using EXECUTE to execute it? Hm, yeah, looks like you could get a new AccessShareLock that way too. But not any exclusive lock. I also looked into whether one could use SELECT FOR UPDATE/SHARE to get stronger locks at a tuple level, but that's been blocked off as well. You guys really did a pretty good job of locking that down. After thinking about this for awhile, though, I believe it might be reasonable to just remove PreventAdvisoryLocksInParallelMode() altogether. The "parallel unsafe" markings on the advisory-lock functions seem like adequate protection against somebody running them in a parallel worker. If you defeat that by calling them from a mislabeled-parallel-safe wrapper (as the proposed test case does), then any negative consequences are on your own head. AFAICT the only actual negative consequence is that the locks disappear the moment the parallel worker exits, so we'd not be opening any large holes even to people who rip off the safety cover. (BTW, why aren't these functions just "parallel restricted"?) regards, tom lane
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 10:06 AM Masahiko Sawada wrote: > On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni > wrote: > > > > On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian wrote: > >> > >> > Checking that all buffers using a single LSN are from the same > >> > relation would be a good idea but I think it's hard to test it and > >> > regard the test result as okay. Even if we passed 'make checkworld', > >> > it might still be possible to happen. And even assertion failures > >> > >> Yes, the problem is that if you embed the relfilenode or tablespace or > >> database in the encryption IV, you then need to then make sure you > >> re-encrypt any files that move between these. I am hesitant to do that > >> since it then requires these workarounds for encryption going forward. > >> We know that most people will not be using encryption, so that will not > >> be well tested either. For pg_upgrade, I used a minimal-impact > >> approach, and it has allowed dramatic changes in our code without > >> requiring changes and retesting of pg_upgrade. > > > > > > Will there be a per-relation salt stored in a separate file? I saw it > mentioned in a few places (most recently > https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org) > but there's also discussion of trying to make the TDEK unique without a > separate salt so I'm unsure. > > > > With a per-relation salt there is no need to include fixed attributes > (database, relfilenode, or tablespace) to ensure the derived key is unique > per relation. A long salt (32-bytes from /dev/urandom) alone guarantees > that uniqueness. Copying or moving files would then be possible by also > copying the salt. It does not need to be a salt per file on disk either, > one salt can be used for many files for the same relation by including the > fork number, type, or segment in the TDEK derivation (so each file on disk > for that relation ends up with a unique TDEK). > > If we can derive unique TDEK using (database, tablespace, relfilenode) > as info I think it's better to use it rather than using random salt > per relations since it doesn't require additional information we need > to store. As described in HKDF RFC[1], if the input key is already > present as a cryptographically strong key we can skip the extract part > where use a salt. > > [1] https://tools.ietf.org/html/rfc5869 Yes a random salt is not required for security reasons. Any unique values mixed into the HKDF is fine and the derived keys will still be unique. The HKDF ensures that uniqueness. The separate salt allows you to disconnect the key derivation from the physical attributes of the file. The physical attributes (ex: database, tablespace, file node) are very convenient as they're unique and do not require additional storage. However using them prevents copying or moving the encrypted files as one or more of them would be different at the destination (so the derived key would no longer decrypt the existing data). So you would have to decrypt / encrypt everything as part of a copy. If copying raw files without a decrypt/encrypt cycle is desired then the key derivation cannot include physical attributes (or per Bruce's note above, there would be no separate key derivation relation). I thought it'd be a nice property to have as it limits the amount of code that needs to be crypto aware (ex: copying a database or moving a table to a different tablespace would not change beyond ensuring the salt is also copied). Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 8:16 AM Bruce Momjian wrote: > On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote: > > If each relation file has its own derived key, the derived TDEK for that > > relation file, then there is no issue with reusing the same IV = LSN || > Page > > Number. The TDEKs will be different so Key + IV will never collide. > > So, this email explains that we are considering not using the > relfilenode/tablepsace/database to create a derived key per relation, > but us the same key for all relaions because the IV will be unique per > page across all relations: > > > https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us > > There is talk of using a derived key with a contant to make sure all > heap/index files use a different derived key than WAL, but I am not > sure. This is related to whether WAL IV and per-heap/index IV can > collide. > Ah, I read that to imply that derived keys were a must. Specifically this piece at the end: >From Joe's email on 2019-07-13 18:41:34: >> Based on all of that I cannot find a requirement that we use more than one key per database. >> >> But I did find that files in an encrypted file system are encrypted with derived keys from a master key, and I view this as analogous to what we are doing. I read that as the "one key per database" is the MDEK. And read the part about derived keys as referring to separate derived keys for relations. Perhaps I misread and it was referring to different keys for WAL vs pages. > There are other emails in the thread that also discuss the topic. The > issue is that we add a lot of complexity to other parts of the system, > (e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces) > to create a derived key, so we should make sure we need it before we do > it. > Yes it definitely complicates things both on the derivation and potential additional storage for the salts (they're small and fixed size, but you still need to put it somewhere). I think key rotation for TDEK will be impossible without some stored salt and per-relation derived key. It might not be needed in a first cut though as the "default salt" could be no salt or a place holder of all zeroes. Even if the rotation itself is out of scope for a first pass the potential to eventually add it should be there. Should keep in mind that because we do not have a MAC on the encrypted pages we'll need to know which derived key to use. We can't try multiple options to see which is correct as any key would "succeed" with garbage decrypted data. > > In general it's fine to use the same IV with different keys. Only reuse > of Key > > + IV is a problem and the entire set of possible counter values (IV + 0, > IV + > > 1, ...) generated with a key must be unique. That's also why we must > leave at > > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be > filled > > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our > per-page IV > > prefix used any of those bits then the counter could overflow into the > next > > page's IV's range. > > Agreed. > > Attached is an updated patch that checks only main relation forks, which > I think are the only files we are going ot encrypt, and it has better > comments. > Okay that makes sense in the context of using a single key and relying on the LSN based IV to be unique. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 5:03 AM Sehrope Sarkuni wrote: > > On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian wrote: >> >> > Checking that all buffers using a single LSN are from the same >> > relation would be a good idea but I think it's hard to test it and >> > regard the test result as okay. Even if we passed 'make checkworld', >> > it might still be possible to happen. And even assertion failures >> >> Yes, the problem is that if you embed the relfilenode or tablespace or >> database in the encryption IV, you then need to then make sure you >> re-encrypt any files that move between these. I am hesitant to do that >> since it then requires these workarounds for encryption going forward. >> We know that most people will not be using encryption, so that will not >> be well tested either. For pg_upgrade, I used a minimal-impact >> approach, and it has allowed dramatic changes in our code without >> requiring changes and retesting of pg_upgrade. > > > Will there be a per-relation salt stored in a separate file? I saw it > mentioned in a few places (most recently > https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org) > but there's also discussion of trying to make the TDEK unique without a > separate salt so I'm unsure. > > With a per-relation salt there is no need to include fixed attributes > (database, relfilenode, or tablespace) to ensure the derived key is unique > per relation. A long salt (32-bytes from /dev/urandom) alone guarantees that > uniqueness. Copying or moving files would then be possible by also copying > the salt. It does not need to be a salt per file on disk either, one salt can > be used for many files for the same relation by including the fork number, > type, or segment in the TDEK derivation (so each file on disk for that > relation ends up with a unique TDEK). If we can derive unique TDEK using (database, tablespace, relfilenode) as info I think it's better to use it rather than using random salt per relations since it doesn't require additional information we need to store. As described in HKDF RFC[1], if the input key is already present as a cryptographically strong key we can skip the extract part where use a salt. [1] https://tools.ietf.org/html/rfc5869 Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 8:16 AM Masahiko Sawada wrote: > On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni > wrote: > > > > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada > wrote: > > > > An argument could be made to push that problem upstream, i.e. let the > > > > supplier of the passphrase deal with the indirection. You would still > > > > need to verify the supplied passphrase/key is correct via something > > > > like authenticating against a stored MAC. > > > > > > So do we need the key for MAC of passphrase/key in order to verify? > > > > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input > > can be "decrypted" with it in both CTR and CBC mode, you'll just end > > up with garbage data if the key does not match. Verification of the > > key prior to usage (i.e. starting DB and encrypting/decrypting data) > > is a must as otherwise you'll end up with all kinds of corruption or > > data loss. > > > > Do you mean that we encrypt and store a 16 byte input with the correct > key to the disk, and then decrypt it with the user supplied key and > compare the result to the input data? > Yes but we don't compare via decryption of a known input. We instead compute a MAC of the encrypted master key using the user supplied key, and compare that against an expected MAC stored alongside the encrypted master key. The pseudo code would be something like: // Read key text from user: string raw_kek = read_from_user() // Normalize it to a fixed size of 64-bytes byte[64] kek = SHA512(SHA512(raw_kek)) // Split the 64-bytes into a separate encryption and MAC key byte[32] user_encryption_key = kek.slice(0,32) byte[32] user_mac_key = kek.slice(32, 64) // Read our saved MAC and encrypted master key byte[80] mac_iv_encrypted_master_key = read_from_file() // First 32-bytes is the MAC of the rest byte[32] expected_mac = mac_iv_encrypted_master_key.slice(0, 32) // Rest is a random IV + Encrypted master key byte[48] iv_encrypted_master_key = mac_iv_encrypted_master_key(32, 80) // Compute the MAC with the user supplied key byte[32] actual_mac = HMAC(user_mac_key, iv_encrypted_master_key) // If it does not match then the user key is invalid if (actual_mac != expected_mac) { print_err_and_exit("Bad user key!") } // Our MAC was correct // ... so we know user supplied key is valid // ... and we know our iv and encrypted_key are valid byte[16] iv = iv_encrypted_master_key.slice(0,16) byte[32] encrypted_master_key = iv_encrypted_master_key.slice(16, 48) // ... so we can use all three to decrypt the master key (MDEK) byte[32] master_key = decrypt_aes_cbc(user_encryption_key, iv, encrypted_master_key) > From a single user supplied passphrase you would derive the MDEK and > > compute a MAC (either using the same key or via a separate derived > > MDEK-MAC key). If the computed MAC matches against the previously > > stored value then you know the MDEK is correct as well. > > You meant KEK, not MDEK? > If the KEK is incorrect then the MAC validation would fail and the decrypt would never be attempted. If the MAC matches then both the KEK (user supplied key) and MDEK ("master_key" in the pseudo code above) would be confirmed to be valid. So the MDEK is safe to use for deriving keys for encrypt / decrypt. I'm using the definitions for "KEK" and "MDEK" from Joe's mail https://www.postgresql.org/message-id/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: tap tests driving the database via psql
Craig Ringer writes: > On Sun, 28 Jul 2019 at 03:15, Andres Freund wrote: >> 1) Just depend on DBD::Pg being installed. It's fairly common, after >> all. It'd be somewhat annoying that we'd often end up using a >> different version of libpq than what we're testing against. But in >> most cases that'd not be particularly problematic. > I advocated for this in the past, and still think it's the best option. I think the not-same-libpq issue would be a much bigger problem than either of you are accounting for. Some off-the-top-of-the-head reasons: * Since we'd have to presume a possibly-very-back-rev libpq, we could never add tests related to any recent libpq bug fixes or new features. * The system libpq might have a different idea of default port and/or socket directory than the test build. Yeah, we could work around that by forcing the choice all the time, but it would be a constant hazard. * We don't have control over what else gets brought in beside libpq. Depending on the whims of the platform packagers, there might need to be other parts of the platform's default postgres installation, eg psql, sitting in one's PATH. Again, this wouldn't be too much of a hazard for pre-debugged test scripts --- but it would be a huge hazard for developers, who do lots of things manually and would always be at risk of invoking the wrong psql. I learned long ago not to have any part of a platform's postgres packages in place on my development systems, and I will fight hard against any test procedure change that requires me to do differently. Now, none of these things are really a problem with DBD/DBI as such --- rather, they are reasons not to depend on a pre-packaged build of DBD::Pg that depends on a pre-packaged build of libpq.so. I haven't looked at the size, or the license, of DBD::Pg ... but could it be sane to include our own modified copy in the tree? (Forking DBD::Pg would also let us add custom testing features to it ...) > The community IMO wastes *so* much time on not-invented-here make-work and > on jumping through hoops to avoid depending on anything newer than the late > '90s. That's an unnecessary, and false, ad-hominem attack. regards, tom lane
Re: block-level incremental backup
On Tue, Jul 30, 2019 at 1:28 AM Robert Haas wrote: > On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova > wrote: > > In attachments, you can find a prototype of incremental pg_basebackup, > > which consists of 2 features: > > > > 1) To perform incremental backup one should call pg_basebackup with a > > new argument: > > > > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn' > > > > where lsn is a start_lsn of parent backup (can be found in > > "backup_label" file) > > > > It calls BASE_BACKUP replication command with a new argument > > PREV_BACKUP_START_LSN 'lsn'. > > > > For datafiles, only pages with LSN > prev_backup_start_lsn will be > > included in the backup. > > They are saved into 'filename.partial' file, 'filename.blockmap' file > > contains an array of BlockNumbers. > > For example, if we backuped blocks 1,3,5, filename.partial will contain > > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}. > > I think it's better to keep both the information about changed blocks > and the contents of the changed blocks in a single file. The list of > changed blocks is probably quite short, and I don't really want to > double the number of files in the backup if there's no real need. I > suspect it's just overall a bit simpler to keep everything together. > I don't think this is a make-or-break thing, and welcome contrary > arguments, but that's my preference. > I had experience working on a similar product and I agree with Robert to keep the changed block info and the changed block in a single file make more sense. +1 > > > 2) To merge incremental backup into a full backup call > > > > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' > > --merge-backups > > > > It will move all files from 'incremental_basedir' to 'basedir' handling > > '.partial' files correctly. > > This, to me, looks like it's much worse than the design that I > proposed originally. It means that: > > 1. You can't take an incremental backup without having the full backup > available at the time you want to take the incremental backup. > > 2. You're always storing a full backup, which means that you need more > disk space, and potentially much more I/O while taking the backup. > You save on transfer bandwidth, but you add a lot of disk reads and > writes, costs which have to be paid even if the backup is never > restored. > > > 1) Whether we collect block maps using simple "read everything page by > > page" approach > > or WAL scanning or any other page tracking algorithm, we must choose a > > map format. > > I implemented the simplest one, while there are more ideas: > > I think we should start simple. > > I haven't had a chance to look at Jeevan's patch at all, or yours in > any detail, as yet, so these are just some very preliminary comments. > It will be good, however, if we can agree on who is going to do what > part of this as we try to drive this forward together. I'm sorry that > I didn't communicate EDB's plans to work on this more clearly; > duplicated effort serves nobody well. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- Ibrar Ahmed
Re: [proposal] de-TOAST'ing using a iterator
John Naylor 于2019年7月29日周一 上午11:49写道: > On Thu, Jul 25, 2019 at 10:21 PM Binguo Bao wrote: > My goal for this stage of review was to understand more fully what the > code is doing, and make it as simple and clear as possible, starting > at the top level. In doing so, it looks like I found some additional > performance gains. I haven't looked much yet at the TOAST fetching > logic. > > > 1). For every needle comparison, text_position_next_internal() > calculates how much of the value is needed and passes that to > detoast_iterate(), which then calculates if it has to do something or > not. This is a bit hard to follow. There might also be a performance > penalty -- the following is just a theory, but it sounds plausible: > The CPU can probably correctly predict that detoast_iterate() will > usually return the same value it did last time, but it still has to > call the function and make sure, which I imagine is more expensive > than advancing the needle. Ideally, we want to call the iterator only > if we have to. > > In the attached patch (applies on top of your v5), > text_position_next_internal() simply compares hptr to the detoast > buffer limit, and calls detoast_iterate() until it can proceed. I > think this is clearer. Yes, I think this is a general scenario where the caller continually calls detoast_iterate until gets enough data, so I think such operations can be extracted as a macro, as I did in patch v6. In the macro, the detoast_iterate function is called only when the data requested by the caller is greater than the buffer limit. (I'm not sure of the error handling, see #2.) > In this scheme, the only reason to know length is to pass to > pglz_decompress_iterate() in the case of in-line compression. As I > alluded to in my first review, I don't think it's worth the complexity > to handle that iteratively since the value is only a few kB. I made it > so in-line datums are fully decompressed as in HEAD and removed struct > members to match. Sounds good. This not only simplifies the structure and logic of Detoast Iterator but also has no major impact on efficiency. > I also noticed that no one updates or looks at > "toast_iter.done" so I removed that as well. > toast_iter.done is updated when the buffer limit reached the buffer capacity now. So, I added it back. > Now pglz_decompress_iterate() doesn't need length at all. For testing > I just set decompress_all = true and let the compiler optimize away > the rest. I left finishing it for you if you agree with these changes. > Done. > 2). detoast_iterate() and fetch_datum_iterate() return a value but we > don't check it or do anything with it. Should we do something with it? > It's also not yet clear if we should check the iterator state instead > of return values. I've added some XXX comments as a reminder. We > should also check the return value of pglz_decompress_iterate(). > IMO, we need to provide users with a simple iterative interface. Using the required data pointer to compare with the buffer limit is an easy way. And the application scenarios of the iterator are mostly read operations. So I think there is no need to return a value, and the iterator needs to throw an exception for some wrong calls, such as all the data have been iterated, but the user still calls the iterator. > > 3). Speaking of pglz_decompress_iterate(), I diff'd it with > pglz_decompress(), and I have some questions on it: > > a). > + srcend = (const unsigned char *) (source->limit == source->capacity > ? source->limit : (source->limit - 4)); > > What does the 4 here mean in this expression? Since we fetch chunks one by one, if we make srcend equals to the source buffer limit, In the while loop "while (sp < srcend && dp < destend)", sp may exceed the source buffer limit and read unallocated bytes. Giving a four-byte buffer can prevent sp from exceeding the source buffer limit. If we have read all the chunks, we don't need to be careful to cross the border, just make srcend equal to source buffer limit. I've added comments to explain it in patch v6. > Is it possible it's > compensating for this bit in init_toast_buffer()? > > + buf->limit = VARDATA(buf->buf); > > It seems the initial limit should also depend on whether the datum is > compressed, right? Can we just do this: > > + buf->limit = buf->position; > I'm afraid not. buf->position points to the data portion of the buffer, but the beginning of the chunks we read may contain header information. For example, for compressed data chunks, the first four bytes record the size of raw data, this means that limit is four bytes ahead of position. This initialization doesn't cause errors, although the position is less than the limit in other cases. Because we always fetch chunks first, then decompress it. > b). > - while (sp < srcend && dp < destend) > ... > + while (sp + 1 < srcend && dp < destend && > ... > > Why is it here "sp + 1"? > Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in
Re: Built-in connection pooler
On Tue, Jul 30, 2019 at 01:01:48PM +0300, Konstantin Knizhnik wrote: On 30.07.2019 4:02, Tomas Vondra wrote: My idea (sorry if it wasn't too clear) was that we might handle some cases more gracefully. For example, if we only switch between transactions, we don't quite care about 'SET LOCAL' (but the current patch does set the tainted flag). The same thing applies to GUCs set for a function. For prepared statements, we might count the number of statements we prepared and deallocated, and treat it as 'not tained' when there are no statements. Maybe there's some risk I can't think of. The same thing applies to temporary tables - if you create and drop a temporary table, is there a reason to still treat the session as tained? I already handling temporary tables with transaction scope (created using "create temp table ... on commit drop" command) - backend is not marked as tainted in this case. Thank you for your notice about "set local" command - attached patch is also handling such GUCs. Thanks. To implement prepared statements we need to store them in session context or at least add some session specific prefix to prepare statement name. Temporary tables also require per-session temporary table space. With GUCs situation is even more complicated - actually most of the time in my PgPro-EE pooler version I have spent in the fight with GUCs (default values, reloading configuration, memory alllocation/deallocation,...). But the "show stopper" are temporary tables: if them are accessed through internal (non-shared buffer), then you can not reschedule session to some other backend. This is why I have now patch with implementation of global temporary tables (a-la Oracle) which has global metadata and are accessed though shared buffers (which also allows to use them in parallel queries). Yeah, temporary tables are messy. Global temporary tables would be nice, not just because of this, but also because of catalog bloat. Global temp tables solves two problems: 1. catalog bloating 2. parallel query execution. Them are not solving problem with using temporary tables at replica. May be this problem can be solved by implementing special table access method for temporary tables. But I am still no sure how useful will be such implementation of special table access method for temporary tables. Obviously it requires much more efforts (need to reimplement a lot of heapam stuff). But it will allow to eliminate MVCC overhead for temporary tuple and may be also reduce space by reducing size of tuple header. Sure. Temporary tables are a hard issue (another place where they cause trouble are 2PC transactions, IIRC), so I think it's perfectly sensible to accept the limitation, handle cases that we can handle and see if we can improve the remaining cases later. If Postgres backend is able to work only with on database, then you will have to start at least such number of backends as number of databases you have. Situation with users is more obscure - it may be possible to implement multiuser access to the same backend (as it can be done now using "set role"). Yes, that's a direct consequence of the PostgreSQL process model. I don't think I've said we need anything like that. The way I'd expect it to work that when we run out of backend connections, we terminate some existing ones (and then fork new backends). I afraid that it may eliminate most of positive effect of session pooling if we will terminate and launch new backends without any attempt to bind backends to database and reuse them. I'm not sure I understand. Surely we'd still reuse connections as much as possible - we'd still keep the per-db/user connection pools, but after running out of backend connections we'd pick a victim in one of the pools, close it and open a new connection. We'd need some logic for picking the 'victim' but that does not seem particularly hard - idle connections first, then connections from "oversized" pools (this is one of the reasons why pgbouncer has min_connection_pool). So I am not sure that if we implement sophisticated configurator which allows to specify in some configuration file for each database/role pair maximal/optimal number of workers, then it completely eliminate the problem with multiple session pools. Why would we need to invent any sophisticated configurator? Why couldn't we use some version of what pgbouncer already does, or maybe integrate it somehow into pg_hba.conf? I didn't think about such possibility. But I suspect many problems with reusing pgbouncer code and moving it to Postgres core. To be clear - I wasn't suggesting to copy any code from pgbouncer. It's far too different (style, ...) compared to core. I'm suggesting to adopt roughly the same configuration approach, i.e. what parameters are allowed for each pool, global limits, etc. I don't know whether we should have a separate configuration file, make it part of pg_hba.conf
Re: SQL:2011 PERIODS vs Postgres Ranges?
Hi Paul, I have rebased the patch to master (1e2fddfa33d3c7cc93ca3ee0f32852699bd3e012) and fixed some compilation warning. Now I am reviewing the actual code. On Fri, Jul 26, 2019 at 6:35 PM Ibrar Ahmed wrote: > The patch requires to rebase on the master branch. > > The new status of this patch is: Waiting on Author > -- Ibrar Ahmed temporal_pks_fks_v005.patch Description: Binary data
Re: concerns around pg_lsn
On Tue, Jul 30, 2019 at 4:52 AM Jeevan Ladhe wrote: > My only concern was something that we internally treat as invalid, why do > we allow, that as a valid value for that type. While I am not trying to > reinvent the wheel here, I am trying to understand if there had been any > idea behind this and I am missing it. Well, the word "invalid" can mean more than one thing. Something can be valid or invalid depending on context. I can't have -2 dollars in my wallet, but I could have -2 dollars in my bank account, because the bank will allow me to pay out slightly more money than I actually have on the idea that I will pay them back later (and with interest!). So as an amount of money in my wallet, -2 is invalid, but as an amount of money in my bank account, it is valid. 0/0 is not a valid LSN in the sense that (in current releases) we never write a WAL record there, but it's OK to compute with it. Subtracting '0/0'::pg_lsn seems useful as a way to convert an LSN to an absolute byte-index in the WAL stream. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Restricting maximum keep segments by repslots
At Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190730.213045.221405075.horikyota@gmail.com> > I attach the revised patch. I'll repost the polished version > sooner. (Mainly TAP test and documentation, code comments will be revised) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Restricting maximum keep segments by repslots
Thanks for reviewing! At Thu, 27 Jun 2019 16:22:56 +0200, Jehan-Guillaume de Rorthais wrote in <20190627162256.4f4872b8@firost> > Hi all, > > Being interested by this feature, I did a patch review. > > This features adds the GUC "max_slot_wal_keep_size". This is the maximum > amount > of WAL that can be kept in "pg_wal" by active slots. > > If the amount of WAL is superior to this limit, the slot is deactivated and > its status (new filed in pg_replication_slot) is set as "lost". This patch doesn't deactivate walsender. A walsender stops by itself when it finds that it cannot continue ongoing replication. > Patching > > > The patch v13-0003 does not apply on HEAD anymore. > > The patch v13-0005 applies using "git am --ignore-space-change" > > Other patches applies correctly. > > Please, find attached the v14 set of patches rebased on master. Sorry for missing this for log time. It is hit by 67b9b3ca32 again so I repost a rebased version. > Documentation > = > > The documentation explains the GUC and related columns in > "pg_replication_slot". > > It reflects correctly the current behavior of the patch. > > > Usability > = > > The patch implement what it described. It is easy to enable and disable. The > GUC name is describing correctly its purpose. > > This feature is useful in some HA scenario where slot are required (eg. no > possible archiving), but where primary availability is more important than > standbys. Yes. Thanks for the clear explanation on the purpose. > In "pg_replication_slots" view, the new "wal_status" field is misleading. > Consider this sentence and the related behavior from documentation > (catalogs.sgml): > > keeping means that some of them are to be removed by the > next checkpoint. > > "keeping" appears when the current checkpoint will delete some WAL further > than > "current_lsn - max_slot_wal_keep_size", but still required by at least one > slot. > As some WAL required by some slots will be deleted quite soon, probably before > anyone can react, "keeping" status is misleading here. We are already in the > red zone. It may be "losing", which would be less misleading. > I would expect this "wal_status" to be: > > - streaming: slot lag between 0 and "max_wal_size" > - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the > slot actually protect some WALs from being deleted > - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't protect > some WAL from deletion I agree that comparing to max_wal_size is meaningful. The revised version behaves as that. > Documentation follows with: > > The last two states are seen only when max_slot_wal_keep_size is > non-negative > > This is true with the current behavior. However, if "keeping" is set as soon > as > te slot lag is superior than "max_wal_size", this status could be useful even > with "max_slot_wal_keep_size = -1". As soon as a slot is stacking WALs that > should have been removed by previous checkpoint, it "keeps" them. I revised the documentation that way. Both view-pg-replication-slots.html and runtime-config-replication.html are reworded. > Feature tests > = > > I have played with various traffic shaping setup between nodes, to observe how > columns "active", "wal_status" and "remain" behaves in regard to each others > using: > > while true; do > > > Finally, at least once the following messages appeared in primary logs > **before** the "wal_status" changed from "keeping" to "streaming": > > WARNING: some replication slots have lost required WAL segments > > So the slot lost one WAL, but the standby has been able to catch-up anyway. Thanks for the intensive test run. It is quite helpful. > My humble opinion about these results: > > * after many different tests, the status "keeping" appears only when "remain" > equals 0. In current implementation, "keeping" really adds no value... Hmm. I agree that given that the "lost" (or "losing" in the patch) state is a not definite state. That is, the slot may recover from the state. > * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't > active The revised version shows the following statuses. streaming / NULL max_slot_wal_keep_size is -1 unkown/ NULL mswks >= 0 and restart_lsn is invalid / elsewise > * the "lost" status should be a definitive status > * it seems related, but maybe the "wal_status" should be set as "lost" > only when the slot has been deactivate ? Agreed. While replication is active, if required segments seems to be lost once, delayed walreceiver ack can advance restart_lsn to "safe" zone later. So, in the revised version, if the segment for restart_lsn has been removed, GetLsnAvailablity() returns "losing" if walsender is active and "lost" if not. > * logs should warn about a failing slot as soon as it is effectively > deactivated, not before.
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Tue, Jul 30, 2019 at 07:44:20AM -0400, Sehrope Sarkuni wrote: > On Mon, Jul 29, 2019 at 8:35 PM Bruce Momjian wrote: > From the patch: > > /* > ! * The initialization vector (IV) is used for page-level > ! * encryption. We use the LSN and page number as the IV, and IV > ! * values must never be reused since it is insecure. It is safe > ! * to use the LSN on multiple pages in the same relation since > ! * the page number is part of the IV. It is unsafe to reuse the > ! * LSN in different relations because the page number might be > ! * the same, and hence the IV. Therefore, we check here that > ! * we don't have WAL records for different relations using the > ! * same LSN. > ! */ > > If each relation file has its own derived key, the derived TDEK for that > relation file, then there is no issue with reusing the same IV = LSN || Page > Number. The TDEKs will be different so Key + IV will never collide. So, this email explains that we are considering not using the relfilenode/tablepsace/database to create a derived key per relation, but us the same key for all relaions because the IV will be unique per page across all relations: https://www.postgresql.org/message-id/20190729134442.2bxakegiqafxg...@momjian.us There is talk of using a derived key with a contant to make sure all heap/index files use a different derived key than WAL, but I am not sure. This is related to whether WAL IV and per-heap/index IV can collide. There are other emails in the thread that also discuss the topic. The issue is that we add a lot of complexity to other parts of the system, (e.g. pg_upgrade, CREATE DATABASE, moving relations between tablespaces) to create a derived key, so we should make sure we need it before we do it. > In general it's fine to use the same IV with different keys. Only reuse of Key > + IV is a problem and the entire set of possible counter values (IV + 0, IV + > 1, ...) generated with a key must be unique. That's also why we must leave at > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be filled > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our per-page > IV > prefix used any of those bits then the counter could overflow into the next > page's IV's range. Agreed. Attached is an updated patch that checks only main relation forks, which I think are the only files we are going ot encrypt, and it has better comments. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c new file mode 100644 index 3ec67d4..57f4d71 *** a/src/backend/access/transam/xloginsert.c --- b/src/backend/access/transam/xloginsert.c *** XLogResetInsertion(void) *** 208,213 --- 208,215 /* * Register a reference to a buffer with the WAL record being constructed. * This must be called for every page that the WAL-logged operation modifies. + * Because of page-level encryption, You cannot reference more than one + * RelFileNode in a WAL record; Assert checks for that. */ void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags) *** XLogRegisterBuffer(uint8 block_id, Buffe *** 235,241 /* * Check that this page hasn't already been registered with some other ! * block_id. */ #ifdef USE_ASSERT_CHECKING { --- 237,243 /* * Check that this page hasn't already been registered with some other ! * block_id, and check for different RelFileNodes in the same WAL record. */ #ifdef USE_ASSERT_CHECKING { *** XLogRegisterBuffer(uint8 block_id, Buffe *** 248,256 --- 250,274 if (i == block_id || !regbuf_old->in_use) continue; + /* check for duplicate block numbers */ Assert(!RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode) || regbuf_old->forkno != regbuf->forkno || regbuf_old->block != regbuf->block); + + /* + * The initialization vector (IV) is used for page-level + * encryption. We use the LSN and page number as the IV, and IV + * values must never be reused since it is insecure. It is safe + * to use the LSN on multiple pages in the same relation since + * the page number is part of the IV. It is unsafe to reuse the + * LSN in different relations because the page number might be + * the same, and hence the IV. Therefore, we check here that + * we don't have WAL records for different relations using the + * same LSN. We only encrypt MAIN_FORKNUM files. + */ + Assert(RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode) || + regbuf_old->forkno != MAIN_FORKNUM || + regbuf->forkno != MAIN_FORKNUM); } } #endif
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni wrote: > > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada wrote: > > > An argument could be made to push that problem upstream, i.e. let the > > > supplier of the passphrase deal with the indirection. You would still > > > need to verify the supplied passphrase/key is correct via something > > > like authenticating against a stored MAC. > > > > So do we need the key for MAC of passphrase/key in order to verify? > > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input > can be "decrypted" with it in both CTR and CBC mode, you'll just end > up with garbage data if the key does not match. Verification of the > key prior to usage (i.e. starting DB and encrypting/decrypting data) > is a must as otherwise you'll end up with all kinds of corruption or > data loss. > Do you mean that we encrypt and store a 16 byte input with the correct key to the disk, and then decrypt it with the user supplied key and compare the result to the input data? > From a single user supplied passphrase you would derive the MDEK and > compute a MAC (either using the same key or via a separate derived > MDEK-MAC key). If the computed MAC matches against the previously > stored value then you know the MDEK is correct as well. You meant KEK, not MDEK? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center