Re: Add "password_protocol" connection parameter to libpq
On Sat, 2019-08-10 at 10:24 +0800, Craig Ringer wrote: > Before we go too far along with this, lets look at how other > established protocols do things and the flaws that've been discovered > in their approaches. If this isn't done with extreme care then > there's a large risk of negating the benefits offered by adopting > recent things like SCRAM. Agreed. I'm happy to hear any proposals better informed by history. > Frankly I kind of wish we could just use SASL, but there are many > (many) reasons no to. I'm curious what the reasons are not to use SASL; do you have a reference? > Off the top of my head I can think of these risks: > > * Protocols that allow naïve pre-auth client/server auth negotiation > (e.g. by finding the overlap in exchanged supported auth-mode lists) > are subject to MiTM downgrade attacks where the attacker filters out > protocols it cannot intercept and break from the proposed > alternatives. We already have the downgrade problem. That's what I'm trying to solve. > * Protocols that specify a hierarchy tend to be inflexible and result > in hard to predict auth mode selections as the options grow. If my > app wants GSSAPI or SuperStrongAuth but doesn't accept SSPI, and the > hierarchy is GSSAPI > SSPI > SuperStrongAuth, it has to fall back to > a disconnect and retry model like now. What do you mean "disconnect and retry model"? I agree that hierarchies are unweildly as the options grow. Then again, as options grow, we need new versions of the client to support them, and those new versions might offer more flexible ways to choose between them. Of course, we should try to think ahead to avoid needing to constantly change client connection syntax, but I'm just pointing out that it's not a one-way door. > * Protocols that announce supported auth methods before any kind of > trust is established make life easier for vulnerability scanners and > worms This discussion is about the client so I don't see how vulnerability scanners are relevant. Regards, Jeff Davis
Re: POC: converting Lists into arrays
On Sat, 10 Aug 2019 at 09:03, Tom Lane wrote: > > David Rowley writes: > > On Fri, 9 Aug 2019 at 09:55, Tom Lane wrote: > >> I still have hopes for getting rid of es_range_table_array though, > >> and will look at that tomorrow or so. > > > Yes, please. I've measured that to be quite an overhead with large > > partitioning setups. However, that was with some additional code which > > didn't lock partitions until it was ... well too late... as it > > turned out. But it seems pretty good to remove code that could be a > > future bottleneck if we ever manage to do something else with the > > locking of all partitions during UPDATE/DELETE. > > I poked at this, and attached is a patch, but again I'm not seeing > that there's any real performance-based argument for it. So far > as I can tell, if we've got a lot of RTEs in an executable plan, > the bulk of the startup time is going into lock (re) acquisition in > AcquirePlannerLocks, and/or permissions scanning in ExecCheckRTPerms; > both of those have to do work for every RTE including ones that > run-time pruning drops later on. ExecInitRangeTable just isn't on > the radar. In the code I tested with locally I ended up with a Bitmapset that marked which RTEs required permission checks so that ExecCheckRTPerms() could quickly skip RTEs with requiredPerms == 0. The Bitmapset was set in the planner. Note: expand_single_inheritance_child sets childrte->requiredPerms = 0, so there's nothing to do there for partitions, which is the most likely reason that the rtable list would be big. Sadly the locking is still a big overhead even with that fixed. Robert threw around some ideas in [1], but that seems like a pretty big project. I don't think removing future bottlenecks is such a bad idea if it can be done in such a way that the code remains clean. It may serve to increase our motivation later to solve the remaining issues. We tend to go to greater lengths when there are more gains, and more gains are more easily visible by removing more bottlenecks. Another reason to remove the es_range_table_array is that the reason it was added in the first place is no longer valid. We'd never have added it if we had array-based lists back then. (Reading below, it looks like Alvaro agrees with this too) [1] https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Locale support
On Sat, Aug 10, 2019 at 11:50 AM Thomas Munro wrote: > On Fri, Aug 9, 2019 at 6:15 PM Yonatan Misgan wrote: > > Can I implement it as a locale support? When the user want to change the > > lc _time = am_ET(Amharic Ethiopia ) the date and time representation of the > > database systems be in Ethiopian calendar. > > I'm not an expert in this stuff, but it seem to me that both the > operating system and the date/time localisation code in PostgreSQL use > Gregorian calendar logic, even though they can use local language > names for them. That is, they allow you to display the French, Greek, > Ethiopian ... names for the Gregorian months, but not a a completely > different calendar system. For example, according to my operating > system: > ... Reading about that led me to the ICU model of calendars. Here's the C API (there are C++ and Java APIs too, but to maximise the possibility of producing something that could ever be part of core PostgreSQL, I'd stick to pure C): http://userguide.icu-project.org/datetime/calendar http://icu-project.org/apiref/icu4c/ucal_8h.html http://icu-project.org/apiref/icu4c/udat_8h.html It does in fact work using locales to select calendars as you were suggesting (unlike POSIX or at least the POSIX systems I tried), and there it knows that am_ET is associated with the Ethiopic calendar, as you were suggesting (and likewise for nearly a dozen other calendars). When you call ucal_open() you have to say whether you want UCAL_TRADITIONAL (the traditional calendar associated with the locale) or UCAL_GREGORIAN. In the usual ICU fashion it lets you mix and match more explicitly, so you can use locale names like "fr_FR@calendar=buddhist". Then you can us the udat.h stuff to format dates and so forth. So now I'm wondering if the best idea would be to write an extension that provides a handful of carefully crafted functions to expose the ICU calendar date/time formatting/parsing/manipulation stuff to SQL. I think that should be doable in a way that is not specific to Ethiopic, so that users of Indian, Persian, Islamic etc calendars can also benefit from this. I'm not sure if it should use LC_TIME -- possibly not, because that would interfere with libc-based stuff and the locale names don't match up; that might only make sense if this facility completely replaced the built in date/time stuff, which seems unlikely. Perhaps you'd just want to pass in the complete ICU locale string to each of the functions, to_char_icu(current_date, 'Month', 'fr_FR@calendar=buddhist'), or perhaps you'd want a GUC to control it (extensions can have their own GUCs). I'm not sure. There has certainly been interest in exposing other bits of ICU to PostgreSQL, either in core or in extensions: collations, Unicode normalisation, and now this. Hmm. Another thing you might want to look into is whether the SQL standard has anything to say about non-Gregorian calendars, and whether DB2 has anything to support them (it certainly has ICU inside it, as does PostgreSQL (optionally), so I wonder if they've exposed this part of it to SQL and if so how). -- Thomas Munro https://enterprisedb.com
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Aug 8, 2019 at 10:17:53PM -0400, Sehrope Sarkuni wrote: > On Thu, Aug 8, 2019 at 2:16 PM Bruce Momjian wrote: > > On Wed, Aug 7, 2019 at 08:56:18AM -0400, Sehrope Sarkuni wrote: > > Simplest approach for derived keys would be to use immutable attributes > of the > > WAL files as an input to the key derivation. Something like HKDF(MDEK, > "WAL:" | > > So, I am thinking we should use "WAL:" for WAL and "REL:" for heap/index > files. > > > Sounds good. Any unique convention is fine. Main thing to keep in mind is that > they're directly tied to the master key so it's not possible to rotate them > without changing the master key. A recent email talked about using two different encryption keys for heap/index and WAL, which allows for future features, and allows for key rotation of the two independently. (I already stated how hard key rotation would be with WAL and pg_rewind.) > This is in contrast to saving a WDEK key to a file (similar to how the MDEK > key > would be saved) and unlocking it with the MDEK. That has more moving parts but > would allow that key to be independent of the MDEK. In a later message Stephen > refers to an example of a replica receiving encrypted WAL and applying it with > a different MDEK for the page buffers. That's doable with an independent WDEK. I assumed we would call a command on boot to get a passphrase, which would unlock the encryption keys. Is there more being described above? > > | timeline_id || wal_segment_num) should be fine for this as it is: > > I considered using the timeline in the nonce, but then remembered that > in timeline switch, we _copy_ the part of the old WAL up to the timeline > switch to the new timeline; see: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/ > > backend/access/transam/xlog.c;h=f55352385732c6b0124eff5265462f3883fe7435;hb > =HEAD#l5502 > > * Initialize the starting WAL segment for the new timeline. If the > switch > * happens in the middle of a segment, copy data from the last WAL > segment > * of the old timeline up to the switch point, to the starting WAL > segment > * on the new timeline. > > We would need to decrypt/encrypt to do the copy, and just wasn't sure of > the value of the timeline in the nonce. One value to it is that if > there some WAL that generated after the timeline switch in the old > primary that isn't transfered, there would be potentially new data > encrypted with the same key/nonce in the new primary, but if that WAL is > not used, odds are it is gone/destroyed/inaccessible, or it would have > been used during the switchover, so it didn't seem worth worrying about. > > One _big_ reason to add the timeline is if you had a standby that you > recovered and rolled forward only to a specific transaction, then > continued running it as a new primary. In that case, you would have > different WAL encrypted with the same key/nonce, but that sounds like > the same as promoting two standbys, and we should just document not to > do it. > > Maybe we need to consider this further. > > > Good points. Yes, anything short of generating a new key at promotion time > will > have these issues. If we're not going to do that, no point in adding the > timeline id if it does not change anything. I had thought only the combo was > unique but sounds like the segment number is unique on its own. One thing I > like about a unique per-file key is that it simplifies the IV generation (i.e. > can start at zero). Uh, well, the segment number is unique, _except_ for a timeline switch, where the segment number exists in the old timeline file, and in the new timeline file, though the new timeline file has more WAL because it has writes only in the new timeline. So, in a way, the WAL data is the same in the old and new timeline files, e.g. 00010001 and 00020001, but 00010001 stops at the timeline switch and 00020001 has more WAL data that represents cluster activity since the WAL switch, though the two files are both 16MB. > What about discarding the rest of the WAL file at promotion and skipping to a > new file? With a random per-file key in the first page header would ensure > that > going forward all WAL data is encrypted differently. Combine that with > independent WAL and MDEK keys and everything would be different between two > replicas promoted from the same point. Yes, we could do that, but I am hesitant to change the WAL format just for this, unless there is value to the random number. We already discussed that the LSN could be duplicated in different heap/index files, so we would still have issues. > > A unique WDEK per WAL file that is derived from the segment number would > not > > have that problem. A unique key per-file means the IVs can all start at
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Aug 9, 2019 at 05:01:47PM +0200, Tomas Vondra wrote: > I know there were proposals to keep it encrypted in shared buffers, but > I'm not sure that's what we'll end up doing (I have not followed the > recent discussion all that closely, though). There is no plan to encrypt shared buffers. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Add "password_protocol" connection parameter to libpq
On Fri, 9 Aug 2019 at 11:00, Michael Paquier wrote: > On Thu, Aug 08, 2019 at 03:38:20PM -0700, Jeff Davis wrote: > > Libpq doesn't have a way to control which password protocols are used. > > For example, the client might expect the server to be using SCRAM, but > > it actually ends up using plain password authentication instead. > > Thanks for working on this! > > > I'm not 100% happy with the name "password_protocol", but other names I > > could think of seemed likely to cause confusion. > > What about auth_protocol then? It seems to me that it could be useful > to have the restriction on AUTH_REQ_MD5 as well. > > > Sets the least-secure password protocol allowable when using password > > authentication. Options are: "plaintext", "md5", "scram-sha-256", or > > "scram-sha-256-plus". > > This makes it sound like there is a linear hierarchy among all those > protocols, which is true in this case, but if the list of supported > protocols is extended in the future it may be not. > Before we go too far along with this, lets look at how other established protocols do things and the flaws that've been discovered in their approaches. If this isn't done with extreme care then there's a large risk of negating the benefits offered by adopting recent things like SCRAM. Frankly I kind of wish we could just use SASL, but there are many (many) reasons no to. Off the top of my head I can think of these risks: * Protocols that allow naïve pre-auth client/server auth negotiation (e.g. by finding the overlap in exchanged supported auth-mode lists) are subject to MiTM downgrade attacks where the attacker filters out protocols it cannot intercept and break from the proposed alternatives. * Protocols that specify a hierarchy tend to be inflexible and result in hard to predict auth mode selections as the options grow. If my app wants GSSAPI or SuperStrongAuth but doesn't accept SSPI, and the hierarchy is GSSAPI > SSPI > SuperStrongAuth, it has to fall back to a disconnect and retry model like now. * Protocols that announce supported auth methods before any kind of trust is established make life easier for vulnerability scanners and worms and I'm sure there are more when it comes to auth handshakes. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Aug 8, 2019 at 10:34:26PM -0400, Sehrope Sarkuni wrote: > On Thu, Aug 8, 2019 at 6:31 PM Stephen Frost wrote: > > Strictly speaking, that isn't actually crash recovery, it's physical > replication / HA, and while those are certainly nice to have it's no > guarantee that they're required or that you'd want to have the same keys > for them- conceptually, at least, you could have WAL with one key that > both sides know and then different keys for the actual data files, if we > go with the approach where the WAL is encrypted with one key and then > otherwise is plaintext. > > > I like the idea of separating the WAL key from the rest of the data files. > It'd all be unlocked by the MDEK and you'd still need derived keys per > WAL-file, but disconnecting all that from the data files solves a lot of the > problems with promoted replicas. > > This would complicate cloning a replica as using a different MDEK would > involve > decrypting / encrypting everything rather than just copying the files. Even if > that's not baked in a first version, the separation allows for eventually > supporting that. OK, I can get behind that idea. One cool idea would be for the WAL on primary and standbys to use the same WAL key, but to use different heap/index keys. When the standby is promoted, there would be a way for the WAL to start using a new encryption key, and the heap/index would already be using its own encryption key. Setting up such a system seems complicated. The big problem is that the base backup would use the primary key, unless we allowed pg_basebackup to decrypt/encrypt with a new heap/index key. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Aug 9, 2019 at 11:51:23PM +0900, Masahiko Sawada wrote: > I agree that cluster-wide is more simpler but I'm not sure that it > meets real needs from users. One example is re-encryption; when the > key leakage happens, in cluster-wide encryption we end up with doing > re-encrypt whole database regardless the amount of user sensitive data > in database. I think it's a big constraint for users because it's > common that the amount of data such as master table that needs to be > encrypted doesn't account for a large potion of database. That's one > reason why I think more fine granularity encryption such as > table/tablespace is required. > > And in terms of feature development we would implement > fine-granularity encryption in the future even if the first step is > cluster-wide encryption? And both TDEs encrypt the same kind of > database objects (i.e. only tables , indexes and WAL)? If so, how > does users use them depending on cases? > > I imagined the case where we had the cluster-wide encryption as the > first TDE feature. We will enable TDE at initdb time by specifying the > command-line parameter for TDE. Then TDE is enabled in cluster wide > and all tables/indexes and WAL are automatically encrypted. Then, if > we want to implement the more fine granularity encryption how can we > make users use it? WAL encryption and tables/index encryption are > enabled at the same time but we want to enable encryption for > particular tables/indexes after initdb. If the cluster-wide encryption > is something like a short-cut of encrypting all tables/indexes, I > personally think that implementing the more fine granularity one first > and then using it to achieve the more coarse granularity would be more > easier. I don't know how to move this thread forward, so I am going to try to explain how I view it. People want feature X and feature Y. Feature X seems straight-forward to implement, use, and seems secure. Feature Y is none of those, so far. When I explain that feature X is the direction we should go in for PG 13, people give more reasons they want feature Y, but don't give any details about how the problems of implemention, use, and security can be addressed. I just don't see that continuing to discuss feature Y with no details in how to implement it really helps, so I have no reply to these ideas. People can talk about whatever feature they want on these lists, but I have no way to help discussions that don't address facts. I will close with what I have already stated, that what people want, and what we can reasonably accomplish, are two different things. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Global temporary tables
On Fri, 9 Aug 2019 at 22:07, Konstantin Knizhnik wrote: > > > Ok, here it is: global_private_temp-1.patch > Fantastic. I'll put that high on my queue. I'd love to see something like this get in. Doubly so if it brings us closer to being able to use temp tables on physical read replicas, though I know there are plenty of other barriers there (not least of which being temp tables using persistent txns not vtxids) Does it have a CF entry? > Also I have attached updated version of the global temp tables with shared > buffers - global_shared_temp-1.patch > Nice to see that split out. In addition to giving the first patch more hope of being committed this time around, it'll help with readability and testability too. To be clear, I have long wanted to see PostgreSQL have the "session" state abstraction you have implemented. I think it's really important for high client count OLTP workloads, working with the endless collection of ORMs out there, etc. So I'm all in favour of it in principle so long as it can be made to work reliably with limited performance impact on existing workloads and without making life lots harder when adding new core functionality, for extension authors etc. The same goes for built-in pooling. I think PostgreSQL has needed some sort of separation of "connection", "backend", "session" and "executor" for a long time and I'm glad to see you working on it. With that said: How do you intend to address the likelihood that this will cause performance regressions for existing workloads that use temp tables *without* relying on your session state and connection pooler? Consider workloads that use temp tables for mid-long txns where txn pooling is unimportant, where they also do plenty of read and write activity on persistent tables. Classic OLAP/DW stuff. e.g.: * four clients, four backends, four connections, session-level connections that stay busy with minimal client sleeps * All sessions run the same bench code * transactions all read plenty of data from a medium to large persistent table (think fact tables, etc) * transactions store a filtered, joined dataset with some pre-computed window results or something in temp tables * benchmark workload makes big-ish temp tables to store intermediate data for its medium-length transactions * transactions also write to some persistent relations, say to record their summarised results How does it perform with and without your patch? I'm concerned that: * the extra buffer locking and various IPC may degrade performance of temp tables * the temp table data in shared_buffers may put pressure on shared_buffers space, cached pages for persistent tables all sessions are sharing; * the temp table data in shared_buffers may put pressure on shared_buffers space for dirty buffers, forcing writes of persistent tables out earlier therefore reducing write-combining opportunities; -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Add "password_protocol" connection parameter to libpq
Greetings, * Jeff Davis (pg...@j-davis.com) wrote: > On Sat, 2019-08-10 at 00:17 +0300, Heikki Linnakangas wrote: > > auth_methods = 'MITM, -password, -md5' > > Keep in mind this is client configuration, so something reasonable in > postgresql.conf might not be so reasonable in the form: Yeah, that's a really good point. > postgresql://foo:secret@myhost/mydb?auth_methods=MITM%2C%20- > password%2C%20-md5 > > Another thing to consider is that there's less control configuring on > the client than on the server. The server will send at most one > authentication request based on its own rules, and all the client can > do is either answer it, or disconnect. And the SSL stuff all happens > before that, and won't use an authentication request message at all. Note that GSSAPI Encryption works the same as SSL in this regard. Thanks, Stephen signature.asc Description: PGP signature
Re: Add "password_protocol" connection parameter to libpq
On Sat, 2019-08-10 at 00:17 +0300, Heikki Linnakangas wrote: > This is a multi-dimensional problem. "channel_binding=require" is > one > way to prevent MITM attacks, but sslmode=verify-ca is another. (Does > Kerberos also prevent MITM?) Or you might want to enable plaintext > passwords over SSL, but not without SSL. > > I think we'll need something like the 'ssl_ciphers' GUC, where you > can > choose from a few reasonable default rules, but also enable/disable > specific methods: .. > auth_methods = 'MITM, -password, -md5' Keep in mind this is client configuration, so something reasonable in postgresql.conf might not be so reasonable in the form: postgresql://foo:secret@myhost/mydb?auth_methods=MITM%2C%20- password%2C%20-md5 Another thing to consider is that there's less control configuring on the client than on the server. The server will send at most one authentication request based on its own rules, and all the client can do is either answer it, or disconnect. And the SSL stuff all happens before that, and won't use an authentication request message at all. Some protocols allow negotiation within them, like SASL, which gives the client a bit more freedom. But FE/BE doesn't allow for arbitrary subsets of authentication methods to be negoitated between client and server, so I'm worried trying to express it that way will just lead to clients that break when you upgrade your server. Regards, Jeff Davis
Re: Locale support
On Fri, Aug 9, 2019 at 6:15 PM Yonatan Misgan wrote: > Can I implement it as a locale support? When the user want to change the lc > _time = am_ET(Amharic Ethiopia ) the date and time representation of the > database systems be in Ethiopian calendar. Hi Yonatan, I'm not an expert in this stuff, but it seem to me that both the operating system and the date/time localisation code in PostgreSQL use Gregorian calendar logic, even though they can use local language names for them. That is, they allow you to display the French, Greek, Ethiopian ... names for the Gregorian months, but not a a completely different calendar system. For example, according to my operating system: $ LC_TIME=en_GB.UTF-8 date Sat 10 Aug 2019 10:58:42 NZST $ LC_TIME=fr_FR.UTF-8 date sam. 10 août 2019 10:58:48 NZST $ LC_TIME=el_GR.UTF-8 date Σάβ 10 Αυγ 2019 10:58:51 NZST $ LC_TIME=am_ET.UTF-8 date ቅዳሜ ኦገስ 10 10:58:55 NZST 2019 These all say it's Saturday (ቅዳሜ) the 10th of August (ኦገስ) on the Gregorian calendar. Looking at POSIX date[1] you can see they contemplated the existence of non-Gregorian calendars in a very limited way, but no operating system I have access to does anything other than Gregorian with %x and %c: "The date string formatting capabilities are intended for use in Gregorian-style calendars, possibly with a different starting year (or years). The %x and %c conversion specifications, however, are intended for local representation; these may be based on a different, non-Gregorian calendar." PostgreSQL behaves the same way when you ask for the localised month in am_ET: tmunro=> set lc_time to 'am_ET.UTF-8'; SET tmunro=> select to_char(now(), 'TMMon'); to_char --- ኦገስ (1 row) This is hard coded into the system, as you can see from src/backend/utils/adt/pg_locale.c where the *twelve* month names are loaded into localized_abbrev_months and other similar arrays. That's the first clue that this system can't handle the thirteen Ethiopian months, not to mention the maths required to work with them. That's why I think you need a new, different to_char() function (and probably more functions). In that skeleton code I posted, you can see I defined a function to_char(date, format, calendar) that takes a third argument, for the calendar name. You might also wonder if that new function should respect the locale settings, but I'm not sure if it could in general; you'd have to be able to get (say) the Greek names for the Ethiopian calendar's months, which the OS won't be able to give you. Though perhaps you'd want some way to select between the Ethiopian script and the transliterations into Latin script, which would presumably be hard coded into the extension, I have no idea if that's useful to anyone... BTW there have been earlier discussions of this: https://www.postgresql.org/message-id/flat/CAM7dW9iBXDJuwZrEXW%2Bdsa_%3Dew%3D%2BFdv7mcF51nQLGSkTkQp2MQ%40mail.gmail.com It shows that Apple has an Objective-C NSCalendar class that understands Ehtiopian, Persian, Hebrew, ... calendars, which made me wonder if LC_TIME=am_ET.UTF-8 would trigger something special on a Mac, but nope, its libc still just gives Ethiopian names for Gregorian months as I expected. [1] http://pubs.opengroup.org/onlinepubs/009695399/utilities/date.html -- Thomas Munro https://enterprisedb.com
Shrinking tuplesort.c's SortTuple struct (Was: More ideas for speeding up sorting)
On Fri, Sep 9, 2016 at 6:14 AM Heikki Linnakangas wrote: > 1. SortTuple.tupindex is not used when the sort fits in memory. If we > also got rid of the isnull1 flag, we could shrink SortTuple from 24 > bytes to 16 bytes (on 64-bit systems). That would allow you to pack more > tuples into memory, which generally speeds things up. We could for > example steal the least-significant bit from the 'tuple' pointer for > that, because the pointers are always at least 4-byte aligned. (But see > next point) I implemented what you sketched here, back in 2016. That is, I found a way to make SortTuples 16 bytes on 64-bit platforms, down from 24 bytes (24 bytes includes alignment padding). Note that it only became possible to do this after commit 8b304b8b72b removed replacement selection sort, which reduced the number of things that use the SortTuple.tupindex field to just one: it is now only used for tapenum during merging. (I also had to remove SortTuple.isnull1 to get down to 16 bytes.) The easy part was removing SortTuple.tupindex itself -- it was fairly natural to stash that in the slab allocation for each tape. I used the aset.c trick of having a metadata "chunk" immediately prior to address that represents the allocation proper -- we can just back up by a few bytes from stup.tuple to find the place to stash the tape number during merging. The worst thing about this change was that it makes a tape slab allocation mandatory in cases that previously didn't have any need for a stup.tuple allocation (e.g. datum tuplesorts of pass-by-value types), though only during merging. Since we must always store the tapenum when merging, we always need a slab buffer for each tape when merging. This aspect wasn't so bad. The hard/ugly part was getting rid of the remaining "unnecessary" SortTuple field, isnull1. This involved squeezing an extra bit out of the stup.tuple pointer, by stealing the least-significant bit. This was invasive in about the way you'd expect it to be. It wasn't awful, but it also wasn't something I'd countenance pursuing without getting a fairly noticeable benefit for users. (Actually, the code that I wrote so far *is* pretty awful, but I could certainly clean it up some more if I felt like it.) I think that the rough patch that I came up with gives us an accurate picture of what the benefits of having SortTuples that are only 16 bytes wide are. The benefits seem kind of underwhelming at this point. For something like a "SELECT COUNT(distinct my_int4_col) FROM tab" query, which uses the qsort_ssup() qsort specialization, we can easily go from getting an external sort to getting an internal sort. We can maybe end up sorting about 20% faster if things really work out for the patch. But in cases that users really care about, such as REINDEX, the difference is in the noise. ISTM that this is simple not worth the trouble at this time. These days, external sorts are often slightly faster than internal sorts in practice, due to the fact that we can do an on-the-fly merge with external sorts, so we could easily hurt performance by making more memory available! I don't want to completely close the door on the idea of shrinking SortTuple to 16 bytes. I can imagine a world in which that matters. For example, perhaps there will one day be a strong case to be made for SIMD-based internal sort algorithms for simple cases, such as the "COUNT(distinct my_int4_col)" query case that I mentioned. Probably SIMD-based multiway merge sort. I understand that such algorithms are very sensitive to things like SIMD CPU register sizes, and were only considered plausible competitors to quicksort due to the advent of 512-bit SIMD registers. 512 bit SIMD registers haven't been available in mainstream CPUs for that long. I have to admit that this SIMD sorting stuff seems like a bit of a stretch, though. The papers in this area all seem to make rather optimistic assumptions about the distribution of values. And, I think we'd have to be even more aggressive about shrinking SortTuples in order to realize the benefits of SIMD-based sorting. Besides, sorting itself is the bottleneck for tuplesort-using operations less and less these days -- the only remaining interesting bottleneck is probably in code like index_form_tuple(), which is probably a good target for JIT. In general, it's much harder to make tuplesort.c noticeably faster than it used to be -- we've picked all the low-hanging fruit. -- Peter Geoghegan
Re: POC: converting Lists into arrays
On 2019-Aug-09, Tom Lane wrote: > I poked at this, and attached is a patch, but again I'm not seeing > that there's any real performance-based argument for it. So far > as I can tell, if we've got a lot of RTEs in an executable plan, > the bulk of the startup time is going into lock (re) acquisition in > AcquirePlannerLocks, and/or permissions scanning in ExecCheckRTPerms; > both of those have to do work for every RTE including ones that > run-time pruning drops later on. ExecInitRangeTable just isn't on > the radar. I'm confused. I thought that the point of doing this wasn't that we wanted to improve performance, but rather that we're now able to remove the array without *losing* performance. I mean, those arrays were there to improve performance for code that wanted fast access to specific list items, but now we have fast access to the list items without it. So a measurement that finds no performance difference is good news, and we can get rid of the now-pointless optimization ... -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add "password_protocol" connection parameter to libpq
On Fri, 2019-08-09 at 16:27 -0400, Jonathan S. Katz wrote: > Seems to be a lot to configure. I'm more of a fan of the previous > method; it'd work nicely with how we've presently defined things and > should be easy to put into a DSN/URI/env variable. Proposals on the table: 1. Hierarchical semantics, where you specify the least-secure acceptable method: password_protocol = {any,md5,scram-sha-256,scram-sha-256-plus} 2. Comma-list approach, where you specify exactly which protocols are acceptable, or "any" to mean that we don't care. 3. three-setting approach: channel_binding = {disable|prefer|require} password_plaintext = {disable|enable} password_md5 = {disable|enable} It looks like Jonathan prefers #1. #1 seems direct and clearly applies today, and corresponds to auth methods on the server side. I'm not a fan of #2, it seems likely to result in a bunch of clients with overly-specific lists of things with long names that can never really go away. #3 is a little more abstract, but also seems more future-proof, and may tie in to what Stephen is talking about with respect to controlling auth methods from the client, or moving more protocols within SASL. Regards, Jeff Davis
Re: Add "password_protocol" connection parameter to libpq
On 09/08/2019 23:27, Jonathan S. Katz wrote: On 8/9/19 11:51 AM, Jeff Davis wrote: On Fri, 2019-08-09 at 09:28 -0400, Stephen Frost wrote: Having an 'any' option, as mentioned before, could be an alternative though. ... I agree with the point that there isn't any guarantee that it'll always be clear-cut as to which of two methods is "better". From a user perspective, it seems like the main things are "don't send my password in the clear to the server", and "require channel binding to prove there isn't a MITM". I have to admit that I like the idea of requiring scram to be used and not allowing md5 though. So it seems like we are leaning toward: password_protocol = any | {plaintext,md5,scram-sha-256,scram-sha- 256-plus}[,...] First, thanks for proposing / working on this, I like the idea! :) Happy to test/review. As long as this one can handle the current upgrade path that's in place for going from md5 to SCRAM (which AIUI it should) this makes sense to me. As stated above, there is a clear hierarchy. I would almost argue that "plaintext" shouldn't even be an option...if you have "any" set (arguably default?) then plaintext is available. With our currently supported versions + driver ecosystem, I hope no one needs to support a forced plaintext setup. Keep in mind that RADIUS, LDAP and PAM authentication methods are 'plaintext' over the wire. It's not that bad, when used with sslmode=verify-ca/full. Or maybe: channel_binding = {disable|prefer|require} password_plaintext = {disable|enable} password_md5 = {disable|enable} That seems reasonable. It's three options, but no normal use case would need to set more than two, because channel binding forces scram-sha- 256-plus. Seems to be a lot to configure. I'm more of a fan of the previous method; it'd work nicely with how we've presently defined things and should be easy to put into a DSN/URI/env variable. This is a multi-dimensional problem. "channel_binding=require" is one way to prevent MITM attacks, but sslmode=verify-ca is another. (Does Kerberos also prevent MITM?) Or you might want to enable plaintext passwords over SSL, but not without SSL. I think we'll need something like the 'ssl_ciphers' GUC, where you can choose from a few reasonable default rules, but also enable/disable specific methods: # anything goes (the default) auth_methods = 'ANY' # Disable plaintext password authentication. Anything else is accepted. auth_methods = '-password' # Only authentication methods that protect from # Man-in-the-Middle attacks. This allows anything if the server's SSL # certificate can be verified, and otherwise only SCRAM with # channel binding auth_methods = 'MITM' # The same, but disable plaintext passwords and md5 altogether auth_methods = 'MITM, -password, -md5' I'm tempted to also allow 'SSL' and 'SSL-verify-full' as part of the same string, so that you could configure everything related to connection security in the same option. Not sure if that would make things simpler for users, or create more confusion. - Heikki
Re: POC: converting Lists into arrays
David Rowley writes: > On Fri, 9 Aug 2019 at 09:55, Tom Lane wrote: >> I still have hopes for getting rid of es_range_table_array though, >> and will look at that tomorrow or so. > Yes, please. I've measured that to be quite an overhead with large > partitioning setups. However, that was with some additional code which > didn't lock partitions until it was ... well too late... as it > turned out. But it seems pretty good to remove code that could be a > future bottleneck if we ever manage to do something else with the > locking of all partitions during UPDATE/DELETE. I poked at this, and attached is a patch, but again I'm not seeing that there's any real performance-based argument for it. So far as I can tell, if we've got a lot of RTEs in an executable plan, the bulk of the startup time is going into lock (re) acquisition in AcquirePlannerLocks, and/or permissions scanning in ExecCheckRTPerms; both of those have to do work for every RTE including ones that run-time pruning drops later on. ExecInitRangeTable just isn't on the radar. If we wanted to try to improve things further, it seems like we'd have to find a way to not lock unreferenced partitions at all, as you suggest above. But combining that with run-time pruning seems like it'd be pretty horrid from a system structural standpoint: if we acquire locks only during execution, what happens if we find we must invalidate the query plan? Anyway, the attached might be worth committing just on cleanliness grounds, to avoid two-sources-of-truth issues in the executor. But it seems like there's no additional performance win here after all ... unless you've got a test case that shows differently? regards, tom lane diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index dbd7dd9..7f494ab 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -2790,7 +2790,6 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree) estate->es_snapshot = parentestate->es_snapshot; estate->es_crosscheck_snapshot = parentestate->es_crosscheck_snapshot; estate->es_range_table = parentestate->es_range_table; - estate->es_range_table_array = parentestate->es_range_table_array; estate->es_range_table_size = parentestate->es_range_table_size; estate->es_relations = parentestate->es_relations; estate->es_queryEnv = parentestate->es_queryEnv; diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index c1fc0d5..afd9beb 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -113,7 +113,6 @@ CreateExecutorState(void) estate->es_snapshot = InvalidSnapshot; /* caller must initialize this */ estate->es_crosscheck_snapshot = InvalidSnapshot; /* no crosscheck */ estate->es_range_table = NIL; - estate->es_range_table_array = NULL; estate->es_range_table_size = 0; estate->es_relations = NULL; estate->es_rowmarks = NULL; @@ -720,29 +719,17 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags) * ExecInitRangeTable * Set up executor's range-table-related data * - * We build an array from the range table list to allow faster lookup by RTI. - * (The es_range_table field is now somewhat redundant, but we keep it to - * avoid breaking external code unnecessarily.) - * This is also a convenient place to set up the parallel es_relations array. + * In addition to the range table proper, initialize arrays that are + * indexed by rangetable index. */ void ExecInitRangeTable(EState *estate, List *rangeTable) { - Index rti; - ListCell *lc; - /* Remember the range table List as-is */ estate->es_range_table = rangeTable; - /* Set up the equivalent array representation */ + /* Set size of associated arrays */ estate->es_range_table_size = list_length(rangeTable); - estate->es_range_table_array = (RangeTblEntry **) - palloc(estate->es_range_table_size * sizeof(RangeTblEntry *)); - rti = 0; - foreach(lc, rangeTable) - { - estate->es_range_table_array[rti++] = lfirst_node(RangeTblEntry, lc); - } /* * Allocate an array to store an open Relation corresponding to each @@ -753,8 +740,8 @@ ExecInitRangeTable(EState *estate, List *rangeTable) palloc0(estate->es_range_table_size * sizeof(Relation)); /* - * es_rowmarks is also parallel to the es_range_table_array, but it's - * allocated only if needed. + * es_rowmarks is also parallel to the es_range_table, but it's allocated + * only if needed. */ estate->es_rowmarks = NULL; } diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 1fb28b4..39c8b3b 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -535,8 +535,7 @@ extern void ExecInitRangeTable(EState *estate, List *rangeTable); static inline RangeTblEntry * exec_rt_fetch(Index rti, EState *estate) { - Assert(rti > 0 && rti <= estate->es_range_table_size); - return estate->es_range_table_array[
Re: Add "password_protocol" connection parameter to libpq
On 8/9/19 11:51 AM, Jeff Davis wrote: > On Fri, 2019-08-09 at 09:28 -0400, Stephen Frost wrote: >> Having an 'any' option, as mentioned before, could be an alternative >> though. > > ... > >> I agree with the point that there isn't any guarantee that it'll >> always >> be clear-cut as to which of two methods is "better". >> >> From a user perspective, it seems like the main things are "don't >> send >> my password in the clear to the server", and "require channel binding >> to >> prove there isn't a MITM". I have to admit that I like the idea of >> requiring scram to be used and not allowing md5 though. > > So it seems like we are leaning toward: > >password_protocol = any | {plaintext,md5,scram-sha-256,scram-sha- > 256-plus}[,...] First, thanks for proposing / working on this, I like the idea! :) Happy to test/review. As long as this one can handle the current upgrade path that's in place for going from md5 to SCRAM (which AIUI it should) this makes sense to me. As stated above, there is a clear hierarchy. I would almost argue that "plaintext" shouldn't even be an option...if you have "any" set (arguably default?) then plaintext is available. With our currently supported versions + driver ecosystem, I hope no one needs to support a forced plaintext setup. > > Or maybe: > >channel_binding = {disable|prefer|require} >password_plaintext = {disable|enable} >password_md5 = {disable|enable} > > That seems reasonable. It's three options, but no normal use case would > need to set more than two, because channel binding forces scram-sha- > 256-plus. Seems to be a lot to configure. I'm more of a fan of the previous method; it'd work nicely with how we've presently defined things and should be easy to put into a DSN/URI/env variable. Thanks, Jonathan signature.asc Description: OpenPGP digital signature
Re: Some memory not freed at the exit of RelationBuildPartitionDesc()
On Thu, Aug 08, 2019 at 05:42:21PM +0900, Amit Langote wrote: > On Thu, Aug 8, 2019 at 5:33 PM amul sul wrote: > > On Thu, Aug 8, 2019 at 1:27 PM Amit Langote wrote: > > >> Thanks for the patch. This was discussed recently in the "hyrax vs. > >> RelationBuildPartitionDesc()" thread [1] and I think Alvaro proposed > >> an approach that's similar to yours. Not sure why it wasn't pursued > >> though. Maybe the reason is buried somewhere in that discussion. > > > > Oh, quite similar, thanks Amit for pointing that out. > > > > Look like "hyrax vs.RelationBuildPartitionDesc()" is in discussion for the > > master > > branch only, not sure though, but we need the similar fix for the back > > branches as well. > > Well, this is not a bug as such, so it's very unlikely that a fix like > this will be back-patched. Also, if this becomes an issue only for > more than over 1000 partitions, then it's not very relevant for PG 10 > and PG 11, because we don't recommend using so many partitions with > them. Maybe we can consider fixing PG 12 though. A fix for the thousands-of-partitions case would be very welcome for 12. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: block-level incremental backup
Hi Robert, On Fri, Aug 9, 2019 at 6:40 PM Robert Haas wrote: > On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe > wrote: > > + if (!XLogRecPtrIsInvalid(previous_lsn)) > > + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n", > > +(uint32) (previous_lsn >> 32), (uint32) > previous_lsn); > > > > May be we should rename to something like: > > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP > START LOCATION" > > to make it more intuitive? > > So, I think that you are right that PREVIOUS WAL LOCATION might not be > entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL > LOCATION is definitely not clear. This backup is an incremental > backup, and it has a start WAL location, so you'd end up with START > WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound > like they ought to both be the same thing, but they're not. Perhaps > something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR > INCREMENTAL BACKUP would be clearer. > Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ? > > File header structure is defined in both the files basebackup.c and > > pg_combinebackup.c. I think it is better to move this to > replication/basebackup.h. > > Or some other header, but yeah, definitely don't duplicate the struct > definition (or any other kind of definition). > Thanks. > > IMHO, while labels are not advisable in general, it may be better to use > a label > > here rather than a while(1) loop, so that we can move to the label in > case we > > want to retry once. I think here it opens doors for future bugs if > someone > > happens to add code here, ending up adding some condition and then the > > break becomes conditional. That will leave us in an infinite loop. > > I'm not sure which style is better here, but I don't really buy this > argument. No issues. I am ok either way. Regards, Jeevan Ladhe
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Aug 7, 2019 at 6:57 AM Heikki Linnakangas wrote: > Yeah, that's also a problem with complicated WAL record types. Hopefully > the complex cases are an exception, not the norm. A complex case is > unlikely to fit any pre-defined set of fields anyway. (We could look at > how e.g. protobuf works, if this is really a big problem. I'm not > suggesting that we add a dependency just for this, but there might be > some patterns or interfaces that we could mimic.) I think what you're calling the complex cases are going to be pretty normal cases, not something exotic, but I do agree with you that making the infrastructure more generic is worth considering. One idea I had is to use the facilities from pqformat.h; have the generic code read whatever the common fields are, and then pass the StringInfo to the AM which can do whatever it wants with the rest of the record, but probably these facilities would make it pretty easy to handle either a series of fixed-length fields or alternatively variable-length data. What do you think of that idea? (That would not preclude doing compression on top, although I think that feeding everything through pglz or even lz4/snappy may eat more CPU cycles than we can really afford. The option is there, though.) > If you remember, we did a big WAL format refactoring in 9.5, which moved > some information from AM-specific structs to the common headers. Namely, > the information on the relation blocks that the WAL record applies to. > That was a very handy refactoring, and allowed tools like pg_waldump to > print more detailed information about all WAL record types. For WAL > records, moving the block information was natural, because there was > special handling for full-page images anyway. However, I don't think we > have enough experience with UNDO log yet, to know which fields would be > best to include in the common undo header, and which to leave as > AM-specific payload. I think we should keep the common header slim, and > delegate to the AM routines. Yeah, I remember. I'm not really sure I totally buy your argument that we don't know what besides XID should go into an undo record: tuples are a pretty important concept, and although there might be some exceptions here and there, I have a hard time imagining that undo is going to be primarily about anything other than identifying a tuple and recording something you did to it. On the other hand, you might want to identify several tuples, or identify a tuple with a TID that's not 6 bytes, so that's a good reason for allowing more flexibility. Another point in being favor of being more flexible is that it's not clear that there's any use case for third-party tools that work using undo. WAL drives replication and logical decoding and could be used to drive incremental backup, but it's not really clear that similar applications exist for undo. If it's just private to the AM, the AM might as well be responsible for it. If that leads to code duplication, we can create a library of common routines and AM users can use them if they want. > Hmm. If you're following an UNDO chain, from newest to oldest, I would > assume that the newer record has enough information to decide whether > you need to look at the previous record. If the previous record is no > longer interesting, it might already be discarded away, after all. I actually thought zedstore might need this pattern. If you store an XID with each undo pointer, as the current zheap code mostly does, then you have enough information to decide whether you care about the previous undo record before you fetch it. But a tuple stores only an undo pointer, and you determine that the undo isn't discarded, you have to fetch the record first and then possibly decide that you had the right version in the first place. Now, maybe that pattern doesn't repeat, because the undo records could be set up to contain both an XMIN and an XMAX, but not necessarily. I don't know exactly what you have in mind, but it doesn't seem totally crazy that an undo record might contain the XID that created that version but not the XID that created the prior version, and if so, you'll iterate backwards until you either hit the end of undo or go one undo record past the version you can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add "password_protocol" connection parameter to libpq
On Fri, 2019-08-09 at 09:28 -0400, Stephen Frost wrote: > Having an 'any' option, as mentioned before, could be an alternative > though. ... > I agree with the point that there isn't any guarantee that it'll > always > be clear-cut as to which of two methods is "better". > > From a user perspective, it seems like the main things are "don't > send > my password in the clear to the server", and "require channel binding > to > prove there isn't a MITM". I have to admit that I like the idea of > requiring scram to be used and not allowing md5 though. So it seems like we are leaning toward: password_protocol = any | {plaintext,md5,scram-sha-256,scram-sha- 256-plus}[,...] Or maybe: channel_binding = {disable|prefer|require} password_plaintext = {disable|enable} password_md5 = {disable|enable} That seems reasonable. It's three options, but no normal use case would need to set more than two, because channel binding forces scram-sha- 256-plus. Regards, Jeff Davis
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Aug 09, 2019 at 11:51:23PM +0900, Masahiko Sawada wrote: On Fri, Aug 9, 2019 at 10:25 AM Bruce Momjian wrote: On Thu, Aug 8, 2019 at 06:31:42PM -0400, Stephen Frost wrote: > > >Crash recovery doesn't happen "all the time" and neither does vacuum > > >freeze, and autovacuum processes are independent of individual client > > >backends- we don't need to (and shouldn't) have the keys in shared > > >memory. > > > > Don't people do physical replication / HA pretty much all the time? > > Strictly speaking, that isn't actually crash recovery, it's physical > replication / HA, and while those are certainly nice to have it's no > guarantee that they're required or that you'd want to have the same keys > for them- conceptually, at least, you could have WAL with one key that > both sides know and then different keys for the actual data files, if we > go with the approach where the WAL is encrypted with one key and then > otherwise is plaintext. Uh, yes, you could have two encryption keys in the data directory, one for heap/indexes, one for WAL, both unlocked with the same passphrase, but what would be the value in that? > > >>That might allow crash recovery and the freeze part of VACUUM FREEZE to > > >>work. (I don't think we could vacuum since we couldn't read the index > > >>pages to find the matching rows since the index values would be encrypted > > >>too. We might be able to not encrypt the tid in the index typle.) > > > > > >Why do we need the indexed values to vacuum the index..? We don't > > >today, as I recall. We would need the tids though, yes. > > > > Well, we also do collect statistics on the data, for example. But even > > if we assume we wouldn't do that for encrypted indexes (which seems like > > a pretty bad idea to me), you'd probably end up leaking information > > about ordering of the values. Which is generally a pretty serious > > information leak, AFAICS. > > I agree entirely that order information would be bad to leak- but this > is all new ground here and we haven't actually sorted out what such a > partially encrypted btree would look like. We don't actually have to > have the down-links in the tree be unencrypted to allow vacuuming of > leaf pages, after all. Agreed, but I think we kind of know that the value in cluster-wide encryption is different from multi-key encryption --- both have their value, but right now cluster-wide is the easiest and simplest, and probably meets more user needs than multi-key encryption. If others want to start scoping out what multi-key encryption would look like, we can discuss it. I personally would like to focus on cluster-wide encryption for PG 13. I agree that cluster-wide is more simpler but I'm not sure that it meets real needs from users. One example is re-encryption; when the key leakage happens, in cluster-wide encryption we end up with doing re-encrypt whole database regardless the amount of user sensitive data in database. I think it's a big constraint for users because it's common that the amount of data such as master table that needs to be encrypted doesn't account for a large potion of database. That's one reason why I think more fine granularity encryption such as table/tablespace is required. TBH I think it's mostly pointless to design for key leakage. My understanding it that all this work is motivated by the assumption that Bob can obtain access to the data directory (say, a backup of it). So if he also manages to get access to the encryption key, we probably have to assume he already has access to current snapshot of the data directory, which means any re-encryption is pretty futile. What we can (and should) optimize for is key rotation, but as that only changes the master key and not the actual encryption keys, the overhead is pretty low. We can of course support "forced" re-encryption, but I think it's acceptable if that's fairly expensive as long as it can be throttled and executed in the background (kinda similar to the patch to enable checksums in the background). And in terms of feature development we would implement fine-granularity encryption in the future even if the first step is cluster-wide encryption? And both TDEs encrypt the same kind of database objects (i.e. only tables , indexes and WAL)? If so, how does users use them depending on cases? I imagined the case where we had the cluster-wide encryption as the first TDE feature. We will enable TDE at initdb time by specifying the command-line parameter for TDE. Then TDE is enabled in cluster wide and all tables/indexes and WAL are automatically encrypted. Then, if we want to implement the more fine granularity encryption how can we make users use it? WAL encryption and tables/index encryption are enabled at the same time but we want to enable encryption for particular tables/indexes after initdb. If the cluster-wide encryption is something like a short-cut of encrypting all tables/indexes, I personally think that implementing the more fine gra
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Thu, Aug 08, 2019 at 06:31:42PM -0400, Stephen Frost wrote: Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: On Thu, Aug 08, 2019 at 03:07:59PM -0400, Stephen Frost wrote: >* Bruce Momjian (br...@momjian.us) wrote: >>On Tue, Jul 9, 2019 at 11:09:01AM -0400, Bruce Momjian wrote: >>> On Tue, Jul 9, 2019 at 10:59:12AM -0400, Stephen Frost wrote: >>> > * Bruce Momjian (br...@momjian.us) wrote: >>> > I agree that all of that isn't necessary for an initial implementation, >>> > I was rather trying to lay out how we could improve on this in the >>> > future and why having the keying done at a tablespace level makes sense >>> > initially because we can then potentially move forward with further >>> > segregation to improve the situation. I do believe it's also useful in >>> > its own right, to be clear, just not as nice since a compromised backend >>> > could still get access to data in shared buffers that it really >>> > shouldn't be able to, even broadly, see. >>> >>> I think TDE is feature of questionable value at best and the idea that >>> we would fundmentally change the internals of Postgres to add more >>> features to it seems very unlikely. I realize we have to discuss it so >>> we don't block reasonable future feature development. >> >>I have a new crazy idea. I know we concluded that allowing multiple >>independent keys, e.g., per user, per table, didn't make sense since >>they have to be unlocked all the time, e.g., for crash recovery and >>vacuum freeze. > >I'm a bit confused as I never agreed that made any sense and I continue >to feel that it doesn't make sense to have one key for everything. > >Crash recovery doesn't happen "all the time" and neither does vacuum >freeze, and autovacuum processes are independent of individual client >backends- we don't need to (and shouldn't) have the keys in shared >memory. Don't people do physical replication / HA pretty much all the time? Strictly speaking, that isn't actually crash recovery, it's physical replication / HA, and while those are certainly nice to have it's no guarantee that they're required or that you'd want to have the same keys for them- conceptually, at least, you could have WAL with one key that both sides know and then different keys for the actual data files, if we go with the approach where the WAL is encrypted with one key and then otherwise is plaintext. Uh? IMHO not breaking physical replication / HA should be pretty much required for any new feature, unless it's somehow obviously clear that it's not needed for that particular feature. I very much doubt we can make that conclusion for encrypted instances (at least I don't see why it would be the case in general). One reason is that those features are also used for backups, which I hope we both agree is not an optional feature. Maybe it's possible to modify pg_basebackup to re-encrypt all the data, but to do that it clearly needs to know all encryption keys (although not necessarily on the same side). >>However, that assumes that all heap/index pages are encrypted, and all >>of WAL. What if we encrypted only the user-data part of the page, i.e., >>tuple data. We left xmin/xmax unencrypted, and only stored the >>encrypted part of that data in WAL, and didn't encrypt any more of WAL. > >This is pretty much what Alvaro was suggesting a while ago, isn't it..? >Have just the user data be encrypted in the table and in the WAL stream. It's also moving us much closer to pgcrypto-style encryption ... Yes, it is, and there's good parts and bad parts to that, to be sure. >>That might allow crash recovery and the freeze part of VACUUM FREEZE to >>work. (I don't think we could vacuum since we couldn't read the index >>pages to find the matching rows since the index values would be encrypted >>too. We might be able to not encrypt the tid in the index typle.) > >Why do we need the indexed values to vacuum the index..? We don't >today, as I recall. We would need the tids though, yes. Well, we also do collect statistics on the data, for example. But even if we assume we wouldn't do that for encrypted indexes (which seems like a pretty bad idea to me), you'd probably end up leaking information about ordering of the values. Which is generally a pretty serious information leak, AFAICS. I agree entirely that order information would be bad to leak- but this is all new ground here and we haven't actually sorted out what such a partially encrypted btree would look like. We don't actually have to have the down-links in the tree be unencrypted to allow vacuuming of leaf pages, after all. Well, I'm not all that familiar with the btree code, but I still think you can deduce an awful amount of information from having the leaf pages alone (not sure if we could deduce a total order, but presumably yes). >>Is this something considering in version one of this feature? Probably >>not, but later? Never? Would the information leakage be too great, >>particularly from
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Aug 9, 2019 at 10:25 AM Bruce Momjian wrote: > > On Thu, Aug 8, 2019 at 06:31:42PM -0400, Stephen Frost wrote: > > > >Crash recovery doesn't happen "all the time" and neither does vacuum > > > >freeze, and autovacuum processes are independent of individual client > > > >backends- we don't need to (and shouldn't) have the keys in shared > > > >memory. > > > > > > Don't people do physical replication / HA pretty much all the time? > > > > Strictly speaking, that isn't actually crash recovery, it's physical > > replication / HA, and while those are certainly nice to have it's no > > guarantee that they're required or that you'd want to have the same keys > > for them- conceptually, at least, you could have WAL with one key that > > both sides know and then different keys for the actual data files, if we > > go with the approach where the WAL is encrypted with one key and then > > otherwise is plaintext. > > Uh, yes, you could have two encryption keys in the data directory, one > for heap/indexes, one for WAL, both unlocked with the same passphrase, > but what would be the value in that? > > > > >>That might allow crash recovery and the freeze part of VACUUM FREEZE to > > > >>work. (I don't think we could vacuum since we couldn't read the index > > > >>pages to find the matching rows since the index values would be > > > >>encrypted > > > >>too. We might be able to not encrypt the tid in the index typle.) > > > > > > > >Why do we need the indexed values to vacuum the index..? We don't > > > >today, as I recall. We would need the tids though, yes. > > > > > > Well, we also do collect statistics on the data, for example. But even > > > if we assume we wouldn't do that for encrypted indexes (which seems like > > > a pretty bad idea to me), you'd probably end up leaking information > > > about ordering of the values. Which is generally a pretty serious > > > information leak, AFAICS. > > > > I agree entirely that order information would be bad to leak- but this > > is all new ground here and we haven't actually sorted out what such a > > partially encrypted btree would look like. We don't actually have to > > have the down-links in the tree be unencrypted to allow vacuuming of > > leaf pages, after all. > > Agreed, but I think we kind of know that the value in cluster-wide > encryption is different from multi-key encryption --- both have their > value, but right now cluster-wide is the easiest and simplest, and > probably meets more user needs than multi-key encryption. If others > want to start scoping out what multi-key encryption would look like, we > can discuss it. I personally would like to focus on cluster-wide > encryption for PG 13. I agree that cluster-wide is more simpler but I'm not sure that it meets real needs from users. One example is re-encryption; when the key leakage happens, in cluster-wide encryption we end up with doing re-encrypt whole database regardless the amount of user sensitive data in database. I think it's a big constraint for users because it's common that the amount of data such as master table that needs to be encrypted doesn't account for a large potion of database. That's one reason why I think more fine granularity encryption such as table/tablespace is required. And in terms of feature development we would implement fine-granularity encryption in the future even if the first step is cluster-wide encryption? And both TDEs encrypt the same kind of database objects (i.e. only tables , indexes and WAL)? If so, how does users use them depending on cases? I imagined the case where we had the cluster-wide encryption as the first TDE feature. We will enable TDE at initdb time by specifying the command-line parameter for TDE. Then TDE is enabled in cluster wide and all tables/indexes and WAL are automatically encrypted. Then, if we want to implement the more fine granularity encryption how can we make users use it? WAL encryption and tables/index encryption are enabled at the same time but we want to enable encryption for particular tables/indexes after initdb. If the cluster-wide encryption is something like a short-cut of encrypting all tables/indexes, I personally think that implementing the more fine granularity one first and then using it to achieve the more coarse granularity would be more easier. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Problem with default partition pruning
On 2019-Aug-09, Amit Langote wrote: > Hmm, oops. I think that judgement was a bit too rushed on my part. I > unintentionally ended up making the partition constraint to *always* > be fetched, whereas we don't need it in most cases. I've reverted > that change. Yeah, I was quite confused about this point yesterday while I was trying to make sense of your patches. > RelOptInfo.partition_qual is poorly named in retrospect. > :( It's not set for all partitions, only those that are partitioned > themselves. Oh. Hmm, I think this realization further clarifies things. Since we're only changing this in the master branch anyway, maybe we can find a better name for it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SQL/JSON path: collation for comparisons, minor typos in docs
On Thu, Aug 8, 2019 at 11:30 PM Alexander Korotkov wrote: > On Thu, Aug 8, 2019 at 11:53 AM Markus Winand wrote: > > The patch makes my tests pass. > > Cool. > > > I wonder about a few things: > > > > - Isn’t there any code that could be re-used for that (the one triggered by > > ‘a’ < ‘A’ COLLATE ucs_basic)? > > PostgreSQL supports ucs_basic, but it's alias to C collation and works > only for utf-8. Jsonpath code may work in different encodings. New > string comparison code can work in different encodings. > > > - For object key members, the standard also refers to unicode code point > > collation (SQL-2:2016 4.46.3, last paragraph). > > - I guess it also applies to the “starts with” predicate, but I cannot find > > this explicitly stated in the standard. > > For object keys we don't actually care about whether strings are less > or greater. We only search for equal keys. So, per-byte comparison > we currently use should be fine. The same states for "starts with" > predicate. > > > My tests check whether those cases do case-sensitive comparisons. With my > > default collation "en_US.UTF-8” I cannot discover potential issues there. I > > haven’t played around with nondeterministic ICU collations yet :( > > That's OK. There should be other beta testers around :) So, I'm going to push this if no objections. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Rethinking opclass member checks and dependency strength
> > But none of our contrib modules do it like that, and I'd lay long odds > against any third party code doing it either. Thoughts? > PostGIS has some rarely used box operations as part of GiST opclass, like "overabove". These are source of misunderstanding, as it hinges on the fact that non-square geometry will be coerced into a box on a call, which is not obvious when you call it on something like diagonal linestrings. It may happen that we will decide to remove them. On such circumstances, I expect that ALTER OPERATOR CLASS DROP OPERATOR will work. Other thing that I would expect is that DROP FUNCTION ... CASCADE will remove the operator and unregister the operator from operator class without dropping operator class itself.
Re: Global temporary tables
On 09.08.2019 8:34, Craig Ringer wrote: On Thu, 8 Aug 2019 at 15:03, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: On 08.08.2019 5:40, Craig Ringer wrote: On Tue, 6 Aug 2019 at 16:32, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: New version of the patch with several fixes is attached. Many thanks to Roman Zharkov for testing. FWIW I still don't understand your argument with regards to using shared_buffers for temp tables having connection pooling benefits. Are you assuming the presence of some other extension in your extended version of PostgreSQL ? In community PostgreSQL a temp table's contents in one backend will not be visible in another backend. So if your connection pooler in transaction pooling mode runs txn 1 on backend 42 and it populates temp table X, then the pooler runs the same app session's txn 2 on backend 45, the contents of temp table X are not visible anymore. Certainly here I mean built-in connection pooler which is not currently present in Postgres, but it is part of PgPRO-EE and there is my patch for vanilla at commitfest: https://commitfest.postgresql.org/24/2067 OK, that's what I assumed. You're trying to treat this change as if it's a given that the other functionality you want/propose is present in core or will be present in core. That's far from given. My suggestion is to split it up so that the parts can be reviewed and committed separately. In PgPRO-EE this problem was solved by binding session to backend. I.e. one backend can manage multiple sessions, but session can not migrate to another backend. The drawback of such solution is obvious: one long living transaction can block transactions of all other sessions scheduled to this backend. Possibility to migrate session to another backend is one of the obvious solutions of the problem. But the main show stopper for it is temporary tables. This is why I consider moving temporary tables to shared buffers as very important step. I can see why it's important for your use case. I am not disagreeing. I am however strongly suggesting that your patch has two fairly distinct functional changes in it, and you should separate them out. * Introduce global temp tables, a new relkind that works like a temp table but doesn't require catalog changes. Uses per-backend relfilenode and cleanup like existing temp tables. You could extend the relmapper to handle the mapping of relation oid to per-backend relfilenode. * Associate global temp tables with session state and manage them in shared_buffers so they can work with the in-core connection pooler (if committed) Historically we've had a few efforts to get in-core connection pooling that haven't gone anywhere. Without your pooler patch the changes you make to use shared_buffers etc are going to be unhelpful at best, if not actively harmful to performance, and will add unnecessary complexity. So I think there's a logical series of patches here: * global temp table relkind and support for it * session state separation * connection pooling * pooler-friendly temp tables in shared_buffers Make sense? But even if we forget about built-in connection pooler, don't you think that possibility to use parallel query plans for temporary tables is itself strong enough motivation to access global temp table through shared buffers? I can see a way to share temp tables across parallel query backends being very useful for DW/OLAP workloads, yes. But I don't know if putting them in shared_buffers is the right answer for that. We have DSM/DSA, we have shm_mq, various options for making temp buffers share-able with parallel worker backends. I'm suggesting that you not tie the whole (very useful) global temp tables feature to this, but instead split it up into logical units that can be understood, reviewed and committed separately. I would gladly participate in review. Ok, here it is: global_private_temp-1.patch Also I have attached updated version of the global temp tables with shared buffers - global_shared_temp-1.patch It is certainly larger (~2k lines vs. 1.5k lines) because it is changing BufferTag and related functions. But I do not think that this different is so critical. Still have a wish to kill two birds with one stone:) -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 1bd579f..2d93f6f 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) buf_state = LockBufHdr(bufHdr); fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr); - fctx->record[i].relfilenode = bufHdr->tag.
Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData
On Wed, Aug 7, 2019 at 6:56 AM vignesh C wrote: > Going by the discussion shall we conclude that we don't need to > convert the subxids into fxid's as part of this fix. > Let me know if any further changes need to be done. I'm not sure, but I think the prior question is whether we want this patch at all, and I'm not sure we've achieved consensus on that. Thomas's point, at least as I understand it, is that if we start doing 32-bit => 64-bit XID conversions all over the place, it's not going to be long before some incautious developer inserts one that is not actually safe. On the other hand, Andres's point, at least as I understand it, is that putting information that we don't really need into the twophase state file because of some overly rigid coding rule is not smart. Both of those arguments sound right to me, but they lead to opposite conclusions. I am somewhat inclined to Andres's conclusion on balance. I think that we can probably define a set of policies about 32 => 64 bit XID conversions both in terms of when you can do them and what comments you have to include justifying them and how the API actually works that makes it safe. It might help to think about defining the API in terms of a reference FullTransactionId that must be OLDER than the XID you're promoting to an FXID. For instance, we know that all of the relfrozenxid and datfrozenxids are from the current era because we've got freezing machinery to enforce that. So if you got a tuple from the heap, it's XID has got to be new, at least modulo bugs. In other cases, we may be able to say, hey, look, this XID can't be from before the CLOG cutoff. I'm not sure of all the details here, but I'm tentatively inclined to think that trying to lay down policies for when promotion can be done safely is more promising than holding our breath and saying we're never going to promote. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add "password_protocol" connection parameter to libpq
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > On Thu, Aug 08, 2019 at 11:16:24PM -0700, Jeff Davis wrote: > > On Fri, 2019-08-09 at 12:00 +0900, Michael Paquier wrote: > > > What about auth_protocol then? It seems to me that it could be > > > useful > > > to have the restriction on AUTH_REQ_MD5 as well. > > > > auth_protocol does sound like a good name. I'm not sure what you mean > > regarding MD5 though. I don't really care for auth_protocol as that's pretty close to "auth_method" and that isn't what we're talking about here- this isn't the user picking the auth method, per se, but rather saying which of the password-based mechanisms for communicating that the user knows the password is acceptable. Letting users choose which auth methods are allowed might also be interesting (as in- we are in a Kerberized environment and therefore no client should ever be using any auth method except GSS, could be a reasonable ask) but it's not the same thing. > Sorry, I meant krb5 here. What restriction are you suggesting here wrt krb5..? > > We already have that concept to a lesser extent, with the md5 > > authentication method also permitting scram-sha-256. > > That's present to ease upgrades, and once the AUTH_REQ part is > received the client knows what it needs to go through. I don't think there's any question that, of the methods available to prove that you know what the password is, simply sending the password to the server as cleartext is the least secure. If I, as a user, decide that I don't really care what method is used, it's certainly simpler to just say 'plaintext' than to have to list out every possible option. Having an 'any' option, as mentioned before, could be an alternative though. I agree with the point that there isn't any guarantee that it'll always be clear-cut as to which of two methods is "better". From a user perspective, it seems like the main things are "don't send my password in the clear to the server", and "require channel binding to prove there isn't a MITM". I have to admit that I like the idea of requiring scram to be used and not allowing md5 though. > > That sounds good, but there are a lot of possibilities and I can't > > quite decide which way to go. > > > > We could expose it as an SASL option like: > > > >saslmode = {disable|prefer|require-scram-sha-256|require-scram-sha- > > 256-plus} > > Or we could shape password_protocol so as it takes a list of > protocols, as a white list of authorized things in short. I'm not really a fan of "saslmode" or anything else involving SASL for this because we don't *really* do SASL- if we did, we'd have that as an auth method and we'd have our Kerberos support be through SASL along with potentially everything else... I'm not against going there but I don't think that's what you were suggesting here. Thanks, Stephen signature.asc Description: PGP signature
Re: block-level incremental backup
On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe wrote: > + if (!XLogRecPtrIsInvalid(previous_lsn)) > + appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n", > +(uint32) (previous_lsn >> 32), (uint32) > previous_lsn); > > May be we should rename to something like: > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START > LOCATION" > to make it more intuitive? So, I think that you are right that PREVIOUS WAL LOCATION might not be entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL LOCATION is definitely not clear. This backup is an incremental backup, and it has a start WAL location, so you'd end up with START WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound like they ought to both be the same thing, but they're not. Perhaps something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR INCREMENTAL BACKUP would be clearer. > File header structure is defined in both the files basebackup.c and > pg_combinebackup.c. I think it is better to move this to > replication/basebackup.h. Or some other header, but yeah, definitely don't duplicate the struct definition (or any other kind of definition). > IMHO, while labels are not advisable in general, it may be better to use a > label > here rather than a while(1) loop, so that we can move to the label in case we > want to retry once. I think here it opens doors for future bugs if someone > happens to add code here, ending up adding some condition and then the > break becomes conditional. That will leave us in an infinite loop. I'm not sure which style is better here, but I don't really buy this argument. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke wrote: > So, do you mean we should just do fread() and fwrite() for the whole file? > > I thought it is better if it was done by the OS itself instead of reading 1GB > into the memory and writing the same to the file. Well, 'cp' is just a C program. If they can write code to copy a file, so can we, and then we're not dependent on 'cp' being installed, working properly, being in the user's path or at the hard-coded pathname we expect, etc. There's an existing copy_file() function in src/backed/storage/file/copydir.c which I'd probably look into adapting for frontend use. I'm not sure whether it would be important to adapt the data-flushing code that's present in that routine or whether we could get by with just the loop to read() and write() data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SegFault on 9.6.14
On Wed, Aug 7, 2019 at 5:45 AM vignesh C wrote: > I have made a patch based on the above lines. > I have tested the scenarios which Thomas had shared in the earlier > mail and few more tests based on Thomas's tests. > I'm not sure if we will be going ahead with this solution or not. > Let me know your opinion on the same. > If you feel this approach is ok, we can add few of this tests into pg tests. I think this patch is bizarre: - It introduces a new function called getParallelModeLevel(), which is randomly capitalized different from the other functions that do similar things, and then uses it to do the same thing that could have been done with the existing function IsInParallelMode(). - It contains an "if" statement whose only content is an Assert(). Don't write if (a) Assert(b); write Assert(!a || b). - It contains zero lines of comment changes, which is obviously not enough for a patch that proposes to fix a very thorny issue. This failure has two parts. First, it adds no new comments to explain the bug being fixed or the theory of operation of the new code. Second, it does not even bother updating existing comments that are falsified by the patch, such as the function header comments for ExecParallelCleanup and ExecShutdownGather. - It changes what ExecParallelCleanup does while adjusting only one of the two callers to match the behavior change. nodeGatherMerge.c manages to be completed untouched by this patch. If you change what a function does, you really need to grep for all the calls to that function and adjust all callers to match the new set of expectations. It's a little hard to get past all of those issues and look at what the patch actually does, but I'm going to try: the theory of operation of the patch seems to be that we can skip destroying the parallel context when performing ExecParallelCleanup and in fact when exiting parallel mode, and then when we get to executor end time the context will still be there and we can fish the instrumentation out of it. But this seems problematic for several reasons. For one thing, as Amit already observed, the code currently contains an assertion which ensure that a ParallelContext can't outlive the time spent in parallel mode, and it doesn't seem desirable to relax that assertion (this patch removes it). But beyond that, the issue here is that the Limit node is shutting down the Gather node too early, and the right fix must be to stop doing that, not to change the definition of what it means to shut down a node, as this patch does. So maybe a possible approach here - which I think is more or less what Tom is proposing - is: 1. Remove the code from ExecLimit() that calls ExecShutdownNode(). 2. Adjust ExecutePlan() so that it ensures that ExecuteShutdownNode() gets called at the very end of execution, at least when execute_once is set, before exiting parallel mode. 3. Figure out, possibly at a later time or only in HEAD, how to make the early call to ExecLimit() in ExecShutdownNode(), and then put it back. I think we could do this by passing down some information indicating which nodes are potentially rescanned by other nodes higher up in the tree; there's the separate question of whether rescans can happen due to cursor operations, but the execute_once stuff can handle that aspect of it, I think. I'm not quite sure that approach is altogether correct so I'd appreciate some analysis on that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add "password_protocol" connection parameter to libpq
On Thu, Aug 08, 2019 at 11:16:24PM -0700, Jeff Davis wrote: > On Fri, 2019-08-09 at 12:00 +0900, Michael Paquier wrote: > > What about auth_protocol then? It seems to me that it could be > > useful > > to have the restriction on AUTH_REQ_MD5 as well. > > auth_protocol does sound like a good name. I'm not sure what you mean > regarding MD5 though. Sorry, I meant krb5 here. > We already have that concept to a lesser extent, with the md5 > authentication method also permitting scram-sha-256. That's present to ease upgrades, and once the AUTH_REQ part is received the client knows what it needs to go through. > That sounds good, but there are a lot of possibilities and I can't > quite decide which way to go. > > We could expose it as an SASL option like: > >saslmode = {disable|prefer|require-scram-sha-256|require-scram-sha- > 256-plus} Or we could shape password_protocol so as it takes a list of protocols, as a white list of authorized things in short. -- Michael signature.asc Description: PGP signature
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 26, 2019 at 9:57 PM Amit Khandekar wrote: > > On Fri, 26 Jul 2019 at 12:25, Amit Kapila wrote: > > I agree with all your other comments. > > Thanks for addressing the comments. Below is the continuation of my > comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch > : > > > + * Perform rollback request. We need to connect to the database for first > + * request and that is required because we access system tables while > for first request and that is required => for the first request. This > is required > Changed as per suggestion. > --- > > +UndoLauncherShmemSize(void) > +{ > +Sizesize; > + > +/* > + * Need the fixed struct and the array of LogicalRepWorker. > + */ > +size = sizeof(UndoApplyCtxStruct); > > The fixed structure size should be offsetof(UndoApplyCtxStruct, > workers) rather than sizeof(UndoApplyCtxStruct) > > --- > Why? I see the similar code in ApplyLauncherShmemSize. If there is any problem with this, then we might have similar problem in existing code as well. > In UndoWorkerCleanup(), we set individual fields of the > UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all > the UndoApplyWorker array elements, we just memset all the > UndoApplyWorker structure elements to 0. I think we should be > consistent for the two cases. I guess we can just memset to 0 as you > do in UndoLauncherShmemInit(), but this will cause the > worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in > UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we > can just set it to XID_QUEUE initially ? Either is fine, because before launching the worker we set the valid value. It is better to set it as InvalidUndoWorkerQueue though. > Also, if we just use memset in UndoWorkerCleanup(), we need to first > save generation into a temp variable, and then after memset(), restore > it back. > This sounds like an unnecessary trickery. > That brought me to another point : > We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup() > can just call ResetUndoRequestInfo(). > > Hmm, both (UndoRequestInfo and UndoApplyWorker) are separate structures, so how can we reuse them? > +boolallow_peek; > + > +CHECK_FOR_INTERRUPTS(); > + > +allow_peek = !TimestampDifferenceExceeds(started_at, > Some comments would be good about what is allow_peek used for. Something > like : > "Arrange to prevent the worker from restarting quickly to switch databases" > Added a slightly different comment. > - > +++ b/src/backend/access/undo/README.UndoProcessing > - > > +worker then start reading from one of the queues the requests for that > start=>starts > --- > > +work, it lingers for UNDO_WORKER_LINGER_MS (10s as default). This avoids > As per the latest definition, it is 20s. IMHO, there's no need to > mention the default value in the readme. > --- > > +++ b/src/backend/access/undo/discardworker.c > --- > > + * portion of transaction that is overflowed into a separate log can > be processed > 80-col crossed. > > +#include "access/undodiscard.h" > +#include "access/discardworker.h" > Not in alphabetical order > Fixed all of the above four points. > > +++ b/src/backend/access/undo/undodiscard.c > --- > > +next_insert = UndoLogGetNextInsertPtr(logno); > I checked UndoLogGetNextInsertPtr() definition. It calls > find_undo_log_slot() to get back the slot from logno. Why not make it > accept slot as against logno ? At all other places, the slot->logno is > passed, so it is convenient to just pass the slot there. And in > UndoDiscardOneLog(), first call find_undo_log_slot() just before the > above line (or call it at the end of the do-while loop). > I am not sure if this is a good idea because find_undo_log_slot is purely a undolog module stuff, exposing it to outside doesn't seem like a good idea to me. > This way, > during each of the UndoLogGetNextInsertPtr() calls in undorequest.c, > we will have one less find_undo_log_slot() call. > I am not sure there is any performance benefit either because there is a cache for the slots and it should get from there very quickly. I think we can avoid this repeated call and I have done so in the attached patch. > - > > In UndoDiscardOneLog(), there are at least 2 variable declarations > that can be moved inside the do-while loop : uur and next_insert. I am > not sure about the other variables viz : undofxid and > latest_discardxid. Values of these variables in one iteration continue > across to the second iteration. For latest_discardxid, it looks like > we do want its value to be carried forward, but is it also true for > undofxid ? > undofxid can be moved inside the loop, fixed that and other variables pointed out by you. > + /* If we reach here, this means there is something to discard. */ > + need_di
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar wrote: > > > > Some further review comments for undoworker.c : > > > +/* Sets the worker's lingering status. */ > +static void > +UndoWorkerIsLingering(bool sleep) > The function name sounds like "is the worker lingering ?". Can we > rename it to something like "UndoWorkerSetLingering" ? > makes sense, changed as per suggestion. > - > > + errmsg("undo worker slot %d is empty, cannot attach", > + slot))); > > + } > + > + if (MyUndoWorker->proc) > + { > + LWLockRelease(UndoWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("undo worker slot %d is already used by " > + "another worker, cannot attach", slot))); > > These two error messages can have a common error message "could not > attach to worker slot", with errdetail separate for each of them : > slot %d is empty. > slot %d is already used by another worker. > > -- > Changed as per suggestion. > +static int > +IsUndoWorkerAvailable(void) > +{ > + int i; > + int alive_workers = 0; > + > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > > Should have bool return value. > > Also, why is it keeping track of number of alive workers ? Sounds like > earlier it used to return number of alive workers ? If it indeed needs > to just return true/false, we can do away with alive_workers. > > Also, *exclusive* lock is unnecessary. > > -- > Changed as per suggestion. Additionally, I changed the name of the function to UndoWorkerIsAvailable(), so that it is similar to other functions in the file. > +if (UndoGetWork(false, false, &urinfo, NULL) && > +IsUndoWorkerAvailable()) > +UndoWorkerLaunch(urinfo); > > There is no lock acquired between IsUndoWorkerAvailable() and > UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() > returns true, there is a small window where UndoWorkerLaunch() does > not find any worker slot with in_use false, causing assertion failure > for (worker != NULL). > -- > I have removed the assert and instead added a warning. I have also added a comment from the place where we call UndoWorkerLaunch to mention the race condition. > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > *Exclusive* lock is unnecessary. > - > Right, changed to share lock. > + LWLockRelease(UndoWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("undo worker slot %d is empty", > + slot))); > I believe there is no need to explicitly release an lwlock before > raising an error, since the lwlocks get released during error > recovery. Please check all other places where this is done. > - > Fixed. > + * Start new undo apply background worker, if possible otherwise return > false. > worker, if possible otherwise => worker if possible, otherwise > - > > +static bool > +UndoWorkerLaunch(UndoRequestInfo urinfo) > We don't check UndoWorkerLaunch() return value. Can't we make it's > return value type void ? > Now, the function returns void and accordingly I have adjusted the comment which should address both the above comments. > Also, it would be better to have urinfo as pointer to UndoRequestInfo > rather than UndoRequestInfo, so as to avoid structure copy. > - > Okay, changed as per suggestion. > +{ > + BackgroundWorker bgw; > + BackgroundWorkerHandle *bgw_handle; > + uint16 generation; > + int i; > + int slot = 0; > We can remove variable i, and use slot variable in place of i. > --- > > + snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker"); > I think it would be trivial to also append the worker->generation in > the bgw_name. > - > I am not sure if adding 'generation' is any useful. It might be better to add database id as each worker can work on a particular database, so that could be useful information. > > + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) > + { > + /* Failed to start worker, so clean up the worker slot. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + UndoWorkerCleanup(worker); > + LWLockRelease(UndoWorkerLock); > + > + return false; > + } > > Is it intentional that there is no (warning?) message logged when we > can't register a bg worker ? > - > Added a warning in that code path. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
default_table_access_method is not in sample config file
On 11/04/2019 19:49, Andres Freund wrote: On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote: diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index f7f726b5aec..bbcab9ce31a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] = {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the default table access method for new tables."), NULL, - GUC_IS_NAME + GUC_NOT_IN_SAMPLE | GUC_IS_NAME }, &default_table_access_method, DEFAULT_TABLE_ACCESS_METHOD, Hm, I think we should rather add it to sample. That's an oversight, not intentional. I just noticed that this is still an issue. default_table_access_method is not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE. I'll add this to the open items list so we don't forget. - Heikki
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar wrote: > > On Mon, 22 Jul 2019 at 14:21, Amit Kapila wrote: > > I have started review of > 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below > are some quick comments to start with: > > +++ b/src/backend/access/undo/undoworker.c > > +#include "access/xact.h" > +#include "access/undorequest.h" > Order is not alphabetical > Fixed this and a few others. > + * Each undo worker then start reading from one of the queue the requests for > start=>starts > queue=>queues > > - > > + rc = WaitLatch(MyLatch, > +WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > +10L, WAIT_EVENT_BGWORKER_STARTUP); > + > + /* emergency bailout if postmaster has died */ > + if (rc & WL_POSTMASTER_DEATH) > + proc_exit(1); > I think now, thanks to commit cfdf4dc4fc9635a, you don't have to > explicitly handle postmaster death; instead you can use > WL_EXIT_ON_PM_DEATH. Please check at all such places where this is > done in this patch. > > - > Fixed both of the above issues. > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo) > +{ > + /* Block concurrent access. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + > + MyUndoWorker = &UndoApplyCtx->workers[slot]; > Not sure why MyUndoWorker is used here. Can't we use a local variable > ? Or do we intentionally attach to the slot as a side-operation ? > > - > I have changed the code around this such that we first attach to the slot and then get the required info. Also, I don't see the need of exclusive lock here, so changed it to shared lock. > + * Get the dbid where the wroker should connect to and get the worker > wroker=>worker > > - > > + BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0); > 0, 0 => InvalidOid, 0 > > + * Set the undo worker request queue from which the undo worker start > + * looking for a work. > start => should start > a work => work > > -- > Fixed both of these. > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > I was thinking what happens if for some reason > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > entry will not be marked invalid, and so there will be no undo action > carried out because I think the undo worker will exit. What happens > next with this entry ? I think this will change after integration with Robert's latest patch on queues, so will address along with it if required. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Small const correctness patch
At Thu, 8 Aug 2019 22:56:02 +0300, Mark G wrote in > On Thu, Aug 8, 2019 at 8:51 PM Peter Eisentraut < > peter.eisentr...@2ndquadrant.com> wrote: > > > > How did you find this? Any special compiler settings? > > > > 16 hours stuck in a plane on an international flight. I was just eyeballing > the code to kill the boredom. A similar loose typing is seen, for example:p -const char * +const char * const src/backend/access/rmgrdesc/*.c relmap_identify(uint8 info) seq_identify(uint8 info) smgr_identify(uint8 info) (many)... src/backend/access/transam/xact.c: BlockStateAsString(TBlockState blockState) I foundnd them by find $(TOP) -type f -exec egrep -nH -e '^(static )?const char \*' {} + then eyeballing on the first ones. I don't know an automated way to detect such possibly-loose constness of variables or functions. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Small patch to fix build on Windows
пт, 9 авг. 2019 г. в 10:23, Kyotaro Horiguchi : > > At Fri, 9 Aug 2019 09:56:27 +0300, Dmitry Igrishin wrote > in > > пт, 9 авг. 2019 г. в 05:45, Michael Paquier : > > > > > > On Thu, Aug 08, 2019 at 10:46:07PM +0300, Dmitry Igrishin wrote: > > > > This looks nice for a Perl hacker :-). As for me, it looks unusual and > > > > a bit confusing. I never > > > > programmed in Perl, but I was able to quickly understand where the > > > > problem lies due to the > > > > style adopted in other languages, when the contents are enclosed in > > > > quotation marks, and > > > > the quotation marks are escaped if they are part of the contents. > > > > So, should I fix it? Any thoughts? > > > > > > FWIW, I like Alvaro's suggestion about qq{} in this case, as it makes > > > sure that double-quotes are correctly applied where they should. > > The attached 4rd version of the patch uses qq||. I used qq|| instead > > of qq{} for consistency because qq|| is already used in Solution.pm: > > > > return qq|VisualStudioVersion = $self->{VisualStudioVersion} > > MinimumVisualStudioVersion = $self->{MinimumVisualStudioVersion} > > |; > > Hmm. qq is nice but '|' make my eyes twitch (a bit). Couldn't we > use other delimites like (), ##, or // ? (I like {} for use in > this patch.) > > Any opinions? Personally I don't care. I used || notation only in order to be consistent, since this notation is already used in Solution.pm. If this consistency is not required let me provide a patch with {} notation. What do you think?
Re: Problem with default partition pruning
Thanks for the comments. On Fri, Aug 9, 2019 at 2:44 PM Kyotaro Horiguchi wrote: > +++ b/src/backend/optimizer/util/plancat.c > @@ -1267,10 +1267,14 @@ get_relation_constraints(PlannerInfo *root, > */ >if (include_partition && relation->rd_rel->relispartition) >{ > ... > +else > { > + /* Nope, fetch from the relcache. */ > > I seems to me that include_partition is true both and only for > modern and old-fasheoned partition parents. > set_relation_partition_info() is currently called only for modern > partition parents. If we need that at the place above, > set_relation_partition_info can be called also for old-fashioned > partition parent, and get_relation_constraints may forget the > else case in a broad way. "include_partition" doesn't have anything to do with what kind the partition parent is. It is true when the input relation that is a partition is directly mentioned in the query (RELOPT_BASEREL) and constraint_exclusion is on (inheritance_planner considerations makes the actual code a bit hard to follow but we'll hopefully simplify that in the near future). That is also the only case where we need to consider the partition constraint when doing constraint exclusion. Regarding how this relates to partition_qual: * get_relation_constraints() can use it if it's set, which would be the case if the partition in question is partitioned itself * It wouldn't be set if the partition in question is a leaf partition, so it will have to get it directly from the relcache > + /* Nope, fetch from the relcache. */ > + List *pcqual = RelationGetPartitionQual(relation); > > If the comment above is right, This would be duplicate. What we > should do instaed is only eval_const_expression. And we could > move it to set_relation_partition_info completely. Consitify must > be useful in both cases. As described above, this block of code is not really duplicative in the sense that when it runs, that would be the first time in a query to fetch the partition constraint of the relation in question. Also, note that expression_planner() calls eval_const_expressions(), so constification happens in both cases. I guess different places have grown different styles of processing constraint expressions as the APIs have evolved over time. Thanks, Amit
Re: Small patch to fix build on Windows
At Fri, 9 Aug 2019 09:56:27 +0300, Dmitry Igrishin wrote in > пт, 9 авг. 2019 г. в 05:45, Michael Paquier : > > > > On Thu, Aug 08, 2019 at 10:46:07PM +0300, Dmitry Igrishin wrote: > > > This looks nice for a Perl hacker :-). As for me, it looks unusual and > > > a bit confusing. I never > > > programmed in Perl, but I was able to quickly understand where the > > > problem lies due to the > > > style adopted in other languages, when the contents are enclosed in > > > quotation marks, and > > > the quotation marks are escaped if they are part of the contents. > > > So, should I fix it? Any thoughts? > > > > FWIW, I like Alvaro's suggestion about qq{} in this case, as it makes > > sure that double-quotes are correctly applied where they should. > The attached 4rd version of the patch uses qq||. I used qq|| instead > of qq{} for consistency because qq|| is already used in Solution.pm: > > return qq|VisualStudioVersion = $self->{VisualStudioVersion} > MinimumVisualStudioVersion = $self->{MinimumVisualStudioVersion} > |; Hmm. qq is nice but '|' make my eyes twitch (a bit). Couldn't we use other delimites like (), ##, or // ? (I like {} for use in this patch.) Any opinions? regards. -- Kyotaro Horiguchi NTT Open Source Software Center