Re: First draft of PG 17 release notes
On 5/11/24 09:57, Jelte Fennema-Nio wrote: On Fri, 10 May 2024 at 23:31, Tom Lane wrote: Bruce Momjian writes: > I looked at both of these. In both cases I didn't see why the user > would need to know these changes were made: I agree that the buffering change is not likely interesting, but the fact that you can now control-C out of a psql "\c" command is user-visible. People might have internalized the fact that it didn't work, or created complicated workarounds. The buffering change improved performance up to ~40% in some of the benchmarks. The case it improves mostly is COPY of large rows and streaming a base backup. That sounds user-visible enough to me to warrant an entry imho. +1 -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: 'trusted'/'untrusted' PL in DoD/DISA PostgreSQL STIGs
On 5/5/24 13:53, Chapman Flack wrote: The four STIGs suggest the same email address [5] for comments or proposed revisions. I could send these comments there myself, but I thought it likely that others in the community have already been involved in the development of those documents and might have better connections. Those docs were developed by the respective companies (Crunchy and EDB) in cooperation with DISA. The community has nothing to do with them. I suggest you contact the two companies with corrections and suggestions. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation
On 5/3/24 11:44, Peter Eisentraut wrote: On 03.05.24 16:13, Tom Lane wrote: Peter Eisentraut writes: On 30.04.24 19:29, Tom Lane wrote: Also, the bigger picture here is the seeming assumption that "if we change pg_trgm then it will be safe to replicate from x86 to arm". I don't believe that that's a good idea and I'm unwilling to promise that it will work, regardless of what we do about char signedness. That being the case, I don't want to invest a lot of effort in the signedness issue. Option (1) is clearly a small change with little if any risk of future breakage. But note that option 1 would prevent some replication that is currently working. The point of this thread though is that it's working only for small values of "work". People are rightfully unhappy if it seems to work and then later they get bitten by compatibility problems. Treating char signedness as a machine property in pg_control would signal that we don't intend to make it work, and would ensure that even the most minimal testing would find out that it doesn't work. If we do not do that, it seems to me we have to buy into making it work. That would mean dealing with the consequences of an incompatible change in pg_trgm indexes, and then going through the same dance again the next time(s) similar problems are found. Yes, that is understood. But anecdotally, replicating between x86-64 arm64 is occasionally used for upgrades or migrations. In practice, this appears to have mostly worked. If we now discover that it won't work with certain index extension modules, it's usable for most users. Even if we say, you have to reindex everything afterwards, it's probably still useful for these scenarios. +1 I have heard similar anecdotes, and the reported experience goes even further -- many such upgrade/migration uses, with exceedingly rare reported failures. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New GUC autovacuum_max_threshold ?
On 4/26/24 09:31, Robert Haas wrote: On Fri, Apr 26, 2024 at 9:22 AM Joe Conway wrote: Although I don't think 50 is necessarily too small. In my view, having autovac run very quickly, even if more frequently, provides an overall better user experience. Can you elaborate on why you think that? I mean, to me, that's almost equivalent to removing autovacuum_vacuum_scale_factor entirely, because only for very small tables will that calculation produce a value lower than 500k. If I understood Nathan's proposed calc, for small tables you would still get (thresh + sf * numtuples). Once that number exceeds the new limit parameter, then the latter would kick in. So small tables would retain the current behavior and large enough tables would be clamped. We might need to try to figure out some test cases here. My intuition is that this is going to vacuum large tables insanely aggressively. It depends on workload to be sure. Just because a table is large, it doesn't mean that dead rows are generated that fast. Admittedly it has been quite a while since I looked at all this that closely, but if A/V runs on some large busy table for a few milliseconds once every few minutes, that is far less disruptive than A/V running for tens of seconds once every few hours or for minutes ones every few days -- or whatever. The key thing to me is the "few milliseconds" runtime. The short duration means that no one notices an impact, and the longer duration almost guarantees that an impact will be felt. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New GUC autovacuum_max_threshold ?
On 4/26/24 04:43, Michael Banck wrote: So this proposal (probably along with a higher default threshold than 50, but IMO less than what Robert and Nathan suggested) sounds like a stop forward to me. DBAs can set the threshold lower if they want, or maybe we can just turn it off by default if we cannot agree on a sane default, but I think this (using the simplified formula from Nathan) is a good approach that takes some pain away from autovacuum tuning and reserves that for the really difficult cases. +1 to the above Although I don't think 50 is necessarily too small. In my view, having autovac run very quickly, even if more frequently, provides an overall better user experience. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Table AM Interface Enhancements
On 4/10/24 09:19, Robert Haas wrote: When you commit a patch and another committer writes a post-commit review saying that your patch has so many serious problems that he gave up on reviewing before enumerating all of them, that's a really bad sign. That should be a strong signal to you to step back and take a close look at whether you really understand the area of the code that you're touching well enough to be doing whatever it is that you're doing. If I got a review like that, I would have reverted the patch instantly, given up for the release cycle, possibly given up on the patch permanently, and most definitely not tried again to commit unless I was absolutely certain that I'd learned a lot in the meantime *and* had the agreement of the committer who wrote that review (or maybe some other committer who was acknowledged as an expert in that area of the code). It's not Andres's job to make sure my patches are not broken. It's my job. That applies to the patches I write, and the patches written by other people that I commit. If I commit something and it turns out that it is broken, that's my bad. If I commit something and it turns out that it does not have consensus, that is also my bad. It is not the fault of the other people for not helping me get my patches to a state where they are up to project standard. It is my fault, and my fault alone, for committing something that was not ready. Now that does not mean that it isn't frustrating when I can't get the help I need. It is extremely frustrating. But the solution is not to commit anyway and then blame the other people for not providing feedback. +many -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
On 4/8/24 22:57, Bruce Momjian wrote: On Sat, Mar 30, 2024 at 04:50:26PM -0400, Robert Haas wrote: An awful lot of what we do operates on the principle that we know the people who are involved and trust them, and I'm glad we do trust them, but the world is full of people who trusted somebody too much and regretted it afterwards. The fact that we have many committers rather than a single maintainer probably reduces risk at least as far as the source repository is concerned, because there are more people paying attention to potentially notice something that isn't as it should be. One unwritten requirement for committers is that we are able to communicate with them securely. If we cannot do that, they potentially could be forced by others, e.g., governments, to add code to our repositories. Unfortunately, there is on good way for them to communicate with us securely once they are unable to communicate with us securely. I suppose some special word could be used to communicate that status --- that is how it was done in non-electronic communication in the past. I don't know how that really helps. If one of our committers is under duress, they probably cannot risk outing themselves anyway. The best defense, IMHO, is the fact that our source code is open and can be reviewed freely. The trick is to get folks to do the review. I know, for example, at $past_employer we had a requirement to get someone on our staff to review every single commit in order to maintain certain certifications. Of course there is no guarantee that such reviews would catch everything, but maybe we could establish post commit reviews by contributors in a more rigorous way? Granted, getting more qualified volunteers is not a trivial problem... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PostgreSQL 17 Release Management Team & Feature Freeze
On 4/8/24 11:05, Tom Lane wrote: Pavel Borisov writes: IMO the fact that people struggle to work on patches, and make them better, etc. is an immense blessing for the Postgres community. Is the peak of commits really a big problem provided we have 6 months before actual release? I doubt March patches tend to be worse than the November ones. Yes, it's a problem, and yes the average quality of last-minute patches is visibly worse than that of patches committed in a less hasty fashion. We have been through this every year for the last couple decades, seems like, and we keep re-learning that lesson the hard way. I'm just distressed at our utter failure to learn from experience. I don't dispute that we could do better, and this is just a simplistic look based on "number of commits per day", but the attached does put it in perspective to some extent. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
On 3/31/24 11:49, Tom Lane wrote: Joe Conway writes: I am saying maybe those patches should be eliminated in favor of our tree including build options that would produce the same result. I don't really see how that can be expected to work sanely. It turns the responsibility for platform-specific build issues on its head, and it doesn't work at all for issues discovered after we make a release. The normal understanding of how you can vet a distro's package is that you look at the package contents (the SRPM in Red Hat world and whatever the equivalent concept is elsewhere), check that the contained tarball matches upstream and that the patches and build instructions look sane, and then build it locally and check for a match to the distro's binary package. Even if we could overcome the obstacles to putting the patch files into the upstream tarball, we're surely not going to include the build instructions, so we'd not have moved the needle very far in terms of whether the packager could do something malicious. True enough I guess. But it has always bothered me how many patches get applied to the upstream tarballs by the OS package builders. Some of them, e.g. glibc on RHEL 7, include more than 1000 patches that you would have to manually vet if you cared enough and had the skills. Last time I looked at the openssl package sources it was similar in volume and complexity. They might as well be called forks if everyone were being honest about it... I know our PGDG packages are no big deal compared to that, fortunately. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
On 3/30/24 21:52, Bruce Momjian wrote: On Sat, Mar 30, 2024 at 07:54:00PM -0400, Joe Conway wrote: Virtually every RPM source, including ours, contains out of tree patches that get applied on top of the release tarball. At least for the PGDG packages, it would be nice to integrate them into our git repo as build options or whatever so that the packages could be built without any patches applied to it. Add a tarball that is signed and traceable back to the git tag, and we would be in a much better place than we are now. How would someone access the out-of-tree patches? I think Debian includes the patches in its source tarball. I am saying maybe those patches should be eliminated in favor of our tree including build options that would produce the same result. For example, these patches are applied to our release tarball files when the RPM is being built for pg16 on RHEL 9: - https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-rpm-pgsql.patch;h=d9b6d12b7517407ac81352fa325ec91b05587641;hb=HEAD https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-var-run-socket.patch;h=f2528efaf8f4681754b20283463eff3e14eedd39;hb=HEAD https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-conf.patch;h=da28ed793232316dd81fdcbbe59a6479b054a364;hb=HEAD https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-perl-rpath.patch;h=748c42f0ec2c9730af3143e90e5b205c136f40d9;hb=HEAD - Nothing too crazy, but wouldn't it be better if no patches were required at all? Ideally we should have reproducible builds so that starting with our tarball (which is traceable back to the git release tag) one can easily obtain the same binary as what the RPMs/DEBs deliver. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
On 3/30/24 19:54, Joe Conway wrote: On 2024-03-30 16:50:26 -0400, Robert Haas wrote: or what Tom does when he builds the release tarballs. Tom follows this, at least last time I checked: https://wiki.postgresql.org/wiki/Release_process Reading through that, I wonder if this part is true anymore: In principle this could be done anywhere, but again there's a concern about reproducibility, since the results may vary depending on installed bison, flex, docbook, etc versions. Current practice is to always do this as pgsql on borka.postgresql.org, so it can only be done by people who have a login there. In detail: Maybe if we split out the docs from the release tarball, we could also add the script (mk-release) to our git repo? Some other aspects of that wiki page look out of date too. Perhaps it needs an overall update? Maybe Tom and/or Magnus could weigh in here. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
On 3/30/24 17:12, Andres Freund wrote: Hi, On 2024-03-30 16:50:26 -0400, Robert Haas wrote: We might also want to move toward signing commits and tags. One of the meson maintainers was recommending that on-list not long ago. I don't know how valuable the commit signing really is, but I strongly agree that we should sign both tags and tarballs. +1 We should think about weaknesses that might occur during the packaging process, too. If someone who alleges that their packaging PG is really packaging PG w/badstuff123.patch, how would we catch that? I don't think we realistically can. The environment, configure and compiler options will influence things too much to do any sort of automatic verification. But that shouldn't stop us from ensuring that at least the packages distributed via *.postgresql.org are reproducibly built. Another good avenue for introducing an attack would be to propose some distro specific changes to the packaging teams - there's a lot fewer eyes there. I think it might be worth working with some of our packagers to integrate more of their changes into our tree. Huge +1 to that. I have thought many times, and even said to Devrim, a huge number of people trust him to not be evil. Virtually every RPM source, including ours, contains out of tree patches that get applied on top of the release tarball. At least for the PGDG packages, it would be nice to integrate them into our git repo as build options or whatever so that the packages could be built without any patches applied to it. Add a tarball that is signed and traceable back to the git tag, and we would be in a much better place than we are now. I can't for example verify what the infrastructure team is doing, Not sure what you feel like you should be able to follow -- anything specific? or what Tom does when he builds the release tarballs. Tom follows this, at least last time I checked: https://wiki.postgresql.org/wiki/Release_process This one however, I think we could improve upon. Making sure the tarball generation is completely binary reproducible and providing means of checking that would surely help. This will be a lot easier if we, as dicussed elsewhere, I believe, split out the generated docs into a separately downloadable archive. We already stopped including other generated files recently. again, big +1 We shouldn't make the mistake of assuming that bad things can't happen to us. +1 and again with the +1 ;-) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On 3/25/24 11:12, Tom Lane wrote: "Amonson, Paul D" writes: I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time. Just for a note --- the cfbot will re-test existing patches every so often without needing a bump. The current cycle period seems to be about two days. Just an FYI -- there seems to be an issue with all three of the macos cfbot runners (mine included). I spent time over the weekend working with Thomas Munro (added to CC list) trying different fixes to no avail. Help from macos CI wizards would be gratefully accepted... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Possibility to disable `ALTER SYSTEM`
On 3/19/24 07:49, Andrew Dunstan wrote: On Tue, Mar 19, 2024 at 5:26 AM Heikki Linnakangas <mailto:hlinn...@iki.fi>> wrote: I want to remind everyone of this from Gabriele's first message that started this thread: > At the moment, a possible workaround is that `ALTER SYSTEM` can be blocked > by making the postgresql.auto.conf read only, but the returned message is > misleading and that’s certainly bad user experience (which is very > important in a cloud native environment): > > > ``` > postgres=# ALTER SYSTEM SET wal_level TO minimal; > ERROR: could not open file "postgresql.auto.conf": Permission denied > ``` I think making the config file read-only is a fine solution. If you don't want postgres to mess with the config files, forbid it with the permission system. Problems with pg_rewind, pg_basebackup were mentioned with that approach. I think if you want the config files to be managed outside PostgreSQL, by kubernetes, patroni or whatever, it would be good for them to be read-only to the postgres user anyway, even if we had a mechanism to disable ALTER SYSTEM. So it would be good to fix the problems with those tools anyway. The error message is not great, I agree with that. Can we improve it? Maybe just add a HINT like this: postgres=# ALTER SYSTEM SET wal_level TO minimal; ERROR: could not open file "postgresql.auto.conf" for writing: Permission denied HINT: Configuration might be managed outside PostgreSQL Perhaps we could make that even better with a GUC though. I propose a GUC called 'configuration_managed_externally = true / false". If you set it to true, we prevent ALTER SYSTEM and make the error message more definitive: postgres=# ALTER SYSTEM SET wal_level TO minimal; ERROR: configuration is managed externally As a bonus, if that GUC is set, we could even check at server startup that all the configuration files are not writable by the postgres user, and print a warning or refuse to start up if they are. (Another way to read this proposal is to rename the GUC that's been discussed in this thread to 'configuration_managed_externally'. That makes it look less like a security feature, and describes the intended use case.) I agree with pretty much all of this. +1 me too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Improving contrib/tablefunc's error reporting
On 3/9/24 15:39, Tom Lane wrote: Joe Conway writes: On 3/9/24 13:07, Tom Lane wrote: BTW, while I didn't touch it here, it seems fairly bogus that connectby() checks both type OID and typmod for its output columns while crosstab() only checks type OID. I think crosstab() is in the wrong and needs to be checking typmod. That might be fit material for a separate patch though. Something like the attached what you had in mind? (applies on top of your two patches) Yeah, exactly. As far as the comment change goes: - * - attribute [1] of the sql tuple is the category; no need to check it - - * attribute [2] of the sql tuple should match attributes [1] to [natts] + * attribute [1] of the sql tuple is the category; no need to check it + * attribute [2] of the sql tuple should match attributes [1] to [natts - 1] * of the return tuple I suspect that this block looked better when originally committed but didn't survive contact with pgindent. You should check whether your version will; if not, some dashes on the /* line will help. Thanks for the review and heads up. I fiddled with it a bit to make it pgindent clean. I saw you commit your patches so I just pushed mine. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Improving contrib/tablefunc's error reporting
On 3/9/24 13:07, Tom Lane wrote: Joe Conway writes: On 3/5/24 17:04, Tom Lane wrote: After reading the thread at [1], I could not escape the feeling that contrib/tablefunc's error reporting is very confusing. The changes all look good to me and indeed more consistent with the docs. Do you want me to push these? If so, development tip only, or backpatch? I can push that. I was just thinking HEAD, we aren't big on changing error reporting in back branches. BTW, while I didn't touch it here, it seems fairly bogus that connectby() checks both type OID and typmod for its output columns while crosstab() only checks type OID. I think crosstab() is in the wrong and needs to be checking typmod. That might be fit material for a separate patch though. I can take a look at this. Presumably this would not be for backpatching. I'm not sure whether that could produce results bad enough to be called a bug or not. But I too would lean towards not back-patching, in view of the lack of field complaints. Something like the attached what you had in mind? (applies on top of your two patches) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com Make tablefunc crosstab() check typmod too tablefunc connectby() checks both type OID and typmod for its output columns while crosstab() only checks type OID. Fix that by makeing the crosstab() check look more like the connectby() check. --- tablefunc.c.v0002 2024-03-09 14:38:29.285393890 -0500 +++ tablefunc.c 2024-03-09 14:43:47.021399855 -0500 @@ -1527,10 +1527,10 @@ compatCrosstabTupleDescs(TupleDesc ret_tupdesc, TupleDesc sql_tupdesc) { int i; - Form_pg_attribute ret_attr; Oid ret_atttypid; - Form_pg_attribute sql_attr; Oid sql_atttypid; + int32 ret_atttypmod; + int32 sql_atttypmod; if (ret_tupdesc->natts < 2) ereport(ERROR, @@ -1539,34 +1539,39 @@ errdetail("Return row must have at least two columns."))); Assert(sql_tupdesc->natts == 3); /* already checked by caller */ - /* check the rowid types match */ + /* check the row_name types match */ ret_atttypid = TupleDescAttr(ret_tupdesc, 0)->atttypid; sql_atttypid = TupleDescAttr(sql_tupdesc, 0)->atttypid; - if (ret_atttypid != sql_atttypid) + ret_atttypmod = TupleDescAttr(ret_tupdesc, 0)->atttypmod; + sql_atttypmod = TupleDescAttr(sql_tupdesc, 0)->atttypmod; + if (ret_atttypid != sql_atttypid || + (ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod)) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("invalid crosstab return type"), errdetail("Source row_name datatype %s does not match return row_name datatype %s.", - format_type_be(sql_atttypid), - format_type_be(ret_atttypid; + format_type_with_typemod(sql_atttypid, sql_atttypmod), + format_type_with_typemod(ret_atttypid, ret_atttypmod; /* - * - attribute [1] of the sql tuple is the category; no need to check it - - * attribute [2] of the sql tuple should match attributes [1] to [natts] + * attribute [1] of the sql tuple is the category; no need to check it + * attribute [2] of the sql tuple should match attributes [1] to [natts - 1] * of the return tuple */ - sql_attr = TupleDescAttr(sql_tupdesc, 2); + sql_atttypid = TupleDescAttr(sql_tupdesc, 2)->atttypid; + sql_atttypmod = TupleDescAttr(sql_tupdesc, 2)->atttypmod; for (i = 1; i < ret_tupdesc->natts; i++) { - ret_attr = TupleDescAttr(ret_tupdesc, i); - - if (ret_attr->atttypid != sql_attr->atttypid) + ret_atttypid = TupleDescAttr(ret_tupdesc, i)->atttypid; + ret_atttypmod = TupleDescAttr(ret_tupdesc, i)->atttypmod; + if (ret_atttypid != sql_atttypid || + (ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod)) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("invalid crosstab return type"), errdetail("Source value datatype %s does not match return value datatype %s in column %d.", - format_type_be(sql_attr->atttypid), - format_type_be(ret_attr->atttypid), + format_type_with_typemod(sql_atttypid, sql_atttypmod), + format_type_with_typemod(ret_atttypid, ret_atttypmod), i + 1))); }
Re: Improving contrib/tablefunc's error reporting
On 3/5/24 17:04, Tom Lane wrote: After reading the thread at [1], I could not escape the feeling that contrib/tablefunc's error reporting is very confusing. Looking into the source code, I soon found that it is also very inconsistent, with similar error reports being phrased quite differently. The terminology for column names doesn't match the SGML docs either. And there are some places that are so confused about whether they are complaining about the calling query or the called query that the output is flat-out backwards. So at that point my nascent OCD wouldn't let me rest without fixing it. Here's a quick patch series to do that. For review purposes, I split this into two patches. 0001 simply adds some more test cases to reach currently-unexercised error reports. Then 0002 makes my proposed code changes and shows how the existing error messages change. I'm not necessarily wedded to the phrasings I used here, in case anyone has better ideas. The changes all look good to me and indeed more consistent with the docs. Do you want me to push these? If so, development tip only, or backpatch? BTW, while I didn't touch it here, it seems fairly bogus that connectby() checks both type OID and typmod for its output columns while crosstab() only checks type OID. I think crosstab() is in the wrong and needs to be checking typmod. That might be fit material for a separate patch though. I can take a look at this. Presumably this would not be for backpatching. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 3/8/24 12:28, Andrey M. Borodin wrote: Hello everyone! Thanks for working on this, really nice feature! On 9 Jan 2024, at 01:40, Joe Conway wrote: Thanks -- will have a look Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. As an author of CF entry [0] can you please comment on which patch version needs review? I don't know if I agree with the proposed changes, but I have also been waiting to see how the parallel discussion regarding COPY extensibility shakes out. And there were a couple of issues found that need to be tracked down. Additionally I have had time/availability challenges recently. Overall, chances seem slim that this will make it into 17, but I have not quite given up hope yet either. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Improving contrib/tablefunc's error reporting
On 3/5/24 17:04, Tom Lane wrote: After reading the thread at [1], I could not escape the feeling that contrib/tablefunc's error reporting is very confusing. Looking into the source code, I soon found that it is also very inconsistent, with similar error reports being phrased quite differently. The terminology for column names doesn't match the SGML docs either. And there are some places that are so confused about whether they are complaining about the calling query or the called query that the output is flat-out backwards. So at that point my nascent OCD wouldn't let me rest without fixing it. Here's a quick patch series to do that. For review purposes, I split this into two patches. 0001 simply adds some more test cases to reach currently-unexercised error reports. Then 0002 makes my proposed code changes and shows how the existing error messages change. I'm not necessarily wedded to the phrasings I used here, in case anyone has better ideas. BTW, while I didn't touch it here, it seems fairly bogus that connectby() checks both type OID and typmod for its output columns while crosstab() only checks type OID. I think crosstab() is in the wrong and needs to be checking typmod. That might be fit material for a separate patch though. Been a long time since I gave contrib/tablefunc any love I guess ;-) I will have a look at your patches, and the other issue you mention, but it might be a day or three before I can give it some quality time. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
PostgreSQL Contributors Updates
All, The PostgreSQL Contributor Page (https://www.postgresql.org/community/contributors/) includes people who have made substantial, long-term contributions of time and effort to the PostgreSQL project. The PostgreSQL Contributors Team recognizes the following people for their contributions. New PostgreSQL Contributors: * Bertrand Drouvot * Gabriele Bartolini * Richard Guo New PostgreSQL Major Contributors: * Alexander Lakhin * Daniel Gustafsson * Dean Rasheed * John Naylor * Melanie Plageman * Nathan Bossart Thank you and congratulations to all! Thanks, On behalf of the PostgreSQL Contributors Team -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: An improved README experience for PostgreSQL
On 2/28/24 14:36, Tom Lane wrote: Nathan Bossart writes: On Wed, Feb 28, 2024 at 01:08:03PM -0600, Nathan Bossart wrote: -PostgreSQL Database Management System -= +# PostgreSQL Database Management System This change can be omitted, which makes the conversion even simpler. That's a pretty convincing proof-of-concept. Let's just do this, and then make sure to keep the file legible as plain text. +1 -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: An improved README experience for PostgreSQL
On 2/28/24 12:25, Daniel Gustafsson wrote: On 28 Feb 2024, at 18:02, Nathan Bossart wrote: On Wed, Feb 28, 2024 at 02:46:11PM +0100, Daniel Gustafsson wrote: On 26 Feb 2024, at 21:30, Tom Lane wrote: Nathan Bossart writes: I think this would be nice. If the Markdown version is reasonably readable as plain-text, maybe we could avoid maintaining two READMEs files, too. But overall, +1 to modernizing the README a bit. Per past track record, we change the top-level README only once every three years or so, so I doubt it'd be too painful to maintain two versions of it. It wont be, and we kind of already have two since there is another similar README displayed at https://www.postgresql.org/ftp/. That being said, a majority of those reading the README will likely be new developers accustomed to Markdown (or doing so via interfaces such as Github) so going to Markdown might not be a bad idea. We can also render a plain text version with pandoc for release builds should we want to. Sorry, my suggestion wasn't meant to imply that I have any strong concerns about maintaining two README files. If we can automate generating one or the other, that'd be great, but I don't see that as a prerequisite to adding a Markdown version. Agreed, and I didn't say we should do it but rather that we can do it based on the toolchain we already have. Personally I think just having a Markdown version is enough, it's become the de facto standard for such documentation for good reasons. +1 Markdown is pretty readable as text, I'm not sure why we need both. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGC_SIGHUP shared_buffers?
On 2/19/24 13:13, Andres Freund wrote: On 2024-02-19 09:19:16 -0500, Joe Conway wrote: Couldn't we scale the rounding, e.g. allow small allocations as we do now, but above some number always round? E.g. maybe >= 2GB round to the nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, etc? That'd make the translation considerably more expensive. Which is important, given how common an operation this is. Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my intent was that the rounding be done/enforced when setting the GUC value which surely cannot be that often. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGC_SIGHUP shared_buffers?
On 2/18/24 15:35, Andres Freund wrote: On 2024-02-18 17:06:09 +0530, Robert Haas wrote: How many people set shared_buffers to something that's not a whole number of GB these days? I'd say the vast majority of postgres instances in production run with less than 1GB of s_b. Just because numbers wise the majority of instances are running on small VMs and/or many PG instances are running on one larger machine. There are a lot of instances where the total available memory is less than 2GB. I mean I bet it happens, but in practice if you rounded to the nearest GB, or even the nearest 2GB, I bet almost nobody would really care. I think it's fine to be opinionated here and hold the line at a relatively large granule, even though in theory people could want something else. I don't believe that at all unfortunately. Couldn't we scale the rounding, e.g. allow small allocations as we do now, but above some number always round? E.g. maybe >= 2GB round to the nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, etc? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On 2/16/24 04:16, Daniel Gustafsson wrote: On 15 Feb 2024, at 16:49, Peter Eisentraut wrote: 1. All the block ciphers currently supported by crypt() and gen_salt() are not FIPS-compliant. 2. The crypt() and gen_salt() methods built on top of them (modes of operation, kind of) are not FIPS-compliant. I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or raise a WARNING when used? It seems rather unlikely that someone running OpenSSL with FIPS=yes want to use our DES cipher without there being an error or misconfiguration somewhere. Something like the below untested pseudocode. diff --git a/contrib/pgcrypto/pgcrypto.c b/contrib/pgcrypto/pgcrypto.c index 96447c5757..3d4391ebe1 100644 --- a/contrib/pgcrypto/pgcrypto.c +++ b/contrib/pgcrypto/pgcrypto.c @@ -187,6 +187,14 @@ pg_crypt(PG_FUNCTION_ARGS) *resbuf; text *res; +#if defined FIPS_mode + if (FIPS_mode()) +#else + if (EVP_default_properties_is_fips_enabled(OSSL_LIB_CTX_get0_global_default())) +#endif + ereport(ERROR, + (errmsg("not available when using OpenSSL in FIPS mode"))); Makes sense +1 -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pgjdbc is not working with PKCS8 certificates with password
On 2/7/24 06:42, just madhu wrote: On further investigation, /With certificate generated as below. JDBC connection is successful./ openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out client.pk8 -passout pass:foobar / -v1 PBE-MD5-DES But a connection from pgAdmin (connection failed: \SSLCerts\pk8_pass\client_pass_PBE.pk8": no start line) and psql(psql: error: could not load private key file "client_pass_PBE.pk8": unsupported) is failing Is there a common way in which certificate with passwords can be created for both libpq and jdbc ? You may want to check with the pgjdbc project on github rather than (or in addition to?) here; see: https://github.com/pgjdbc/pgjdbc/issues Joe On Wed, Feb 7, 2024 at 3:17 PM just madhu <mailto:justvma...@gmail.com>> wrote: Hi , postgresql-42.7.1.jar Trying to use establish a connection using PKCS8 certificate created with password. /openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out client.pk8 -passout pass:foobar / I set the properties as below: /.../ /sslProperties.setProperty("sslkey", "client.pk8"); sslProperties.setProperty("sslpassword","foobar");/ /.../ /Connection connection = DriverManager.getConnection(jdbcUrl, sslProperties); / // /This is failing with the error:/ /org.postgresql.util.PSQLException: SSL error: Connection reset at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43) at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:584) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:168) / /.../ Regards, Madhu -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [17] CREATE SUBSCRIPTION ... SERVER
On 1/12/24 20:17, Jeff Davis wrote: On Fri, 2024-01-05 at 16:11 +0530, Ashutosh Bapat wrote: I don't think we need to add a test for every FDW. E.g. adding a test in file_fdw would be pointless. But postgres_fdw is special. The test could further create a foreign table ftab_foo on subscriber referencing foo on publisher and then compare the data from foo and ftab_foo to make sure that the replication is happening. This will serve as a good starting point for replicated tables setup in a sharded cluster. Attached updated patch set with added TAP test for postgres_fdw, which uses a postgres_fdw server as the source for subscription connection information. I think 0004 needs a bit more work, so I'm leaving it off for now, but I'll bring it back in the next patch set. I took a quick scan through the patch. The only thing that jumped out at me was that it seems like it might make sense to use quote_literal_cstr() rather than defining your own appendEscapedValue() function? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: System username in pg_stat_activity
On 1/10/24 08:59, Magnus Hagander wrote: On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot I think it depends what we want the new field to reflect. If it is the exact same thing as the SYSTEM_USER then I think it has to be text (as the SYSTEM_USER is made of "auth_method:identity"). Now if we want it to be "only" the identity part of it, then the `name` datatype would be fine. I'd vote for the exact same thing as the SYSTEM_USER (means auth_method:identity). I definitely think it should be the same. If it's not exactly the same, then it should be *two* columns, one with auth method and one with the name. And thinking more about it maybe that's cleaner, because that makes it easier to do things like filter based on auth method? +1 for the overall feature and +1 for two columns > + > + > + authname name > + > + > + The authentication method and identity (if any) that the user > + used to log in. It contains the same value as > +returns in the backend. > + > + I'm fine with auth_method:identity. > +S.authname, What about using system_user as the field name? (because if we keep auth_method:identity it's not really the authname anyway). I was worried system_user or sysuser would both be confusing with the fact that we have usesysid -- which would reference a *different* sys... I think if it is exactly "system_user" it would be pretty clearly a match for SYSTEM_USER -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 1/8/24 14:36, Dean Rasheed wrote: On Thu, 7 Dec 2023 at 01:10, Joe Conway wrote: The attached should fix the CopyOut response to say one column. Playing around with this, I found a couple of cases that generate an error: COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json); COPY (VALUES (1), (2)) TO stdout WITH (format json); both of those generate the following: ERROR: record type has not been registered Thanks -- will have a look -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 1/6/24 15:10, Tom Lane wrote: Joe Conway writes: The only code specific comments were Tom's above, which have been addressed. If there are no serious objections I plan to commit this relatively soon. I had not actually read this patchset before, but now I have, and I have a few minor suggestions: Many thanks! * The API comment for PQchangePassword should specify that encryption is done according to the server's password_encryption setting, and probably the SGML docs should too. You shouldn't have to read the code to discover that. Check * I don't especially care for the useless initializations of encrypted_password, fmtpw, and fmtuser. In all those cases the initial NULL is immediately replaced by a valid value. Probably the compiler will figure out that the initializations are useless, but human readers have to do so as well. Moreover, I think this style is more bug-prone not less so, because if you ever change the logic in a way that causes some code paths to fail to set the variables, you won't get use-of-possibly-uninitialized-value warnings from the compiler. * Perhaps move the declaration of "buf" to the inner block where it's actually used? Makes sense -- fixed * This could be shortened to just "return res": + if (!res) + return NULL; + else + return res; Heh, apparently I needed more coffee at this point :-) * I'd make the SGML documentation a bit more explicit about the return value, say + Returns a PGresult pointer representing + the result of the ALTER USER command, or + a null pointer if the routine failed before issuing any command. Fixed. I also ran pgindent. I was kind of surprised/unhappy when it made me change this (which aligned the two var names): 8< PQExpBufferDatabuf; PGresult*res; 8< to this (which leaves the var names unaligned): 8< PQExpBufferDatabuf; PGresult*res; 8< Anyway, the resulting adjustments attached. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 850734a..28b861f 100644 *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** PQclosePrepared 188 *** 191,193 --- 191,194 PQclosePortal 189 PQsendClosePrepared 190 PQsendClosePortal 191 + PQchangePassword 192 diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index 912aa14..dcb7c9f 100644 *** a/src/interfaces/libpq/fe-auth.c --- b/src/interfaces/libpq/fe-auth.c *** PQencryptPasswordConn(PGconn *conn, cons *** 1372,1374 --- 1372,1455 return crypt_pwd; } + + /* + * PQchangePassword -- exported routine to change a password + * + * This is intended to be used by client applications that wish to + * change the password for a user. The password is not sent in + * cleartext because it is encrypted on the client side. This is + * good because it ensures the cleartext password is never known by + * the server, and therefore won't end up in logs, pg_stat displays, + * etc. The password encryption is performed by PQencryptPasswordConn(), + * which is passed a NULL for the algorithm argument. Hence encryption + * is done according to the server's password_encryption + * setting. We export the function so that clients won't be dependent + * on the implementation specific details with respect to how the + * server changes passwords. + * + * Arguments are a connection object, the SQL name of the target user, + * and the cleartext password. + * + * Return value is the PGresult of the executed ALTER USER statement + * or NULL if we never get there. The caller is responsible to PQclear() + * the returned PGresult. + * + * PQresultStatus() should be called to check the return value for errors, + * and PQerrorMessage() used to get more information about such errors. + */ + PGresult * + PQchangePassword(PGconn *conn, const char *user, const char *passwd) + { + char *encrypted_password = PQencryptPasswordConn(conn, passwd, + user, NULL); + + if (!encrypted_password) + { + /* PQencryptPasswordConn() already registered the error */ + return NULL; + } + else + { + char *fmtpw = PQescapeLiteral(conn, encrypted_password, + strlen(encrypted_password)); + + /* no longer needed, so clean up now */ + PQfreemem(encrypted_password); + + if (!fmtpw) + { + /* PQescapeLiteral() already registered the error */ + return NULL; + } + else + { + char *fmtuser = PQescapeIdentifier(conn, user, strlen(user)); + + if (!fmtuser) + { + /* PQescapeIdentifier() already registered th
Re: Password leakage avoidance
On 1/6/24 13:16, Sehrope Sarkuni wrote: On Sat, Jan 6, 2024 at 12:39 PM Joe Conway <mailto:m...@joeconway.com>> wrote: The only code specific comments were Tom's above, which have been addressed. If there are no serious objections I plan to commit this relatively soon. One more thing that we do in pgjdbc is to zero out the input password args so that they don't remain in memory even after being freed. It's kind of odd in Java as it makes the input interface a char[] and we have to convert them to garbage collected Strings internally (which kind of defeats the purpose of the exercise). But in libpq could be done via something like: memset(pw1, 0, strlen(pw1)); memset(pw2, 0, strlen(pw2)); That part is in psql not libpq There was some debate on our end of where to do that and we settled on doing it inside the encoding functions to ensure it always happens. So the input password char[] always gets wiped regardless of how the encoding functions are invoked. Even if it's not added to the password encoding functions (as that kind of changes the after effects if anything was relying on the password still having the password), I think it'd be good to add it to the command.c stuff that has the two copies of the password prior to freeing them. While that change might or might not be worthwhile, I see it as independent of this patch. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 12/24/23 10:14, Joe Conway wrote: On 12/23/23 11:00, Tom Lane wrote: Joe Conway writes: The attached patch set moves the guts of \password from psql into the libpq client side -- PQchangePassword() (patch 0001). Haven't really read the patch, just looked at the docs, but here's a bit of bikeshedding: Thanks! * This seems way too eager to promote the use of md5. Surely the default ought to be SCRAM, full stop. I question whether we even need an algorithm parameter. Perhaps it's a good idea for future-proofing, but we could also plan that the function would make its own decisions based on noting the server's version. (libpq is far more likely to be au courant about what to do than the calling application, IMO.) * Parameter order seems a bit odd: to me it'd be natural to write user before password. * Docs claim return type is char *, but then say bool (which is also what the code says). We do not use bool in libpq's API; the admittedly-hoary convention is to use int with 1/0 values. Rather than quibble about that, though, I wonder if we should make the result be the PGresult from the command, which the caller would be responsible to free. That would document error conditions much more flexibly. On the downside, client-side errors would be harder to report, particularly OOM, but I think we have solutions for that in existing functions. * The whole business of nonblock mode seems fairly messy here, and I wonder whether it's worth the trouble to support. If we do want to claim it works then it ought to be tested. All of these (except for the doc "char *" cut-n-pasteo) were due to trying to stay close to the same interface as PQencryptPasswordConn(). But I agree with your assessment and the attached patch series addresses all of them. The original use of PQencryptPasswordConn() in psql passed a NULL for the algorithm, so I dropped that argument entirely. I also swapped user and password arguments because as you pointed out that is more natural. This version returns PGresult. As far as special handling for client-side errors like OOM, I don't see anything beyond returning a NULL to signify fatal error, e,g,: 8<-- PGresult * PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) { PGresult *result; result = (PGresult *) malloc(sizeof(PGresult)); if (!result) return NULL; 8<-- That is the approach I took. One thing I have not done but, considered, is adding an additional optional parameter to allow "VALID UNTIL" to be set. Seems like it would be useful to be able to set an expiration when setting a new password. No strong opinion about that. Thanks -- hopefully others will weigh in on that. I just read through the thread and my conclusion is that, specifically related to this patch set (i.e. notwithstanding suggestions for related features), there is consensus in favor of adding this feature. The only code specific comments were Tom's above, which have been addressed. If there are no serious objections I plan to commit this relatively soon. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 850734a..28b861f 100644 *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** PQclosePrepared 188 *** 191,193 --- 191,194 PQclosePortal 189 PQsendClosePrepared 190 PQsendClosePortal 191 + PQchangePassword 192 diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index 912aa14..a6897f8 100644 *** a/src/interfaces/libpq/fe-auth.c --- b/src/interfaces/libpq/fe-auth.c *** PQencryptPasswordConn(PGconn *conn, cons *** 1372,1374 --- 1372,1459 return crypt_pwd; } + + /* + * PQchangePassword -- exported routine to change a password + * + * This is intended to be used by client applications that wish to + * change the password for a user. The password is not sent in + * cleartext because it is encrypted on the client side. This is + * good because it ensures the cleartext password is never known by + * the server, and therefore won't end up in logs, pg_stat displays, + * etc. We export the function so that clients won't be dependent + * on the implementation specific details with respect to how the + * server changes passwords. + * + * Arguments are a connection object, the SQL name of the target user, + * and the cleartext password. + * + * Return value is the PGresult of the executed ALTER USER statement + * or NULL if we never get there. The caller is responsible to PQclear() + * the returned PGresult. + * + * PQresultStatus() should be called to check the return value for errors, + * and PQerrorMessage() used to get more information about such errors. + */ + PGr
Re: Password leakage avoidance
On 1/2/24 07:23, Sehrope Sarkuni wrote: 1. There's two sets of defaults, the client program's default and the server's default. Need to pick one for each implemented function. They don't need to be the same across the board. Is there a concrete recommendation here? 2. Password encoding should be split out and made available as its own functions. Not just as part of a wider "create _or_ alter a user's password" function attached to a connection. It already is split out in libpq[1], unless I don't understand you correctly. [1] https://www.postgresql.org/docs/current/libpq-misc.html#LIBPQ-PQENCRYPTPASSWORDCONN We went a step further and added an intermediate function that generates the "ALTER USER ... PASSWORD" SQL. I don't think we want that in libpq, but in any case separate patch/discussion IMHO. 3. We only added an "ALTER USER ... PASSWORD" function, not anything to create a user. There's way too many options for that and keeping this targeted at just assigning passwords makes it much easier to test. +1 Also separate patch/discussion, but I don't think the CREATE version is needed. 4. RE:defaults, the PGJDBC approach is that the encoding-only function uses the driver's default (SCRAM). The "alterUserPassword(...)" uses the server's default (again usually SCRAM for modern installs but could be something else). It's kind of odd that they're different but the use cases are different as well. Since PQencryptPasswordConn() already exists, and psql's "\password" used it with its defaults, I don't think we want to change the behavior. The patch as written behaves in a backward compatible way. 5. Our SCRAM specific function allows for customizing the algo iteration and salt parameters. That topic came up on hackers previously[1]. Our high level "alterUserPassword(...)" function does not have those options but it is available as part of our PasswordUtil SCRAM API for advanced users who want to leverage it. The higher level functions have defaults for iteration counts (4096) and salt size (16-bytes). Again separate patch/discussion, IMHO. 6. Getting the verbiage right for the PGJDBC version was kind of annoying as we wanted to match the server's wording, e.g. "password_encryption", but it's clearly hashing, not encryption. We settled on "password encoding" for describing the overall task with the comments referencing the server's usage of the term "password_encryption". Found a similar topic[2] on changing that recently as well but looks like that's not going anywhere. Really this is irrelevant to this discussion, because the new function is called PQchangePassword(). The function PQencryptPasswordConn() has been around for a while and the horse is out of the gate on that one. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
On 1/5/24 12:56, Robert Haas wrote: On Sun, Aug 27, 2023 at 4:25 PM Heikki Linnakangas wrote: I think you got that it backwards. 'perl_locale_obj' is set to the perl interpreter's locale, whenever we are *outside* the interpreter. This thread has had no update for more than 4 months, so I'm marking the CF entry RwF for now. It can always be reopened, if Joe or Tristan or Heikki or someone else picks it up again. It is definitely a bug, so I do plan to get back to it at some point, hopefully sooner rather than later... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: SET ROLE x NO RESET
On 12/30/23 17:19, Michał Kłeczek wrote: On 30 Dec 2023, at 17:16, Eric Hanson wrote: What do you think of adding a NO RESET option to the SET ROLE command? What I proposed some time ago is SET ROLE … GUARDED BY ‘password’, so that you could later: RESET ROLE WITH ‘password' I like that too, but see it as a separate feature. FWIW that is also supported by the set_user extension referenced elsewhere on this thread. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Build versionless .so for Android
On 12/31/23 01:24, Michael Paquier wrote: On Sun, Dec 31, 2023 at 06:08:04AM +0100, Matthias Kuhn wrote: I was wondering if there are a) any comments on the approach and if I should be handed in for a commitfest (currently waiting for the cooldown period after account activation, I am not sure how long that is) Such discussions are adapted in a commit fest entry. No idea if there is a cooldown period after account creation before being able to touch the CF app contents, though. FWIW I have expedited the cooldown period, so Matthias can log in right away. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: SET ROLE x NO RESET
On 12/30/23 11:16, Eric Hanson wrote: Hi, What do you think of adding a NO RESET option to the SET ROLE command? Right now Postgres can enforce data security with roles and RLS, but role-per-end-user doesn't really scale: Db connections are per-role, so a connection pooler can't share connections across users. We can work around this with policies that use session variables and checks against current_user, but it seems like role-per end user would be more beautiful. If SET ROLE had a NO RESET option, you could connect through a connection pooler as a privileged user, but downgrade to the user's role for the duration of the session. +1 I agree this would be useful. In the meantime, in case it helps, see https://github.com/pgaudit/set_user Specifically set_session_auth(text): - When set_session_auth(text) is called, the effective session and current user is switched to the rolename supplied, irrevocably. Unlike set_user() or set_user_u(), it does not affect logging nor allowed statements. If set_user.exit_on_error is "on" (the default), and any error occurs during execution, a FATAL error is thrown and the backend session exits. ----- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 12/27/23 16:09, Tom Lane wrote: Joe Conway writes: On 12/27/23 15:39, Peter Eisentraut wrote: On 23.12.23 16:13, Joe Conway wrote: The attached patch set moves the guts of \password from psql into the libpq client side -- PQchangePassword() (patch 0001). I don't follow how you get from the problem statement to this solution. This proposal doesn't avoid password leakage, does it? It just provides a different way to phrase the existing solution. Yes, a fully built one that is convenient to use, and does not ask everyone to roll their own. It's convenient for users of libpq, I guess, but it doesn't help anyone not writing C code directly atop libpq. If this is the way forward then we need to also press JDBC and other client libraries to implement comparable functionality. That's within the realm of sanity surely, and having a well-thought-through reference implementation in libpq would help those authors. So I don't think this is a strike against the patch; but the answer to Peter's question has to be that this is just part of the solution. As mentioned downthread by Dave Cramer, JDBC is already onboard. And as Jonathan said in an adjacent part of the thread: This should generally helpful, as it allows libpq-based drivers to adopt this method and provide a safer mechanism for setting/changing passwords! (which we should also promote once availbale). Which is definitely something I have had in mind all along. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 12/27/23 15:39, Peter Eisentraut wrote: On 23.12.23 16:13, Joe Conway wrote: I have recently, once again for the umpteenth time, been involved in discussions around (paraphrasing) "why does Postgres leak the passwords into the logs when they are changed". I know well that the canonical advice is something like "use psql with \password if you care about that". And while that works, it is a deeply unsatisfying answer for me to give and for the OP to receive. The alternative is something like "...well if you don't like that, use PQencryptPasswordConn() to roll your own solution that meets your security needs". Again, not a spectacular answer IMHO. It amounts to "here is a do-it-yourself kit, go put it together". It occurred to me that we can, and really should, do better. The attached patch set moves the guts of \password from psql into the libpq client side -- PQchangePassword() (patch 0001). The usage in psql serves as a ready built-in test for the libpq function (patch 0002). Docs included too (patch 0003). I don't follow how you get from the problem statement to this solution. This proposal doesn't avoid password leakage, does it? Yes, it most certainly does. The plaintext password would never be seen by the server and therefore never logged. This is exactly why the feature already existed in psql. It just provides a different way to phrase the existing solution. Yes, a fully built one that is convenient to use, and does not ask everyone to roll their own. Who is a potential user of this solution? Literally every company that has complained that Postgres pollutes their logs with plaintext passwords. I have heard the request to provide a better solution many times, over many years, while working for three different companies. Right now it just saves a dozen lines in psql, but it's not clear how it improves anything else. It is to me, and so far no one else has complained about that. More opinions would be welcomed of course. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 12/24/23 12:22, Tom Lane wrote: Joe Conway writes: Completely unrelated process bikeshedding: I changed the naming scheme I used for the split patch-set this time. I don't know if we have a settled/documented pattern for such naming, but the original pattern which I borrowed from someone else's patches was "vX--description.patch". As far as that goes, that filename pattern is what is generated by "git format-patch". I agree that the digit-count choices are a tad odd, but they're not so awful as to be worth trying to override. Ah, knew it was something like that. I am still a curmudgeon doing things the old way ¯\_(ツ)_/¯ The new pattern I picked is "description-vXXX-NN.patch" which fixes all of those issues. Only if you use the same "description" for all patches of a series, which seems kind of not the point. In any case, "git format-patch" is considered best practice for a multi-patch series AFAIK, so we have to cope with its ideas about how to name the files. Even if I wanted some differentiating name for the individual patches in a set, I still like them to be grouped because it is one unit of work from my perspective. Oh well, I guess I will get with the program and put every patch-set into its own directory. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Password leakage avoidance
On 12/23/23 11:00, Tom Lane wrote: Joe Conway writes: The attached patch set moves the guts of \password from psql into the libpq client side -- PQchangePassword() (patch 0001). Haven't really read the patch, just looked at the docs, but here's a bit of bikeshedding: Thanks! * This seems way too eager to promote the use of md5. Surely the default ought to be SCRAM, full stop. I question whether we even need an algorithm parameter. Perhaps it's a good idea for future-proofing, but we could also plan that the function would make its own decisions based on noting the server's version. (libpq is far more likely to be au courant about what to do than the calling application, IMO.) * Parameter order seems a bit odd: to me it'd be natural to write user before password. * Docs claim return type is char *, but then say bool (which is also what the code says). We do not use bool in libpq's API; the admittedly-hoary convention is to use int with 1/0 values. Rather than quibble about that, though, I wonder if we should make the result be the PGresult from the command, which the caller would be responsible to free. That would document error conditions much more flexibly. On the downside, client-side errors would be harder to report, particularly OOM, but I think we have solutions for that in existing functions. * The whole business of nonblock mode seems fairly messy here, and I wonder whether it's worth the trouble to support. If we do want to claim it works then it ought to be tested. All of these (except for the doc "char *" cut-n-pasteo) were due to trying to stay close to the same interface as PQencryptPasswordConn(). But I agree with your assessment and the attached patch series addresses all of them. The original use of PQencryptPasswordConn() in psql passed a NULL for the algorithm, so I dropped that argument entirely. I also swapped user and password arguments because as you pointed out that is more natural. This version returns PGresult. As far as special handling for client-side errors like OOM, I don't see anything beyond returning a NULL to signify fatal error, e,g,: 8<-- PGresult * PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) { PGresult *result; result = (PGresult *) malloc(sizeof(PGresult)); if (!result) return NULL; 8<-- That is the approach I took. One thing I have not done but, considered, is adding an additional optional parameter to allow "VALID UNTIL" to be set. Seems like it would be useful to be able to set an expiration when setting a new password. No strong opinion about that. Thanks -- hopefully others will weigh in on that. Completely unrelated process bikeshedding: I changed the naming scheme I used for the split patch-set this time. I don't know if we have a settled/documented pattern for such naming, but the original pattern which I borrowed from someone else's patches was "vX--description.patch". The problems I have with that are 1/ there may well be more that 10 versions of a patch-set, 2/ there are probably never going to be more than 2 digits worth of files in a patch-set, and 3/ the description coming after the version and file identifiers causes the patches in my local directory to sort poorly, intermingling several unrelated patch-sets. The new pattern I picked is "description-vXXX-NN.patch" which fixes all of those issues. Does that bother anyone? *Should* we try to agree on a desired pattern (assuming there is not one already)? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 850734a..28b861f 100644 *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** PQclosePrepared 188 *** 191,193 --- 191,194 PQclosePortal 189 PQsendClosePrepared 190 PQsendClosePortal 191 + PQchangePassword 192 diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index 912aa14..a6897f8 100644 *** a/src/interfaces/libpq/fe-auth.c --- b/src/interfaces/libpq/fe-auth.c *** PQencryptPasswordConn(PGconn *conn, cons *** 1372,1374 --- 1372,1459 return crypt_pwd; } + + /* + * PQchangePassword -- exported routine to change a password + * + * This is intended to be used by client applications that wish to + * change the password for a user. The password is not sent in + * cleartext because it is encrypted on the client side. This is + * good because it ensures the cleartext password is never known by + * the server, and therefore won't end up in logs, pg_stat displays, + * etc. We export the function so that clients won't be dependent + * on the implementation specific details with respect to how the + *
Password leakage avoidance
I have recently, once again for the umpteenth time, been involved in discussions around (paraphrasing) "why does Postgres leak the passwords into the logs when they are changed". I know well that the canonical advice is something like "use psql with \password if you care about that". And while that works, it is a deeply unsatisfying answer for me to give and for the OP to receive. The alternative is something like "...well if you don't like that, use PQencryptPasswordConn() to roll your own solution that meets your security needs". Again, not a spectacular answer IMHO. It amounts to "here is a do-it-yourself kit, go put it together". It occurred to me that we can, and really should, do better. The attached patch set moves the guts of \password from psql into the libpq client side -- PQchangePassword() (patch 0001). The usage in psql serves as a ready built-in test for the libpq function (patch 0002). Docs included too (patch 0003). One thing I have not done but, considered, is adding an additional optional parameter to allow "VALID UNTIL" to be set. Seems like it would be useful to be able to set an expiration when setting a new password. I will register this in the upcoming commitfest, but meantime thought/comments/etc. would be gratefully received. Thanks, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 850734a..28b861f 100644 *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** PQclosePrepared 188 *** 191,193 --- 191,194 PQclosePortal 189 PQsendClosePrepared 190 PQsendClosePortal 191 + PQchangePassword 192 diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index 912aa14..3ee67ba 100644 *** a/src/interfaces/libpq/fe-auth.c --- b/src/interfaces/libpq/fe-auth.c *** PQencryptPasswordConn(PGconn *conn, cons *** 1372,1374 --- 1372,1463 return crypt_pwd; } + + /* + * PQchangePassword -- exported routine to change a password + * + * This is intended to be used by client applications that wish to + * change the password for a user. The password is not sent in + * cleartext because it is encrypted on the client side. This is + * good because it ensures the cleartext password is never known by + * the server, and therefore won't end up in logs, pg_stat displays, + * etc. We export the function so that clients won't be dependent + * on the implementation specific details with respect to how the + * server changes passwords. + * + * Arguments are a connection object, the cleartext password, the SQL + * name of the user it is for, and a string indicating the algorithm to + * use for encrypting the password. If algorithm is NULL, + * PQencryptPasswordConn() queries the server for the current + * 'password_encryption' value. If you wish to avoid that, e.g. to avoid + * blocking, you can execute 'show password_encryption' yourself before + * calling this function, and pass it as the algorithm. + * + * Return value is a boolean, true for success. On error, an error message + * is stored in the connection object, and returns false. + */ + bool + PQchangePassword(PGconn *conn, const char *passwd, const char *user, + const char *algorithm) + { + char *encrypted_password = NULL; + PQExpBufferData buf; + bool success = true; + + encrypted_password = PQencryptPasswordConn(conn, passwd, user, algorithm); + + if (!encrypted_password) + { + /* PQencryptPasswordConn() already registered the error */ + success = false; + } + else + { + char *fmtpw = NULL; + + fmtpw = PQescapeLiteral(conn, encrypted_password, + strlen(encrypted_password)); + + /* no longer needed, so clean up now */ + PQfreemem(encrypted_password); + + if (!fmtpw) + { + /* PQescapeLiteral() already registered the error */ + success = false; + } + else + { + char *fmtuser = NULL; + + fmtuser = PQescapeIdentifier(conn, user, strlen(user)); + if (!fmtuser) + { + /* PQescapeIdentifier() already registered the error */ + success = false; + PQfreemem(fmtpw); + } + else + { + PGresult *res; + + initPQExpBuffer(); + printfPQExpBuffer(, "ALTER USER %s PASSWORD %s", + fmtuser, fmtpw); + + res = PQexec(conn, buf.data); + if (!res) + success = false; + else + PQclear(res); + + /* clean up */ + termPQExpBuffer(); + PQfreemem(fmtuser); + PQfreemem(fmtpw); + } + } + } + + return success; + } diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h index 97762d5..f6e7207 100644 *** a/src/interfaces/libpq/libpq-fe.h --- b/src/interfaces/libpq/libpq-fe.h *** e
Re: Eager page freeze criteria clarification
On 12/21/23 10:56, Melanie Plageman wrote: On Sat, Dec 9, 2023 at 9:24 AM Joe Conway wrote: However, even if we assume a more-or-less normal distribution, we should consider using subgroups in a way similar to Statistical Process Control[1]. The reasoning is explained in this quote: The Math Behind Subgroup Size The Central Limit Theorem (CLT) plays a pivotal role here. According to CLT, as the subgroup size (n) increases, the distribution of the sample means will approximate a normal distribution, regardless of the shape of the population distribution. Therefore, as your subgroup size increases, your control chart limits will narrow, making the chart more sensitive to special cause variation and more prone to false alarms. I haven't read anything about statistical process control until you mentioned this. I read the link you sent and also googled around a bit. I was under the impression that the more samples we have, the better. But, it seems like this may not be the assumption in statistical process control? It may help us to get more specific. I'm not sure what the relationship between "unsets" in my code and subgroup members would be. The article you linked suggests that each subgroup should be of size 5 or smaller. Translating that to my code, were you imagining subgroups of "unsets" (each time we modify a page that was previously all-visible)? Basically, yes. It might not makes sense, but I think we could test the theory by plotting a histogram of the raw data, and then also plot a histogram based on sub-grouping every 5 sequential values in your accumulator. If the former does not look very normal (I would guess most workloads it will be skewed with a long tail) and the latter looks to be more normal, then it would say we were on the right track. There are statistical tests for "normalness" that could be applied too ( e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/#sec2-13title ) which be a more rigorous approach, but the quick look at histograms might be sufficiently convincing. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Eager page freeze criteria clarification
er table vacuum, especially in case of long-running vacuums. To benchmark this new heuristic (I'm calling it algo 6 since it is the 6th I've proposed on this thread), I've used a modified subset of my original workloads: Workloads C. Shifting hot set 32 clients inserting multiple rows and then updating an indexed column of a row on a different page containing data they formerly inserted. Only recent data is updated. H. Append only table 32 clients, each inserting a single row at a time I. Work queue 32 clients, each inserting a row, then updating an indexed column in 2 different rows in nearby pages, then deleting 1 of the updated rows Workload C: Algo | Table| AVs | Page Freezes | Pages Frozen | % Frozen -|--|-|--|--|- 6 | hot_tail | 14 |2,111,824 |1,643,217 |53.4% M | hot_tail | 16 | 241,605 |3,921 | 0.1% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time |TPS -||--||| 6 |193 |5,473,949 | 12,793,574 | 14,870 | 28,397 M |207 |5,213,763 | 20,362,315 | 46,190 | 28,461 Algorithm 6 freezes all of the cold data and doesn't freeze the current working set. The notable thing is how much this reduces overall system I/O. On master, autovacuum is doing more than 3x the I/O and the rest of the system is doing more than 1.5x the I/O. I suspect freezing data when it is initially vacuumed is saving future vacuums from having to evict pages of the working set and read in cold data. Workload H: Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen -|---|-|--|--|- 6 | hthistory | 22 | 668,448 | 668,003 | 87% M | hthistory | 22 |0 |0 | 0% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time |TPS -||--|---|| 6 | 14 | 725,103 | 725,575 | 1 | 43,535 M | 13 | 693,945 | 694,417 | 1 | 43,522 The insert-only table is mostly frozen at the end. There is more I/O done but not at the expense of TPS. This is exactly what we want. Workload I: Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen -|---|-|--|--|- 6 | workqueue | 234 |0 |4,416 | 78% M | workqueue | 234 |0 |4,799 | 87% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time |TPS -||--|---|| 6 | 74 | 64,345 |64,813 | 1 | 36,366 M | 73 | 63,468 |63,936 | 1 | 36,145 What we want is for the work queue table to freeze as little as possible, because we will be constantly modifying the data. Both on master and with algorithm 6 we do not freeze tuples on any pages. You will notice, however, that much of the table is set frozen in the VM at the end. This is because we set pages all frozen in the VM if they are technically all frozen even if we do not freeze tuples on the page. This is inexpensive and not under the control of the freeze heuristic. Overall, the behavior of this new adaptive freezing method seems to be exactly what we want. The next step is to decide how many values to remove from the accumulator and benchmark cases where old data is deleted. I'd be delighted to receive any feedback, ideas, questions, or review. This is well thought out, well described, and a fantastic improvement in my view -- well done! I do think we will need to consider distributions other than normal, but I don't know offhand what they will be. However, even if we assume a more-or-less normal distribution, we should consider using subgroups in a way similar to Statistical Process Control[1]. The reasoning is explained in this quote: The Math Behind Subgroup Size The Central Limit Theorem (CLT) plays a pivotal role here. According to CLT, as the subgroup size (n) increases, the distribution of the sample means will approximate a normal distribution, regardless of the shape of the population distribution. Therefore, as your subgroup size increases, your control chart limits will narrow, making the chart more sensitive to special cause variation and more prone to false alarms. [1] https://www.qualitygurus.com/rational-subgrouping-in-statistical-process-control/ -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/8/23 14:45, Daniel Verite wrote: Joe Conway wrote: copyto_json.007.diff When the source has json fields with non-significant line feeds, the COPY output has these line feeds too, which makes the output incompatible with rule #2 at https://jsonlines.org ("2. Each Line is a Valid JSON Value"). create table j(f json); insert into j values('{"a":1, "b":2 }'); copy j to stdout (format json); Result: {"f":{"a":1, "b":2 }} Is that expected? copy.sgml in 007 doesn't describe the output in terms of lines so it's hard to tell from the doc. The patch as-is just does the equivalent of row_to_json(): 8< select row_to_json(j) from j; row_to_json -- {"f":{"a":1,+ "b":2 + }} (1 row) 8< So yeah, that is how it works today. I will take a look at what it would take to fix it. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/7/23 09:11, David G. Johnston wrote: Those are all the same breakage though - if truly interpreted as data rows the protocol is basically written such that the array format is not supportable and only the lines format can be used. Hence my “format 0 doesn’t work” comment for array output and we should explicitly add format 2 where we explicitly decouple lines of output from rows of data. That said, it would seem in practice format 0 already decouples them and so the current choice of the brackets on their own lines is acceptable. I’d prefer to keep them on their own line. WFM ¯\_(ツ)_/¯ I am merely responding with options to the many people opining on the thread. I also don’t know why you introduced another level of object nesting here. That seems quite undesirable. I didn't add anything. It is an artifact of the particular query I wrote in the copy to statement (I did "select ss from ss" instead of "select * from ss"), mea culpa. This is what the latest patch, as written today, outputs: 8<-- copy (select 1, g.i from generate_series(1, 3) g(i)) to stdout (format json, force_array); [ {"?column?":1,"i":1} ,{"?column?":1,"i":2} ,{"?column?":1,"i":3} ] 8<-- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/7/23 08:52, Joe Conway wrote: Or maybe this is preferred? 8<-- [{"ss":{"f1":1,"f2":1}}, {"ss":{"f1":1,"f2":2}}, {"ss":{"f1":1,"f2":3}}] 8<-- I don't know why my mail client keeps adding extra spaces, but the intention here is a single space in front of row 2 and 3 in order to line the json objects up at column 2. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/7/23 08:35, Daniel Verite wrote: Joe Conway wrote: The attached should fix the CopyOut response to say one column. I.e. it ought to look something like: Spending more time with the doc I came to the opinion that in this bit of the protocol, in CopyOutResponse (B) ... Int16 The number of columns in the data to be copied (denoted N below). ... this number must be the number of columns in the source. That is for COPY table(a,b,c) the number is 3, independently on whether the result is formatted in text, cvs, json or binary. I think that changing it for json can reasonably be interpreted as a protocol break and we should not do it. The fact that this value does not help parsing the CopyData messages that come next is not a new issue. A reader that doesn't know the field separator and whether it's text or csv cannot parse these messages into fields anyway. But just knowing how much columns there are in the original data might be useful by itself and we don't want to break that. Ok, that sounds reasonable to me -- I will revert that change. The other question for me is, in the CopyData message, this bit: " Messages sent from the backend will always correspond to single data rows" ISTM that considering that the "[" starting the json array is a "data row" is a stretch. That might be interpreted as a protocol break, depending on how strict the interpretation is. If we really think that is a problem I can see about changing it to this format for json array: 8<-- copy ( with ss(f1, f2) as ( select 1, g.i from generate_series(1, 3) g(i) ) select ss from ss ) to stdout (format json, force_array); [{"ss":{"f1":1,"f2":1}} ,{"ss":{"f1":1,"f2":2}} ,{"ss":{"f1":1,"f2":3}}] 8<-- Is this acceptable to everyone? Or maybe this is preferred? 8<-- [{"ss":{"f1":1,"f2":1}}, {"ss":{"f1":1,"f2":2}}, {"ss":{"f1":1,"f2":3}}] 8<-- Or as long as we are painting the shed, maybe this? 8<-- [{"ss":{"f1":1,"f2":1}}, {"ss":{"f1":1,"f2":2}}, {"ss":{"f1":1,"f2":3}}] 8<-- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 21:56, Nathan Bossart wrote: On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote: If Nathan's perf results hold up elsewhere, it seems like some micro-optimization around the text-pushing (appendStringInfoString) might be more useful than caching. The 7% spent in cache lookups could be worth going after later, but it's not the top of the list. Hah, it turns out my benchmark of 110M integers really stresses the JSONTYPE_NUMERIC path in datum_to_json_internal(). That particular path calls strlen() twice: once for IsValidJsonNumber(), and once in appendStringInfoString(). If I save the result from IsValidJsonNumber() and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster. It's probably worth giving datum_to_json_internal() a closer look in a new thread. Yep, after looking through that code I was going to make the point that your 11 integer test was over indexing on that one type. I am sure there are other micro-optimizations to be made here, but I also think that it is outside the scope of the COPY TO JSON patch. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 20:09, David G. Johnston wrote: On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mailto:m...@joeconway.com>> wrote: On 12/6/23 19:39, David G. Johnston wrote: > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway mailto:m...@joeconway.com> > <mailto:m...@joeconway.com <mailto:m...@joeconway.com>>> wrote: > But I still cannot shake the belief that using a format code of 1 - > which really could be interpreted as meaning "textual csv" in practice - > for this JSON output is unwise and we should introduce a new integer > value for the new fundamental output format. No, I am pretty sure you still have that wrong. The "1" means binary mode Ok. I made the same typo twice, I did mean to write 0 instead of 1. Fair enough. But the point that we should introduce a 2 still stands. The new code would mean: use text output functions but that there is no inherent tabular structure in the underlying contents. Instead the copy format was JSON and the output layout is dependent upon the json options in the copy command and that there really shouldn't be any attempt to turn the contents directly into a tabular data structure like you presently do with the CSV data under format 0. Ignore the column count and column formats as they are fixed or non-existent. I think that amounts to a protocol change, which we tend to avoid at all costs. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 18:09, Joe Conway wrote: On 12/6/23 14:47, Joe Conway wrote: On 12/6/23 13:59, Daniel Verite wrote: Andrew Dunstan wrote: IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). "COPY Operations" in the doc says: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone". In the ARRAY case, the first messages with the copyjsontest regression test look like this (tshark output): PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 3 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… The first Copy data message with contents "5b0a" does not qualify as a row of data with 3 columns as advertised in the CopyOut message. Isn't that a problem? Is it a real problem, or just a bit of documentation change that I missed? Anything receiving this and looking for a json array should know how to assemble the data correctly despite the extra CopyData messages. Hmm, maybe the real problem here is that Columns do not equal "3" for the json mode case -- that should really say "1" I think, because the row is not represented as 3 columns but rather 1 json object. Does that sound correct? Assuming yes, there is still maybe an issue that there are two more "rows" that actual output rows (the "[" and the "]"), but maybe those are less likely to cause some hazard? The attached should fix the CopyOut response to say one column. I.e. it ought to look something like: PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 1 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: [...] -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69..8915fb3 100644 *** a/doc/src/sgml/ref/copy.sgml --- b/doc/src/sgml/ref/copy.sgml *** COPY { ta *** 43,48 --- 43,49 FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL { ( column_name [, ...] ) | * } FORCE_NULL { ( column_name [, ...] ) | * } + FORCE_ARRAY [ boolean ] ENCODING 'encoding_name' *** COPY { ta *** 206,214 --- 207,220 Selects the data format to be read or written: text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + *** COPY { ta *** 372,377 --- 378,396 + + + FORCE_ARRAY + + + Force output of square brackets as array decorations at the beginning + and end of output, and commas between the rows. It is allowed only in + COPY TO, and only when using + JSON format. The default is + false. + + + ENCODING diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..23b570f 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 419,424 --- 419,425 bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool force_array_specified = false; ListCell *option; /* Support external use for option sanity checking */ *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 444,451 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, *** 540,545 --- 543,555 defel->defname), parser_errposition(pstate, defel->location))); } + else if (strcmp(defel->defname, "force_array") == 0) + { + if (force_array_specified) + errorConflictingDefElem(defel, pstate); + force_array_specified = true; + opts_out->force_array = defGetBoolean(defel); + } else if (strcmp(defel->defname, "convert_selectively") == 0) { /* *** ProcessCopyOptions(ParseState *pstate, *** 598,603 --- 608,625 (errcode(ERRCODE_SYNTAX_
Re: Emitting JSON to file using COPY TO
On 12/6/23 19:39, David G. Johnston wrote: On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mailto:m...@joeconway.com>> wrote: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone" probably "always one per row" would be changed to note that json array format outputs two extra rows for the start/end bracket. Fair, I was ascribing much more semantic meaning to this than it wants. I don't see any real requirement, given the lack of semantics, to mention JSON at all. It is one CopyData per row, regardless of the contents. We don't delineate between the header and non-header data in CSV. It isn't a protocol concern. good point But I still cannot shake the belief that using a format code of 1 - which really could be interpreted as meaning "textual csv" in practice - for this JSON output is unwise and we should introduce a new integer value for the new fundamental output format. No, I am pretty sure you still have that wrong. The "1" means binary mode. As in 8<-- FORMAT Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text. 8<-- That is completely separate from text and csv. It literally means to use the binary output functions instead of the usual ones: 8<-- if (cstate->opts.binary) getTypeBinaryOutputInfo(attr->atttypid, _func_oid, ); else getTypeOutputInfo(attr->atttypid, _func_oid, ); 8<-- Both "text" and "csv" mode use are non-binary output formats. I believe the JSON output format is also non-binary. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 18:38, David G. Johnston wrote: On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston mailto:david.g.johns...@gmail.com>> wrote: On Wed, Dec 6, 2023 at 4:09 PM Joe Conway mailto:m...@joeconway.com>> wrote: On 12/6/23 14:47, Joe Conway wrote: > On 12/6/23 13:59, Daniel Verite wrote: >> Andrew Dunstan wrote: >> >>> IMNSHO, we should produce either a single JSON >>> document (the ARRAY case) or a series of JSON documents, one per row >>> (the LINES case). >> >> "COPY Operations" in the doc says: >> >> " The backend sends a CopyOutResponse message to the frontend, followed >> by zero or more CopyData messages (always one per row), followed by >> CopyDone". >> >> In the ARRAY case, the first messages with the copyjsontest >> regression test look like this (tshark output): >> >> PostgreSQL >> Type: CopyOut response >> Length: 13 >> Format: Text (0) >> Columns: 3 >> Format: Text (0) > Anything receiving this and looking for a json array should know how to > assemble the data correctly despite the extra CopyData messages. Hmm, maybe the real problem here is that Columns do not equal "3" for the json mode case -- that should really say "1" I think, because the row is not represented as 3 columns but rather 1 json object. Does that sound correct? Assuming yes, there is still maybe an issue that there are two more "rows" that actual output rows (the "[" and the "]"), but maybe those are less likely to cause some hazard? What is the limitation, if any, of introducing new type codes for these. n = 2..N for the different variants? Or even -1 for "raw text"? And document that columns and structural rows need to be determined out-of-band. Continuing to use 1 (text) for this non-csv data seems like a hack even if we can technically make it function. The semantics, especially for the array case, are completely discarded or wrong. Also, it seems like this answer would be easier to make if we implement COPY FROM now since how is the server supposed to deal with decomposing this data into tables without accurate type information? I don't see implementing only half of the feature being a good idea. I've had much more desire for FROM compared to TO personally. Several people have weighed in on the side of getting COPY TO done by itself first. Given how long this discussion has already become for a relatively small and simple feature, I am a big fan of not expanding the scope now. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 18:28, David G. Johnston wrote: On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mailto:m...@joeconway.com>> wrote: On 12/6/23 14:47, Joe Conway wrote: > On 12/6/23 13:59, Daniel Verite wrote: >> Andrew Dunstan wrote: >> >>> IMNSHO, we should produce either a single JSON >>> document (the ARRAY case) or a series of JSON documents, one per row >>> (the LINES case). >> >> "COPY Operations" in the doc says: >> >> " The backend sends a CopyOutResponse message to the frontend, followed >> by zero or more CopyData messages (always one per row), followed by >> CopyDone". >> >> In the ARRAY case, the first messages with the copyjsontest >> regression test look like this (tshark output): >> >> PostgreSQL >> Type: CopyOut response >> Length: 13 >> Format: Text (0) >> Columns: 3 >> Format: Text (0) > Anything receiving this and looking for a json array should know how to > assemble the data correctly despite the extra CopyData messages. Hmm, maybe the real problem here is that Columns do not equal "3" for the json mode case -- that should really say "1" I think, because the row is not represented as 3 columns but rather 1 json object. Does that sound correct? Assuming yes, there is still maybe an issue that there are two more "rows" that actual output rows (the "[" and the "]"), but maybe those are less likely to cause some hazard? What is the limitation, if any, of introducing new type codes for these. n = 2..N for the different variants? Or even -1 for "raw text"? And document that columns and structural rows need to be determined out-of-band. Continuing to use 1 (text) for this non-csv data seems like a hack even if we can technically make it function. The semantics, especially for the array case, are completely discarded or wrong. I am not following you here. SendCopyBegin looks like this currently: 8< SendCopyBegin(CopyToState cstate) { StringInfoData buf; int natts = list_length(cstate->attnumlist); int16 format = (cstate->opts.binary ? 1 : 0); int i; pq_beginmessage(, PqMsg_CopyOutResponse); pq_sendbyte(, format); /* overall format */ pq_sendint16(, natts); for (i = 0; i < natts; i++) pq_sendint16(, format); /* per-column formats */ pq_endmessage(); cstate->copy_dest = COPY_FRONTEND; } 8< The "1" is saying are we binary mode or not. JSON mode will never be sending in binary in the current implementation at least. And it always aggregates all the columns as one json object. So the correct answer is (I think): 8< *** SendCopyBegin(CopyToState cstate) *** 146,154 pq_beginmessage(, PqMsg_CopyOutResponse); pq_sendbyte(, format); /* overall format */ ! pq_sendint16(, natts); ! for (i = 0; i < natts; i++) ! pq_sendint16(, format); /* per-column formats */ pq_endmessage(); cstate->copy_dest = COPY_FRONTEND; } --- 150,169 pq_beginmessage(, PqMsg_CopyOutResponse); pq_sendbyte(, format); /* overall format */ ! if (!cstate->opts.json_mode) ! { ! pq_sendint16(, natts); ! for (i = 0; i < natts; i++) ! pq_sendint16(, format); /* per-column formats */ ! } ! else ! { ! /* !* JSON mode is always one non-binary column !*/ ! pq_sendint16(, 1); ! pq_sendint16(, 0); ! } pq_endmessage(); cstate->copy_dest = COPY_FRONTEND; } 8< That still leaves the need to fix the documentation: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone" probably "always one per row" would be changed to note that json array format outputs two extra rows for the start/end bracket. In fact, as written the patch does this: 8< COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i) WHERE false) TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY); [ ] 8< Not sure if that is a problem or not. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 14:47, Joe Conway wrote: On 12/6/23 13:59, Daniel Verite wrote: Andrew Dunstan wrote: IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). "COPY Operations" in the doc says: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone". In the ARRAY case, the first messages with the copyjsontest regression test look like this (tshark output): PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 3 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… The first Copy data message with contents "5b0a" does not qualify as a row of data with 3 columns as advertised in the CopyOut message. Isn't that a problem? Is it a real problem, or just a bit of documentation change that I missed? Anything receiving this and looking for a json array should know how to assemble the data correctly despite the extra CopyData messages. Hmm, maybe the real problem here is that Columns do not equal "3" for the json mode case -- that should really say "1" I think, because the row is not represented as 3 columns but rather 1 json object. Does that sound correct? Assuming yes, there is still maybe an issue that there are two more "rows" that actual output rows (the "[" and the "]"), but maybe those are less likely to cause some hazard? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 16:42, Sehrope Sarkuni wrote: On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mailto:m...@joeconway.com>> wrote: > 1. Outputting a top level JSON object without the additional column > keys. IIUC, the top level keys are always the column names. A common use > case would be a single json/jsonb column that is already formatted > exactly as the user would like for output. Rather than enveloping it in > an object with a dedicated key, it would be nice to be able to output it > directly. This would allow non-object results to be outputted as well > (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is > structured, I think this would play nice with the JSON lines v.s. array > concept. > > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, > SOME_OPTION_TO_NOT_ENVELOPE) > {"foo":1} > {"foo":2} > {"foo":3} Your example does not match what you describe, or do I misunderstand? I thought your goal was to eliminate the repeated "foo" from each row... The "foo" in this case is explicit as I'm adding it when building the object. What I was trying to show was not adding an additional object wrapper / envelope. So each row is: {"foo":1} Rather than: "{"json_build_object":{"foo":1}} I am still getting confused ;-) Let's focus on the current proposed patch with a "minimum required feature set". Right now the default behavior is "JSON lines": 8<--- COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i)) TO STDOUT WITH (FORMAT JSON); {"i":1,"v":"val1"} {"i":2,"v":"val2"} {"i":3,"v":"val3"} 8<--- and the other, non-default option is "JSON array": 8<--- COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i)) TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY); [ {"i":1,"v":"val1"} ,{"i":2,"v":"val2"} ,{"i":3,"v":"val3"} ] 8<--- So the questions are: 1. Do those two formats work for the initial implementation? 2. Is the default correct or should it be switched e.g. rather than specifying FORCE_ARRAY to get an array, something like FORCE_NO_ARRAY to get JSON lines and the JSON array is default? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 11:28, Sehrope Sarkuni wrote: Big +1 to this overall feature. cool! Regarding the defaults for the output, I think JSON lines (rather than a JSON array of objects) would be preferred. It's more natural to combine them and generate that type of data on the fly rather than forcing aggregation into a single object. So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 (Andrew and Davin) for defaulting to json arrays. Anyone else want to weigh in on that issue? Couple more features / use cases come to mind as well. Even if they're not part of a first round of this feature I think it'd be helpful to document them now as it might give some ideas for what does make that first cut: 1. Outputting a top level JSON object without the additional column keys. IIUC, the top level keys are always the column names. A common use case would be a single json/jsonb column that is already formatted exactly as the user would like for output. Rather than enveloping it in an object with a dedicated key, it would be nice to be able to output it directly. This would allow non-object results to be outputted as well (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is structured, I think this would play nice with the JSON lines v.s. array concept. COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE) {"foo":1} {"foo":2} {"foo":3} Your example does not match what you describe, or do I misunderstand? I thought your goal was to eliminate the repeated "foo" from each row... 2. An option to ignore null fields so they are excluded from the output. This would not be a default but would allow shrinking the total size of the output data in many situations. This would be recursive to allow nested objects to be shrunk down (not just the top level). This might be worthwhile as a standalone JSON function though handling it during output would be more efficient as it'd only be read once. COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS) {} {"foo":2} {"foo":3} clear enough I think 3. Reverse of #2 when copying data in to allow defaulting missing fields to NULL. good to record the ask, but applies to a different feature (COPY FROM instead of COPY TO). -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 11:44, Nathan Bossart wrote: On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote: (format csv) Time: 12295.480 ms (00:12.295) Time: 12311.059 ms (00:12.311) Time: 12305.469 ms (00:12.305) (format json) Time: 24568.621 ms (00:24.569) Time: 23756.234 ms (00:23.756) Time: 24265.730 ms (00:24.266) I should also note that the json output is 85% larger than the csv output. I'll see if I can add some caching to composite_to_json(), but based on the relative data size it does not sound like there is much performance left on the table to go after, no? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 13:59, Daniel Verite wrote: Andrew Dunstan wrote: IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). "COPY Operations" in the doc says: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone". In the ARRAY case, the first messages with the copyjsontest regression test look like this (tshark output): PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 3 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… The first Copy data message with contents "5b0a" does not qualify as a row of data with 3 columns as advertised in the CopyOut message. Isn't that a problem? Is it a real problem, or just a bit of documentation change that I missed? Anything receiving this and looking for a json array should know how to assemble the data correctly despite the extra CopyData messages. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 10:44, Tom Lane wrote: Joe Conway writes: I believe this is ready to commit unless there are further comments or objections. I thought we were still mostly at proof-of-concept stage? The concept is narrowly scoped enough that I think we are homing in on the final patch. In particular, has anyone done any performance testing? I'm concerned about that because composite_to_json() has zero capability to cache any metadata across calls, meaning there is going to be a large amount of duplicated work per row. I will devise some kind of test and report back. I suppose something with many rows and many narrow columns comparing time to COPY text/csv/json modes would do the trick? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 10:32, Andrew Dunstan wrote: On 2023-12-06 We 08:49, Joe Conway wrote: On 12/6/23 07:36, Andrew Dunstan wrote: On 2023-12-05 Tu 16:46, Joe Conway wrote: On 12/5/23 16:20, Andrew Dunstan wrote: On 2023-12-05 Tu 16:09, Joe Conway wrote: On 12/5/23 16:02, Joe Conway wrote: On 12/5/23 15:55, Andrew Dunstan wrote: and in any other case (e.g. LINES) I can't see why you would have them. Oh I didn't address this -- I saw examples in the interwebs of MSSQL server I think [1] which had the non-array with commas import and export style. It was not that tough to support and the code as written already does it, so why not? That seems quite absurd, TBH. I know we've catered for some absurdity in the CSV code (much of it down to me), so maybe we need to be liberal in what we accept here too. IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). So your preference would be to not allow the non-array-with-commas case but if/when we implement COPY FROM we would accept that format? As in Postel'a law ("be conservative in what you do, be liberal in what you accept from others")? Yes, I think so. Awesome. The attached does it that way. I also ran pgindent. I believe this is ready to commit unless there are further comments or objections. Sorry to bikeshed a little more, I'm a bit late looking at this. I suspect that most users will actually want the table as a single JSON document, so it should probably be the default. In any case FORCE_ARRAY as an option has a slightly wrong feel to it. Sure, I can make that happen, although I figured that for the many-rows-scenario the single array size might be an issue for whatever you are importing into. I'm having trouble coming up with a good name for the reverse of that, off the top of my head. Will think about it and propose something with the next patch revision. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/6/23 07:36, Andrew Dunstan wrote: On 2023-12-05 Tu 16:46, Joe Conway wrote: On 12/5/23 16:20, Andrew Dunstan wrote: On 2023-12-05 Tu 16:09, Joe Conway wrote: On 12/5/23 16:02, Joe Conway wrote: On 12/5/23 15:55, Andrew Dunstan wrote: and in any other case (e.g. LINES) I can't see why you would have them. Oh I didn't address this -- I saw examples in the interwebs of MSSQL server I think [1] which had the non-array with commas import and export style. It was not that tough to support and the code as written already does it, so why not? That seems quite absurd, TBH. I know we've catered for some absurdity in the CSV code (much of it down to me), so maybe we need to be liberal in what we accept here too. IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). So your preference would be to not allow the non-array-with-commas case but if/when we implement COPY FROM we would accept that format? As in Postel'a law ("be conservative in what you do, be liberal in what you accept from others")? Yes, I think so. Awesome. The attached does it that way. I also ran pgindent. I believe this is ready to commit unless there are further comments or objections. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com Add json format mode to COPY TO Add json format mode support to COPY TO, which includes two output variations: 1) "json lines" which is each row as a json object delimited by newlines (the default); and 2) "json array" which is the same as #1, but with the addition of a leading "[", trailing "]", and comma row delimiters, to form a valid json array. Early versions: helpful hints/reviews provided by Nathan Bossart, Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan and Davin Shearer. Requested-by: Davin Shearer Author: Joe Conway Reviewed-by: Andrew Dunstan, Davin Shearer Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5 diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69..8915fb3 100644 *** a/doc/src/sgml/ref/copy.sgml --- b/doc/src/sgml/ref/copy.sgml *** COPY { ta *** 43,48 --- 43,49 FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL { ( column_name [, ...] ) | * } FORCE_NULL { ( column_name [, ...] ) | * } + FORCE_ARRAY [ boolean ] ENCODING 'encoding_name' *** COPY { ta *** 206,214 --- 207,220 Selects the data format to be read or written: text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + *** COPY { ta *** 372,377 --- 378,396 + + + FORCE_ARRAY + + + Force output of square brackets as array decorations at the beginning + and end of output, and commas between the rows. It is allowed only in + COPY TO, and only when using + JSON format. The default is + false. + + + ENCODING diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..23b570f 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 419,424 --- 419,425 bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool force_array_specified = false; ListCell *option; /* Support external use for option sanity checking */ *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 444,451 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, *** 540,545 --- 543,555 defel->defname), parser_errposition(pstate, defel->location))); } + else if (strcmp(defel->defname, "force_array") == 0) + { + if (force_array_specified) + errorConflictingDefElem(defel, pstate); + force_array_specified = true; + opts_out->force_array = defGetBoolean(defel); + } else if (strcmp(defel->defname, "convert_selectively") == 0) { /* *** ProcessCopyOptions(ParseState *pstate, *** 598,603 --- 608,625 (errcode(ERRCODE_SYNTAX_ERROR), errmsg("cannot specify DEFAULT in BINARY mode"))); + if (opts_out->
Re: Emitting JSON to file using COPY TO
On 12/5/23 16:20, Andrew Dunstan wrote: On 2023-12-05 Tu 16:09, Joe Conway wrote: On 12/5/23 16:02, Joe Conway wrote: On 12/5/23 15:55, Andrew Dunstan wrote: and in any other case (e.g. LINES) I can't see why you would have them. Oh I didn't address this -- I saw examples in the interwebs of MSSQL server I think [1] which had the non-array with commas import and export style. It was not that tough to support and the code as written already does it, so why not? That seems quite absurd, TBH. I know we've catered for some absurdity in the CSV code (much of it down to me), so maybe we need to be liberal in what we accept here too. IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). So your preference would be to not allow the non-array-with-commas case but if/when we implement COPY FROM we would accept that format? As in Postel'a law ("be conservative in what you do, be liberal in what you accept from others")? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/5/23 16:12, Andrew Dunstan wrote: On 2023-12-05 Tu 16:02, Joe Conway wrote: On 12/5/23 15:55, Andrew Dunstan wrote: On 2023-12-05 Tu 14:50, Davin Shearer wrote: Hi Joe, In reviewing the 005 patch, I think that when used with FORCE ARRAY, we should also _imply_ FORCE ROW DELIMITER. I can't envision a use case where someone would want to use FORCE ARRAY without also using FORCE ROW DELIMITER. I can, however, envision a use case where someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe including into a larger array. I definitely appreciate these options and the flexibility that they afford from a user perspective. In the test output, will you also show the different variations with FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), (false, true), (true, true)}? Technically you've already shown me the (false, false) case as those are the defaults. I don't understand the point of FORCE_ROW_DELIMITER at all. There is only one legal delimiter of array items in JSON, and that's a comma. There's no alternative and it's not optional. So in the array case you MUST have commas and in any other case (e.g. LINES) I can't see why you would have them. The current patch already *does* imply row delimiters in the array case. It says so here: 8<--- + + FORCE_ARRAY + + + Force output of array decorations at the beginning and end of output. + This option implies the FORCE_ROW_DELIMITER + option. It is allowed only in COPY TO, and only + when using JSON format. + The default is false. + 8<--- and it does so here: 8<--- + if (opts_out->force_array) + opts_out->force_row_delimiter = true; 8<--- and it shows that here: 8<--- + copy copytest to stdout (format json, force_array); + [ + {"style":"DOS","test":"abc\r\ndef","filler":1} + ,{"style":"Unix","test":"abc\ndef","filler":2} + ,{"style":"Mac","test":"abc\rdef","filler":3} + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} + ] 8<--- It also does not allow explicitly setting row delimiters false while force_array is true here: 8<--- + if (opts_out->force_array && + force_row_delimiter_specified && + !opts_out->force_row_delimiter) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot specify FORCE_ROW_DELIMITER false with FORCE_ARRAY true"))); 8<--- Am I understanding something incorrectly? But what's the point of having it if you're not using FORCE_ARRAY? See the follow up email -- other databases support it so why not? It seems to be a thing... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/5/23 16:02, Joe Conway wrote: On 12/5/23 15:55, Andrew Dunstan wrote: and in any other case (e.g. LINES) I can't see why you would have them. Oh I didn't address this -- I saw examples in the interwebs of MSSQL server I think [1] which had the non-array with commas import and export style. It was not that tough to support and the code as written already does it, so why not? [1] https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/5/23 15:55, Andrew Dunstan wrote: On 2023-12-05 Tu 14:50, Davin Shearer wrote: Hi Joe, In reviewing the 005 patch, I think that when used with FORCE ARRAY, we should also _imply_ FORCE ROW DELIMITER. I can't envision a use case where someone would want to use FORCE ARRAY without also using FORCE ROW DELIMITER. I can, however, envision a use case where someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe including into a larger array. I definitely appreciate these options and the flexibility that they afford from a user perspective. In the test output, will you also show the different variations with FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), (false, true), (true, true)}? Technically you've already shown me the (false, false) case as those are the defaults. I don't understand the point of FORCE_ROW_DELIMITER at all. There is only one legal delimiter of array items in JSON, and that's a comma. There's no alternative and it's not optional. So in the array case you MUST have commas and in any other case (e.g. LINES) I can't see why you would have them. The current patch already *does* imply row delimiters in the array case. It says so here: 8<--- + + FORCE_ARRAY + + + Force output of array decorations at the beginning and end of output. + This option implies the FORCE_ROW_DELIMITER + option. It is allowed only in COPY TO, and only + when using JSON format. + The default is false. + 8<--- and it does so here: 8<--- + if (opts_out->force_array) + opts_out->force_row_delimiter = true; 8<--- and it shows that here: 8<--- + copy copytest to stdout (format json, force_array); + [ + {"style":"DOS","test":"abc\r\ndef","filler":1} + ,{"style":"Unix","test":"abc\ndef","filler":2} + ,{"style":"Mac","test":"abc\rdef","filler":3} + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} + ] 8<--- It also does not allow explicitly setting row delimiters false while force_array is true here: 8<--- + if (opts_out->force_array && + force_row_delimiter_specified && + !opts_out->force_row_delimiter) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot specify FORCE_ROW_DELIMITER false with FORCE_ARRAY true"))); 8<--- Am I understanding something incorrectly? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/5/23 12:43, Davin Shearer wrote: Joe, those test cases look great and the outputs are the same as `jq`. Forward slash escaping is optional, so not escaping them in Postgres is okay. The important thing is that the software _reading_ JSON interprets both '\/' and '/' as '/'. Thanks for the review and info. I modified the existing regression test thus: 8<-- create temp table copyjsontest ( id bigserial, f1 text, f2 timestamptz); insert into copyjsontest select g.i, CASE WHEN g.i % 2 = 0 THEN 'line with '' in it: ' || g.i::text ELSE 'line with " in it: ' || g.i::text END, 'Mon Feb 10 17:32:01 1997 PST' from generate_series(1,5) as g(i); insert into copyjsontest (f1) values (E'aaa\"bbb'::text), (E'aaa\\bbb'::text), (E'aaa\/bbb'::text), (E'aaa\'::text), (E'aaa\fbbb'::text), (E'aaa\nbbb'::text), (E'aaa\rbbb'::text), (E'aaa\tbbb'::text); copy copyjsontest to stdout json; {"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"} {"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"} {"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"} {"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"} {"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"} {"id":1,"f1":"aaa\"bbb","f2":null} {"id":2,"f1":"aaa\\bbb","f2":null} {"id":3,"f1":"aaa/bbb","f2":null} {"id":4,"f1":"aaa\","f2":null} {"id":5,"f1":"aaa\fbbb","f2":null} {"id":6,"f1":"aaa\nbbb","f2":null} {"id":7,"f1":"aaa\rbbb","f2":null} {"id":8,"f1":"aaa\tbbb","f2":null} 8<-- I think the code, documentation, and tests are in pretty good shape at this point. Latest version attached. Any other comments or complaints out there? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com Add json format mode to COPY TO Add json format mode support to COPY TO, which includes three output variations: 1) "json lines" which is each row as a json object delimited by newlines (the default); 2) "json lines", except include comma delimiters between json objects; and 3) "json array" which is the same as #2, but with the addition of a leading "[" and trailing "]" to form a valid json array. Early versions: helpful hints/reviews provided by Nathan Bossart, Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan and Davin Shearer. Requested-by: Davin Shearer Author: Joe Conway Reviewed-by: Andrew Dunstan, Davin Shearer Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5 diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69..af8777b 100644 *** a/doc/src/sgml/ref/copy.sgml --- b/doc/src/sgml/ref/copy.sgml *** COPY { ta *** 43,48 --- 43,50 FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL { ( column_name [, ...] ) | * } FORCE_NULL { ( column_name [, ...] ) | * } + FORCE_ARRAY [ boolean ] + FORCE_ROW_DELIMITER [ boolean ] ENCODING 'encoding_name' *** COPY { ta *** 206,214 --- 208,221 Selects the data format to be read or written: text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + *** COPY { ta *** 372,377 --- 379,410 + + + FORCE_ROW_DELIMITER + + + Force output of commas as row delimiters, in addition to the usual + end of line characters. This option is allowed only in + COPY TO, and only when using + JSON format. + The default is false. + + + + + + FORCE_ARRAY + + + Force output of array decorations at the beginning and end of output. + This option implies the FORCE_ROW_DELIMITER + option. It is allowed only in COPY TO, and only + when using JSON format. + The default is false. + + + ENCODING diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index c
Re: Emitting JSON to file using COPY TO
On 12/4/23 21:54, Joe Conway wrote: On 12/4/23 17:55, Davin Shearer wrote: There are however a few characters that need to be escaped 1. |"|(double quote) 2. |\|(backslash) 3. |/|(forward slash) 4. |\b|(backspace) 5. |\f|(form feed) 6. |\n|(new line) 7. |\r|(carriage return) 8. |\t|(horizontal tab) These characters should be represented in the test cases to see how the escaping behaves and to ensure that the escaping is done properly per JSON requirements. I can look at adding these as test cases. So I did a quick check: 8<-- with t(f1) as ( values (E'aaa\"bbb'::text), (E'aaa\\bbb'::text), (E'aaa\/bbb'::text), (E'aaa\'::text), (E'aaa\fbbb'::text), (E'aaa\nbbb'::text), (E'aaa\rbbb'::text), (E'aaa\tbbb'::text) ) select length(t.f1), t.f1, row_to_json(t) from t; length | f1 |row_to_json +-+--- 7 | aaa"bbb | {"f1":"aaa\"bbb"} 7 | aaa\bbb | {"f1":"aaa\\bbb"} 7 | aaa/bbb | {"f1":"aaa/bbb"} 7 | aaa\x08bbb | {"f1":"aaa\"} 7 | aaa\x0Cbbb | {"f1":"aaa\fbbb"} 7 | aaa+| {"f1":"aaa\nbbb"} | bbb | 7 | aaa\rbbb| {"f1":"aaa\rbbb"} 7 | aaa bbb | {"f1":"aaa\tbbb"} (8 rows) 8<-- This is all independent of my patch for COPY TO. If I am reading that correctly, everything matches Davin's table *except* the forward slash ("/"). I defer to the experts on the thread to debate that... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: introduce dynamic shared memory registry
On 12/4/23 22:46, Nathan Bossart wrote: Every once in a while, I find myself wanting to use shared memory in a loadable module without requiring it to be loaded at server start via shared_preload_libraries. The DSM API offers a nice way to create and manage dynamic shared memory segments, so creating a segment after server start is easy enough. However, AFAICT there's no easy way to teach other backends about the segment without storing the handles in shared memory, which puts us right back at square one. The attached 0001 introduces a "DSM registry" to solve this problem. The API provides an easy way to allocate/initialize a segment or to attach to an existing one. The registry itself is just a dshash table that stores the handles keyed by a module-specified string. 0002 adds a test for the registry that demonstrates basic usage. I don't presently have any concrete plans to use this for anything, but I thought it might be useful for extensions for caching, etc. and wanted to see whether there was any interest in the feature. Notwithstanding any dragons there may be, and not having actually looked at the the patches, I love the concept! + -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/4/23 17:55, Davin Shearer wrote: Sorry about the top posting / top quoting... the link you sent me gives me a 404. I'm not exactly sure what top quoting / posting means and Googling those terms wasn't helpful for me, but I've removed the quoting that my mail client is automatically "helpfully" adding to my emails. I mean no offense. No offense taken. But it is worthwhile to conform to the very long established norms of the mailing lists on which you participate. See: https://en.wikipedia.org/wiki/Posting_style I would describe the Postgres list style (based on that link) as "inline replying, in which the different parts of the reply follow the relevant parts of the original post...[with]...trimming of the original text" There are however a few characters that need to be escaped 1. |"|(double quote) 2. |\|(backslash) 3. |/|(forward slash) 4. |\b|(backspace) 5. |\f|(form feed) 6. |\n|(new line) 7. |\r|(carriage return) 8. |\t|(horizontal tab) These characters should be represented in the test cases to see how the escaping behaves and to ensure that the escaping is done properly per JSON requirements. I can look at adding these as test cases. The latest version of the patch (attached) includes some of that already. For reference, the tests so far include this: 8<--- test=# select * from copytest; style | test | filler -+--+ DOS | abc\r +| 1 | def | Unix| abc +| 2 | def | Mac | abc\rdef | 3 esc\ape | a\r\\r\ +| 4 | \nb | (4 rows) test=# copy copytest to stdout (format json); {"style":"DOS","test":"abc\r\ndef","filler":1} {"style":"Unix","test":"abc\ndef","filler":2} {"style":"Mac","test":"abc\rdef","filler":3} {"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} 8<--- At this point "COPY TO" should be sending exactly the unaltered output of the postgres JSON processing functions. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69..af8777b 100644 *** a/doc/src/sgml/ref/copy.sgml --- b/doc/src/sgml/ref/copy.sgml *** COPY { ta *** 43,48 --- 43,50 FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL { ( column_name [, ...] ) | * } FORCE_NULL { ( column_name [, ...] ) | * } + FORCE_ARRAY [ boolean ] + FORCE_ROW_DELIMITER [ boolean ] ENCODING 'encoding_name' *** COPY { ta *** 206,214 --- 208,221 Selects the data format to be read or written: text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + *** COPY { ta *** 372,377 --- 379,410 + + + FORCE_ROW_DELIMITER + + + Force output of commas as row delimiters, in addition to the usual + end of line characters. This option is allowed only in + COPY TO, and only when using + JSON format. + The default is false. + + + + + + FORCE_ARRAY + + + Force output of array decorations at the beginning and end of output. + This option implies the FORCE_ROW_DELIMITER + option. It is allowed only in COPY TO, and only + when using JSON format. + The default is false. + + + ENCODING diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..0236a9e 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 419,424 --- 419,426 bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool force_row_delimiter_specified = false; + bool force_array_specified = false; ListCell *option; /* Support external use for option sanity checking */ *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 445,452 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, *** 540,545 --- 544,563 defel->de
Re: Emitting JSON to file using COPY TO
On 12/4/23 09:25, Andrew Dunstan wrote: On 2023-12-04 Mo 08:37, Joe Conway wrote: On 12/4/23 07:41, Andrew Dunstan wrote: On 2023-12-03 Su 20:14, Joe Conway wrote: (please don't top quote on the Postgres lists) On 12/3/23 17:38, Davin Shearer wrote: " being quoted as \\" breaks the JSON. It needs to be \". This has been my whole problem with COPY TO for JSON. Please validate that the output is in proper format with correct quoting for special characters. I use `jq` on the command line to validate and format the output. I just hooked existing "row-to-json machinery" up to the "COPY TO" statement. If the output is wrong (just for for this use case?), that would be a missing feature (or possibly a bug?). Davin -- how did you work around the issue with the way the built in functions output JSON? Andrew -- comments/thoughts? I meant to mention this when I was making comments yesterday. The patch should not be using CopyAttributeOutText - it will try to escape characters such as \, which produces the effect complained of here, or else we need to change its setup so we have a way to inhibit that escaping. Interesting. I am surprised this has never been raised as a problem with COPY TO before. Should the JSON output, as produced by composite_to_json(), be sent as-is with no escaping at all? If yes, is JSON somehow unique in this regard? Text mode output is in such a form that it can be read back in using text mode input. There's nothing special about JSON in this respect - any text field will be escaped too. But output suitable for text mode input is not what you're trying to produce here; you're trying to produce valid JSON. So, yes, the result of composite_to_json, which is already suitably escaped, should not be further escaped in this case. Gotcha. This patch version uses CopySendData() instead and includes documentation changes. Still lacks regression tests. Hopefully this looks better. Any other particular strings I ought to test with? 8<-- test=# copy (select * from foo limit 4) to stdout (format json, force_array true); [ {"id":1,"f1":"line with \" in it: 1","f2":"2023-12-03T12:26:41.596053-05:00"} ,{"id":2,"f1":"line with ' in it: 2","f2":"2023-12-03T12:26:41.596173-05:00"} ,{"id":3,"f1":"line with \" in it: 3","f2":"2023-12-03T12:26:41.596179-05:00"} ,{"id":4,"f1":"line with ' in it: 4","f2":"2023-12-03T12:26:41.596182-05:00"} ] 8<-- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69..af8777b 100644 *** a/doc/src/sgml/ref/copy.sgml --- b/doc/src/sgml/ref/copy.sgml *** COPY { ta *** 43,48 --- 43,50 FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL { ( column_name [, ...] ) | * } FORCE_NULL { ( column_name [, ...] ) | * } + FORCE_ARRAY [ boolean ] + FORCE_ROW_DELIMITER [ boolean ] ENCODING 'encoding_name' *** COPY { ta *** 206,214 --- 208,221 Selects the data format to be read or written: text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + *** COPY { ta *** 372,377 --- 379,410 + + + FORCE_ROW_DELIMITER + + + Force output of commas as row delimiters, in addition to the usual + end of line characters. This option is allowed only in + COPY TO, and only when using + JSON format. + The default is false. + + + + + + FORCE_ARRAY + + + Force output of array decorations at the beginning and end of output. + This option implies the FORCE_ROW_DELIMITER + option. It is allowed only in COPY TO, and only + when using JSON format. + The default is false. + + + ENCODING diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..46ec34f 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 443,450 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, ***
Re: Emitting JSON to file using COPY TO
On 12/4/23 07:41, Andrew Dunstan wrote: On 2023-12-03 Su 20:14, Joe Conway wrote: (please don't top quote on the Postgres lists) On 12/3/23 17:38, Davin Shearer wrote: " being quoted as \\" breaks the JSON. It needs to be \". This has been my whole problem with COPY TO for JSON. Please validate that the output is in proper format with correct quoting for special characters. I use `jq` on the command line to validate and format the output. I just hooked existing "row-to-json machinery" up to the "COPY TO" statement. If the output is wrong (just for for this use case?), that would be a missing feature (or possibly a bug?). Davin -- how did you work around the issue with the way the built in functions output JSON? Andrew -- comments/thoughts? I meant to mention this when I was making comments yesterday. The patch should not be using CopyAttributeOutText - it will try to escape characters such as \, which produces the effect complained of here, or else we need to change its setup so we have a way to inhibit that escaping. Interesting. I am surprised this has never been raised as a problem with COPY TO before. Should the JSON output, as produced by composite_to_json(), be sent as-is with no escaping at all? If yes, is JSON somehow unique in this regard? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
(please don't top quote on the Postgres lists) On 12/3/23 17:38, Davin Shearer wrote: " being quoted as \\" breaks the JSON. It needs to be \". This has been my whole problem with COPY TO for JSON. Please validate that the output is in proper format with correct quoting for special characters. I use `jq` on the command line to validate and format the output. I just hooked existing "row-to-json machinery" up to the "COPY TO" statement. If the output is wrong (just for for this use case?), that would be a missing feature (or possibly a bug?). Davin -- how did you work around the issue with the way the built in functions output JSON? Andrew -- comments/thoughts? Joe On Sun, Dec 3, 2023, 10:51 Joe Conway <mailto:m...@joeconway.com>> wrote: On 12/3/23 10:31, Davin Shearer wrote: > Please be sure to include single and double quotes in the test values > since that was the original problem (double quoting in COPY TO breaking > the JSON syntax). test=# copy (select * from foo limit 4) to stdout (format json); {"id":2456092,"f1":"line with ' in it: 2456092","f2":"2023-12-03T10:44:40.9712-05:00"} {"id":2456093,"f1":"line with \\" in it: 2456093","f2":"2023-12-03T10:44:40.971221-05:00"} {"id":2456094,"f1":"line with ' in it: 2456094","f2":"2023-12-03T10:44:40.971225-05:00"} {"id":2456095,"f1":"line with \\" in it: 2456095","f2":"2023-12-03T10:44:40.971228-05:00"} -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com> -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/3/23 14:52, Andrew Dunstan wrote: On 2023-12-03 Su 14:24, Joe Conway wrote: On 12/3/23 11:03, Joe Conway wrote: On 12/3/23 10:10, Andrew Dunstan wrote: I realize this is just a POC, but I'd prefer to see composite_to_json() not exposed. You could use the already public datum_to_json() instead, passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third arguments. Ok, thanks, will do Just FYI, this change does loose some performance in my not massively scientific A/B/A test: 8<--- 8<--- That is somewhere in the 3% range. I assume it's because datum_to_json() constructs a text value from which you then need to extract the cstring, whereas composite_to_json(), just gives you back the stringinfo. I guess that's a good enough reason to go with exposing composite_to_json(). Yeah, that was why I went that route in the first place. If you are good with it I will go back to that. The code is a bit simpler too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/3/23 11:03, Joe Conway wrote: On 12/3/23 10:10, Andrew Dunstan wrote: I realize this is just a POC, but I'd prefer to see composite_to_json() not exposed. You could use the already public datum_to_json() instead, passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third arguments. Ok, thanks, will do Just FYI, this change does loose some performance in my not massively scientific A/B/A test: 8<--- -- with datum_to_json() test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 1000 Time: 37196.898 ms (00:37.197) Time: 37408.161 ms (00:37.408) Time: 38393.309 ms (00:38.393) Time: 36855.438 ms (00:36.855) Time: 37806.280 ms (00:37.806) Avg = 37532 -- original patch test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 1000 Time: 37426.207 ms (00:37.426) Time: 36068.187 ms (00:36.068) Time: 38285.252 ms (00:38.285) Time: 36971.042 ms (00:36.971) Time: 35690.822 ms (00:35.691) Avg = 36888 -- with datum_to_json() test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 1000 Time: 39083.467 ms (00:39.083) Time: 37249.326 ms (00:37.249) Time: 38529.721 ms (00:38.530) Time: 38704.920 ms (00:38.705) Time: 39001.326 ms (00:39.001) Avg = 38513 8<--- That is somewhere in the 3% range. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/3/23 11:03, Joe Conway wrote: From your earlier post, regarding constructing the aggregate -- not extensive testing but one data point: 8<-- test=# copy foo to '/tmp/buf' (format json, force_array); COPY 1000 Time: 36353.153 ms (00:36.353) test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; COPY 1 Time: 46835.238 ms (00:46.835) 8<-- Also if the table is large enough, the aggregate method is not even feasible whereas the COPY TO method works: 8<-- test=# select count(*) from foo; count -- 2000 (1 row) test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; ERROR: out of memory DETAIL: Cannot enlarge string buffer containing 1073741822 bytes by 1 more bytes. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 2000 8<---------- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/3/23 10:10, Andrew Dunstan wrote: On 2023-12-01 Fr 14:28, Joe Conway wrote: On 11/29/23 10:32, Davin Shearer wrote: Thanks for the responses everyone. I worked around the issue using the `psql -tc` method as Filip described. I think it would be great to support writing JSON using COPY TO at some point so I can emit JSON to files using a PostgreSQL function directly. -Davin On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <mailto:fi...@sedlakovi.org>> wrote: This would be a very special case for COPY. It applies only to a single column of JSON values. The original problem can be solved with psql --tuples-only as David wrote earlier. $ psql -tc 'select json_agg(row_to_json(t)) from (select * from public.tbl_json_test) t;' [{"id":1,"t_test":"here's a \"string\""}] Special-casing any encoding/escaping scheme leads to bugs and harder parsing. (moved to hackers) I did a quick PoC patch (attached) -- if there interest and no hard objections I would like to get it up to speed for the January commitfest. Currently the patch lacks documentation and regression test support. Questions: -- 1. Is supporting JSON array format sufficient, or does it need to support some other options? How flexible does the support scheme need to be? 2. This only supports COPY TO and we would undoubtedly want to support COPY FROM for JSON as well, but is that required from the start? Thanks for any feedback. I realize this is just a POC, but I'd prefer to see composite_to_json() not exposed. You could use the already public datum_to_json() instead, passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third arguments. Ok, thanks, will do I think JSON array format is sufficient. The other formats make sense from a completeness standpoint (versus other databases) and the latest patch already includes them, so I still lean toward supporting all three formats. I can see both sides of the COPY FROM argument, but I think insisting on that makes this less doable for release 17. On balance I would stick to COPY TO for now. WFM. From your earlier post, regarding constructing the aggregate -- not extensive testing but one data point: 8<-- test=# copy foo to '/tmp/buf' (format json, force_array); COPY 1000 Time: 36353.153 ms (00:36.353) test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; COPY 1 Time: 46835.238 ms (00:46.835) 8<-- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/3/23 10:31, Davin Shearer wrote: Please be sure to include single and double quotes in the test values since that was the original problem (double quoting in COPY TO breaking the JSON syntax). test=# copy (select * from foo limit 4) to stdout (format json); {"id":2456092,"f1":"line with ' in it: 2456092","f2":"2023-12-03T10:44:40.9712-05:00"} {"id":2456093,"f1":"line with \\" in it: 2456093","f2":"2023-12-03T10:44:40.971221-05:00"} {"id":2456094,"f1":"line with ' in it: 2456094","f2":"2023-12-03T10:44:40.971225-05:00"} {"id":2456095,"f1":"line with \\" in it: 2456095","f2":"2023-12-03T10:44:40.971228-05:00"} -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/2/23 17:37, Joe Conway wrote: On 12/2/23 16:53, Nathan Bossart wrote: On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote: So if you are writing a production that might need to match FORMAT followed by JSON, you need to match FORMAT_LA too. Thanks for the pointer. That does seem to be the culprit. diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index d631ac89a9..048494dd07 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -3490,6 +3490,10 @@ copy_generic_opt_elem: { $$ = makeDefElem($1, $2, @1); } +| FORMAT_LA copy_generic_opt_arg +{ +$$ = makeDefElem("format", $2, @1); +} ; copy_generic_opt_arg: Yep -- I concluded the same. Thanks Tom! The attached implements the above repair, as well as adding support for array decoration (or not) and/or comma row delimiters when not an array. This covers the three variations of json import/export formats that I have found after light searching (SQL Server and DuckDB). Still lacks and documentation, tests, and COPY FROM support, but here is what it looks like in a nutshell: 8<--- create table foo(id int8, f1 text, f2 timestamptz); insert into foo select g.i, 'line: ' || g.i::text, clock_timestamp() from generate_series(1,4) as g(i); copy foo to stdout (format json); {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} {"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} {"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} {"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} copy foo to stdout (format json, force_array); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] copy foo to stdout (format json, force_row_delimiter); {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} copy foo to stdout (force_array); ERROR: COPY FORCE_ARRAY requires JSON mode copy foo to stdout (force_row_delimiter); ERROR: COPY FORCE_ROW_DELIMITER requires JSON mode 8<--- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..1f9ac31 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 443,450 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, *** 540,545 --- 542,559 defel->defname), parser_errposition(pstate, defel->location))); } + else if (strcmp(defel->defname, "force_row_delimiter") == 0) + { + if (opts_out->force_row_delimiter) + errorConflictingDefElem(defel, pstate); + opts_out->force_row_delimiter = true; + } + else if (strcmp(defel->defname, "force_array") == 0) + { + if (opts_out->force_array) + errorConflictingDefElem(defel, pstate); + opts_out->force_array = true; + } else if (strcmp(defel->defname, "convert_selectively") == 0) { /* *** ProcessCopyOptions(ParseState *pstate, *** 598,603 --- 612,631 (errcode(ERRCODE_SYNTAX_ERROR), errmsg("cannot specify DEFAULT in BINARY mode"))); + if (opts_out->json_mode) + { + if (opts_out->force_array) + opts_out->force_row_delimiter = true; + } + else if (opts_out->force_array) + ereport(ERROR, + (errcode(ERR
Re: Emitting JSON to file using COPY TO
On 12/2/23 13:50, Maciek Sakrejda wrote: On Fri, Dec 1, 2023 at 11:32 AM Joe Conway wrote: 1. Is supporting JSON array format sufficient, or does it need to support some other options? How flexible does the support scheme need to be? "JSON Lines" is a semi-standard format [1] that's basically just newline-separated JSON values. (In fact, this is what log_destination=jsonlog gives you for Postgres logs, no?) It might be worthwhile to support that, too. [1]: https://jsonlines.org/ Yes, I have seen examples of that associated with other databases (MSSQL and Duckdb at least) as well. It probably makes sense to support that format too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/2/23 16:53, Nathan Bossart wrote: On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote: So if you are writing a production that might need to match FORMAT followed by JSON, you need to match FORMAT_LA too. Thanks for the pointer. That does seem to be the culprit. diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index d631ac89a9..048494dd07 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -3490,6 +3490,10 @@ copy_generic_opt_elem: { $$ = makeDefElem($1, $2, @1); } +| FORMAT_LA copy_generic_opt_arg +{ +$$ = makeDefElem("format", $2, @1); +} ; copy_generic_opt_arg: Yep -- I concluded the same. Thanks Tom! -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/1/23 18:09, Nathan Bossart wrote: On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote: I did a quick PoC patch (attached) -- if there interest and no hard objections I would like to get it up to speed for the January commitfest. Cool. I would expect there to be interest, given all the other JSON support that has been added thus far. Thanks for the review I noticed that, with the PoC patch, "json" is the only format that must be quoted. Without quotes, I see a syntax error. I'm assuming there's a conflict with another json-related rule somewhere in gram.y, but I haven't tracked down exactly which one is causing it. It seems to be because 'json' is also a type name ($$ = SystemTypeName("json")). What do you think about using 'json_array' instead? It is more specific and accurate, and avoids the need to quote. test=# copy foo to stdout (format json_array); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] 1. Is supporting JSON array format sufficient, or does it need to support some other options? How flexible does the support scheme need to be? I don't presently have a strong opinion on this one. My instinct would be start with something simple, though. I don't think we offer any special options for log_destination... WFM 2. This only supports COPY TO and we would undoubtedly want to support COPY FROM for JSON as well, but is that required from the start? I would vote for including COPY FROM support from the start. Check. My thought is to only accept the same format we emit -- i.e. only take a json array. ! if (!cstate->opts.json_mode) I think it's unfortunate that this further complicates the branching in CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's worth refactoring a bunch of stuff to make this nicer. Yeah that was my conclusion. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 12/1/23 22:00, Davin Shearer wrote: I'm really glad to see this taken up as a possible new feature and will definitely use it if it gets released. I'm impressed with how clean, understandable, and approachable the postgres codebase is in general and how easy it is to read and understand this patch. I reviewed the patch (though I didn't build and test the code) and have a concern with adding the '[' at the beginning and ']' at the end of the json output. Those are already added by `json_agg` (https://www.postgresql.org/docs/current/functions-aggregate.html <https://www.postgresql.org/docs/current/functions-aggregate.html>) as you can see in my initial email. Adding them in the COPY TO may be redundant (e.g., [[{"key":"value"...}]]). With this patch in place you don't use json_agg() at all. See the example output (this is real output with the patch applied): (oops -- I meant to send this with the same email as the patch) 8<- create table foo(id int8, f1 text, f2 timestamptz); insert into foo select g.i, 'line: ' || g.i::text, clock_timestamp() from generate_series(1,4) as g(i); copy foo to stdout (format 'json'); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] 8<- I think COPY TO makes good sense to support, though COPY FROM maybe not so much as JSON isn't necessarily flat and rectangular like CSV. Yeah -- definitely not as straight forward but possibly we just support the array-of-jsonobj-rows as input as well? For my use-case, I'm emitting JSON files to Apache NiFi for processing, and NiFi has superior handling of JSON (via JOLT parsers) versus CSV where parsing is generally done with regex. I want to be able to emit JSON using a postgres function and thus COPY TO. Definitely +1 for COPY TO. I don't think COPY FROM will work out well unless the JSON is required to be flat and rectangular. I would vote -1 to leave it out due to the necessary restrictions making it not generally useful. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Emitting JSON to file using COPY TO
On 11/29/23 10:32, Davin Shearer wrote: Thanks for the responses everyone. I worked around the issue using the `psql -tc` method as Filip described. I think it would be great to support writing JSON using COPY TO at some point so I can emit JSON to files using a PostgreSQL function directly. -Davin On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <mailto:fi...@sedlakovi.org>> wrote: This would be a very special case for COPY. It applies only to a single column of JSON values. The original problem can be solved with psql --tuples-only as David wrote earlier. $ psql -tc 'select json_agg(row_to_json(t)) from (select * from public.tbl_json_test) t;' [{"id":1,"t_test":"here's a \"string\""}] Special-casing any encoding/escaping scheme leads to bugs and harder parsing. (moved to hackers) I did a quick PoC patch (attached) -- if there interest and no hard objections I would like to get it up to speed for the January commitfest. Currently the patch lacks documentation and regression test support. Questions: -- 1. Is supporting JSON array format sufficient, or does it need to support some other options? How flexible does the support scheme need to be? 2. This only supports COPY TO and we would undoubtedly want to support COPY FROM for JSON as well, but is that required from the start? Thanks for any feedback. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index cfad47b..bc1f684 100644 *** a/src/backend/commands/copy.c --- b/src/backend/commands/copy.c *** ProcessCopyOptions(ParseState *pstate, *** 443,448 --- 443,450 /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) + opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else *** ProcessCopyOptions(ParseState *pstate, *** 667,672 --- 669,679 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot specify HEADER in BINARY mode"))); + if (opts_out->json_mode && opts_out->header_line) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot specify HEADER in JSON mode"))); + /* Check quote */ if (!opts_out->csv_mode && opts_out->quote != NULL) ereport(ERROR, diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index c66a047..f6ee771 100644 *** a/src/backend/commands/copyto.c --- b/src/backend/commands/copyto.c *** *** 37,42 --- 37,43 #include "rewrite/rewriteHandler.h" #include "storage/fd.h" #include "tcop/tcopprot.h" + #include "utils/json.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/partcache.h" *** typedef struct *** 112,117 --- 113,120 /* NOTE: there's a copy of this in copyfromparse.c */ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0"; + /* need delimiter to start next json array element */ + static bool json_row_delim_needed = false; /* non-export function prototypes */ static void EndCopy(CopyToState cstate); *** DoCopyTo(CopyToState cstate) *** 845,850 --- 848,861 CopySendEndOfRow(cstate); } + + /* if a JSON has been requested send the opening bracket */ + if (cstate->opts.json_mode) + { + CopySendChar(cstate, '['); + CopySendEndOfRow(cstate); + json_row_delim_needed = false; + } } if (cstate->rel) *** DoCopyTo(CopyToState cstate) *** 892,897 --- 903,915 CopySendEndOfRow(cstate); } + /* if a JSON has been requested send the closing bracket */ + if (cstate->opts.json_mode) + { + CopySendChar(cstate, ']'); + CopySendEndOfRow(cstate); + } + MemoryContextDelete(cstate->rowcontext); if (fe_copy) *** DoCopyTo(CopyToState cstate) *** 906,916 static void CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot) { - bool need_delim = false; - FmgrInfo *out_functions = cstate->out_functions; MemoryContext oldcontext; - ListCell *cur; - char *string; MemoryContextReset(cstate->rowcontext); oldcontext = MemoryContextSwitchTo(cstate->rowcontext); --- 924,930 *** CopyOneRowTo(CopyToState cstate, TupleTa *** 921,974 CopySendInt16(cstate, list_length(cstate->attnumlist)); } ! /* Make sure the tuple is fully deconstructed */ ! slot_getallattrs(slot); ! ! foreach(cur, cstate->attnumlist) { ! int attnum = lfirst_int(cur); ! Datum va
Re: How to solve the problem of one backend process crashing and causing other processes to restart?
On 11/13/23 00:53, Laurenz Albe wrote: On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: yuansong writes: > In PostgreSQL, when a backend process crashes, it can cause other backend > processes to also require a restart, primarily to ensure data consistency. > I understand that the correct approach is to analyze and identify the > cause of the crash and resolve it. However, it is also important to be > able to handle a backend process crash without affecting the operation of > other processes, thus minimizing the scope of negative impact and > improving availability. To achieve this goal, could we mimic the Oracle > process by introducing a "pmon" process dedicated to rolling back crashed > process transactions and performing resource cleanup? I wonder if anyone > has attempted such a strategy or if there have been previous discussions > on this topic. The reason we force a database-wide restart is that there's no way to be certain that the crashed process didn't corrupt anything in shared memory. (Even with the forced restart, there's a window where bad data could reach disk before we kill off the other processes that might write it. But at least it's a short window.) "Corruption" here doesn't just involve bad data placed into disk buffers; more often it's things like unreleased locks, which would block other processes indefinitely. I seriously doubt that anything like what you're describing could be made reliable enough to be acceptable. "Oracle does it like this" isn't a counter-argument: they have a much different (and non-extensible) architecture, and they also have an army of programmers to deal with minutiae like undoing resource acquisition. Even with that, you'd have to wonder about the number of bugs existing in such necessarily-poorly-tested code paths. Yes. I think that PostgreSQL's approach is superior: rather than investing in code to mitigate the impact of data corruption caused by a crash, invest in quality code that doesn't crash in the first place. While true, this does nothing to prevent OOM kills, which are becoming more prevalent as, for example, running Postgres in a container (or otherwise) with a cgroup memory limit becomes more popular. And in any case, there are enterprise use cases that necessarily avoid Postgres due to this behavior, which is a shame. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: commitfest app down for repairs
On 9/30/23 08:47, Joe Conway wrote: The committfest app is down for repairs. We will reply back here once it is back up. The commitfest app is back up. We restored to a backup from one day prior. We will take a look at what changed in between, but it might be up to folks to redo some things. A cooling off period was added to the commitfest app for new community accounts, similar to what was done with the wiki a few years ago. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
commitfest app down for repairs
The committfest app is down for repairs. We will reply back here once it is back up. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On 9/25/23 14:03, Jeff Davis wrote: On Mon, 2023-09-25 at 12:00 -0400, Joe Conway wrote: Should there be a way to have a separate "execution" search_path? I hadn't considered that and I like that idea for a few reasons: * a lot of the problem cases are for functions that don't need to access tables at all, e.g., in an index expression. * it avoids annoyances with pg_temp, because that's not searched for functions/operators anyway * perhaps we could force the object search_path to be empty for IMMUTABLE functions? I haven't thought it through in detail, but it seems like a promising approach. Related to this, it would be useful if you could grant create on schema for only non-executable objects. You may want to allow a user to create their own tables but not allow them to create their own functions, for example. Right now "GRANT CREATE ON SCHEMA foo" gives the grantee the ability to create "all the things". -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On 9/25/23 11:30, Robert Haas wrote: I don't believe that people want to run their functions under a sanitized search_path that only includes system schemas. That might work for some people, but I think most people will define functions that call other functions that they themselves defined, or access tables that they themselves created. They will therefore need the search_path to include the schemas in which they created those objects. Without diving into all the detailed nuances of this discussion, this particular paragraph made me wonder if at least part of the problem here is that the same search_path is used to find "things that I want to execute" (functions and operators) and "things I want to access" (tables, etc). I think many folks would be well served by only having system schemas in the search_path for the former (augmented by explicit schema qualifying of one's own functions), but agree that almost no one wants that for the latter (needing to schema qualify every table reference). Should there be a way to have a separate "execution" search_path? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: RFC: pg_stat_logmsg
On 7/9/23 14:13, Joe Conway wrote: On 7/7/23 01:38, Gurjeet Singh wrote: In this case, the error message stored in pg_stat_logmsg is just '%s'. The filename and line number columns might help identify the actual error but it requires users to read the source code and cannot know the actual error message. I believe that the number of such error messages, the ones with very little descriptive content, is very low in Postgres code. These kinds of messages are not prevalent, and must be used when code hits obscure conditions, like seen in your example above. These are the kinds of errors that Joe is referring to as "should not happen". In these cases, even if the error message was descriptive, the user will very likely have to dive deep into code to find out the real cause. Agreed. As an example, after running `make installcheck` 8<- select sum(count) from pg_stat_logmsg; sum -- 6005 (1 row) select message,sum(count) from pg_stat_logmsg where length(message) < 30 and elevel in ('ERROR','FATAL','PANIC') and message like '%\%s%' escape '\' group by message order by length(message); message| sum ---+- %s| 107 "%s" is a view| 9 "%s" is a table | 4 %s is a procedure | 1 invalid size: "%s"| 13 %s at or near "%s"| 131 %s at end of input| 3 ... 8<- I only see three message formats there that are generic enough to be of concern (the first and last two shown -- FWIW I did not see any more of them as the fmt string gets longer) So out of 6005 log events, 241 fit this concern. Perhaps given the small number of message format strings affected, it would make sense to special case those, but I am not sure it is worth the effort, at least for version 1. Attached is an update, this time as a patch against 17devel. It is not quite complete, but getting close IMHO. Changes: 1. Now is integrated into contrib as a "Additional Supplied Extension" 2. Documentation added 3. I had a verbal conversation with Andres and he reminded me that the original idea for this was to collect data across fleets of pg hosts so we could look for how often "should never happen" errors actually happen. As such, we need to think in terms of aggregating the info from many hosts, potentially with many localized languages for the messages. So I converted the "message" column back to an untranslated string, and added a "translated_message" column which is localized. An alternative approach might be to provide a separate function that accepts the message string and returns the translation. Thoughts on that? 4. In the same vein, I added a pgversion column since the filename and line number for the same log message could vary across major or even minor releases. Not done: - 1. The main thing lacking at this point is a regression test. 2. No special casing for message == "%s". I am still not convinced it is worthwhile to do so. Comments gratefully welcomed. Thanks, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/contrib/Makefile b/contrib/Makefile index da4e231..9144cb7 100644 *** a/contrib/Makefile --- b/contrib/Makefile *** SUBDIRS = \ *** 34,39 --- 34,40 pg_buffercache \ pg_freespacemap \ pg_prewarm \ + pg_stat_logmsg \ pg_stat_statements \ pg_surgery \ pg_trgm \ diff --git a/contrib/meson.build b/contrib/meson.build index 84d4e18..4e8ac39 100644 *** a/contrib/meson.build --- b/contrib/meson.build *** subdir('pgcrypto') *** 44,49 --- 44,50 subdir('pg_freespacemap') subdir('pg_prewarm') subdir('pgrowlocks') + subdir('pg_stat_logmsg') subdir('pg_stat_statements') subdir('pgstattuple') subdir('pg_surgery') diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml index ab7e38b..4dfc5dd 100644 *** a/doc/src/sgml/contrib.sgml --- b/doc/src/sgml/contrib.sgml *** CREATE EXTENSION extension_ *** 157,162 --- 157,163 + diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 4c63a7e..acede66 100644 *** a/doc/src/sgml/filelist.sgml --- b/doc/src/sgml/filelist.sgml *** *** 145,150 --- 145,151 + diff --git a/contrib/pg_stat_logmsg/.gitignore b/contrib/pg_stat_logmsg/.gitignore index ...5dcb3ff . *** a/contrib/pg_stat_logmsg/.gitignore --- b/contrib/pg_stat_logmsg/.gitignore *** *** 0 --- 1,4 + # Generated subdirectories + /log/ + /results/ + /tmp_check/ diff --git a/contrib/pg_stat_logmsg/LICENSE b/contrib/pg_stat_logmsg/LICENSE index ...998f814 . *** a/contrib/pg
Re: Possibility to disable `ALTER SYSTEM`
On 9/7/23 15:51, Gabriele Bartolini wrote: I would like to propose a patch that allows administrators to disable `ALTER SYSTEM` via either a runt-time option to pass to the Postgres server process at startup (e.g. `--disable-alter-system=true`, false by default) or a new GUC (or even both), without changing the current default method of the server. Without trying to debate the particulars, a huge +1 for the concept of allowing ALTER SYSTEM to be disabled. FWIW I would vote the same for control over COPY PROGRAM. Not coincidentally both concepts were built into set_user: https://github.com/pgaudit/set_user -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [17] CREATE SUBSCRIPTION ... SERVER
On 8/31/23 12:52, Jeff Davis wrote: On Thu, 2023-08-31 at 10:59 +0530, Ashutosh Bapat wrote: The server's FDW has to be postgres_fdw. So we have to handle the awkward dependency between core and postgres_fdw (an extension). That sounds more than just "awkward". I can't think of any precedent for that and it seems to violate the idea of an "extension" entirely. Can you explain more concretely how we might resolve that? Maybe move postgres_fdw to be a first class built in feature instead of an extension? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
On 8/15/23 10:40, Heikki Linnakangas wrote: On 01/08/2023 16:48, Joe Conway wrote: Any further comments on the posted patch[1]? I would like to apply/push this prior to the beta and minor releases next week. I'm not sure about the placement of the uselocale() calls. In plperl_spi_exec(), for example, I think we should switch to the global locale right after the check_spi_usage_allowed() call. Otherwise, if an error happens in BeginInternalSubTransaction() or in pg_verifymbstr(), the error would be processed with the perl locale. Maybe that's harmless, error processing hardly cares about LC_MONETARY, but seems wrong in principle. I guess you could probably argue that we should flip this around, and only enter the perl locale when calling into libperl, and exit the perl locale every time we reemerge under plperl.c control. That seems pretty drastic and potentially messy though. Hmm, come to think of it, if BeginInternalSubTransaction() throws an error, we just jump out of the perl interpreter? That doesn't seem cool. But that's not new with this patch. Hmm, true enough I guess. If I'm reading correctly, compile_plperl_function() calls select_perl_context(), which calls plperl_trusted_init(), which calls uselocale(). So it leaves locale set to the perl locale. Who sets it back? No one does it seems, at least not currently How about adding a small wrapper around eval_pl() that sets and unsets the locale(), just when we enter the interpreter? It's easier to see that we are doing the calls in right places, if we make them as close as possible to entering/exiting the interpreter. Are there other functions in addition to eval_pl() that need to be called with the perl locale? I can see that as a better strategy, but "other functions in addition to eval_pv()" (I assume you mean eval_pv rather than eval_pl) is a tricky one to answer. I ran the attached script like so (from cwd src/pl/plperl) like so: ``` symbols-used.sh /lib/x86_64-linux-gnu/libperl.so.5.34 plperl.so ``` and get a fairly long list of exported libperl functions that get linked into plperl.so: ``` Matched symbols: boot_DynaLoader perl_alloc Perl_av_extend Perl_av_fetch Perl_av_len Perl_av_push *Perl_call_list *Perl_call_pv *Perl_call_sv perl_construct Perl_croak Perl_croak_nocontext Perl_croak_sv Perl_croak_xs_usage Perl_die *Perl_eval_pv Perl_free_tmps Perl_get_sv Perl_gv_add_by_type Perl_gv_stashpv Perl_hv_clear Perl_hv_common Perl_hv_common_key_len Perl_hv_iterinit Perl_hv_iternext Perl_hv_iternext_flags Perl_hv_iternextsv Perl_hv_ksplit Perl_looks_like_number Perl_markstack_grow Perl_mg_get Perl_newRV Perl_newRV_noinc Perl_newSV Perl_newSViv Perl_newSVpv Perl_newSVpvn Perl_newSVpvn_flags Perl_newSVsv Perl_newSVsv_flags Perl_newSV_type Perl_newSVuv Perl_newXS Perl_newXS_flags *perl_parse Perl_pop_scope Perl_push_scope *perl_run Perl_save_item Perl_savetmps Perl_stack_grow Perl_sv_2bool Perl_sv_2bool_flags Perl_sv_2iv Perl_sv_2iv_flags Perl_sv_2mortal Perl_sv_2pv Perl_sv_2pvbyte Perl_sv_2pvbyte_flags Perl_sv_2pv_flags Perl_sv_2pvutf8 Perl_sv_2pvutf8_flags Perl_sv_bless Perl_sv_free Perl_sv_free2 Perl_sv_isa Perl_sv_newmortal Perl_sv_setiv Perl_sv_setiv_mg Perl_sv_setsv Perl_sv_setsv_flags Perl_sys_init Perl_sys_init3 Perl_xs_boot_epilog Perl_xs_handshake ``` I marked the ones that look like perhaps we should care about in the above list with an asterisk: *Perl_call_list *Perl_call_pv *Perl_call_sv *Perl_eval_pv *perl_run but perhaps there are others? /* * plperl_xact_callback --- cleanup at main-transaction end. */ static void plperl_xact_callback(XactEvent event, void *arg) { /* ensure global locale is the current locale */ if (uselocale((locale_t) 0) != LC_GLOBAL_LOCALE) perl_locale_obj = uselocale(LC_GLOBAL_LOCALE); } So the assumption is that the if current locale is not LC_GLOBAL_LOCALE, then it was the perl locale. Seems true today, but this could confusion if anything else calls uselocale(). In particular, if another PL implementation copies this, and you use plperl and the other PL at the same time, they would get mixed up. I think we need another "bool perl_locale_obj_in_use" variable to track explicitly whether the perl locale is currently active. Or perhaps don't assume that we want the global locale and swap between pg_locale_obj (whatever it is) and perl_locale_obj? If we are careful to put the uselocale() calls in the right places so that we never ereport() while in perl locale, this callback isn't needed. Maybe it's still a good idea, though, to be extra sure that things get reset to a sane state if something unexpected happens. I feel more comfortable that we have a "belt and suspenders" method to restore the locale that was in use by Postgres before entering perl. If multiple interpreters are used, is the single perl_locale_obj variable still enough? Each interpreter can have their own locale I believe. So in other
Re: list of acknowledgments for PG16
On 8/22/23 09:44, Tom Lane wrote: Alvaro Herrera writes: On 2023-Aug-22, Peter Eisentraut wrote: The list of acknowledgments for the PG16 release notes has been committed. It should show up here sometime: <https://www.postgresql.org/docs/16/release-16.html#RELEASE-16-ACKNOWLEDGEMENTS>. Hmm, I think these docs would only regenerate during the RC1 release, so it'll be a couple of weeks, unless we manually poke the doc builder. Yeah. I could produce a new set of tarballs from the v16 branch tip, but I don't know the process (nor have the admin permissions) to extract the HTML docs and put them on the website. These days the docs update is part of a scripted process for doing an entire release. I'm sure we could figure out how to just release the updated docs, but with RC1 a week away, is it really worthwhile? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: C function to return double precision[][]
On 8/21/23 15:31, Markur Sens wrote: Is there any piece of code I could see how to achieve $subject ? I haven’t found anything in the standard library or contrib modules. I’m trying to build ArrayType ** of sorts and return a Datum of those but I can’t seem to manage memory correctly. There is an example in PL/R here: https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/pg_conversion.c#L1275 which points to here with number of dims == 2: https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/pg_conversion.c#L1493 This is all generic to the element type (i.e. not specifically float8), but the needed type conversion stuff happens in here: https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/plr.c#L1109 HTH, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Would it be possible to backpatch Close support in libpq (28b5726) to PG16?
On 8/15/23 15:39, Alvaro Herrera wrote: On 2023-Aug-16, Michael Paquier wrote: Personally I think backpatching 28b5726 has a really low risk of breaking anything. I agree about the low-risk argument, though. This is just new code. Here's a way to think about it. If 16.1 was already out, would we add libpq support for Close to 16.2? Seems pretty clearly a "no" to me. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On 8/12/23 09:15, Joe Conway wrote: On 8/11/23 22:35, Jeff Davis wrote: 2. We can more accurately serve the user's intent. For instance, the safe search_path of "pg_catalog, pg_temp" is arcane and seems to be there just because we don't have a way to specify that pg_temp be excluded entirely. But perhaps in the future we *do* want to exclude pg_temp entirely. Knowing that the user just wants "SEARCH SYSTEM" allows us some freedom to do that. Personally I think having pg_temp in the SYSTEM search path makes sense for temp tables, but I find it easy to forget that functions can be created by unprivileged users in pg_temp, and therefore having pg_temp in the search path for functions is dangerous. Hmm, I guess I was too hasty -- seems we have some magic related to this already. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On 8/11/23 22:35, Jeff Davis wrote: The attached patch implements a new SEARCH clause for CREATE FUNCTION. The SEARCH clause controls the search_path used when executing functions that were created without a SET clause. Background: Controlling search_path is critical for the correctness and security of functions. Right now, the author of a function without a SET clause has little ability to control the function's behavior, because even basic operators like "+" involve search_path. This is a big problem for, e.g. functions used in expression indexes which are called by any user with write privileges on the table. Motivation: I'd like to (eventually) get to safe-by-default behavior. In other words, the simplest function declaration should be safe for the most common use cases. I agree with the general need. Add SEARCH { DEFAULT | SYSTEM | SESSION } clause to CREATE/ALTER function. * SEARCH DEFAULT is the same as no SEARCH clause at all, and ends up stored in the catalog as prosearch='d'. * SEARCH SYSTEM means that we switch to the safe search path of "pg_catalog, pg_temp" when executing the function. Stored as prosearch='y'. * SEARCH SESSION means that we don't switch the search_path when executing the function, and it's inherited from the session. Stored as prosearch='e'. It isn't clear to me what is the precise difference between DEFAULT and SESSION 2. We can more accurately serve the user's intent. For instance, the safe search_path of "pg_catalog, pg_temp" is arcane and seems to be there just because we don't have a way to specify that pg_temp be excluded entirely. But perhaps in the future we *do* want to exclude pg_temp entirely. Knowing that the user just wants "SEARCH SYSTEM" allows us some freedom to do that. Personally I think having pg_temp in the SYSTEM search path makes sense for temp tables, but I find it easy to forget that functions can be created by unprivileged users in pg_temp, and therefore having pg_temp in the search path for functions is dangerous. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
On 7/3/23 12:25, Tristan Partin wrote: On Sat Jun 24, 2023 at 8:09 AM CDT, Joe Conway wrote: Although I have not looked yet, presumably we could have similar problems with plpython. I would like to get agreement on this approach against plperl before diving into that though. Thoughts? I don't see anything immediately wrong with this. Any further comments on the posted patch[1]? I would like to apply/push this prior to the beta and minor releases next week. Joe [1] https://www.postgresql.org/message-id/ec6fa20d-e691-198a-4a13-e761771b9dec%40joeconway.com -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pgsql: Fix search_path to a safe value during maintenance operations.
On 7/31/23 12:53, Robert Haas wrote: On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis wrote: I'm not sure that everyone in this thread realizes just how broken it is to depend on search_path in a functional index at all. And doubly so if it depends on a schema other than pg_catalog in the search_path. Let's also not forget that logical replication always uses search_path=pg_catalog, so if you depend on a different search_path for any function attached to the table (not just functional indexes, also functions inside expressions or trigger functions), then those are already broken in version 15. And if a superuser is executing maintenance commands, there's little reason to think they'll have the same search path as the user that created the table. At some point in the very near future (though I realize that point may come after version 16), we need to lock down the search path in a lot of cases (not just maintenance commands), and I don't see any way around that. I agree. I think there are actually two interrelated problems here. One is that virtually all code needs to run with the originally intended search_path rather than some search_path chosen at another time and maybe by a different user. If not, it's going to break, or compromise security, depending on the situation. The other is that running arbitrary code written by somebody else as yourself is basically instant death, from a security perspective. I agree too. But the analysis of the issue needs to go one step further. Even if the search_path does not change from the originally intended one, a newly created function can shadow the intended one based on argument coercion rules. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com